A jQuery text extractor (via Java proxy)

Developing online services seems always to bring about new challenges: yesterday’ problem was to extract plain text from a generic web page.

Using  jQuery you can easily extract text from an element by using $(..).text(), so I wanted to put this function to use; the problem then was to make the entire web page available to jQuery.

Before extracting text, you should at first load the content of the intended page somewhere; if you want to preserve the integrity of your calling page, the only suitable solution is to use  an iframe as container:

here are the html basic elements:

  <input type="text" name="FILLFROMHERE" >
  <span onclick="fillFromURL()"></span>
  <iframe id="pageExample" style="display:none;"></iframe>


In order to fill the iframe:

function fillSpamFromURL(){
  var url=$("#FILLFROMHERE").val();
  if (url && url!=""){
    $("#pageExample").attr('src', encodeURIComponent(url));
  }
}

You should wait until the iframe loads the page before starting the text extraction; bind the "load" event on the iframe and it will be raised at the right time:

  $(function(){     $("#pageExample").load(extractInfos);   })
At this point you can access the iframe content by
$(this).contents()
Sadly I discovered what probably I should have known already: cross site DOM inspection is forbidden (see Cross-Domain Communication with IFrames).
To work around this limitation I wrote a really simple ". jsp" proxy page in order to load the "external" page in my domain:

<%@ page import="java.io.InputStream,
                 org.apache.commons.httpclient.*,
                 org.apache.commons.httpclient.methods.GetMethod,
                 java.io.InputStreamReader, java.io.BufferedReader" %><%
  String url=request.getParameter("url");
  HttpClient client = new HttpClient();
  HttpMethod method = new GetMethod(url);
  client.executeMethod(method);
  InputStream bodyAsStream = method.getResponseBodyAsStream();
  StringBuffer sb = new StringBuffer();
  InputStreamReader streamReader = new InputStreamReader(bodyAsStream, "UTF-8");
  BufferedReader reader = new BufferedReader(streamReader);
  while (true) {
    int cr = reader.read();
    if (cr < 0)
      break;
    sb.append((char) cr);
  }
  String thePage= sb.toString();
  bodyAsStream.close();
  method.releaseConnection();
%><%=thePage%>

Now the iframe load function should be changed to :

$("#pageExample").attr('src', "proxy.jsp?url="+encodeURIComponent(url));

Warning: the proxied "page" can’t load relative resources, so this solution may not be a  suitable  for everyone.
The jQuery .text() function will retrieve text even in the script tags, so you can use a little trick to remove unwanted tags:

function extractInfos(){
  var ifraBody = $(this).contents().find("body");
  ifraBody.find("script,style,object,link,embed").remove();
  var text=ifraBody.text()+"";
  var re = new RegExp("(\\s){2,}", "g");
  text =text.replace(re,"$1");
  $("#EXCERPT").val(text);
}

The regular expression replacement will remove replications in "space like" chars.

I’m using this code to instruct anti-spam services (like Defensio or Akismet), supplying them sample contents to tune spam detection, which we will use for Patapage (patapage.com)’s comments. These services usually need to know the content of the main article before rating a comment as spam or not.