Developing online services seems always to bring about new challenges: yesterday’ problem was to extract plain text from a generic web page.
Using jQuery you can easily extract text from an element by using $(..).text(), so I wanted to put this function to use; the problem then was to make the entire web page available to jQuery.
Before extracting text, you should at first load the content of the intended page somewhere; if you want to preserve the integrity of your calling page, the only suitable solution is to use an iframe as container:
here are the html basic elements:
<input type="text" name="FILLFROMHERE" >
<span onclick="fillFromURL()"></span>
<iframe id="pageExample" style="display:none;"></iframe>
In order to fill the iframe:
function fillSpamFromURL(){
var url=$("#FILLFROMHERE").val();
if (url && url!=""){
$("#pageExample").attr('src', encodeURIComponent(url));
}
}
You should wait until the iframe loads the page before starting the text extraction; bind the "load" event on the iframe and it will be raised at the right time:
$(function(){ $("#pageExample").load(extractInfos); })
<%@ page import="java.io.InputStream,
org.apache.commons.httpclient.*,
org.apache.commons.httpclient.methods.GetMethod,
java.io.InputStreamReader, java.io.BufferedReader" %><%
String url=request.getParameter("url");
HttpClient client = new HttpClient();
HttpMethod method = new GetMethod(url);
client.executeMethod(method);
InputStream bodyAsStream = method.getResponseBodyAsStream();
StringBuffer sb = new StringBuffer();
InputStreamReader streamReader = new InputStreamReader(bodyAsStream, "UTF-8");
BufferedReader reader = new BufferedReader(streamReader);
while (true) {
int cr = reader.read();
if (cr < 0)
break;
sb.append((char) cr);
}
String thePage= sb.toString();
bodyAsStream.close();
method.releaseConnection();
%><%=thePage%>
$("#pageExample").attr('src', "proxy.jsp?url="+encodeURIComponent(url));
function extractInfos(){
var ifraBody = $(this).contents().find("body");
ifraBody.find("script,style,object,link,embed").remove();
var text=ifraBody.text()+"";
var re = new RegExp("(\\s){2,}", "g");
text =text.replace(re,"$1");
$("#EXCERPT").val(text);
}
The regular expression replacement will remove replications in "space like" chars.
I’m using this code to instruct anti-spam services (like Defensio or Akismet), supplying them sample contents to tune spam detection, which we will use for Patapage (patapage.com)’s comments. These services usually need to know the content of the main article before rating a comment as spam or not.
1 thought on “A jQuery text extractor (via Java proxy)”