Home > Java, doctorate > Processing Unstructured Text with OpenCalais

Processing Unstructured Text with OpenCalais

January 5th, 2009 Leave a comment Go to comments

OpenCalais provides a service for analyzing data and retuning various rdf derivatives. OpenCalais is not designed to process unstructured data typically found on web pages. Most research in this area, Information Extraction(IE) is about extracting text from Corpora that have a higher prose number than a typical url page. As part of my doctoral research I had to do some comparisons on various techniques and processes for performing named entity extraction, relations and named entity extraction to start with. One of the approaches I have been working compares the following.

  1. OpenCalais analysis of a web page
  2. OpenCalais analysis of a web page that has been chunked into sentences
  3. LingPipe analysis of a web page
  4. LingPipe analysis of a web page that has been chunked into sentences
  5. Mallet analysis of a web page
  6. Mallet analysis of a web page that has been chunked into sentences

I intend to publish these code segments here and eventually make it all available on kenai for general download.  The current project in whatever state it is in can be found here. Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

To run this code try something like this and provide a valid configuration.xml file.

A short aside on Parsing web pages for data. Many years ago we were doing this in Sun Labs at about the same time we were working on something similar to REST called Url Programming Interface (UPI)  as was Berkeley, and we ran into constant opposition from many in the Industry who thought that it was too brittle and too heuristic. Each year I see more and more use of this approach and the general realization that there is quite a bit of unstructured data out there that is below the level of a news article. A very simple classification scheme I have been working on. In this scheme pages are assigned a prose index. This prose index is yet to be named.  Some of the sites and or classifications.

  1. Sloppy html page( Fails DOM)
  2. Neat Html page (Passes DOM)
  3. Mainly Text page
  4. All Text
  5. Link page
  6. Fully Rest
  7. Bad Rest page

Obviously there are advantages to structured data and the semantic web but how do you get there without throwing out all the content we currently have.

  1. January 6th, 2009 at 14:22 | #1

    Rinaldo:

    Tom Tague from Calais here.

    Just a quick suggestion. If you’re trying to work directly with formatted HTML pages and Calais you might want to take a look at semanticproxy.com. This tool fetches the page for you and attempts to do basic HTML cleansing before handing it to Calais for processing.

    As I’m sure you’re aware cleansing is hard – but it’s doing fairly well and we’ll continue to improve it over time.

    Regards,

  2. January 6th, 2009 at 20:12 | #2

    I have tried it and I will try it again. I have a few thousand pages and the semanticproxy failed to parse quite a few of them at all. I will send you the results privately if you want them. I am also doing cleansing, it will be informative to see the difference perhaps we can learn something from the differences if any.

  3. January 8th, 2009 at 11:07 | #3

    Hi, it would be interesting to hear your thoughts on Zemanta API as well, because it is more focused on user generated content analysis…

  4. March 2nd, 2009 at 07:41 | #4

    OpenCalais is great, I use it in a new website I created recently : http://www.klezio.com
    News are automatically classified and news metadata extracted ; Contextual information is fetched from apps such as wikipedia, flickr, twitter or delicious.
    Hope it’ll serve.
    Regards,

  1. No trackbacks yet.
You must be logged in to post a comment.