Archive

Archive for the ‘doctorate’ Category

BestBuy RDF data — SPARQL access

September 7th, 2009 No comments

Recently Best Buy made product and other information available from download, read about it here. I downloaded some of the data and made it available for SPARQL queries. Some of the  tasks I have performed with the data.

  1. Data is made available as a series of follows with little or no description of what he breakdown is.
  2. I downloaded the first 20,000 rdf descriptions and 14610 of them had errors
  3. I proceeded to load the remaining 5390 of them into the OpenRDF(formerly Sesame) engine
  4. Deciding if I should write a general process to take the data directly from the sitemap files and download it directly into the triple store.
  5. Why is some of this data so old?
  6. Will Best Buy make updates available?
  7. What will Best Buy’s policy be on downloads?

This is an interesting trend to follow especially if the following happens.

  1. Ebay makes data available in some RDF like format
  2. Amazon makes data available in some RDF like format
  3. NetFlix makes data available in some RDF like format
  4. More applications that are not basically FOAF or social networking based
  5. Valuable technical data like this something simple like you go to the support market and buy some groceries, as part of your credit card service you get an rdf presentation of what you bought so you can look at your purchases from dimensions that interest you food composition, quality, calories, cost, environmental impact and so on.

The current trend of creating mashups and APIs is a yet another tower of Babel in the comptuer industry. Each individual application is usually very nice but when you want to have a more global scope it doesn’t work very well. General adoptions of RDF would provide a common language an shft the problem to solving some more tractable problems like.

  1. Vocabulary differences
  2. SPARQL performance and usage

Which I believe to be more tractable problems that trying to get all the Different apis to work together.  Having a different mashup for each site that doesn’t work with other sites is not an internet I look forward to. Besides most of the data already exists in DB2, MySQL and Oracle databases making it available as RDF for corporate customers is a useful and not trivial service that would allow websites to instantly participate int eh Semantic Web experiment.

Some of the more relevant links, also repeated at GoodRelations, a promising application of Semantic web to commerce, products and services

  • http://www.mail-archive.com/public-lod@w3.org/msg03445.html
  • Company data in RDF/XML using Goodrelations
  • A product RDF representation
  • I have quite a few experiments planned the simpler ones are using Tabulator, Operator and OpenRDF.

    Processing Unstructured Text with OpenCalais

    January 5th, 2009 4 comments

    OpenCalais provides a service for analyzing data and retuning various rdf derivatives. OpenCalais is not designed to process unstructured data typically found on web pages. Most research in this area, Information Extraction(IE) is about extracting text from Corpora that have a higher prose number than a typical url page. As part of my doctoral research I had to do some comparisons on various techniques and processes for performing named entity extraction, relations and named entity extraction to start with. One of the approaches I have been working compares the following.

    1. OpenCalais analysis of a web page
    2. OpenCalais analysis of a web page that has been chunked into sentences
    3. LingPipe analysis of a web page
    4. LingPipe analysis of a web page that has been chunked into sentences
    5. Mallet analysis of a web page
    6. Mallet analysis of a web page that has been chunked into sentences

    I intend to publish these code segments here and eventually make it all available on kenai for general download.  The current project in whatever state it is in can be found here. Creative Commons License
    This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

    To run this code try something like this and provide a valid configuration.xml file.

    A short aside on Parsing web pages for data. Many years ago we were doing this in Sun Labs at about the same time we were working on something similar to REST called Url Programming Interface (UPI)  as was Berkeley, and we ran into constant opposition from many in the Industry who thought that it was too brittle and too heuristic. Each year I see more and more use of this approach and the general realization that there is quite a bit of unstructured data out there that is below the level of a news article. A very simple classification scheme I have been working on. In this scheme pages are assigned a prose index. This prose index is yet to be named.  Some of the sites and or classifications.

    1. Sloppy html page( Fails DOM)
    2. Neat Html page (Passes DOM)
    3. Mainly Text page
    4. All Text
    5. Link page
    6. Fully Rest
    7. Bad Rest page

    Obviously there are advantages to structured data and the semantic web but how do you get there without throwing out all the content we currently have.