Our group discussed the problem of 'web content extraction and mining'.
Mainly, we focused on the following questions:
What is Content Extraction ?
Content extraction (CE) identifies parts of the document for extraction (see also logical labeling or
document structure analysis). It is not the same than Information Retrieval (IR). Nowadays, it is
more going into extraction of semantical information.
Content extraction handles with the locating of predefined information types, e.g. names, addresses,
What is Mining ?
Mining deals with infering information from extracted data, e.g. trends. It is used for finding things
which are not implicit in the data, e.g. for finding relationships between objects within one or several documents.
For example, defects in a product can be found by using text mining applied on customers e-mails (or letters).
If they are complaining similar things about a special product then there must be something wrong.
What is the difference to the 'traditional' Information Extraction ?
The main difference is the use of structure. In the traditional Information Extraction (IE) no structure is
used. It only relies on the text data. In the web the document structure can be used for several purposes,
e.g. to infer the content (or document) type.
But web documents also implicate some problems:
Document Analysis has several things to offer ! Web Document Analysis is a new field,
DA techniques can help to solve some WDA problems:
The way of encoding information within a pages is not uniform, e.g. the time tables of the
universities look all different !
Furthermore, the big encoding problem is that people want to have fancy pages, with a picture here and a blinking object there...
Another aspect is, that in the WWW a big variation of document formats exists, e.g. ascii-text, html, xml, pdf, cgi-generated pages, java-scripts,... The only standard format seems to be a TIFF-image. "Some existing systems first convert a web page into a TIFF-image and then use 'traditional' DA techniques" (Bertin Klein).
But the question remains: "How to handle with 'real' HTML, XML,... pages?"
What influences gives 'Semantic Web' to us ?
Everybody in the group agreed that including semantics will improve a system and that
ontologies, WordNet, CYC, etc. are helpful. (Although Michael Rys has shown that
an ontologie for the complete world is not practicable!) But the problem of all these is
that they are hand-coded. What we need is an 'online'-learning system.
Another thing which could be kept in mind are collaborative aspects like 3rd party comments or rating in amazons shop, direct hit or citeseer.
For those who want to read my handwritten notes and
for those who want to test their handwriting recognition engine...
(Please do not ask for ground-truth data ;-)
notes page 1
notes page 2
notes page 3
[Note: The following names are as I could remember. It could be that the list is not completely. Please, contact me, if I have forgotten you.]
University of Liverpool
Roger B. Bradford
IBM Almaden Research Center
University of Maryland
WhizBang! Labs Inc.
Dept. of Computer and Systems Sciences,
Osaka Prefecture University
Ahmad Fuad R Rahman
BCL Computers Inc.