Our group discussed the problem of 'web content extraction and mining'.
Mainly, we focused on the following questions:
What is Content Extraction ?
Content extraction (CE) identifies parts of the document for extraction (see also logical labeling or
document structure analysis). It is not the same than Information Retrieval (IR). Nowadays, it is
more going into extraction of semantical information.
Content extraction handles with the locating of predefined information types, e.g. names, addresses,
dates, etc.
What is Mining ?
Mining deals with infering information from extracted data, e.g. trends. It is used for finding things
which are not implicit in the data, e.g. for finding relationships between objects within one or several documents.
For example, defects in a product can be found by using text mining applied on customers e-mails (or letters).
If they are complaining similar things about a special product then there must be something wrong.
What is the difference to the 'traditional' Information Extraction ?
The main difference is the use of structure. In the traditional Information Extraction (IE) no structure is
used. It only relies on the text data. In the web the document structure can be used for several purposes,
e.g. to infer the content (or document) type.
But web documents also implicate some problems:
Document Analysis has several things to offer ! Web Document Analysis is a new field,
DA techniques can help to solve some WDA problems:
The way of encoding information within a pages is not uniform, e.g. the time tables of the
universities look all different !
Furthermore, the big encoding problem is that people want to have fancy pages, with a picture
here and a blinking object there...
Another aspect is, that in the WWW a big variation of document formats exists, e.g. ascii-text,
html, xml, pdf, cgi-generated pages, java-scripts,... The only standard format seems to be
a TIFF-image. "Some existing systems first convert a web page into a TIFF-image and then use
'traditional' DA techniques" (Bertin Klein).
But the question remains: "How to handle with 'real' HTML, XML,... pages?"
What influences gives 'Semantic Web' to us ?
Everybody in the group agreed that including semantics will improve a system and that
ontologies, WordNet, CYC, etc. are helpful. (Although Michael Rys has shown that
an ontologie for the complete world is not practicable!) But the problem of all these is
that they are hand-coded. What we need is an 'online'-learning system.
Another thing which could be kept in mind are collaborative aspects like 3rd party
comments or rating in amazons shop, direct hit or citeseer.
Handwritten notes:
For those who want to read my handwritten notes and
for those who want to test their handwriting recognition engine...
(Please do not ask for ground-truth data ;-)
notes page 1
notes page 2
notes page 3
presentation slide
Participants
[Note: The following names are as I could remember. It could be that the list is not completely.
Please, contact me, if I have forgotten you.]
Apostolos Antonacopoulos
University of Liverpool
A.Antonacopoulos@csc.liv.ac.uk
Roger B. Bradford
IBM Almaden Research Center
roger.b.bradford@saic.com
Andreas Dengel
DFKI GmbH
Andreas.Dengel@dfki.de
David Doermann
University of Maryland
doermann@umiacs.umd.edu
Matthew Hurst
WhizBang! Labs Inc.
mhurst@whizbang.com
Koichi Kise
Dept. of Computer and Systems Sciences,
Osaka Prefecture University
kise@cs.osakafu-u.ac.jp
Bertin Klein
DFKI GmbH
klein@dfki.uni-kl.de
Stefan Klink
DFKI GmbH
Stefan.Klink@dfki.de
Ahmad Fuad R Rahman
BCL Computers Inc.
fuad@bcl-computers.com
...