First International Workshop on
Web Document Analysis (WDA 2001)

Report on Discussion Group III
Web Content Extraction and Mining

Stefan Klink
German Research Center for Artificial Intelligence (DFKI GmbH)

Matthew Hurst
WhizBang! Labs Inc.

Our group discussed the problem of 'web content extraction and mining'. Mainly, we focused on the following questions:

What is Content Extraction ?

Content extraction (CE) identifies parts of the document for extraction (see also logical labeling or document structure analysis). It is not the same than Information Retrieval (IR). Nowadays, it is more going into extraction of semantical information. Content extraction handles with the locating of predefined information types, e.g. names, addresses, dates, etc.

What is Mining ?

Mining deals with infering information from extracted data, e.g. trends. It is used for finding things which are not implicit in the data, e.g. for finding relationships between objects within one or several documents. For example, defects in a product can be found by using text mining applied on customers e-mails (or letters). If they are complaining similar things about a special product then there must be something wrong.

What is the difference to the 'traditional' Information Extraction ?

The main difference is the use of structure. In the traditional Information Extraction (IE) no structure is used. It only relies on the text data. In the web the document structure can be used for several purposes, e.g. to infer the content (or document) type.
But web documents also implicate some problems:

What does Document Analysis have to offer ?

Document Analysis has several things to offer ! Web Document Analysis is a new field, DA techniques can help to solve some WDA problems:

How to deal with encodings on the Web ?

The way of encoding information within a pages is not uniform, e.g. the time tables of the universities look all different !
Furthermore, the big encoding problem is that people want to have fancy pages, with a picture here and a blinking object there...
Another aspect is, that in the WWW a big variation of document formats exists, e.g. ascii-text, html, xml, pdf, cgi-generated pages, java-scripts,... The only standard format seems to be a TIFF-image. "Some existing systems first convert a web page into a TIFF-image and then use 'traditional' DA techniques" (Bertin Klein).
But the question remains: "How to handle with 'real' HTML, XML,... pages?"

What influences gives 'Semantic Web' to us ?

Everybody in the group agreed that including semantics will improve a system and that ontologies, WordNet, CYC, etc. are helpful. (Although Michael Rys has shown that an ontologie for the complete world is not practicable!) But the problem of all these is that they are hand-coded. What we need is an 'online'-learning system.
Another thing which could be kept in mind are collaborative aspects like 3rd party comments or rating in amazons shop, direct hit or citeseer.

Handwritten notes:

For those who want to read my handwritten notes and for those who want to test their handwriting recognition engine... (Please do not ask for ground-truth data ;-)

notes page 1
notes page 2
notes page 3
presentation slide


[Note: The following names are as I could remember. It could be that the list is not completely. Please, contact me, if I have forgotten you.]

Apostolos Antonacopoulos
University of Liverpool

Roger B. Bradford
IBM Almaden Research Center

Andreas Dengel

David Doermann
University of Maryland

Matthew Hurst
WhizBang! Labs Inc.

Koichi Kise
Dept. of Computer and Systems Sciences,
Osaka Prefecture University

Bertin Klein

Stefan Klink

Ahmad Fuad R Rahman
BCL Computers Inc.