WDA 2001 Report on discussion group III

First International Workshop on Web Document Analysis (WDA 2001)
Report on Discussion Group III Web Content Extraction and Mining
Stefan Klink German Research Center for Artificial Intelligence (DFKI GmbH) http://www.dfki.de/~klink
Matthew Hurst WhizBang! Labs Inc. http://www.research.whizbang.com/

Our group discussed the problem of 'web content extraction and mining'. Mainly, we focused on the following questions:

What is Content Extraction ?

Content extraction (CE) identifies parts of the document for extraction (see also logical labeling or document structure analysis). It is not the same than Information Retrieval (IR). Nowadays, it is more going into extraction of semantical information. Content extraction handles with the locating of predefined information types, e.g. names, addresses, dates, etc.

What is Mining ?

Mining deals with infering information from extracted data, e.g. trends. It is used for finding things which are not implicit in the data, e.g. for finding relationships between objects within one or several documents. For example, defects in a product can be found by using text mining applied on customers e-mails (or letters). If they are complaining similar things about a special product then there must be something wrong.

What is the difference to the 'traditional' Information Extraction ?

The main difference is the use of structure. In the traditional Information Extraction (IE) no structure is used. It only relies on the text data. In the web the document structure can be used for several purposes, e.g. to infer the content (or document) type.
But web documents also implicate some problems:

Web documents have an enormous variation of the document structure. It is practically not possible to build a universal classifier for all documents types nor it possible to find all kinds of objects within an arbitrary web document. Furthermore, the encoding of the structure differs highly.
The world wide web contains lots of information ( >billions of pages) but even modern search engines just index a fraction of all pages.
The large amount of information (pages) comes along with another problem: The WWW contains a lot of junk pages because every one puts his (private) pages on the web which are more or mostly less useful.
More and more pages are not written by a HTML expert but are automatically generated by authoring tools. The output (source HTML code) of such tools can not be used anymore easily by traditional IE algorithms because the HTML pages contain much more HTML syntax code than text content. Sometimes the text is even embedded within HTML code.
Furthermore,the raising use of pictures and particularly java-Scripts causes more and more problems for IE and CE tools.
And last but not least, actuality, up-to-dateness and age are problems in web documents which have to be concerned:

Even if a system is able to handle pictures within a document page it will become problems with animated pictures, movies or other objects which are not 'stable', e.g. moving objects.
The WWW is a living corpus! Some documents are just temporary and result in a dead link if a system tries to access the page later or documents are under construction or updated (more or less) frequently. Then the system gets different information from the newer page.
And sometimes a page is even useless because it just contains 'old' out-of-date information.

What does Document Analysis have to offer ?

Document Analysis has several things to offer ! Web Document Analysis is a new field, DA techniques can help to solve some WDA problems:

Grammars are too low for dealing with web pages. The big question is how to build a model ?
But statistical Final State Models can help, e.g. the price has to be near by the product description.
Statistical NLP techniques can improve traditional DA systems, but then the problem of language and cultural dependency arises.
Particularly, structure analysis (and comparison) and the analysis of the document layout improves WDA systems.
So, "deal with structures and comparison of structures!".

How to deal with encodings on the Web ?

The way of encoding information within a pages is not uniform, e.g. the time tables of the universities look all different !
Furthermore, the big encoding problem is that people want to have fancy pages, with a picture here and a blinking object there...
Another aspect is, that in the WWW a big variation of document formats exists, e.g. ascii-text, html, xml, pdf, cgi-generated pages, java-scripts,... The only standard format seems to be a TIFF-image. "Some existing systems first convert a web page into a TIFF-image and then use 'traditional' DA techniques" (Bertin Klein).
But the question remains: "How to handle with 'real' HTML, XML,... pages?"

What influences gives 'Semantic Web' to us ?

Everybody in the group agreed that including semantics will improve a system and that ontologies, WordNet, CYC, etc. are helpful. (Although Michael Rys has shown that an ontologie for the complete world is not practicable!) But the problem of all these is that they are hand-coded. What we need is an 'online'-learning system.
Another thing which could be kept in mind are collaborative aspects like 3rd party comments or rating in amazons shop, direct hit or citeseer.

Handwritten notes:

For those who want to read my handwritten notes and for those who want to test their handwriting recognition engine... (Please do not ask for ground-truth data ;-)

notes page 1
notes page 2
notes page 3
presentation slide

Participants

[Note: The following names are as I could remember. It could be that the list is not completely. Please, contact me, if I have forgotten you.]

Apostolos Antonacopoulos
University of Liverpool
A.Antonacopoulos@csc.liv.ac.uk

Roger B. Bradford
IBM Almaden Research Center
roger.b.bradford@saic.com

Andreas Dengel
DFKI GmbH
Andreas.Dengel@dfki.de

David Doermann
University of Maryland
doermann@umiacs.umd.edu

Matthew Hurst
WhizBang! Labs Inc.
mhurst@whizbang.com

Koichi Kise
Dept. of Computer and Systems Sciences,
Osaka Prefecture University
kise@cs.osakafu-u.ac.jp

Bertin Klein
DFKI GmbH
klein@dfki.uni-kl.de

Stefan Klink
DFKI GmbH
Stefan.Klink@dfki.de

Ahmad Fuad R Rahman
BCL Computers Inc.
fuad@bcl-computers.com

...