Index building
In Information Retrieval community, it is assumed that discriminating word
senses in requests and documents significantly improves the performance
of information retrieval systems. Some experiments were lead (Gonzalo
et
al., 1998) on documents and requests semantically hand-annotated: the
precision was greatly enhanced. We saw in section 3 that manual annotation
of web pages with ontology is a titanic task. Therefore, we propose a semi-automatic
indexer of web pages. The major problem is to semi-automatically determine
the specific content of documents. The natural language processing community
commonly admitted that terms involved in text, carry relevant information
on text content. Several works (Bourigault, 1994; Daille, 1994) underline
that terms pattern and frequency are determinant to extract terms from
large corpora from a single domain. Unfortunately, web pages are often
small and web sites often cover multiple domains. However, web pages are
structured according to HTML markers, which can be used for weighting terms
importance in pages and then can be exploited for characterizing web page
content.
4.1 Terms extraction
The term extraction begins by (1) removing HTML marker from web pages,
(2) dividing the text in independent sentences, (3) lemmatizing words
included in the page. Next, web pages are annotated (figure 1) with part
of speech tags using the Brill tagger (Brill, 1995). As a result, each
word in a page is annotated with its corresponding grammatical category
(noun, modifier...).
Figure 1. Terms extraction
Finally, the surface structure of sentences is then analysed using term
patterns to provide well-formed terms. To each well-formed term is assigned
a coefficient C according to the term frequency and the weight of
HTML Markers. This last coefficient is called the weighted frequency.
Table 1 shows some results extracted from an experiment on the web site
http://www.cs.washington.edu/.
This web page is the homepage of the departement of computer science of
the University of Washington. A coefficient equal to 1.0 means the term
(here "Washington") is the most relevant index. Other term coefficients
are calculated according to this one. We have extracted well-formed terms
like single term ("science"...) and complex term ("university washington"...).
Note that the term "university washington" provides another term "university
of washington". We provide several forms for a term to improve the term
retrieval process in the thesaurus.
Weighted coefficient
|
Terms
|
1.00
|
washington
|
0.83
|
science
|
0.67
|
university washington
|
0.67
|
university of washington
|
0.67
|
university
|
0.67
|
engineering
|
0.50
|
program
|
0.50
|
computer science
|
0.50
|
computer
|
0.50
|
college
|
0.33
|
uw
|
0.33
|
student
|
0.33
|
seattle
|
0.33
|
member
|
0.33
|
major
|
0.33
|
field
|
...
|
...
|
Table 1. Terms extracted from a web page
4.2. Page concept determination
During the term extracting process, well-formed terms and their coefficient
were respectively extracted and calculated. The well-formed terms are forms
representing a particular concept. Different forms may represent a same
concept (i.e. chair, professorship). In order to determine not only the
term set included in a page but also the concept set included in a page,
linguistic ontologies are used. A linguistic ontology can be viewed as
a machine readable dictionary structured around groups of synonymous words,
which represent a concept. Moreover, it provides explicit relationships
among group of synonymous (i.e. hypernym relationship, meronym relationship...).
Linguistic ontology allows the mapping of concepts and words or terms.
Therefor, in our context, linguistic ontology is used to generate all concepts
corresponding to well-formed term. In our experiment, we have still used
the WordNet linguistic ontology (Miller, 1990).
The process to generate candidate synsets is quite simple: from all
extracted terms, all candidate concepts (all senses) are generated using
WordNet. This thesaurus is a broad coverage linguistic ontology but it
does not cover all the terms included in web page. If a term does not exist
in WordNet, a specific sense is generated. Then, a convenience coefficient
is calculated using a semantic similarity measure (Figure 2). It measures
for a given term (a form), the convenience with all possible concepts it
could represent. The calculus takes terms context into account (the page
in which it appears and its neighbourhood).
Figure 2. Concept generating and weighting
Then, figure 3 shows an extract of the XML file containing candidate concepts
(the DTD is presented in appendix 1).
Figure 3. Extract of a generated index