Associating concepts and synsets
At this point, we have on the one hand an ontology for which concept labels
are disambiguated and on the other hand possible senses in each HTML pages
of the Web site with their relative frequency, and their evaluated convenience.
In the next step, indexes are matched with concepts of ontologies. For
each sense, we search for the same one in the index. If it exists, concerned
web pages and coefficients are added to it.
For the moment, our process gets pages containing concepts of the ontology.
However, it does not take the weighted frequency of synsets into account.
Consequently, a concept that appears only one time in a page allows this
page to be referred by the ontology. For this reason, we added a frequency
threshold to consider a concept only if its weighted frequency (section
4.1) is greater or equal than this threshold. In the next section, we will
present several indexation processes according to the evolution of the
threshold.
To evaluate the appropriatness of an ontology according to a of HTML
pages, four typical coefficients are calculated:
-
the rate of pages concerned with its concepts, called the covering degree,
which gives the number of web pages that involve at least one concept of
the ontology,
-
the rate of its concepts directly involved in HTML pages, called the
direct indexing degree ,
-
the rate of its concepts involved (directly or by the way of more specialized
concepts), called the global indexing degree,
-
the average convenience degree of candidate concepts from pages selected
by the ontology.
Finally, a relevant ontology is an ontology having these coefficients close
to 1.0. A high covering degree implies a wide proportion of the pages contain
concepts of the ontology. A high direct indexing degree implies a lot of
concepts can be found in the pages. A high value for this couple of coefficient
is quite important. Namely, we can have a site where only one page contains
the ontology (this gives an indexing degree at 1 and a little covering
degree). In the same way, all pages can contain a general concept like
"Entity" in the head of each page (this gives a weak indexing degree but
a covering degree equals to 1).
The indexation process can also highlight indexes, which do not match
with concepts of ontologies. In this case, we may search for ontologies
related to this index. In the future, one can redo the indexation process
either when the site content notably evolves or when the used ontologies
are updated. This process can only be executed with modified pages.