We have choosen XML to store our results, and our ontologies. The SHOE DTD seems convenient for us but it lacks some important information we will present later. So, we have extended this DTD to add our information. Our ontology DTD is described in the appendix 1. Currently, we have only managed the "isa" relationship but other ones can be used (and have to be used). For example, an extract of the Thoth's University ontology is shown in the appendix 2.
A major problem that concerns ontologies is: where to find them and how to build them? First, there exist ontologies for specialized field of interest like in the SHOE project, in the KA2 project, in the Knowledge Sharing Effort public Library (KSEL)... In a same way, we can find general ontologies in WWW's indexers like AltaVista or Yahoo! (Labrou and Finin, 1999). Second, a subset of a thesaurus can be used and can be extended to build ontologies. For (Martin, 1995), such a thesaurus is a wide linguistic ontology. Users should rarely have to add intermediate types but rather specialized precise types of WordNet in order to express the shades of meanings needed for the application (the disambiguation process described in the next section is then easiest). Finally, we can use tools (manual, semi-automatic or automatic) to build ontologies according to a set of typical data (KA2 project).
A thesaurus can be viewed as a linguistic ontology (Guarino et al., 1999; Borgo et al., 1997). In WordNet, a "concept", called a sense, is defined with a single set of synonyms, called a synset. Therefore, concept in a thesaurus is unambiguous. For example, the first sense (sense 1) of the term "Person" is the synset {<"individual#1","person#1", "human#1", "mortal#1", "soul#2", "somebody#1", "someone#1">} (for each term the own sense number is given after the '#' symbol). We use WordNet because it is free available for research purpose and it is a broad coverage linguistic ontology (70 000 nodes). However, it does not include cross-part-of speech semantic relationship and it includes too much fine-grained sense-distinctions and lacks domain information (O'Hara et al., 1998) for text retrieval.
Our main goal is to associate to each concept of the ontology the right synset of the thesaurus (figure 1). An ontology concept has a label, which is only one of its possible lexical forms. This lexical form helps the process to select corresponding synsets into the thesaurus (each term has often several meanings). Then, among these selected synsets, we only choose the more relevant ones. Accordingly, for each concept label of the ontology, the thesaurus provides several candidate synsets related to this concept. To select the relevant synset, we try to find if the synset context according to the hypernym relationship in the thesaurus is similar to the concept context according to the "isa" relationship in the ontology. During this process, we measure the matching degree between a synset and a concept of the ontology. It is evaluated taking the result of the matching process into account, namely the number of related concepts, the type of relationship, the depth of the different relationship... Therefore, our ontology is a terminologically oriented ontology (Martin, 1999) to ease rapid and simple knowledge representation, management, and use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Regarding these results, even if the disambiguation process is not complete for a given concept, the user has a good chance to find the right label sense first (or among the first ones). This chance is increased if the ontology is convenient for the thesaurus. Indeed, the Thoth's ontology is built according to a substantial modification of the SHOE's one according to WordNet. As a result, the disambiguation process is more efficient on the Thoth's ontology than the SHOE's one.
Table 6 gives several samples of concepts for each level. In this table,
a concept is described by its label, its right label sense (after the '#'
sign), the number of candidate label senses at the end of the disambiguation
process in comparison of the number of all its possible label senses, and
a list of each label sense number and its matching degree.
|
|
|
|
|
|
|
|
School #1 - 3/7 (6/0.33 4/0.95 1/0.95) |
|
|
|
|
|
Information #1 - 2/5 (1/0.95 5/0.38) |
|
EmailAddress #1 - 1/1 (1/1.0) |
Address #2 - 1/7 (2/0.95) |
This process allows the user to verify and to correct the ontology too. Looking to the matching result help us to correct the ontology according to the referring thesaurus. Table 5 shows that manipulating ontology can improve the disambiguation process. These manipulations can be: adding or removing concepts, changing concepts labels or descriptions...