We started applying our ideas about the usage of the background information to develop a system able to perform WSD and Entity linking by using this kind of information. The task number 13 of the SemEval competition forum was selected to test our hypohesis: Multilingual WSD and Entity Linking
Our main idea was to build a background corpus representative of the target domain (the domain of the texts to be processed within the task), and apply an algorithm to extract the predominant sense in this domain for all the words. Two possible scenarios were established in order to create the background corpus:
- Offline corpus: we know the target domain in advance and we can choose documents belonging to that domain. These documents will be our starting set of documents
- Online corpus: the target domain is not known in advance, so the starting documents will be the documents to be processes themselves.
In any of both cases, the first step is to get all the possible entities and their links to dbpedia, selecting only those that are specific concepts (according to the dbpedia ontology). Once we have this list of filtered dbpedia links we can follow two approaches in order to expand them and build our background corpus:
- Sibling expansion: all the dbpedia entries belonging to the same category of that of the selected categories is first selected and compared agains an LDA model built from the seed documents. Only similar dbpedia entries are selected
- Entity overlapping: for each dbpedia entry selected, all the wikipedia links from the related wikipedia page are obtained. For each of this wikipedia links, we obtain the wikipedia page, extract again all the wikipedia links (which can be considered entities) and the number of overlapping entities with the original set of documents is calculated. Only wikipedia pages having more than a certain number of entities in common are selected.
Once the background corpus has been composed (following any of the two previous approaches), an unsupervised algorithm to obtain the predominant sense of every word in background corpus (which should belong to the target domain).
With all this information and the sense ranking provided by one supervised state-of-the-art WSD system, we decide for every polysemous word which is the most likely sense. This choice will depend on the confidences assigned by the supervised system and by the predominant sense algorithm. In some cases, the sense of the word will follow the predominant meaning in the domain, while in other cases a general domain will apply. The systems are still under evaluation by the organisers of the task.