Entity Disambiguation based on Solr MLT
---------------------------------------

                 Key: STANBOL-223
                 URL: https://issues.apache.org/jira/browse/STANBOL-223
             Project: Stanbol
          Issue Type: New Feature
          Components: Enhancer
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


In short:

The Idea is to use sentences with links to an Entity in a dataset (e.g. 
wikipedia) as context and compare this with the surrounding text of an Entity 
extracted from the analyzed content. Solr More Like This (MLT) queries will be 
used for the ranking. 
 
More details:

Sentences with occurrences of the Entity can be extracted by using 
https://github.com/ogrisel/pignlproc. Functionality will be added to output the 
results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). This 
will allow it to indexed this information (together with all the other 
information of Entities) by using the Indexing Tools porvided by the Stanbol 
Entityhub (e.g. entityhub/indexing/dbpedia).

The following Information will be used for EntityDisambiguation:

(1) TextAnnotations providing the label, the type as detected by the NLP 
framework, the context of the extraction
(1b) In addition links to other Text Annotations about the same Entity could be 
used to extend the context
(2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least 
the labels, types and the occurrences of the Entities

EntityDisambiguation will filter based on the label and the type (filter query) 
and rank selected Entities based on a "More Like This" query with the context 
over the occurrences.

A first prototype of this engine was implemented during the bbuzz - Semantic 
Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own 
EnhancementEngine that uses an separate Solr Index for the MLT queries.

The plan is to implement this as an optional (configureable) feature to the 
existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to 
activate/deactivate Entity disambiguation via the OSGI Console if the required 
data are available for a ReferencedSite.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to