Re: Entity Disambiguation Engine

Rupert Westenthaler Mon, 27 Aug 2012 02:44:10 -0700

Hi all,

let me additional information about the broader context of Kritarths
contributions.

Kritarths GSoC project has focused on disambiguation that can be
applied to custom Vocabularies. The working assumption was a Stanbol
user that convert his CRM (Customer Relationship Management) system
data to RDF and uses it for analyzing his Content. So the goal was
that the disambiguation Engine resulting from the GSoC project will
help those users to disambiguate e.g. two customers with the same
name.

In more general terms the assumption was to have a "Shallow KB" (as
Pablo described it in [1]) over Entities provided by a Stanbol
Entityhub ReferencedSite.

### Managing "Shallow KB"s with the Stanbol Entityhub:

While out of scope of Kritarths GSoC project this is a important
pre-requirement for using the Disambiguation Engine. The main
component for creating a "Shallow KB" for a vocabulary is the
Entityhub Indexing Tool. This needs to be configured to collect the
necessary contextual information for Entities so that the
Disambiguation Engine can disambiguate them.

Typically users will want to index the following contextual information:

1. defining the textual context: Labels and descriptions of the Entity
and related Entities (e.g. the name of all projects an employee is
working on, the name of all products a customer has bought, all album
titles and track titles an music artist has released, ...)
2. defining the semantic context: URIs of other Entities that are
linked within the Knowledge base (e.g. the broader/related concepts
within a thesaurus, parent administrative regions, work packages,
tasks for a project, the work package, project, assigned employees,
partners for every task in a project ...)

While (1) is useful to disambiguate based on the surrounding text of
the mention (the fise:TextAnnotation) (2) is better suited to
disambiguate based on other linked Entities (e.g. those that are not
ambiguous and do not need to be disambiguated). See also the section
"2. The Context Procurement" of Kritarths mail.

With the current version of the Entityhub it is already possible to
build those context by using LDPath statements when indexing the data
with the Entityhub Indexing tool. It is also possible to disambiguate
against this contexts by using Solr MLT (as exposed by the Entityhub
FieldQuery interface). So every Stanbol users should be able to use
this Disambigutation Engine not only for DBpedia but also for his own
vocabularies if he configures the Entityhub indexing tool accordingly.

### Next Steps

Based on the Results of the GSoC [2] I will move the
DisambiguationEngine to the Stanbol Code base. I plan to do this work
in an own branch (similar as for the CELI and DBpedia Spotlight
engines). This should make it easier for Kritarths to provide further
contributions and also for other to test/use the Engine during this
phase.

In parallel I will also see to it to adapt the default configuration
of the Entityhub Indexing tool in a way that it creates indexes that
are suited for disambiguation so that users that do index vocabularies
that follow an well known schema (e.g. SKOS thesauri) will be able to
use Disambiguation without changing the configuration. This will also
require to update some of the usage-scenarios on the Stanbol Webpage.

In addition I plan to provide updated version of some of the available
indexes [3] that are more optimized for the use with the
Disambiguation engine (especially the ehealth demo would be a good
candidate for this).

I would also like to have the Disambiguation Engine (combined with a
fitting vocabulary) as part of the default configuration, but I do not
yet have Idea what such a fitting vocabulary could be. Suggestions
welcome.

### Further Outlook:

In this section I will try to provide an overview about possible
future improvements to disambiguation based on already ongoing and
planed additions to Stanbol Components - mainly the Stanbol Entityhub.

Currently I am working on adapting the "two layered storage"
architecture as described by STANOL-471 for the Stanbol Contenthub
also for the Entityhub (see STANBOL-704). This will separate storage
and indexing in two separate components (currently the Entityhub Yard
has both roles) This will also allow to bring the functionality of the
Entity Indexing Tool directly to the Entityhub (as this requires to
differentiate between an Indexing Source - the EntityStore - and a
IndexingTarget - the EntityIndex).

As soon as this work is completed this will bring three major
improvements for Disambiguation Engines:

1. It will allow to to efficiently manage a "Shallow KB" also for
Vocabularies that are managed in the Entityhub or by a ManagedSite
(see STANBOL-673), because batch processing with the Entityhub
indexing tool will no longer be required to build build a good
"Shallow KB".

2. The separation of "Entity Store" and "Entity Index" (that is meant
by "two layered" in STANBOL-471) will also allow to have several
Entity Indexes for a single Entity. This e.g. would also allow to
build special indexes (e.g. a temporal and spacial index) that cover
Entities of several/all vocabularies. Those additional indexes could
be than used to disambiguate along additional dimensions  what should
improve the disambiguation results.

3. "Entity Indexes" could also allow to collect Entity information
from different sources (multiple IndexingSources) . This would allow
to combine information available for the Entity in the Vocabulary with
additional Information  e.g. mentions of the Entity as collected by
some Feedback service or available via annotated documents in the
Contenthub. This would allow Disambiguration to work on
"Occurrence/Mention-based" contexts (again see Pablo mail [1])

I assume that those improvements will result in the implementation of
more advanced Disambiguation Engines for the Stanbol Enhancer.

A big thanks to Kritarths for advancing on the bumpy road to bring
Disambiguation to Apache Stanbol. I am very pleased that he shows
interest to further contribute even now that GSoC is finally coming to
an end.

best
Rupert

[1] http://markmail.org/message/udorfbzibfx7zfuo
[2] https://issues.apache.org/jira/browse/STANBOL-723
[3] http://dev.iks-project.eu/downloads/stanbol-indices/

On Thu, Aug 23, 2012 at 3:23 PM, kritarth anand
<kritarth.an...@gmail.com> wrote:
> Dear members of Stanbol community,
>
> I hereby would like to discuss about the next few iterations of the
> Disambiguation Engine. The Disambiguation Engine, To Disambiguate Engines
> few versions of Engines have been prepared. I would like to briefly
> describe them below. I hope to become a permanent committer for Stanbol if
> my contribution is considered after this GSOC period. I will be committing
> the code versions soon. And applying patch to JIRA soon.
>
> 1. How disambiguation Engine problem was approached.
>  For certain text annotations there are might be many Entity Annotations
> mapped, It was required to rank them in the order of there likelihood.
> Paris is the a small city in the United States.
>
> a.The Paris is this sentence without disambiguation (using Dbpedia as
> vocabulary). There are three entity annotations mapped 1. Paris, France ,
> 2. Paris, Texas 3. Paris, *Something* (The entity mapped with highest
> fise:confidence is Paris, France.)
> b. Now how would disambiguation by humans take place. On reading the line
> an individual thinks of the context the text is referring to. Doing so he
> realizes that since the text talks about Paris and also about United
> States. The Paris mentioned here is More Like Paris,Texas(which is in
> United States) and therefore must refer to it.
> c. The approach followed in implementation takes inspiration from the
> example and works in the following manner somewhat follows the pseudo code
> below.
>     for( K: TextAnnotations)
>     {    List EntityAnnotations =getEntityAnnotationsRelated(K);
>         Context=GetContextInformation(K);
>
>         List Results=QueryMLTVocabularies(K, Context);
>         updateConfidences(Result,EntityAnnotations)
>     }
>
> d. My current approach to handle disambiguation involved a lot of
> variations however for the purpose of simplicity I'll talk only about
> differences in obtaining "Context".
>
> 2. The Context Procurement:
> a. All Entity Context: The context would be decided on by all the
> textannotations of the text. It proves to show good results for shorter
> texts, but introduces lot of redundant annotations in longer ones making
> context less useful
> b. All link Context: The context is decided on the basis of site or
> reference link associated with the text annotations, which of course can be
> required to disambiguate. So it does not behave in a very good fashion
> c. Selection Context: The selection context is basically contains text one
> sentence prior and after the current one. Also another version worked with
> Text Annotations in this region of text.
> d. Vicinity Entity Context: The vicinity annotation detection measures
> distance in the neighborhood of the text annotation.
>
> 3. Future
> a. With a running POC of this Engine it can be used to create an advanced
> version like the Spotlight approach or using Markov Logic Networks
> discussed earlier.

-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Entity Disambiguation Engine

Reply via email to