Hi all,
We are interested in joining the Early Adopters Programme (EAP) as a way to
seed a long lasting collaboration with the Stanbol community.

We are the creators of DBpedia Spotlight, a Java/Scala Open Source
Enhancement Engine (Apache V2 license) that is complementary to Stanbol.
DBpedia Spotlight has the ambitious goal to annotate any of the 3.5M
entities from all 320 classes in the DBpedia Ontology. At the core of our
proposal is the idea of remaining generic and configurable for many use
cases. Besides the open source code, we also provide a freely available
REST service that has been used to annotate cultural goods [1], generate
RDFa annotations in Wordpress [2], and enhance the content in Wikipedia
through a MediaWiki toolbar [3], among others [4].

[1] http://dme.ait.ac.at/annotation
[2] http://aksw.org/Projects/RDFaCE
[3] http://pedia.sztaki.hu/
[4] More at: http://wiki.dbpedia.org/spotlight/knownuses

We have a demo interface that lets you tweak some parameters and see how
the system works in practice:
http://spotlight.dbpedia.org/demo

As a first step through the EAP, shall our proposal be selected, our
intention is to provide Stanbol enhancement engines based on the different
strategies that DBpedia Spotlight uses for term recognition and
disambiguation (more technical details below). For the validation part, one
idea is to provide a benchmark comparing the perfomance (esp. accuracy) of
the different enhancement engines in different annotated corpora that we
have already collected. Would this be interesting for IKS/Stanbol? Is there
another type of validation that would be more appealing to the community?

Looking forward to discussing possibilities with you.

Best regards,
Pablo

For the More Technical Folks

Our content enhancement is performed in 4 stages:
- Spotting recognizes terms in some input text. It can be done via
substring matches in a dictionary, or with more sophisticated approaches
such as NER and keyphrase extraction.
- Candidate mapping matches the "spotted" terms with their possible
interpretations (entity identifiers). This can also be done with a
dictionary (hashmap), but offers the possibility to do fancier matching
with name variations - acronyms, approximate matching, etc.
- Disambiguation ranks the "candidates" given the context (e.g. words
around the spotted phrase). This can also be done in many ways, locally,
globally, with different scoring functions, etc.
- Linking decides which of the spots to keep, given that after the previous
steps we have more information about confidence, topical pertinence, etc.

Other potentially interesting more technical details
- Our Web service uses Jersey (JAX-RS)
- The Web Service is CORS-enabled, and we have both pure JS and jQuery
clients. We also have Java, Scala and PHP clients.
- Users can provide SPARQL queries to blacklist/whitelist results
(currently in the Linking step only, but work in progress for other steps).

Reply via email to