Hi all, let me additional information about the broader context of Kritarths contributions.
Kritarths GSoC project has focused on disambiguation that can be applied to custom Vocabularies. The working assumption was a Stanbol user that convert his CRM (Customer Relationship Management) system data to RDF and uses it for analyzing his Content. So the goal was that the disambiguation Engine resulting from the GSoC project will help those users to disambiguate e.g. two customers with the same name. In more general terms the assumption was to have a "Shallow KB" (as Pablo described it in [1]) over Entities provided by a Stanbol Entityhub ReferencedSite. ### Managing "Shallow KB"s with the Stanbol Entityhub: While out of scope of Kritarths GSoC project this is a important pre-requirement for using the Disambiguation Engine. The main component for creating a "Shallow KB" for a vocabulary is the Entityhub Indexing Tool. This needs to be configured to collect the necessary contextual information for Entities so that the Disambiguation Engine can disambiguate them. Typically users will want to index the following contextual information: 1. defining the textual context: Labels and descriptions of the Entity and related Entities (e.g. the name of all projects an employee is working on, the name of all products a customer has bought, all album titles and track titles an music artist has released, ...) 2. defining the semantic context: URIs of other Entities that are linked within the Knowledge base (e.g. the broader/related concepts within a thesaurus, parent administrative regions, work packages, tasks for a project, the work package, project, assigned employees, partners for every task in a project ...) While (1) is useful to disambiguate based on the surrounding text of the mention (the fise:TextAnnotation) (2) is better suited to disambiguate based on other linked Entities (e.g. those that are not ambiguous and do not need to be disambiguated). See also the section "2. The Context Procurement" of Kritarths mail. With the current version of the Entityhub it is already possible to build those context by using LDPath statements when indexing the data with the Entityhub Indexing tool. It is also possible to disambiguate against this contexts by using Solr MLT (as exposed by the Entityhub FieldQuery interface). So every Stanbol users should be able to use this Disambigutation Engine not only for DBpedia but also for his own vocabularies if he configures the Entityhub indexing tool accordingly. ### Next Steps Based on the Results of the GSoC [2] I will move the DisambiguationEngine to the Stanbol Code base. I plan to do this work in an own branch (similar as for the CELI and DBpedia Spotlight engines). This should make it easier for Kritarths to provide further contributions and also for other to test/use the Engine during this phase. In parallel I will also see to it to adapt the default configuration of the Entityhub Indexing tool in a way that it creates indexes that are suited for disambiguation so that users that do index vocabularies that follow an well known schema (e.g. SKOS thesauri) will be able to use Disambiguation without changing the configuration. This will also require to update some of the usage-scenarios on the Stanbol Webpage. In addition I plan to provide updated version of some of the available indexes [3] that are more optimized for the use with the Disambiguation engine (especially the ehealth demo would be a good candidate for this). I would also like to have the Disambiguation Engine (combined with a fitting vocabulary) as part of the default configuration, but I do not yet have Idea what such a fitting vocabulary could be. Suggestions welcome. ### Further Outlook: In this section I will try to provide an overview about possible future improvements to disambiguation based on already ongoing and planed additions to Stanbol Components - mainly the Stanbol Entityhub. Currently I am working on adapting the "two layered storage" architecture as described by STANOL-471 for the Stanbol Contenthub also for the Entityhub (see STANBOL-704). This will separate storage and indexing in two separate components (currently the Entityhub Yard has both roles) This will also allow to bring the functionality of the Entity Indexing Tool directly to the Entityhub (as this requires to differentiate between an Indexing Source - the EntityStore - and a IndexingTarget - the EntityIndex). As soon as this work is completed this will bring three major improvements for Disambiguation Engines: 1. It will allow to to efficiently manage a "Shallow KB" also for Vocabularies that are managed in the Entityhub or by a ManagedSite (see STANBOL-673), because batch processing with the Entityhub indexing tool will no longer be required to build build a good "Shallow KB". 2. The separation of "Entity Store" and "Entity Index" (that is meant by "two layered" in STANBOL-471) will also allow to have several Entity Indexes for a single Entity. This e.g. would also allow to build special indexes (e.g. a temporal and spacial index) that cover Entities of several/all vocabularies. Those additional indexes could be than used to disambiguate along additional dimensions what should improve the disambiguation results. 3. "Entity Indexes" could also allow to collect Entity information from different sources (multiple IndexingSources) . This would allow to combine information available for the Entity in the Vocabulary with additional Information e.g. mentions of the Entity as collected by some Feedback service or available via annotated documents in the Contenthub. This would allow Disambiguration to work on "Occurrence/Mention-based" contexts (again see Pablo mail [1]) I assume that those improvements will result in the implementation of more advanced Disambiguation Engines for the Stanbol Enhancer. A big thanks to Kritarths for advancing on the bumpy road to bring Disambiguation to Apache Stanbol. I am very pleased that he shows interest to further contribute even now that GSoC is finally coming to an end. best Rupert [1] http://markmail.org/message/udorfbzibfx7zfuo [2] https://issues.apache.org/jira/browse/STANBOL-723 [3] http://dev.iks-project.eu/downloads/stanbol-indices/ On Thu, Aug 23, 2012 at 3:23 PM, kritarth anand <kritarth.an...@gmail.com> wrote: > Dear members of Stanbol community, > > I hereby would like to discuss about the next few iterations of the > Disambiguation Engine. The Disambiguation Engine, To Disambiguate Engines > few versions of Engines have been prepared. I would like to briefly > describe them below. I hope to become a permanent committer for Stanbol if > my contribution is considered after this GSOC period. I will be committing > the code versions soon. And applying patch to JIRA soon. > > 1. How disambiguation Engine problem was approached. > For certain text annotations there are might be many Entity Annotations > mapped, It was required to rank them in the order of there likelihood. > Paris is the a small city in the United States. > > a.The Paris is this sentence without disambiguation (using Dbpedia as > vocabulary). There are three entity annotations mapped 1. Paris, France , > 2. Paris, Texas 3. Paris, *Something* (The entity mapped with highest > fise:confidence is Paris, France.) > b. Now how would disambiguation by humans take place. On reading the line > an individual thinks of the context the text is referring to. Doing so he > realizes that since the text talks about Paris and also about United > States. The Paris mentioned here is More Like Paris,Texas(which is in > United States) and therefore must refer to it. > c. The approach followed in implementation takes inspiration from the > example and works in the following manner somewhat follows the pseudo code > below. > for( K: TextAnnotations) > { List EntityAnnotations =getEntityAnnotationsRelated(K); > Context=GetContextInformation(K); > > List Results=QueryMLTVocabularies(K, Context); > updateConfidences(Result,EntityAnnotations) > } > > d. My current approach to handle disambiguation involved a lot of > variations however for the purpose of simplicity I'll talk only about > differences in obtaining "Context". > > 2. The Context Procurement: > a. All Entity Context: The context would be decided on by all the > textannotations of the text. It proves to show good results for shorter > texts, but introduces lot of redundant annotations in longer ones making > context less useful > b. All link Context: The context is decided on the basis of site or > reference link associated with the text annotations, which of course can be > required to disambiguate. So it does not behave in a very good fashion > c. Selection Context: The selection context is basically contains text one > sentence prior and after the current one. Also another version worked with > Text Annotations in this region of text. > d. Vicinity Entity Context: The vicinity annotation detection measures > distance in the neighborhood of the text annotation. > > 3. Future > a. With a running POC of this Engine it can be used to create an advanced > version like the Spotlight approach or using Markov Logic Networks > discussed earlier. -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen