Hi all, I have made a lot of progress with the Disambiguation Engine (STANBOL-723) this week. So let me provide you with an update.
All work described in this mail takes place in the "disambiguation-engine" branch [1]. So if you want to test the features described in this mail you will need to check-out this branch. ### Disambiguation Engine This Engine disambiguates (modifies fise:confidence values) for existing Entity suggestions (fise:TextAnnotations with a dc:relation to a fise:TextAnnotation). It does not create any new suggestions. So the Disambiguation Engine MUST BE used in combination with some other engine that suggests Entities managed by the Stanbol Entityhub (NamedEntityTaggingEngine or KeywordLinkingEngine). This Engine is based on the SimilarityConstraint [2] supported by FieldQuery interface implemented by the Stanbol Entityhub. The implementation of this is based on Solr MLT [3]. The Engine can disambiguate with any Entityhub Site. By default the "full text field" is used for the Similarity. The Entityhub Site used to disambiguate suggested Entities need not to be configured as the fise:EntityAnnotations do provide this information by the value of the "entityhub:site" property [4]. This means that if you have an Enhancement Chain that suggest Entities from different Entityhub Sites the Disambiguation Engine will be able to disambiguate Entities from any site. The confidence of disambiguated Entities is combined with the original confidence with the disambiguation score. For this a user configured ratio '{disambiguation-weight}:{original-confidence-weight}' (default is '2:1') is used. The algorithm uses: dc := (oc* cw / ( cw + dw)) + (ds * dw / ( cw + dw)) oc ... original-confidence [0..1] ds ... disambiguation-score [0..1] dc ... disambiguated - confidence [0..1] cw ... original-confidence-weight dw ... disambiguation-weight Notes: * Confidences of suggestions where not a single one was found by the Disambiguation Engine are currently not modified * The disambiguation engine currently ignores all fise:TextAnnotations with only a single suggestion * Currently the Disambiguation Engine can not be configured. However this will change in the near future. * No updates to the semantic contexts. The Engine uses all 'fise:selected-text' of other 'fise:TextAnnotations without a window of 100 characters surrounding the currently processed fise:TextAnnotations. ### Stanbol Enhancer UI In the disambiguation branch I implemented a lot of improvements to the Web UI of the Stanbol Enhancer as the current UI (in the trunk version) was not able to visualize disambiguation results. Most important new version shows multiple entires for fise:TextAnnotations with the same "fise:selected-text" if there is a different set of suggested entities. In addition the new interface shows additional metadata (mentions, occurrence, confidence) and lists all mentions if an entity was found several times in the text (with the same list of suggested entities. ### KeywordLinkingEngine The version of the KeywordLinkingEngine in the trunk uses a slightly different version to calculate matches. The main differences are * Only "processable" Tokens are counted as matches. "Processable" are only Tokens that are Nouns, or - if no POS tagging is available or the confidence of the POS tag is to low - all tokens that are equals or longer as the configured "Min Token Length". * No restriction about the minimum number of matching tokens relative to the overall number of tokens in the matched Label. Both those changes improve the performance of the engines with configurations that do allow a lot of Entities to match (e.g. when setting the "Minimum Found Tokens" to 1). While those configurations are not typical in current settings they do become much more desirable assuming that a DisambiguationEngine post-processes results. ### Default Configuarion The disambiguation branch also provides a modified default configuration. This configuration adds the Disambiguation Engine to the default chain and also provides an additional Enhancement Chain with the name "dbpedia-keyword-disambiguation". While the modified default chain just adds the DisambiguationEngine at the end of the default "langdetect, ner, dbpediaLinking" chain the "dbpedia-keyword-disambiguation" is intended to validate the performance of the Disambiguation Engine as it uses a configuration of the KeywordLinkingEngine that suggests up to 20 Entities and only require a single Token to match. NOTE that in both cases disambiguation is based on Solr MLT queries over the short abstract of DBpedia entities. It is planed to provide other vocabularies with better disambiguation contexts (see also section "Managing "Shallow KB"s with the Stanbol Entityhub" in the previous mail of this thread). ### Testing the Branch This explains how to test the changes of this branch. The following steps are requires (current Stanbol users might have already completed 1. and 2.) 1. check out the Apache Stanbol trunk svn co http://svn.apache.org/repos/asf/incubator/stanbol/trunk/ stanbol-trunk 2. build the Stanbol trunk cd stanbol-trunk export MAVEN_OPTS="-Xmx512M -XX:MaxPermSize=128M" mvn clean install 3. check out the disambiguation branch cd .. svn co http://svn.apache.org/repos/asf/incubator/stanbol/branches/disambiguation-engine/ stanbol-disambiguation 4. build the Stanbol disambiguation branch cd stanbol-disambiguation mvn clean install 5. build the full launcher of the stanbol-trunk a 2nd time - this will now use/add the modified bundles of the stanbol-disambiguation branch installed to the local repository as part of step (4). cd .. cd stanbol-trunk/launchers/full mvn clean install 6. run the full launcher cd target java -Xmx1024m -XX:MaxPermSize=256m -jar org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar 7. Install the bigger DBpedia Index available at [5] and copying it to the "{stanbol-working-dir}/stanbol/datafiles". While this is not required it is still recommended as the bigger index contains much more Entities and is therefore much better suited to test disambiguation. However not the the examples in (8) do also work with the small index included by the Stanbol launcher. 8. Try the disambiguation Engine at http://localhost:8080/enhancer/chain/dbpedia-keyword-disambiguation and using texts like "Apple is a company based in California" "A Jaguar would not eat an Apple" "I am impressed by the performance of Jaguar in this years F1 season." If you want to have details open the Stanbol log file ({stanbol-working-dir}/stanbol/logs/error.log) and look for loggings of the "org.apache.stanbol.enhancer.engine.disambiguation.mlt.DisambiguatorEngine" component. For each disambiguated fise:TextAnnotation the following loggings are provided 1. "Use Window: '{window}'" - the text of the window 2. "Query '{site-name}' for {selected-text}@{language} with context '{context}'": The Entityhub {site-name}, {selected-text} of the EntityAnnotation as well as the {context} extracted from {window} 3. "disambiguate {label}: " with the results in the following lines. " - not found {uri}" means that this Entity was returned by the Solr MLT query, but was not part of the suggested Entities " - found {uri} origConf:{oc}, disScore:{dc}, disConf:{dc}" if an entity was disambiguated " - none found" : in case non of the MLT results do match with the suggestions. Happy testing Rupert Westenthaler [1] http://svn.apache.org/repos/asf/incubator/stanbol/branches/disambiguation-engine/ [2] see STANBOL-202, STANBOL-589 and STANBOL-596 [3] http://wiki.apache.org/solr/MoreLikeThis [4] http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/enhancementstructure.html#fiseentityannotation [5] http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/dbpedia.solrindex.zip -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen