Hi all, In the last two weeks I considerable improved the validation of the Enhancements created by the different Stanbol Enhancement Engines. Here is the list of related issues:
* STANBOL-613: Define how to retrieve the language of the parsed content * STANBOL-617: Define how to encode fise:TopicEnhancements * STANBOL-625: Add link to the entityhub:site if suggested Entity is available via the Entityhub Note also STANBOL-612 - providing a utility class that easily allows to validate created enhancements in unit tests of EnhancementEngines. All existing engines do now use this utility to validate Enhancements. This is also true for the contributed CELI engine (STANBOL-583) already confirm to those tests. The next think I would like to make more clear (and easier to use/understand) is how confidence is represented for Stanbol Enhancements. Related to this I would like to discuss the following two suggestions: ### Suggestion 1: Require confidence values to be in the range [0..1] This is an long going discussion, but I would really like to add a check that enforces confidence values to be in the range between [0..1]. I think this change is necessary, because it moves the responsibility for interpreting confidence values from the Stanbol users to the implementors of the Engines. I know that providing confidence values is a hard thing to do, but while it may be hard for Engine developers it is near to impossible to Stanbol users to do so. Note that EnhancementEngine would still be free to create Enhancements with no "fise:confidence" value. Surprisingly a lot of the existing Engines do already confirm to this rule. The most prominent exception is the Named Entity Tagging Engine (o.a.s.enhancer.engine.entitytagging). Because of this I implemented already an algorithm that normalizes confidence values by a combination of the levenshtein distance (selected-text <-> entity label) and the Solr result score for the Entity (see STANBOL-624 for details). If we could agree to this rule I would use a similar approach also for other Engines that do not yet normalize confidence values between [0..1] ### Suggestion 2: Add fise:confidence-level property The "confidence-level" is intended to make it easier for clients to decide how to process Enhancements. It would not use a numerical range but four distinct values: * confident: Meaning that a match is very likely - indicating that those annotations typically can be accepted automatically (e.g. If the EntityLinking engine finds a single Entity that exactly matches the text selected by an text annotation) * ambiguous: Meaning that there are several possibilities but is is still likely that one of them match (e.g. Paris, Paris (Texas)) * suggestion: Meaning that the match is not completely certain, but there are not several options (e.g. Germans -> Germany) * uncertain: Meaning that Entities do match, but the probability of a match is rather speculative (e.g. John -> Elton John) IMHO using this classification would fit a lot of engines much better as the numeric "fise:confidence" property as it does not rise the expectation in users that confidence values are on a rational scale (e.g. a Enhancement with a confidence of "0.8" is not two times as likely as one with "0.4"). Engines would have the possibility to manually add those information to enhancements. For enhancements that do not define those we could implement an post-processing engine that adds those based on generic rules. e.g. * ignore Enhancements with an existing "confidence-level" assignment * TextAnnotations with a confidence value > 0.8 => confident * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion * TextAnnotations with a confidence value < 0.5 => uncertain * TextAnnotations with a single linked EntityAnnotation with a confidence > 0.8 => confident * TextAnnotations with a several linked EntityAnnotation with a confidence > 0.8 => ambiguous *) * TextAnnotations with several linked EntityAnnotations with a confidence > 0.5 but no one > 0.8 => ambiguous *) * TextAnnotations with a single linked EntityAnnotation with a confidence < 0.8 > 0.5 => suggestion * TextAnnotations with EntityAnnotations with confidence values < 0.5 => uncertain * TopicAnnotation with a confidence value > 0.8 => confident * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion * TopicAnnotation with a confidence value < 0.5 => uncertain *) NOTE that in those cases only EntityAnnotations with a confidence value > 0.5 would be marked as "ambiguous". Additional EntityAnnotations with confidence values < 0.5 would be marked as "uncertain" The values '0.8' and '0.5' should be configurable. Note that "fise:confidence-level" could be also used by Engines that can not provide fise:confidence values (E.g. the langid engine could mark detected languages as "uncertain" if the parsed text was to short). WDYT Rupert -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
