Hi Rupert, We feel this as needed in order to provide a better user experience to our end users.
Is there also a way to formally express the *relevance* (how much a matched entity is relevant in the provided context)? +1 David On Thu, May 24, 2012 at 8:18 AM, Rupert Westenthaler < [email protected]> wrote: > Hi all, > > In the last two weeks I considerable improved the validation of the > Enhancements created by the different Stanbol Enhancement Engines. > Here is the list of related issues: > > * STANBOL-613: Define how to retrieve the language of the parsed content > * STANBOL-617: Define how to encode fise:TopicEnhancements > * STANBOL-625: Add link to the entityhub:site if suggested Entity is > available via the Entityhub > > Note also STANBOL-612 - providing a utility class that easily allows > to validate created enhancements in unit tests of EnhancementEngines. > All existing engines do now use this utility to validate Enhancements. > This is also true for the contributed CELI engine (STANBOL-583) > already confirm to those tests. > > The next think I would like to make more clear (and easier to > use/understand) is how confidence is represented for Stanbol > Enhancements. Related to this I would like to discuss the following > two suggestions: > > ### Suggestion 1: Require confidence values to be in the range [0..1] > > This is an long going discussion, but I would really like to add a > check that enforces confidence values to be in the range between > [0..1]. > > I think this change is necessary, because it moves the responsibility > for interpreting confidence values from the Stanbol users to the > implementors of the Engines. I know that providing confidence values > is a hard thing to do, but while it may be hard for Engine developers > it is near to impossible to Stanbol users to do so. > > Note that EnhancementEngine would still be free to create Enhancements > with no "fise:confidence" value. > > Surprisingly a lot of the existing Engines do already confirm to this > rule. The most prominent exception is the Named Entity Tagging Engine > (o.a.s.enhancer.engine.entitytagging). Because of this I implemented > already an algorithm that normalizes confidence values by a > combination of the levenshtein distance (selected-text <-> entity > label) and the Solr result score for the Entity (see STANBOL-624 for > details). > > If we could agree to this rule I would use a similar approach also for > other Engines that do not yet normalize confidence values between > [0..1] > > ### Suggestion 2: Add fise:confidence-level property > > The "confidence-level" is intended to make it easier for clients to > decide how to process Enhancements. It would not use a numerical range > but four distinct values: > > * confident: Meaning that a match is very likely - indicating that > those annotations typically can be accepted automatically (e.g. If the > EntityLinking engine finds a single Entity that exactly matches the > text selected by an text annotation) > * ambiguous: Meaning that there are several possibilities but is is > still likely that one of them match (e.g. Paris, Paris (Texas)) > * suggestion: Meaning that the match is not completely certain, but > there are not several options (e.g. Germans -> Germany) > * uncertain: Meaning that Entities do match, but the probability of a > match is rather speculative (e.g. John -> Elton John) > > IMHO using this classification would fit a lot of engines much better > as the numeric "fise:confidence" property as it does not rise the > expectation in users that confidence values are on a rational scale > (e.g. a Enhancement with a confidence of "0.8" is not two times as > likely as one with "0.4"). > > Engines would have the possibility to manually add those information > to enhancements. For enhancements that do not define those we could > implement an post-processing engine that adds those based on generic > rules. > > e.g. > > * ignore Enhancements with an existing "confidence-level" assignment > * TextAnnotations with a confidence value > 0.8 => confident > * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion > * TextAnnotations with a confidence value < 0.5 => uncertain > * TextAnnotations with a single linked EntityAnnotation with a > confidence > 0.8 => confident > * TextAnnotations with a several linked EntityAnnotation with a > confidence > 0.8 => ambiguous *) > * TextAnnotations with several linked EntityAnnotations with a > confidence > 0.5 but no one > 0.8 => ambiguous *) > * TextAnnotations with a single linked EntityAnnotation with a > confidence < 0.8 > 0.5 => suggestion > * TextAnnotations with EntityAnnotations with confidence values < 0.5 > => uncertain > * TopicAnnotation with a confidence value > 0.8 => confident > * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion > * TopicAnnotation with a confidence value < 0.5 => uncertain > > *) NOTE that in those cases only EntityAnnotations with a confidence > value > 0.5 would be marked as "ambiguous". Additional > EntityAnnotations with confidence values < 0.5 would be marked as > "uncertain" > > The values '0.8' and '0.5' should be configurable. > > Note that "fise:confidence-level" could be also used by Engines that > can not provide fise:confidence values (E.g. the langid engine could > mark detected languages as "uncertain" if the parsed text was to > short). > > WDYT > Rupert > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > -- David Riccitelli ******************************************************************************** InsideOut10 s.r.l. P.IVA: IT-11381771002 Fax: +39 0110708239 --- LinkedIn: http://it.linkedin.com/in/riccitelli Twitter: ziodave --- Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1> ********************************************************************************
