Hi Rupert,

We feel this as needed in order to provide a better user experience to our
end users.

Is there also a way to formally express the *relevance* (how much a matched
entity is relevant in the provided context)?

+1

David

On Thu, May 24, 2012 at 8:18 AM, Rupert Westenthaler <
[email protected]> wrote:

> Hi all,
>
> In the last two weeks I considerable improved the validation of the
> Enhancements created by the different Stanbol Enhancement Engines.
> Here is the list of related issues:
>
> * STANBOL-613: Define how to retrieve the language of the parsed content
> * STANBOL-617: Define how to encode fise:TopicEnhancements
> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
> available via the Entityhub
>
> Note also STANBOL-612 - providing a utility class that easily allows
> to validate created enhancements in unit tests of EnhancementEngines.
> All existing engines do now use this utility to validate Enhancements.
> This is also true for the contributed CELI engine (STANBOL-583)
> already confirm to those tests.
>
> The next think I would like to make more clear (and easier to
> use/understand) is how confidence is represented for Stanbol
> Enhancements. Related to this I would like to discuss the following
> two suggestions:
>
> ### Suggestion 1: Require confidence values to be in the range [0..1]
>
> This is an long going discussion, but I would really like to add a
> check that enforces confidence values to be in the range between
> [0..1].
>
> I think this change is necessary, because it moves the responsibility
> for interpreting confidence values from the Stanbol users to the
> implementors of the Engines. I know that providing confidence values
> is a hard thing to do, but while it may be hard for Engine developers
> it is near to impossible to Stanbol users to do so.
>
> Note that EnhancementEngine would still be free to create Enhancements
> with no "fise:confidence" value.
>
> Surprisingly a lot of the existing Engines do already confirm to this
> rule. The most prominent exception is the Named Entity Tagging Engine
> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
> already an algorithm that normalizes confidence values by a
> combination of the levenshtein distance (selected-text <-> entity
> label) and the Solr result score for the Entity (see STANBOL-624 for
> details).
>
> If we could agree to this rule I would use a similar approach also for
> other Engines that do not yet normalize confidence values between
> [0..1]
>
> ### Suggestion 2: Add fise:confidence-level property
>
> The "confidence-level" is intended to make it easier for clients to
> decide how to process Enhancements. It would not use a numerical range
> but four distinct values:
>
> * confident: Meaning that a match is very likely - indicating that
> those annotations typically can be accepted automatically (e.g. If the
> EntityLinking engine finds a single Entity that exactly matches the
> text selected by an text annotation)
> * ambiguous: Meaning that there are several possibilities but is is
> still likely that one of them match (e.g. Paris, Paris (Texas))
> * suggestion: Meaning that the match is not completely certain, but
> there are not several options (e.g. Germans -> Germany)
> * uncertain: Meaning that Entities do match, but the probability of a
> match is rather speculative (e.g. John -> Elton John)
>
> IMHO using this classification would fit a lot of engines much better
> as the numeric "fise:confidence" property as it does not rise the
> expectation in users that confidence values are on a rational scale
> (e.g. a Enhancement with a confidence of "0.8" is not two times as
> likely as one with "0.4").
>
> Engines would have the possibility to manually add those information
> to enhancements. For enhancements that do not define those we could
> implement an post-processing engine that adds those based on generic
> rules.
>
> e.g.
>
> * ignore Enhancements with an existing "confidence-level" assignment
> * TextAnnotations with a confidence value > 0.8 => confident
> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
> * TextAnnotations with a confidence value < 0.5 => uncertain
> * TextAnnotations with a single linked EntityAnnotation with a
> confidence > 0.8 => confident
> * TextAnnotations with a several linked EntityAnnotation with a
> confidence > 0.8 => ambiguous *)
> * TextAnnotations with several linked EntityAnnotations with a
> confidence > 0.5 but no one > 0.8 => ambiguous *)
> * TextAnnotations with a single linked EntityAnnotation with a
> confidence < 0.8 > 0.5 => suggestion
> * TextAnnotations with EntityAnnotations with confidence values < 0.5
> => uncertain
> * TopicAnnotation with a confidence value > 0.8 => confident
> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
> * TopicAnnotation with a confidence value < 0.5 => uncertain
>
> *) NOTE that in those cases only EntityAnnotations with a confidence
> value > 0.5 would be marked as "ambiguous". Additional
> EntityAnnotations with confidence values < 0.5 would be marked as
> "uncertain"
>
> The values '0.8' and '0.5' should be configurable.
>
> Note that "fise:confidence-level" could be also used by Engines that
> can not provide fise:confidence values (E.g. the langid engine could
> mark detected languages as "uncertain" if the parsed text was to
> short).
>
> WDYT
> Rupert
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner 
Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Reply via email to