Fwd: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Alessandra Donnini Thu, 24 May 2012 21:41:55 -0700

Typo


> Hi Rupert 
> I tried the two options but I have some doubt:
> 1) Short option
> get the CELI engine bundle and install it to a recent Stanbol 0.10.0 
> launcher: install means copy {branch}/celi-enhancement-engines in place of 
> {trunk}/enhancer directory?
> 2) Complete workflow
> what you mean with "check out the branch [1]" in the complete workflow list? 
> Do I need to substitute {trunk}/enhancer directory with 
> {branch}/celi-enhancement-engines directory? 
> 
> thanks
> Alessandra
> Il giorno 24/mag/2012, alle ore 11.58, Rupert Westenthaler ha scritto:
> 
>> Hi
>> 
>> Am 24.05.2012 um 08:50 schrieb Alessandra Donnini <[email protected]>:
>> 
>>> Are the new CELI enhancement engines available in the last release 
>>> apache-stanbol-0.9.0-incubating (2012/05/08)  available in 
>>> http://incubator.apache.org/stanbol/downloads/releases.html?
>>> Do I need to download files from 
>>> https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so 
>>> how should I do?
>>> thanks
>>> Alessandra Donnini
>>> 
>> 
>> Currently the plan is to include The CELI engines only in 0.10.0.
>> The engines are not yet included in the trunk, but available in a branch [1] 
>> as there are still two remaining issues with the NER engine. If those are 
>> solved the engines should be included in the trunk within days.
>> 
>> To use the CELI engines in the current state you will need to
>> 
>> Short option (should work)
>> 
>> * check out the branch 
>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>> * call mvn install in the branch
>> * get the CELI engine bundle and install it to a recent Stanbol 0.10.0 
>> launcher
>> 
>> The complete workflow would be to
>> 
>> * check out and "mvn install" the trunk
>> * check out the branch [1]
>> * call mvn install in the branch
>> * go back to {trunk}/launchers/full
>> * call mvn clean install - this will create a full launcher that includes 
>> the bundles as build in the branch
>> * use this launcher to start Stanbol
>> 
>> best
>> Rupert
>> 
>> [1] 
>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>> 
>>> 
>>> 
>>> 
>>> Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto:
>>> 
>>>> Hi all,
>>>> 
>>>> In the last two weeks I considerable improved the validation of the
>>>> Enhancements created by the different Stanbol Enhancement Engines.
>>>> Here is the list of related issues:
>>>> 
>>>> * STANBOL-613: Define how to retrieve the language of the parsed content
>>>> * STANBOL-617: Define how to encode fise:TopicEnhancements
>>>> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
>>>> available via the Entityhub
>>>> 
>>>> Note also STANBOL-612 - providing a utility class that easily allows
>>>> to validate created enhancements in unit tests of EnhancementEngines.
>>>> All existing engines do now use this utility to validate Enhancements.
>>>> This is also true for the contributed CELI engine (STANBOL-583)
>>>> already confirm to those tests.
>>>> 
>>>> The next think I would like to make more clear (and easier to
>>>> use/understand) is how confidence is represented for Stanbol
>>>> Enhancements. Related to this I would like to discuss the following
>>>> two suggestions:
>>>> 
>>>> ### Suggestion 1: Require confidence values to be in the range [0..1]
>>>> 
>>>> This is an long going discussion, but I would really like to add a
>>>> check that enforces confidence values to be in the range between
>>>> [0..1].
>>>> 
>>>> I think this change is necessary, because it moves the responsibility
>>>> for interpreting confidence values from the Stanbol users to the
>>>> implementors of the Engines. I know that providing confidence values
>>>> is a hard thing to do, but while it may be hard for Engine developers
>>>> it is near to impossible to Stanbol users to do so.
>>>> 
>>>> Note that EnhancementEngine would still be free to create Enhancements
>>>> with no "fise:confidence" value.
>>>> 
>>>> Surprisingly a lot of the existing Engines do already confirm to this
>>>> rule. The most prominent exception is the Named Entity Tagging Engine
>>>> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
>>>> already an algorithm that normalizes confidence values by a
>>>> combination of the levenshtein distance (selected-text <-> entity
>>>> label) and the Solr result score for the Entity (see STANBOL-624 for
>>>> details).
>>>> 
>>>> If we could agree to this rule I would use a similar approach also for
>>>> other Engines that do not yet normalize confidence values between
>>>> [0..1]
>>>> 
>>>> ### Suggestion 2: Add fise:confidence-level property
>>>> 
>>>> The "confidence-level" is intended to make it easier for clients to
>>>> decide how to process Enhancements. It would not use a numerical range
>>>> but four distinct values:
>>>> 
>>>> * confident: Meaning that a match is very likely - indicating that
>>>> those annotations typically can be accepted automatically (e.g. If the
>>>> EntityLinking engine finds a single Entity that exactly matches the
>>>> text selected by an text annotation)
>>>> * ambiguous: Meaning that there are several possibilities but is is
>>>> still likely that one of them match (e.g. Paris, Paris (Texas))
>>>> * suggestion: Meaning that the match is not completely certain, but
>>>> there are not several options (e.g. Germans -> Germany)
>>>> * uncertain: Meaning that Entities do match, but the probability of a
>>>> match is rather speculative (e.g. John -> Elton John)
>>>> 
>>>> IMHO using this classification would fit a lot of engines much better
>>>> as the numeric "fise:confidence" property as it does not rise the
>>>> expectation in users that confidence values are on a rational scale
>>>> (e.g. a Enhancement with a confidence of "0.8" is not two times as
>>>> likely as one with "0.4").
>>>> 
>>>> Engines would have the possibility to manually add those information
>>>> to enhancements. For enhancements that do not define those we could
>>>> implement an post-processing engine that adds those based on generic
>>>> rules.
>>>> 
>>>> e.g.
>>>> 
>>>> * ignore Enhancements with an existing "confidence-level" assignment
>>>> * TextAnnotations with a confidence value > 0.8 => confident
>>>> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
>>>> * TextAnnotations with a confidence value < 0.5 => uncertain
>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>> confidence > 0.8 => confident
>>>> * TextAnnotations with a several linked EntityAnnotation with a
>>>> confidence > 0.8 => ambiguous *)
>>>> * TextAnnotations with several linked EntityAnnotations with a
>>>> confidence > 0.5 but no one > 0.8 => ambiguous *)
>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>> confidence < 0.8 > 0.5 => suggestion
>>>> * TextAnnotations with EntityAnnotations with confidence values < 0.5
>>>> => uncertain
>>>> * TopicAnnotation with a confidence value > 0.8 => confident
>>>> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
>>>> * TopicAnnotation with a confidence value < 0.5 => uncertain
>>>> 
>>>> *) NOTE that in those cases only EntityAnnotations with a confidence
>>>> value > 0.5 would be marked as "ambiguous". Additional
>>>> EntityAnnotations with confidence values < 0.5 would be marked as
>>>> "uncertain"
>>>> 
>>>> The values '0.8' and '0.5' should be configurable.
>>>> 
>>>> Note that "fise:confidence-level" could be also used by Engines that
>>>> can not provide fise:confidence values (E.g. the langid engine could
>>>> mark detected languages as "uncertain" if the parsed text was to
>>>> short).
>>>> 
>>>> WDYT
>>>> Rupert
>>>> 
>>>> 
>>>> -- 
>>>> | Rupert Westenthaler             [email protected]
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>> 
>

Fwd: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Reply via email to