Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Rupert Westenthaler Thu, 24 May 2012 02:58:45 -0700

Hi

Am 24.05.2012 um 08:50 schrieb Alessandra Donnini <[email protected]>:


> Are the new CELI enhancement engines available in the last release 
> apache-stanbol-0.9.0-incubating (2012/05/08)  available in 
> http://incubator.apache.org/stanbol/downloads/releases.html?
> Do I need to download files from 
> https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so how 
> should I do?
> thanks
> Alessandra Donnini
> 

Currently the plan is to include The CELI engines only in 0.10.0.
The engines are not yet included in the trunk, but available in a branch [1] as 
there are still two remaining issues with the NER engine. If those are solved 
the engines should be included in the trunk within days.

To use the CELI engines in the current state you will need to

Short option (should work)

 * check out the branch 
http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
* call mvn install in the branch
* get the CELI engine bundle and install it to a recent Stanbol 0.10.0 launcher

The complete workflow would be to

 * check out and "mvn install" the trunk
 * check out the branch [1]
* call mvn install in the branch
* go back to {trunk}/launchers/full
* call mvn clean install - this will create a full launcher that includes the 
bundles as build in the branch
* use this launcher to start Stanbol

best
Rupert

[1] 
http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/

> 
> 
> 
> Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto:
> 
>> Hi all,
>> 
>> In the last two weeks I considerable improved the validation of the
>> Enhancements created by the different Stanbol Enhancement Engines.
>> Here is the list of related issues:
>> 
>> * STANBOL-613: Define how to retrieve the language of the parsed content
>> * STANBOL-617: Define how to encode fise:TopicEnhancements
>> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
>> available via the Entityhub
>> 
>> Note also STANBOL-612 - providing a utility class that easily allows
>> to validate created enhancements in unit tests of EnhancementEngines.
>> All existing engines do now use this utility to validate Enhancements.
>> This is also true for the contributed CELI engine (STANBOL-583)
>> already confirm to those tests.
>> 
>> The next think I would like to make more clear (and easier to
>> use/understand) is how confidence is represented for Stanbol
>> Enhancements. Related to this I would like to discuss the following
>> two suggestions:
>> 
>> ### Suggestion 1: Require confidence values to be in the range [0..1]
>> 
>> This is an long going discussion, but I would really like to add a
>> check that enforces confidence values to be in the range between
>> [0..1].
>> 
>> I think this change is necessary, because it moves the responsibility
>> for interpreting confidence values from the Stanbol users to the
>> implementors of the Engines. I know that providing confidence values
>> is a hard thing to do, but while it may be hard for Engine developers
>> it is near to impossible to Stanbol users to do so.
>> 
>> Note that EnhancementEngine would still be free to create Enhancements
>> with no "fise:confidence" value.
>> 
>> Surprisingly a lot of the existing Engines do already confirm to this
>> rule. The most prominent exception is the Named Entity Tagging Engine
>> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
>> already an algorithm that normalizes confidence values by a
>> combination of the levenshtein distance (selected-text <-> entity
>> label) and the Solr result score for the Entity (see STANBOL-624 for
>> details).
>> 
>> If we could agree to this rule I would use a similar approach also for
>> other Engines that do not yet normalize confidence values between
>> [0..1]
>> 
>> ### Suggestion 2: Add fise:confidence-level property
>> 
>> The "confidence-level" is intended to make it easier for clients to
>> decide how to process Enhancements. It would not use a numerical range
>> but four distinct values:
>> 
>> * confident: Meaning that a match is very likely - indicating that
>> those annotations typically can be accepted automatically (e.g. If the
>> EntityLinking engine finds a single Entity that exactly matches the
>> text selected by an text annotation)
>> * ambiguous: Meaning that there are several possibilities but is is
>> still likely that one of them match (e.g. Paris, Paris (Texas))
>> * suggestion: Meaning that the match is not completely certain, but
>> there are not several options (e.g. Germans -> Germany)
>> * uncertain: Meaning that Entities do match, but the probability of a
>> match is rather speculative (e.g. John -> Elton John)
>> 
>> IMHO using this classification would fit a lot of engines much better
>> as the numeric "fise:confidence" property as it does not rise the
>> expectation in users that confidence values are on a rational scale
>> (e.g. a Enhancement with a confidence of "0.8" is not two times as
>> likely as one with "0.4").
>> 
>> Engines would have the possibility to manually add those information
>> to enhancements. For enhancements that do not define those we could
>> implement an post-processing engine that adds those based on generic
>> rules.
>> 
>> e.g.
>> 
>> * ignore Enhancements with an existing "confidence-level" assignment
>> * TextAnnotations with a confidence value > 0.8 => confident
>> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
>> * TextAnnotations with a confidence value < 0.5 => uncertain
>> * TextAnnotations with a single linked EntityAnnotation with a
>> confidence > 0.8 => confident
>> * TextAnnotations with a several linked EntityAnnotation with a
>> confidence > 0.8 => ambiguous *)
>> * TextAnnotations with several linked EntityAnnotations with a
>> confidence > 0.5 but no one > 0.8 => ambiguous *)
>> * TextAnnotations with a single linked EntityAnnotation with a
>> confidence < 0.8 > 0.5 => suggestion
>> * TextAnnotations with EntityAnnotations with confidence values < 0.5
>> => uncertain
>> * TopicAnnotation with a confidence value > 0.8 => confident
>> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
>> * TopicAnnotation with a confidence value < 0.5 => uncertain
>> 
>> *) NOTE that in those cases only EntityAnnotations with a confidence
>> value > 0.5 would be marked as "ambiguous". Additional
>> EntityAnnotations with confidence values < 0.5 would be marked as
>> "uncertain"
>> 
>> The values '0.8' and '0.5' should be configurable.
>> 
>> Note that "fise:confidence-level" could be also used by Engines that
>> can not provide fise:confidence values (E.g. the langid engine could
>> mark detected languages as "uncertain" if the parsed text was to
>> short).
>> 
>> WDYT
>> Rupert
>> 
>> 
>> -- 
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>

Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Reply via email to