Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Rupert Westenthaler Fri, 25 May 2012 14:16:02 -0700

Hi
On 25.05.2012, at 17:51, Alessandra Donnini wrote:

> thanks it works
> Is there a details about CELI engines contribution to enhancement?


I have written a good overview about the current state in 

    http://markmail.org/thread/qycq37xmfozfonwp

Ehancements created by the CELI 

* Language Identification
* Named Entity Recognition
* Topic Classification Engine

engine will be fully compatible with the current Engines. So current users will 
be able to use CELI as a 1:1 replacement.

The Lemmatizes Engine will be not useable to improve results for the 
KeywordLinkingEngine, because the KeywordLinkingEngine is currently unable to 
consume the fise:TextAnnotations with "fise:hasLemmaForm" and 
"fise:hasMorphologicalFeature" created by the CELI Lemmatizer Engine.

best
Rupert


> I'm working with italian language for our demo, by using also our SKOS 
> thesaurus, and I would like to improve enhancement quality. As service user, 
> suggestion 2 is a very interesting way to interpret numbers coming from an 
> engine that, for us, is a black box.
> regards
> Alessandra
> 
> 
> Il giorno 25/mag/2012, alle ore 08.40, Rupert Westenthaler ha scritto:
> 
>> 
>> On 24.05.2012, at 17:15, Alessandra Donnini wrote:
>> 
>>> Hi Rupert 
>>> I tried the two options but I have some doubt:
>>> 1) Short option
>>> get the CELI engine bundle and install it to a recent Stanbol 0.10.0 
>>> launcher: install means copy il place of {trunk}/enhancer directory?
>> 
>> sorry for being not more precise ...
>> ... with "install" I referred to  "installing the bundle of the CELI engines 
>> to the running OSGI environment. 
>> 
>> Assuming that you have already checked out and build the branch you have two 
>> options to do that
>> 
>> 1. via the console
>> 
>> go to directory  "{branch}/engines/celi" and call
>> 
>>   mvn clean install -PinstallBundle 
>> -Dsling.url=http://localhost:8080/system/console
>> 
>> 1. via the Apache Felix Web Console
>> 
>> go in the browser to "http://localhost:8080/system/console/bundles";
>> press the "Install/Update..." Button (top right corner)
>> add the CELI bundle from "{branch}/engines/celi/target"
>> 
>> 
>> Both options assume that Stanbol 0.10.0 runs at localhost port 8080.
>> 
>>> 2) Complete workflow
>>> what you mean with "check out the branch [1]" in the complete workflow 
>>> list? Do I need to substitute {trunk}/enhancer directory with 
>>> {branch}/celi-enhancement-engines directory? 
>>> 
>> 
>> no you should check out the branch to an other directory.
>> 
>>  mvn install
>> 
>> does copy compiled modules to your local maven repository
>> 
>> ~/.m2/repository
>> 
>> so if you compile the branch it will override the modules of the trunk in 
>> your local Repository (with the version of the branch). Because of this  if 
>> you afterwards only compile the Full Launcher module in the trunk it will 
>> take the jars versions of the branch.
>> 
>> best
>> Rupert
>> 
>> 
>>> thanks
>>> Alessandra
>>> Il giorno 24/mag/2012, alle ore 11.58, Rupert Westenthaler ha scritto:
>>> 
>>>> Hi
>>>> 
>>>> Am 24.05.2012 um 08:50 schrieb Alessandra Donnini <[email protected]>:
>>>> 
>>>>> Are the new CELI enhancement engines available in the last release 
>>>>> apache-stanbol-0.9.0-incubating (2012/05/08)  available in 
>>>>> http://incubator.apache.org/stanbol/downloads/releases.html?
>>>>> Do I need to download files from 
>>>>> https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so 
>>>>> how should I do?
>>>>> thanks
>>>>> Alessandra Donnini
>>>>> 
>>>> 
>>>> Currently the plan is to include The CELI engines only in 0.10.0.
>>>> The engines are not yet included in the trunk, but available in a branch 
>>>> [1] as there are still two remaining issues with the NER engine. If those 
>>>> are solved the engines should be included in the trunk within days.
>>>> 
>>>> To use the CELI engines in the current state you will need to
>>>> 
>>>> Short option (should work)
>>>> 
>>>> * check out the branch 
>>>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>>>> * call mvn install in the branch
>>>> * get the CELI engine bundle and install it to a recent Stanbol 0.10.0 
>>>> launcher
>>>> 
>>>> The complete workflow would be to
>>>> 
>>>> * check out and "mvn install" the trunk
>>>> * check out the branch [1]
>>>> * call mvn install in the branch
>>>> * go back to {trunk}/launchers/full
>>>> * call mvn clean install - this will create a full launcher that includes 
>>>> the bundles as build in the branch
>>>> * use this launcher to start Stanbol
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>> [1] 
>>>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> In the last two weeks I considerable improved the validation of the
>>>>>> Enhancements created by the different Stanbol Enhancement Engines.
>>>>>> Here is the list of related issues:
>>>>>> 
>>>>>> * STANBOL-613: Define how to retrieve the language of the parsed content
>>>>>> * STANBOL-617: Define how to encode fise:TopicEnhancements
>>>>>> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
>>>>>> available via the Entityhub
>>>>>> 
>>>>>> Note also STANBOL-612 - providing a utility class that easily allows
>>>>>> to validate created enhancements in unit tests of EnhancementEngines.
>>>>>> All existing engines do now use this utility to validate Enhancements.
>>>>>> This is also true for the contributed CELI engine (STANBOL-583)
>>>>>> already confirm to those tests.
>>>>>> 
>>>>>> The next think I would like to make more clear (and easier to
>>>>>> use/understand) is how confidence is represented for Stanbol
>>>>>> Enhancements. Related to this I would like to discuss the following
>>>>>> two suggestions:
>>>>>> 
>>>>>> ### Suggestion 1: Require confidence values to be in the range [0..1]
>>>>>> 
>>>>>> This is an long going discussion, but I would really like to add a
>>>>>> check that enforces confidence values to be in the range between
>>>>>> [0..1].
>>>>>> 
>>>>>> I think this change is necessary, because it moves the responsibility
>>>>>> for interpreting confidence values from the Stanbol users to the
>>>>>> implementors of the Engines. I know that providing confidence values
>>>>>> is a hard thing to do, but while it may be hard for Engine developers
>>>>>> it is near to impossible to Stanbol users to do so.
>>>>>> 
>>>>>> Note that EnhancementEngine would still be free to create Enhancements
>>>>>> with no "fise:confidence" value.
>>>>>> 
>>>>>> Surprisingly a lot of the existing Engines do already confirm to this
>>>>>> rule. The most prominent exception is the Named Entity Tagging Engine
>>>>>> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
>>>>>> already an algorithm that normalizes confidence values by a
>>>>>> combination of the levenshtein distance (selected-text <-> entity
>>>>>> label) and the Solr result score for the Entity (see STANBOL-624 for
>>>>>> details).
>>>>>> 
>>>>>> If we could agree to this rule I would use a similar approach also for
>>>>>> other Engines that do not yet normalize confidence values between
>>>>>> [0..1]
>>>>>> 
>>>>>> ### Suggestion 2: Add fise:confidence-level property
>>>>>> 
>>>>>> The "confidence-level" is intended to make it easier for clients to
>>>>>> decide how to process Enhancements. It would not use a numerical range
>>>>>> but four distinct values:
>>>>>> 
>>>>>> * confident: Meaning that a match is very likely - indicating that
>>>>>> those annotations typically can be accepted automatically (e.g. If the
>>>>>> EntityLinking engine finds a single Entity that exactly matches the
>>>>>> text selected by an text annotation)
>>>>>> * ambiguous: Meaning that there are several possibilities but is is
>>>>>> still likely that one of them match (e.g. Paris, Paris (Texas))
>>>>>> * suggestion: Meaning that the match is not completely certain, but
>>>>>> there are not several options (e.g. Germans -> Germany)
>>>>>> * uncertain: Meaning that Entities do match, but the probability of a
>>>>>> match is rather speculative (e.g. John -> Elton John)
>>>>>> 
>>>>>> IMHO using this classification would fit a lot of engines much better
>>>>>> as the numeric "fise:confidence" property as it does not rise the
>>>>>> expectation in users that confidence values are on a rational scale
>>>>>> (e.g. a Enhancement with a confidence of "0.8" is not two times as
>>>>>> likely as one with "0.4").
>>>>>> 
>>>>>> Engines would have the possibility to manually add those information
>>>>>> to enhancements. For enhancements that do not define those we could
>>>>>> implement an post-processing engine that adds those based on generic
>>>>>> rules.
>>>>>> 
>>>>>> e.g.
>>>>>> 
>>>>>> * ignore Enhancements with an existing "confidence-level" assignment
>>>>>> * TextAnnotations with a confidence value > 0.8 => confident
>>>>>> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
>>>>>> * TextAnnotations with a confidence value < 0.5 => uncertain
>>>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>>>> confidence > 0.8 => confident
>>>>>> * TextAnnotations with a several linked EntityAnnotation with a
>>>>>> confidence > 0.8 => ambiguous *)
>>>>>> * TextAnnotations with several linked EntityAnnotations with a
>>>>>> confidence > 0.5 but no one > 0.8 => ambiguous *)
>>>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>>>> confidence < 0.8 > 0.5 => suggestion
>>>>>> * TextAnnotations with EntityAnnotations with confidence values < 0.5
>>>>>> => uncertain
>>>>>> * TopicAnnotation with a confidence value > 0.8 => confident
>>>>>> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
>>>>>> * TopicAnnotation with a confidence value < 0.5 => uncertain
>>>>>> 
>>>>>> *) NOTE that in those cases only EntityAnnotations with a confidence
>>>>>> value > 0.5 would be marked as "ambiguous". Additional
>>>>>> EntityAnnotations with confidence values < 0.5 would be marked as
>>>>>> "uncertain"
>>>>>> 
>>>>>> The values '0.8' and '0.5' should be configurable.
>>>>>> 
>>>>>> Note that "fise:confidence-level" could be also used by Engines that
>>>>>> can not provide fise:confidence values (E.g. the langid engine could
>>>>>> mark detected languages as "uncertain" if the parsed text was to
>>>>>> short).
>>>>>> 
>>>>>> WDYT
>>>>>> Rupert
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> | Rupert Westenthaler             [email protected]
>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>> 
>>> 
>> 
>

Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Reply via email to