Hi
On 25.05.2012, at 17:51, Alessandra Donnini wrote:
> thanks it works
> Is there a details about CELI engines contribution to enhancement?
I have written a good overview about the current state in
http://markmail.org/thread/qycq37xmfozfonwp
Ehancements created by the CELI
* Language Identification
* Named Entity Recognition
* Topic Classification Engine
engine will be fully compatible with the current Engines. So current users will
be able to use CELI as a 1:1 replacement.
The Lemmatizes Engine will be not useable to improve results for the
KeywordLinkingEngine, because the KeywordLinkingEngine is currently unable to
consume the fise:TextAnnotations with "fise:hasLemmaForm" and
"fise:hasMorphologicalFeature" created by the CELI Lemmatizer Engine.
best
Rupert
> I'm working with italian language for our demo, by using also our SKOS
> thesaurus, and I would like to improve enhancement quality. As service user,
> suggestion 2 is a very interesting way to interpret numbers coming from an
> engine that, for us, is a black box.
> regards
> Alessandra
>
>
> Il giorno 25/mag/2012, alle ore 08.40, Rupert Westenthaler ha scritto:
>
>>
>> On 24.05.2012, at 17:15, Alessandra Donnini wrote:
>>
>>> Hi Rupert
>>> I tried the two options but I have some doubt:
>>> 1) Short option
>>> get the CELI engine bundle and install it to a recent Stanbol 0.10.0
>>> launcher: install means copy il place of {trunk}/enhancer directory?
>>
>> sorry for being not more precise ...
>> ... with "install" I referred to "installing the bundle of the CELI engines
>> to the running OSGI environment.
>>
>> Assuming that you have already checked out and build the branch you have two
>> options to do that
>>
>> 1. via the console
>>
>> go to directory "{branch}/engines/celi" and call
>>
>> mvn clean install -PinstallBundle
>> -Dsling.url=http://localhost:8080/system/console
>>
>> 1. via the Apache Felix Web Console
>>
>> go in the browser to "http://localhost:8080/system/console/bundles"
>> press the "Install/Update..." Button (top right corner)
>> add the CELI bundle from "{branch}/engines/celi/target"
>>
>>
>> Both options assume that Stanbol 0.10.0 runs at localhost port 8080.
>>
>>> 2) Complete workflow
>>> what you mean with "check out the branch [1]" in the complete workflow
>>> list? Do I need to substitute {trunk}/enhancer directory with
>>> {branch}/celi-enhancement-engines directory?
>>>
>>
>> no you should check out the branch to an other directory.
>>
>> mvn install
>>
>> does copy compiled modules to your local maven repository
>>
>> ~/.m2/repository
>>
>> so if you compile the branch it will override the modules of the trunk in
>> your local Repository (with the version of the branch). Because of this if
>> you afterwards only compile the Full Launcher module in the trunk it will
>> take the jars versions of the branch.
>>
>> best
>> Rupert
>>
>>
>>> thanks
>>> Alessandra
>>> Il giorno 24/mag/2012, alle ore 11.58, Rupert Westenthaler ha scritto:
>>>
>>>> Hi
>>>>
>>>> Am 24.05.2012 um 08:50 schrieb Alessandra Donnini <[email protected]>:
>>>>
>>>>> Are the new CELI enhancement engines available in the last release
>>>>> apache-stanbol-0.9.0-incubating (2012/05/08) available in
>>>>> http://incubator.apache.org/stanbol/downloads/releases.html?
>>>>> Do I need to download files from
>>>>> https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so
>>>>> how should I do?
>>>>> thanks
>>>>> Alessandra Donnini
>>>>>
>>>>
>>>> Currently the plan is to include The CELI engines only in 0.10.0.
>>>> The engines are not yet included in the trunk, but available in a branch
>>>> [1] as there are still two remaining issues with the NER engine. If those
>>>> are solved the engines should be included in the trunk within days.
>>>>
>>>> To use the CELI engines in the current state you will need to
>>>>
>>>> Short option (should work)
>>>>
>>>> * check out the branch
>>>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>>>> * call mvn install in the branch
>>>> * get the CELI engine bundle and install it to a recent Stanbol 0.10.0
>>>> launcher
>>>>
>>>> The complete workflow would be to
>>>>
>>>> * check out and "mvn install" the trunk
>>>> * check out the branch [1]
>>>> * call mvn install in the branch
>>>> * go back to {trunk}/launchers/full
>>>> * call mvn clean install - this will create a full launcher that includes
>>>> the bundles as build in the branch
>>>> * use this launcher to start Stanbol
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [1]
>>>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> In the last two weeks I considerable improved the validation of the
>>>>>> Enhancements created by the different Stanbol Enhancement Engines.
>>>>>> Here is the list of related issues:
>>>>>>
>>>>>> * STANBOL-613: Define how to retrieve the language of the parsed content
>>>>>> * STANBOL-617: Define how to encode fise:TopicEnhancements
>>>>>> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
>>>>>> available via the Entityhub
>>>>>>
>>>>>> Note also STANBOL-612 - providing a utility class that easily allows
>>>>>> to validate created enhancements in unit tests of EnhancementEngines.
>>>>>> All existing engines do now use this utility to validate Enhancements.
>>>>>> This is also true for the contributed CELI engine (STANBOL-583)
>>>>>> already confirm to those tests.
>>>>>>
>>>>>> The next think I would like to make more clear (and easier to
>>>>>> use/understand) is how confidence is represented for Stanbol
>>>>>> Enhancements. Related to this I would like to discuss the following
>>>>>> two suggestions:
>>>>>>
>>>>>> ### Suggestion 1: Require confidence values to be in the range [0..1]
>>>>>>
>>>>>> This is an long going discussion, but I would really like to add a
>>>>>> check that enforces confidence values to be in the range between
>>>>>> [0..1].
>>>>>>
>>>>>> I think this change is necessary, because it moves the responsibility
>>>>>> for interpreting confidence values from the Stanbol users to the
>>>>>> implementors of the Engines. I know that providing confidence values
>>>>>> is a hard thing to do, but while it may be hard for Engine developers
>>>>>> it is near to impossible to Stanbol users to do so.
>>>>>>
>>>>>> Note that EnhancementEngine would still be free to create Enhancements
>>>>>> with no "fise:confidence" value.
>>>>>>
>>>>>> Surprisingly a lot of the existing Engines do already confirm to this
>>>>>> rule. The most prominent exception is the Named Entity Tagging Engine
>>>>>> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
>>>>>> already an algorithm that normalizes confidence values by a
>>>>>> combination of the levenshtein distance (selected-text <-> entity
>>>>>> label) and the Solr result score for the Entity (see STANBOL-624 for
>>>>>> details).
>>>>>>
>>>>>> If we could agree to this rule I would use a similar approach also for
>>>>>> other Engines that do not yet normalize confidence values between
>>>>>> [0..1]
>>>>>>
>>>>>> ### Suggestion 2: Add fise:confidence-level property
>>>>>>
>>>>>> The "confidence-level" is intended to make it easier for clients to
>>>>>> decide how to process Enhancements. It would not use a numerical range
>>>>>> but four distinct values:
>>>>>>
>>>>>> * confident: Meaning that a match is very likely - indicating that
>>>>>> those annotations typically can be accepted automatically (e.g. If the
>>>>>> EntityLinking engine finds a single Entity that exactly matches the
>>>>>> text selected by an text annotation)
>>>>>> * ambiguous: Meaning that there are several possibilities but is is
>>>>>> still likely that one of them match (e.g. Paris, Paris (Texas))
>>>>>> * suggestion: Meaning that the match is not completely certain, but
>>>>>> there are not several options (e.g. Germans -> Germany)
>>>>>> * uncertain: Meaning that Entities do match, but the probability of a
>>>>>> match is rather speculative (e.g. John -> Elton John)
>>>>>>
>>>>>> IMHO using this classification would fit a lot of engines much better
>>>>>> as the numeric "fise:confidence" property as it does not rise the
>>>>>> expectation in users that confidence values are on a rational scale
>>>>>> (e.g. a Enhancement with a confidence of "0.8" is not two times as
>>>>>> likely as one with "0.4").
>>>>>>
>>>>>> Engines would have the possibility to manually add those information
>>>>>> to enhancements. For enhancements that do not define those we could
>>>>>> implement an post-processing engine that adds those based on generic
>>>>>> rules.
>>>>>>
>>>>>> e.g.
>>>>>>
>>>>>> * ignore Enhancements with an existing "confidence-level" assignment
>>>>>> * TextAnnotations with a confidence value > 0.8 => confident
>>>>>> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
>>>>>> * TextAnnotations with a confidence value < 0.5 => uncertain
>>>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>>>> confidence > 0.8 => confident
>>>>>> * TextAnnotations with a several linked EntityAnnotation with a
>>>>>> confidence > 0.8 => ambiguous *)
>>>>>> * TextAnnotations with several linked EntityAnnotations with a
>>>>>> confidence > 0.5 but no one > 0.8 => ambiguous *)
>>>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>>>> confidence < 0.8 > 0.5 => suggestion
>>>>>> * TextAnnotations with EntityAnnotations with confidence values < 0.5
>>>>>> => uncertain
>>>>>> * TopicAnnotation with a confidence value > 0.8 => confident
>>>>>> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
>>>>>> * TopicAnnotation with a confidence value < 0.5 => uncertain
>>>>>>
>>>>>> *) NOTE that in those cases only EntityAnnotations with a confidence
>>>>>> value > 0.5 would be marked as "ambiguous". Additional
>>>>>> EntityAnnotations with confidence values < 0.5 would be marked as
>>>>>> "uncertain"
>>>>>>
>>>>>> The values '0.8' and '0.5' should be configurable.
>>>>>>
>>>>>> Note that "fise:confidence-level" could be also used by Engines that
>>>>>> can not provide fise:confidence values (E.g. the langid engine could
>>>>>> mark detected languages as "uncertain" if the parsed text was to
>>>>>> short).
>>>>>>
>>>>>> WDYT
>>>>>> Rupert
>>>>>>
>>>>>>
>>>>>> --
>>>>>> | Rupert Westenthaler [email protected]
>>>>>> | Bodenlehenstraße 11 ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>>
>>>
>>
>