Hi all,

In the last two weeks I considerable improved the validation of the
Enhancements created by the different Stanbol Enhancement Engines.
Here is the list of related issues:

* STANBOL-613: Define how to retrieve the language of the parsed content
* STANBOL-617: Define how to encode fise:TopicEnhancements
* STANBOL-625: Add link to the entityhub:site if suggested Entity is
available via the Entityhub

Note also STANBOL-612 - providing a utility class that easily allows
to validate created enhancements in unit tests of EnhancementEngines.
All existing engines do now use this utility to validate Enhancements.
This is also true for the contributed CELI engine (STANBOL-583)
already confirm to those tests.

The next think I would like to make more clear (and easier to
use/understand) is how confidence is represented for Stanbol
Enhancements. Related to this I would like to discuss the following
two suggestions:

### Suggestion 1: Require confidence values to be in the range [0..1]

This is an long going discussion, but I would really like to add a
check that enforces confidence values to be in the range between
[0..1].

I think this change is necessary, because it moves the responsibility
for interpreting confidence values from the Stanbol users to the
implementors of the Engines. I know that providing confidence values
is a hard thing to do, but while it may be hard for Engine developers
it is near to impossible to Stanbol users to do so.

Note that EnhancementEngine would still be free to create Enhancements
with no "fise:confidence" value.

Surprisingly a lot of the existing Engines do already confirm to this
rule. The most prominent exception is the Named Entity Tagging Engine
(o.a.s.enhancer.engine.entitytagging). Because of this I implemented
already an algorithm that normalizes confidence values by a
combination of the levenshtein distance (selected-text <-> entity
label) and the Solr result score for the Entity (see STANBOL-624 for
details).

If we could agree to this rule I would use a similar approach also for
other Engines that do not yet normalize confidence values between
[0..1]

### Suggestion 2: Add fise:confidence-level property

The "confidence-level" is intended to make it easier for clients to
decide how to process Enhancements. It would not use a numerical range
but four distinct values:

* confident: Meaning that a match is very likely - indicating that
those annotations typically can be accepted automatically (e.g. If the
EntityLinking engine finds a single Entity that exactly matches the
text selected by an text annotation)
* ambiguous: Meaning that there are several possibilities but is is
still likely that one of them match (e.g. Paris, Paris (Texas))
* suggestion: Meaning that the match is not completely certain, but
there are not several options (e.g. Germans -> Germany)
* uncertain: Meaning that Entities do match, but the probability of a
match is rather speculative (e.g. John -> Elton John)

IMHO using this classification would fit a lot of engines much better
as the numeric "fise:confidence" property as it does not rise the
expectation in users that confidence values are on a rational scale
(e.g. a Enhancement with a confidence of "0.8" is not two times as
likely as one with "0.4").

Engines would have the possibility to manually add those information
to enhancements. For enhancements that do not define those we could
implement an post-processing engine that adds those based on generic
rules.

e.g.

* ignore Enhancements with an existing "confidence-level" assignment
* TextAnnotations with a confidence value > 0.8 => confident
* TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
* TextAnnotations with a confidence value < 0.5 => uncertain
* TextAnnotations with a single linked EntityAnnotation with a
confidence > 0.8 => confident
* TextAnnotations with a several linked EntityAnnotation with a
confidence > 0.8 => ambiguous *)
* TextAnnotations with several linked EntityAnnotations with a
confidence > 0.5 but no one > 0.8 => ambiguous *)
* TextAnnotations with a single linked EntityAnnotation with a
confidence < 0.8 > 0.5 => suggestion
* TextAnnotations with EntityAnnotations with confidence values < 0.5
=> uncertain
* TopicAnnotation with a confidence value > 0.8 => confident
* TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
* TopicAnnotation with a confidence value < 0.5 => uncertain

*) NOTE that in those cases only EntityAnnotations with a confidence
value > 0.5 would be marked as "ambiguous". Additional
EntityAnnotations with confidence values < 0.5 would be marked as
"uncertain"

The values '0.8' and '0.5' should be configurable.

Note that "fise:confidence-level" could be also used by Engines that
can not provide fise:confidence values (E.g. the langid engine could
mark detected languages as "uncertain" if the parsed text was to
short).

WDYT
Rupert


-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to