Am 09.09.2011, 10:18 Uhr, schrieb Reto Bachmann-Gmür <[email protected]>:
On Thu, Sep 8, 2011 at 11:33 AM, Rupert Westenthaler
<[email protected]> wrote:
Am 07.09.2011, 10:10 Uhr, schrieb Reto Bachmann-Gmür <[email protected]>:
Hi Aldo,
The functionality the enhancer engines currently offer is with no
doubt useful, especially when dealing with content from different
domains and from different sources.
Integrating the enhancer engines in CQ5 I noticed that a usecase for
it is simply to associate content items to a category out of a small
set of predefined ones. It seems that for this usecase it is quite
inefficient to get all concepts just to see if one happens to match a
local category/concept. Also it is frustrating if the system keeps
repeating the same error ("no, our press releases have nothing to do
with physics even if our spokesperson's name is Einstein").
Using NamedEntities for the categorization of Documents is not very
efficient.
Yet, even for enrichment we have the Einstein-problem.
For doing this the Topic Classification Engine Olivier is currently
working
on should be much better suited.
I don't find any information about this. The Enricher engine is for
adding annotations to parts of the document and the topic
classification engine for tagging a text as whole?
Yes. Olivier showed it in Paris. I think he is still working on it.
AFAIK it is currently not in the SVN.
My proposal is to extend the Engine API so that clients may give
feedback on which Enhancements were of actual use and the engines may
use this information. This would allow an engine delegating to other
engines to weigh those engines based on their success-rate. Also it
would allow engines based other trainable text classification
algorithms such as naive bayes.
This would be an interesting feature however I can not clearly see how
it could work out (especially in a stateless fashion).
Well the learning outcome should be instance-wide and not just limited
to a user "session". Of course if the engines doesn't store the
content and the content is not associated to a URI it has to be
resubmitted with the feedback, i.e. in the simplest case one submits a
text and a set of categories this text isn't related with.
My comment related to Wernher's remark on having useful limited
content understand for customers in niche businesses. If I understand
things correctly the proposal to understand verb is a proposal for a
new enhancer engine but wouldn't require a change of the api (the
interface org.apache.stanbol.enhancer.servicesapi.EnhancementEngine),
This is true. The API will not need to be extended. However this Engine
will use some new Annotation types that are currently not part of the
Enhancement Structure.
My impression is, that to provide useful content classification for
typical cms customers our api should support trainable engines and for
this we should extend the API.
Maybe add a Trainable interface that can be optionally implemented by
an EnhancementEngine. It would than be possible to send Feedback to
those
engines.
Yes I considered this, however this would require a client to check
the type and distinguish between instances of EnhancementEngine and
TrainableEnhancementEngine, on the hand if they are trainable and
implementation that doesn't support training can just provide an empty
implementation of the new method, I think having only one interface is
to be preferred to keep api and client implementation simpler.
My intend was to add an RESTful service that allows to send feedback to
the Stanbol Enhancer. The Enhancer would than forward feedback only to
Engines that also implement the TrainableEngine interface.
Clients do not interact with single EnhancementEngines anyway. The
JobManager takes care of that.
Having an own interface would also allow to register components that are
only interested in feedback (e.g. an component that manages a
controlled vocabulary of all Entities used by Users and may even adapt
the ranks by the number of usages.)
Such component might not be specific to a specific EnhancementEngine, but
to specific type of Feedback provided by Users (e.g. a confirmed
Suggestion)
BTW we should move this discussion to the stanbol-dev mailing list!
I have already added stanbol-dev as cc
best
Rupert
Cheers,
Reto
best
Rupert
Cheers,
Reto
On Fri, Sep 2, 2011 at 1:38 PM, Aldo Gangemi <[email protected]>
wrote:
Hi Reto, do you mean you would like a new text classification engine
in
Stanbol, or you would like to use a text classification algorithm to
support
domain adaptation for verb semantics (for a possible future relation
enhancement engine)?
The first one is not trivial, because domain-specific text
classification
requires training from a specialized corpus and given categories. Some
(globally trained) general-purpose text classifiers do exist (e.g.
Alchemy,
our SemioSearch, etc.), but they are too broad to perform well in a
domain-specific environment.
The second would be great if lexical knowledge would exist in
modularized,
domain-oriented form, which is not the case, with the exception of
some
wordnets. In this case text classification categories need to match
some
specialized lexical knowledge.
My two cents
Aldo
On 2 Sep 2011, at 12:07, Reto Bachmann-Gmür wrote:
On Thu, Sep 1, 2011 at 7:53 AM, Wernher Behrendt
<[email protected]> wrote:
I would like to see work on IKS in this "focus on the problem"
tradition,
and the goal is that a CMS provider can then offer tools for LIMITED,
yet
useful, understanding of content, to a customer in some (niche)
business
domain.
A side node: I don't know if this has been discussed before but for
this adaptation to niche-business domains it would be useful to
support learning EnhancementEngines. On the same CMS instance there is
typically a very specific set of terms used when writing about a
specific topic and it would be good if stanbol could learn that
whenever I mention "bnode", "triple" or "clerezza" chances are very
high that the article I'm writing belongs into the Semweb category. It
seems that the current EnhancementEngine API doesn't support feedback
and learning.
Cheers,
Reto
_______________________________________________
iks-wip mailing list
[email protected]
http://lists.interactive-knowledge.org/cgi-bin/mailman/listinfo/iks-wip
_____________________________________
Aldo Gangemi
Senior Researcher
Semantic Technology Lab (STLab)
Institute for Cognitive Science and Technology,
National Research Council (ISTC-CNR)
Via Nomentana 56, 00161, Roma, Italy
Tel: +390644161535
Fax: +390644161513
[email protected]
http://www.stlab.istc.cnr.it
http://www.istc.cnr.it/createhtml.php?nbr=71
skype aldogangemi
okkam ID: http://www.okkam.org/entity/ok200707031186131660596
_______________________________________________
iks-wip mailing list
[email protected]
http://lists.interactive-knowledge.org/cgi-bin/mailman/listinfo/iks-wip
|--
| Rupert Westenthaler
[email protected]
| Salzburg Research Forschungsgesellschaft
http://www.salzburgresearch.at
| Knowledge Based Information Systems +43 662 2288
413
| Jakob-Haringer Strasse 5/II Skype-Name:
westei
| A-5020 Salzburg
_______________________________________________
iks-wip mailing list
[email protected]
http://lists.interactive-knowledge.org/cgi-bin/mailman/listinfo/iks-wip
|--
| Rupert Westenthaler [email protected]
| Salzburg Research Forschungsgesellschaft http://www.salzburgresearch.at
| Knowledge Based Information Systems +43 662 2288 413
| Jakob-Haringer Strasse 5/II Skype-Name: westei
| A-5020 Salzburg