Thanks very much Rupert, you help me a lot in clarify my ideas :-)

i think i'll try to follow your suggestion, and try to use my thesaurus
with the workflow option 2)
i already use solr either, so it's probably the best choice for my needs,
indeed

on the other hand i'm still interested on give a try on opennlp italian
model construction, but i can to my experiments externally, as i correct
understand.

thanks very much, i'll try to make some progress
Alfredo


2012/3/22 Rupert Westenthaler <[email protected]>

> Hi Alfredo
>
> On 22.03.2012, at 12:24, seralf wrote:
>
> > Hi i'm new to stambol, i'm reading the documentation and examples, and
> i'd
> > like to start some testing with it on italian language, if it's possible.
> >
> > Could someone give me some hint regarding the steps to try to costruct my
> > model (Italian) and configure it inside the platform? I suppose it's
> > possible and it should be not very far to the steps taken for construct
> > -let's say- the Spanish integration.
> > What i need to do? I know it could sound a very generic question, but
> it's
> > not so clear from the documentation, so i need help.
> > For my test i would like to be able to use a text corpora from the
> database
> > of a client, and a skos thesaurus from the same domain.
> >
> > thanks in advance for every help (suggestions, code examples, ideas, etc)
> >
>
> In principle there are two different workflows how to extract Entities
> form Text
>
> (1) NamedEntityExtraction (NER) [3] => NamedEntityLinking [4]
> (2) KeywordLinking [5]
>
>
> (1) requires a OpenNLP [1] NER model for the language of your documents.
> However currently there are no models for the italian language distributed
> by OpenNLP. This would require you to build your own models. For more
> information on how to do that please see the documentation of OpenNLP [1].
> As soon as you have such models you need only copy them into the
> {stanbol-workingdir}/sling/datafiles folder. If they follow the naming
> scheme used by OpenNLP ("{lang}-ner-{type}.bin" e.g. "it-ner.location.bin"
> for the model that detects locations for italian) Stanbol will pick them up
> automatically.
>
> (2) directly matches words of the text with labels of entities within the
> controlled vocabulary. This process can be improved by Natural Langauge
> Processing (e.g. Part-of-Speech tagging) but this is not a requirement.
> Typically this works fine for datasets that contain named entities such as
> concepts of an thesaurus; contacts of an company, projects, products … It
> does not work well with datasets that contains entities with labels that
> are also used as common words in the given language as this will result in
> a lot of false positives.
>
> Based on the information you provided on you use case I suggest that (2)
> should work just fine for you. This user scenario [2] should provide you
> will all the needed information on how to configure Stanbol for your use
> case.
>
> I hope this helps. If you have any further questions feel free to ask
>
> best
> Rupert Westenthaler
>
> [1] http://opennlp.apache.org/
> [2] http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html
>
> [3]
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentityextractionengine.html
> [4]
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentitytaggingengine.html
> [5]
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
>
> > cheers,
> > Alfredo Serafini
>
>

Reply via email to