Thanks very much Rupert, you help me a lot in clarify my ideas :-) i think i'll try to follow your suggestion, and try to use my thesaurus with the workflow option 2) i already use solr either, so it's probably the best choice for my needs, indeed
on the other hand i'm still interested on give a try on opennlp italian model construction, but i can to my experiments externally, as i correct understand. thanks very much, i'll try to make some progress Alfredo 2012/3/22 Rupert Westenthaler <[email protected]> > Hi Alfredo > > On 22.03.2012, at 12:24, seralf wrote: > > > Hi i'm new to stambol, i'm reading the documentation and examples, and > i'd > > like to start some testing with it on italian language, if it's possible. > > > > Could someone give me some hint regarding the steps to try to costruct my > > model (Italian) and configure it inside the platform? I suppose it's > > possible and it should be not very far to the steps taken for construct > > -let's say- the Spanish integration. > > What i need to do? I know it could sound a very generic question, but > it's > > not so clear from the documentation, so i need help. > > For my test i would like to be able to use a text corpora from the > database > > of a client, and a skos thesaurus from the same domain. > > > > thanks in advance for every help (suggestions, code examples, ideas, etc) > > > > In principle there are two different workflows how to extract Entities > form Text > > (1) NamedEntityExtraction (NER) [3] => NamedEntityLinking [4] > (2) KeywordLinking [5] > > > (1) requires a OpenNLP [1] NER model for the language of your documents. > However currently there are no models for the italian language distributed > by OpenNLP. This would require you to build your own models. For more > information on how to do that please see the documentation of OpenNLP [1]. > As soon as you have such models you need only copy them into the > {stanbol-workingdir}/sling/datafiles folder. If they follow the naming > scheme used by OpenNLP ("{lang}-ner-{type}.bin" e.g. "it-ner.location.bin" > for the model that detects locations for italian) Stanbol will pick them up > automatically. > > (2) directly matches words of the text with labels of entities within the > controlled vocabulary. This process can be improved by Natural Langauge > Processing (e.g. Part-of-Speech tagging) but this is not a requirement. > Typically this works fine for datasets that contain named entities such as > concepts of an thesaurus; contacts of an company, projects, products … It > does not work well with datasets that contains entities with labels that > are also used as common words in the given language as this will result in > a lot of false positives. > > Based on the information you provided on you use case I suggest that (2) > should work just fine for you. This user scenario [2] should provide you > will all the needed information on how to configure Stanbol for your use > case. > > I hope this helps. If you have any further questions feel free to ask > > best > Rupert Westenthaler > > [1] http://opennlp.apache.org/ > [2] http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html > > [3] > http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentityextractionengine.html > [4] > http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentitytaggingengine.html > [5] > http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html > > > cheers, > > Alfredo Serafini > >
