Dear Rupert,
thanks ... thing are going clearer and clearer.

Best regards,
A

2012/6/21 Rupert Westenthaler <[email protected]>

> Hi Andrea
>
> The CELI Lemmatizer engine (see STANBOL-583) does exactly that. It
> creates TextAnnotations for each word and adds the POS and Lemma (if
> you enable the "Full Morphological Analysis" in its configuration).
>
> Here is an example for "Tagen" of the german sentence "An Tagen wie
> diesen würde man lieber baden gehen!"
>
> <urn:enhancement-3bf15662-f87a-dcba-e4cb-92024b167d30>
>      a       <http://fise.iks-project.eu/ontology/TextAnnotation> ,
> <http://fise.iks-project.eu/ontology/Enhancement> ;
>      <http://fise.iks-project.eu/ontology/selected-text>
>              "Tagen"@de ;
>      <http://fise.iks-project.eu/ontology/selection-context>
>              "An Tagen wie diesen würde man lieber baden gehen!"@de ;
>      <http://fise.iks-project.eu/ontology/start>
>              "3"^^<http://www.w3.org/2001/XMLSchema#int> ;
>      <http://fise.iks-project.eu/ontology/end>
>              "8"^^<http://www.w3.org/2001/XMLSchema#int> ;
>      <http://fise.iks-project.eu/ontology/hasLemmaForm>
>              "tagen"@de , "Tag"@de ;
>      <http://fise.iks-project.eu/ontology/hasMorphologicalFeature>
>              "MOOD=SUB" , "MOOD=INF" , "POS=N" , "PERSON=P3" ,
> "CASE=DAT" , "POS=V"^^ , "TENSE=PRS" , "GENDER=MAS" , "NUMBER=PLU" ;
>
>
> This engines uses the Properties
>
> * fise:hasLemmaForm
> * fise:hasMorphologicalFeature: values are {key}={value}
>
> to encode results of the Morphological analyses. However note that
> this two properties are NOT specified in the Stanbol Enhancement
> Structure.
>
> Doing the same with the POSTagger of OpenNLP would be quite easy.
> Especially when you use the
> "org.apache.stanbol.commons.opennlp.TextAnalyzer" as the
> KeywordLinkingEngine does.
>
>   @Reference
>   OpenNLP openNLP; //injected -> loads models from config
>
>    //get the plain text from the ContentItem
>    Entry<UriRef,Blob> contentPart = ContentItemHelper.getBlob(ci,
> Collections.singleton("text/plain"));
>    String text = ContentItemHelper.getText(contentPart.getValue());
>    //get the language of the Text
>    String lang = EnhancementEngineHelper.getLanguage(ci);
>
>    //Analyze the text
>    //config for the TextAnalyzer ... you may expose some of them
>    //in the Engine config
>    TextAnalyzerConfig config = new TextAnalyzerConfig(); //uses defaults
>
>    //create the TextAnalyzer
>    TextAnalyzer analyzer = new TextAnalyzer(openNLP, language,config);
>    //process the text
>    Iterator<AnalysedText> analysedSentences = analyzer.analyse(text);
>    while(analysedSentences.hasNext()){
>        AnalysedText analysed = analysedSentences.next();
>        //NOTE: depending on the config and the available models
>        //           Tokens and/or Chunks might not be present
>        for(Token token : tokens){
>            String posTag = token.getPosTag();
>            double posProb = token.getPosProbability();
>        }
>        for(Chunk chunk : chunks){
>            //similar things for chunks
>        }
>    }
>
> While iterating over the sentences, tokens and chunk you could create
> similar TextAnnotations as created by the CELI engine
>
> However note that - as Olivier mentioned - this creates a lot of RDF
> triples. So it will not scale to very long texts. Assume 20
> Triples/Word. So texts with some thousands words should be still fine,
> but if you analyze longer texts you will run into performance and
> memory issues.
>
> best
> Rupert
>
> On Thu, Jun 21, 2012 at 2:52 PM, Olivier Grisel
> <[email protected]> wrote:
> > 2012/6/21 Andrea Taurchini <[email protected]>:
> >> Dear Olivier,
> >> thanks for your reply.
> >> Ok, so it is possible, but I have to implement it as a new Engine on my
> own.
> >> As for "Tagging Server" is a new restful interface to OpenNLP exposing
> on
> >> http its algorithm.
> >
> > Alright then you can indeed write a set of new low level, pure NLP
> > engines and let delegate the semantic intepretations of such
> > annotations to the caller.
> >
> > The Stanbol RDF-based output format might be a little bit verbose for
> > such kind of low level annotations though.
> >
> > --
> > Olivier
> > http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Reply via email to