Rupert Westenthaler created STANBOL-734:
-------------------------------------------

             Summary: ContentPart for NLP data - AnalyzedText
                 Key: STANBOL-734
                 URL: https://issues.apache.org/jira/browse/STANBOL-734
             Project: Stanbol
          Issue Type: Sub-task
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


Because the management of NLP metadata - that is usually available on a word 
granularity - is not feasible using the RDF metadata this describes the 
addition of a special ContentPart Stanbol. This ContentPart will have the name 
AnalysedText.

AnalysedText
=====

* It wraps the text/plain ContentPart of a ContentItem
* It allows the definition of Spans (type, start, end, spanText). Type
is an Enum: Text, TextSection, Sentence, Chunk, Span
* Spans are sorted naturally by type, start and end. This allows to
use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
to work with contained Tokens. The #higher and #lower methods of
NavigateableSet even allow to build Iterators that allow concurrent
modifications (e.g adding Chunks while iterating over the Tokens of a
Sentence).
* One can attach Annotations to Spans. Basically a multi-valued Map
with Object keys and Value<valueType> value(s) that support a type
save view by using generically typed Annotation<key,valueType>
* The Value<valueType> object natively supports confidence. This
allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
tag for Noun) to be used for all noun annotations.

* Note that the AnalysedText does NOT use RDF as representing those
kind of data as RDF is not scaleable enough. This also means that the
data of the AnalysedText are NOT available in the Enhancement Metadata
of the ContentItem. However EnhancementEngines are free to write
all/some results to the AnalysedText AND the RDF metadata of the
ContentItem.

Here is a sample code

    AnalysedText at; //the contentPart
    Iterator<Sentence> sentences = at.getSentences;
    while(sentences.hasNext){
        Sentence sentence = sentences.next();
        String sentText = sentence.getSpan();
        Iterator<SentenceToken> tokens = sentence.getTokens();
        while(tokens.hasNext()){
            Token token = tokens.next();
            String tokenText = token.getSpan();
            Value<PosTag> pos = token.getAnnotation(
                NlpAnnotations.posAnnotation);
            String tag = pos.value().getTag();
            double confidence = pos.probability();
        }
    }

NLP annotations
=====

* TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
contains Tags of a specific generic type. The Tag only defines a
String "tag" property
* Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
defined. Both define also an optional LexicalCategory. This is a enum
with the 12 top level concepts defined by the
[Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
Adjective, Adposition, Adverb ...)
* TagSets (including mapped LexicalCategories) are defined for all
languages where POS taggers are available for OpenNLP. This includes
also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
OLIA. The other TagSets used by OpenNLP are currently not available by
Olia.
* Note that the LexicalCategory can be used to process POS annotations
of different languages

TagSet:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
POS:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos

A code sample:

    TagSet<PosTag> tagSet; //the used TagSet
    Map<String,PosTag> unknown; //missing tags in the TagSet

    Token token; //the token
    String tag; //the detected tag
    double prob; //the probability

    PosTag pos = tagset.getTag(tag);
    if(pos == null){ //unkonw tag
        pos = unknown.get(tag);
    }
    if(pos == null) {
        pos = new PosTag(tag);
        //this tag will not have a LexicalCategory
        unknown.add(pos); //only one instance
    }
    token.addAnnotation(
        NlpAnnotations.POSAnnotation,
        new Value<PosTag>(pos, prob));


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to