Rupert Westenthaler created STANBOL-734: -------------------------------------------
Summary: ContentPart for NLP data - AnalyzedText Key: STANBOL-734 URL: https://issues.apache.org/jira/browse/STANBOL-734 Project: Stanbol Issue Type: Sub-task Reporter: Rupert Westenthaler Assignee: Rupert Westenthaler Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText. AnalysedText ===== * It wraps the text/plain ContentPart of a ContentItem * It allows the definition of Spans (type, start, end, spanText). Type is an Enum: Text, TextSection, Sentence, Chunk, Span * Spans are sorted naturally by type, start and end. This allows to use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality to work with contained Tokens. The #higher and #lower methods of NavigateableSet even allow to build Iterators that allow concurrent modifications (e.g adding Chunks while iterating over the Tokens of a Sentence). * One can attach Annotations to Spans. Basically a multi-valued Map with Object keys and Value<valueType> value(s) that support a type save view by using generically typed Annotation<key,valueType> * The Value<valueType> object natively supports confidence. This allows (e.g. for POS tags) to use the same instance ( e.g. of the POS tag for Noun) to be used for all noun annotations. * Note that the AnalysedText does NOT use RDF as representing those kind of data as RDF is not scaleable enough. This also means that the data of the AnalysedText are NOT available in the Enhancement Metadata of the ContentItem. However EnhancementEngines are free to write all/some results to the AnalysedText AND the RDF metadata of the ContentItem. Here is a sample code AnalysedText at; //the contentPart Iterator<Sentence> sentences = at.getSentences; while(sentences.hasNext){ Sentence sentence = sentences.next(); String sentText = sentence.getSpan(); Iterator<SentenceToken> tokens = sentence.getTokens(); while(tokens.hasNext()){ Token token = tokens.next(); String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); } } NLP annotations ===== * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and contains Tags of a specific generic type. The Tag only defines a String "tag" property * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are defined. Both define also an optional LexicalCategory. This is a enum with the 12 top level concepts defined by the [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb, Adjective, Adposition, Adverb ...) * TagSets (including mapped LexicalCategories) are defined for all languages where POS taggers are available for OpenNLP. This includes also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by OLIA. The other TagSets used by OpenNLP are currently not available by Olia. * Note that the LexicalCategory can be used to process POS annotations of different languages TagSet: https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java POS: https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos A code sample: TagSet<PosTag> tagSet; //the used TagSet Map<String,PosTag> unknown; //missing tags in the TagSet Token token; //the token String tag; //the detected tag double prob; //the probability PosTag pos = tagset.getTag(tag); if(pos == null){ //unkonw tag pos = unknown.get(tag); } if(pos == null) { pos = new PosTag(tag); //this tag will not have a LexicalCategory unknown.add(pos); //only one instance } token.addAnnotation( NlpAnnotations.POSAnnotation, new Value<PosTag>(pos, prob)); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira