Hi all,
First thanks to Sebastian for writing this mail. I will try to add
some additional Information to it
First let me provide an Overview about the AnalysedText API
AnalysedText ContentPart
=====
You can find the source discussed in this part at
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/model
* It wraps the text/plain ContentPart of a ContentItem
* It allows the definition of Spans (type, start, end, spanText). Type
is an Enum: Text, TextSection, Sentence, Chunk, Span
* Spans are sorted naturally by type, start and end. This allows to
use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
to work with contained Tokens. The #higher and #lower methods of
NavigateableSet even allow to build Iterators that allow concurrent
modifications (e.g adding Chunks while iterating over the Tokens of a
Sentence).
* One can attach Annotations to Spans. Basically a multi-valued Map
with Object keys and Value<valueType> value(s) that support a type
save view by using generically typed Annotation<key,valueType>
* The Value<valueType> object natively supports confidence. This
allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
tag for Noun) to be used for all noun annotations.
* Note that the AnalysedText does NOT use RDF as representing those
kind of data as RDF is not scaleable enough. This also means that the
data of the AnalysedText are NOT available in the Enhancement Metadata
of the ContentItem. However EnhancementEngines are free to write
all/some results to the AnalysedText AND the RDF metadata of the
ContentItem.
Here is a sample code
AnalysedText at; //the contentPart
Iterator<Sentence> sentences = at.getSentences;
while(sentences.hasNext){
Sentence sentence = sentences.next();
String sentText = sentence.getSpan();
Iterator<SentenceToken> tokens = sentence.getTokens();
while(tokens.hasNext()){
Token token = tokens.next();
String tokenText = token.getSpan();
Value<PosTag> pos = token.getAnnotation(
NlpAnnotations.posAnnotation);
String tag = pos.value().getTag();
double confidence = pos.probability();
}
}
NLP annotations
=====
* TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
contains Tags of a specific generic type. The Tag only defines a
String "tag" property
* Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
defined. Both define also an optional LexicalCategory. This is a enum
with the 12 top level concepts defined by the
[Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
Adjective, Adposition, Adverb ...)
* TagSets (including mapped LexicalCategories) are defined for all
languages where POS taggers are available for OpenNLP. This includes
also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
OLIA. The other TagSets used by OpenNLP are currently not available by
Olia.
* Note that the LexicalCategory can be used to process POS annotations
of different languages
TagSet:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
POS:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
A code sample:
TagSet<PosTag> tagSet; //the used TagSet
Map<String,PosTag> unknown; //missing tags in the TagSet
Token token; //the token
String tag; //the detected tag
double prob; //the probability
PosTag pos = tagset.getTag(tag);
if(pos == null){ //unkonw tag
pos = unknown.get(tag);
}
if(pos == null) {
pos = new PosTag(tag);
//this tag will not have a LexicalCategory
unknown.add(pos); //only one instance
}
token.addAnnotation(
NlpAnnotations.POSAnnotation,
new Value<PosTag>(pos, prob));
In the second part I will try to lay out future plans and TODOs
1. Next Steps:
* The most important thing was already started by this mail thread
- to discuss this within the Stanbol Community. I am on vacation the
next two weeks, but I will have time to participate on such a
discussion.
* Migrate the sentiment engine to recent API changes to the
AnalysedText ContentPart? Does anyone know an Sentiment Ontology?
* AnalyzedText and Annotations currently do not keep
creator/contributor and creation/modification date information. Those
might be needed to convert them to fise:Enhancements - any use cases
why one would want to add those memory consuming information?
2. near term TODOs: thinks I would like to start in August
* contribute this work to Apache Stanbol: Based on the
Feedback/Discussion we plan to do this as one of the first things
after vacation. Having this feature within Stanbol is important as it
has a lot of Opportunities for existing Components (see 3.)
* adapt the KeywordLinkingEngine to use the AnalyzedText: This
would allow to use any NLP framework for preprocessing the Text before
linking its Tokens with a vocabulary. It would also solve the issue
that text needs to process n-times for n configured
KeywordLinkingEngines. In addition this would also allow to use lemma
information (if available) for linking.
3. mid-term improvements and opportunities:
* nlp2rdf (NIF): I am confident that one could implement an
EnhancementEngine that converts the data of the AnalyzedText to RDF
data compatible to NIF as suggested by Sebastian Hellmann here on the
list (see [1]). While converting all NLP related information to RDF is
not something one would like to do in a typical text enhancement chain
this is an important feature for some use cases AND it might also help
during development/configuration and debugging.
* CELI lemmatizer: Currently this Engine can provide POS tags and
Lemmas as RDF in the metadata. Migrating this engine to the
AnalyzedText would e.g. allow to use its results for the
KeywordLinking Engine. In addition the AnalysedText ContentPart would
also make it much simpler to add the discussed CELI sentiment engine
[2].
* Additions of new kind of EnhancementEngines (as mentioned in the
mail of Sebastian)
best
Rupert
[1] http://markmail.org/message/oq3y4ae2rhtbmpri
[2] http://markmail.org/message/m3m6vox46vewgomi
On Mon, Aug 6, 2012 at 11:10 AM, Sebastian Schaffert
<[email protected]> wrote:
> Dear all,
>
> Rupert and I have been working on porting some of our OpenNLP based natural
> language processing to Apache Stanbol. While not yet completely finished, we
> decided it might be worthwhile for you all to have a look on it and maybe
> even contribute. I will try to briefly summarise the goals and current state
> of implementation:
>
> Goals
> =====
>
> 1. provide a modular infrastructure for NLP-related things
>
> Many tasks in NLP can be computationally intensive, and there is no "one fits
> all" NLP approach when analysing text. Therefore, we wanted to have a NLP
> infrastructure that can be configured and wired together as needed for the
> specific use case, with several specialised modules that can build upon each
> other but many of which are optional.
>
> 2. provide a unified data model for representing NLP text annotations
>
> In many szenarios, it will be necessary to implement custom engines building
> on the results of a previous "generic" analysis of the text (e.g. POS tagging
> and chunking). For example, in a project we are identifying so-called "noun
> phrases", use a lemmatizer to build the ground form, then convert this to
> singular nominative form to have a gramatically correct label to use in a tag
> cloud. Most of this builds on generic NLP functionality, but the last step is
> very specific to the use case.
>
> Therefore, we wanted also to implement a generic NLP data model that allows
> representing text annotations attached to individual words or also to spans
> of words.
>
>
> Current State
> =============
>
> Currently, the unified data model has been implemented by Rupert in a first
> version. He has tested it thoroughly and it is reliable and useful for the
> szenarios we had in mind. The current enhancement engines are using OpenNLP
> for analysis, but the model can in general be used by any NLP engine that
> associates tags with tokens or spans of tokens.
>
> I have in the meantime concentrated on implementing modules for different
> NLP tasks. The following modules are already finished:
>
> - POS Tagger: takes text/plain from a content item and stores an AnalyzedText
> content part in the content item where each token is assigned its grammar POS
> tag
> - Chunker (Noun Phrase Detector): takes a content item with AnalyzedText
> content part (from POS tagger) and applies noun phrase chunking on the token
> stream; results are annotated token spans that are stored in the AnalyzedText
> - Sentiment Analyzer (English/German): takes a content item with AnalyzedText
> content part (from POS tagger) and assigns sentiment values to each token in
> the stream; results are annotated tokens that are stored in the AnalyzedText
>
> In progress:
> - Lemmatizer (English/German): takes a token stream (POS tagged AnalyzedText)
> and adds the lemma for each token to the AnalyzedText content part
>
>
> Future work
> ===========
>
> Based on these generic modules, we intend to implement a number of "NLP
> result summarizers" that take the results in an AnalyzedText and perform some
> post processing on them, storing them as RDF in the metadata associated with
> the content item. Some ideas:
> - Average Sentiment: compute the average sentiment value for the text by
> summing all sentiment values and dividing them by the number of annotated
> tokens
> - Improved Sentiment: take into account negations in a sentence before a
> sentiment value and invert the values in this case; otherwise like average
> sentiment.
> - Per-Noun Sentiment: associate sentiment values with each noun occurring in
> the text by taking into account the sentiment values of adjectives associated
> with the noun in a noun phrase and negations before them; result are text
> annotations where each noun is associated with a sentiment value, so you
> could say "Product XYZ is typically mentioned with an average sentiment of
> 0.N"
> - Noun Adjectives: collect the adjectives that are commonly used in
> association with a noun by using the noun phrases and taking the adjectives
> - Simple Tag Cloud: take nouns, build lemmatized form, generate a tag cloud
> in the metadata
> - Noun Phrase Cloud: take noun phrases, build lemmatized form, build
> nominative singular form, generate tag cloud; this is useful when you want to
> provide more context for the tags, e.g. in facetted search ("red car", "blue
> car").
>
> The possibilities are literally endless… feel free to think about other
> options :)
>
>
> Availability
> ============
>
> Since this is still experimental code, we have for the time being set up a
> separate (public) repository:
>
> https://bitbucket.org/srfgkmt/stanbol-nlp
>
> When it is more-or-less finished, we would however like to include this into
> the main Stanbol code base so others can more easily benefit from it. Feel
> free to look at what we have implemented there!
>
> ;-)
>
> Sebastian
> --
> | Dr. Sebastian Schaffert [email protected]
> | Salzburg Research Forschungsgesellschaft http://www.salzburgresearch.at
> | Head of Knowledge and Media Technologies Group +43 662 2288 423
> | Jakob-Haringer Strasse 5/II
> | A-5020 Salzburg
>
--
| Rupert Westenthaler [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen