Hi,

thanks for sharing these ideas. It totally fits into Stanbol as an
important part of the content enhancement process. This would enable
people to dig into the art of programming a high quality engine.

I had never the time to play with UIMA but have you guys checked out
how this task is performed there? Is there something good or bad we
can learn from them?

Best,
 - Fabian

2012/8/6 harish suvarna <[email protected]>:
> Really Really exciting to hear all this. Quality enhancement engines are
> key to this great platform.
> Count me in as developer, tester, reviewer and contributor or in any other
> role.
> More on this subject later this week when I have some free time.
>
> Thanks,
> Harish
>
>
> On Mon, Aug 6, 2012 at 4:52 AM, Rupert Westenthaler <
> [email protected]> wrote:
>
>> Hi all,
>>
>> First thanks to Sebastian for writing this mail. I will try to add
>> some additional Information to it
>>
>> First let me provide an Overview about the AnalysedText API
>>
>> AnalysedText ContentPart
>> =====
>>
>> You can find the source discussed in this part at
>>
>>
>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/model
>>
>> * It wraps the text/plain ContentPart of a ContentItem
>> * It allows the definition of Spans (type, start, end, spanText). Type
>> is an Enum: Text, TextSection, Sentence, Chunk, Span
>> * Spans are sorted naturally by type, start and end. This allows to
>> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
>> to work with contained Tokens. The #higher and #lower methods of
>> NavigateableSet even allow to build Iterators that allow concurrent
>> modifications (e.g adding Chunks while iterating over the Tokens of a
>> Sentence).
>> * One can attach Annotations to Spans. Basically a multi-valued Map
>> with Object keys and Value<valueType> value(s) that support a type
>> save view by using generically typed Annotation<key,valueType>
>> * The Value<valueType> object natively supports confidence. This
>> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
>> tag for Noun) to be used for all noun annotations.
>>
>> * Note that the AnalysedText does NOT use RDF as representing those
>> kind of data as RDF is not scaleable enough. This also means that the
>> data of the AnalysedText are NOT available in the Enhancement Metadata
>> of the ContentItem. However EnhancementEngines are free to write
>> all/some results to the AnalysedText AND the RDF metadata of the
>> ContentItem.
>>
>> Here is a sample code
>>
>>     AnalysedText at; //the contentPart
>>     Iterator<Sentence> sentences = at.getSentences;
>>     while(sentences.hasNext){
>>         Sentence sentence = sentences.next();
>>         String sentText = sentence.getSpan();
>>         Iterator<SentenceToken> tokens = sentence.getTokens();
>>         while(tokens.hasNext()){
>>             Token token = tokens.next();
>>             String tokenText = token.getSpan();
>>             Value<PosTag> pos = token.getAnnotation(
>>                 NlpAnnotations.posAnnotation);
>>             String tag = pos.value().getTag();
>>             double confidence = pos.probability();
>>         }
>>     }
>>
>> NLP annotations
>> =====
>>
>> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
>> contains Tags of a specific generic type. The Tag only defines a
>> String "tag" property
>> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
>> defined. Both define also an optional LexicalCategory. This is a enum
>> with the 12 top level concepts defined by the
>> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
>> Adjective, Adposition, Adverb ...)
>> * TagSets (including mapped LexicalCategories) are defined for all
>> languages where POS taggers are available for OpenNLP. This includes
>> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
>> OLIA. The other TagSets used by OpenNLP are currently not available by
>> Olia.
>> * Note that the LexicalCategory can be used to process POS annotations
>> of different languages
>>
>> TagSet:
>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
>> POS:
>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
>>
>>
>> A code sample:
>>
>>     TagSet<PosTag> tagSet; //the used TagSet
>>     Map<String,PosTag> unknown; //missing tags in the TagSet
>>
>>     Token token; //the token
>>     String tag; //the detected tag
>>     double prob; //the probability
>>
>>     PosTag pos = tagset.getTag(tag);
>>     if(pos == null){ //unkonw tag
>>         pos = unknown.get(tag);
>>     }
>>     if(pos == null) {
>>         pos = new PosTag(tag);
>>         //this tag will not have a LexicalCategory
>>         unknown.add(pos); //only one instance
>>     }
>>     token.addAnnotation(
>>         NlpAnnotations.POSAnnotation,
>>         new Value<PosTag>(pos, prob));
>>
>>
>> In the second part I will try to lay out future plans and TODOs
>>
>> 1. Next Steps:
>>
>>     * The most important thing was already started by this mail thread
>> - to discuss this within the Stanbol Community. I am on vacation the
>> next two weeks, but I will have time to participate on such a
>> discussion.
>>
>>     * Migrate the sentiment engine to recent API changes to the
>> AnalysedText ContentPart? Does anyone know an Sentiment Ontology?
>>
>>     * AnalyzedText and Annotations currently do not keep
>> creator/contributor and creation/modification date information. Those
>> might be needed to convert them to fise:Enhancements - any use cases
>> why one would want to add those memory consuming information?
>>
>> 2. near term TODOs: thinks I would like to start in August
>>
>>     * contribute this work to Apache Stanbol: Based on the
>> Feedback/Discussion we plan to do this as one of the first things
>> after vacation. Having this feature within Stanbol is important as it
>> has a lot of Opportunities for existing Components (see 3.)
>>
>>     * adapt the KeywordLinkingEngine to use the AnalyzedText: This
>> would allow to use any NLP framework for preprocessing the Text before
>> linking its Tokens with a vocabulary. It would also solve the issue
>> that text needs to process n-times for n configured
>> KeywordLinkingEngines. In addition this would also allow to use lemma
>> information (if available) for linking.
>>
>> 3. mid-term improvements and opportunities:
>>
>>     * nlp2rdf (NIF): I am confident that one could implement an
>> EnhancementEngine that converts the data of the AnalyzedText to RDF
>> data compatible to NIF as suggested by Sebastian Hellmann here on the
>> list (see [1]). While converting all NLP related information to RDF is
>> not something one would like to do in a typical text enhancement chain
>> this is an important feature for some use cases AND it might also help
>> during development/configuration and debugging.
>>
>>     * CELI lemmatizer: Currently this Engine can provide POS tags and
>> Lemmas as RDF in the metadata. Migrating this engine to the
>> AnalyzedText would e.g. allow to use its results for the
>> KeywordLinking Engine. In addition the AnalysedText ContentPart would
>> also make it much simpler to add the discussed CELI sentiment engine
>> [2].
>>
>>     * Additions of new kind of EnhancementEngines (as mentioned in the
>> mail of Sebastian)
>>
>> best
>> Rupert
>>
>> [1] http://markmail.org/message/oq3y4ae2rhtbmpri
>> [2] http://markmail.org/message/m3m6vox46vewgomi
>>
>> On Mon, Aug 6, 2012 at 11:10 AM, Sebastian Schaffert
>> <[email protected]> wrote:
>> > Dear all,
>> >
>> > Rupert and I have been working on porting some of our OpenNLP based
>> natural language processing to Apache Stanbol. While not yet completely
>> finished, we decided it might be worthwhile for you all to have a look on
>> it and maybe even contribute. I will try to briefly summarise the goals and
>> current state of implementation:
>> >
>> > Goals
>> > =====
>> >
>> > 1. provide a modular infrastructure for NLP-related things
>> >
>> > Many tasks in NLP can be computationally intensive, and there is no "one
>> fits all" NLP approach when analysing text. Therefore, we wanted to have a
>> NLP infrastructure that can be configured and wired together as needed for
>> the specific use case, with several specialised modules that can build upon
>> each other but many of which are optional.
>> >
>> > 2. provide a unified data model for representing NLP text annotations
>> >
>> > In many szenarios, it will be necessary to implement custom engines
>> building on the results of a previous "generic" analysis of the text (e.g.
>> POS tagging and chunking). For example, in a project we are identifying
>> so-called "noun phrases", use a lemmatizer to build the ground form, then
>> convert this to singular nominative form to have a gramatically correct
>> label to use in a tag cloud. Most of this builds on generic NLP
>> functionality, but the last step is very specific to the use case.
>> >
>> > Therefore, we wanted also to implement a generic NLP data model that
>> allows representing text annotations attached to individual words or also
>> to spans of words.
>> >
>> >
>> > Current State
>> > =============
>> >
>> > Currently, the unified data model has been implemented by Rupert in a
>> first version. He has tested it thoroughly and it is reliable and useful
>> for the szenarios we had in mind. The current enhancement engines are using
>> OpenNLP for analysis, but the model can in general be used by any NLP
>> engine that associates tags with tokens or spans of tokens.
>> >
>> >  I have in the meantime concentrated on implementing modules for
>> different NLP tasks. The following modules are already finished:
>> >
>> > - POS Tagger: takes text/plain from a content item and stores an
>> AnalyzedText content part in the content item where each token is assigned
>> its grammar POS tag
>> > - Chunker (Noun Phrase Detector): takes a content item with AnalyzedText
>> content part (from POS tagger) and applies noun phrase chunking on the
>> token stream; results are annotated token spans that are stored in the
>> AnalyzedText
>> > - Sentiment Analyzer (English/German): takes a content item with
>> AnalyzedText content part (from POS tagger) and assigns sentiment values to
>> each token in the stream; results are annotated tokens that are stored in
>> the AnalyzedText
>> >
>> > In progress:
>> > - Lemmatizer (English/German): takes a token stream (POS tagged
>> AnalyzedText) and adds the lemma for each token to the AnalyzedText content
>> part
>> >
>> >
>> > Future work
>> > ===========
>> >
>> > Based on these generic modules, we intend to implement a number of "NLP
>> result summarizers" that take the results in an AnalyzedText and perform
>> some post processing on them, storing them as RDF in the metadata
>> associated with the content item. Some ideas:
>> > - Average Sentiment: compute the average sentiment value for the text by
>> summing all sentiment values and dividing them by the number of annotated
>> tokens
>> > - Improved Sentiment: take into account negations in a sentence before a
>> sentiment value and invert the values in this case; otherwise like average
>> sentiment.
>> > - Per-Noun Sentiment: associate sentiment values with each noun
>> occurring in the text by taking into account the sentiment values of
>> adjectives associated with the noun in a noun phrase and negations before
>> them; result are text annotations where each noun is associated with a
>> sentiment value, so you could say "Product XYZ is typically mentioned with
>> an average sentiment of 0.N"
>> > - Noun Adjectives: collect the adjectives that are commonly used in
>> association with a noun by using the noun phrases and taking the adjectives
>> > - Simple Tag Cloud: take nouns, build lemmatized form, generate a tag
>> cloud in the metadata
>> > - Noun Phrase Cloud: take noun phrases, build lemmatized form, build
>> nominative singular form, generate tag cloud; this is useful when you want
>> to provide more context for the tags, e.g. in facetted search ("red car",
>> "blue car").
>> >
>> > The possibilities are literally endless… feel free to think about other
>> options :)
>> >
>> >
>> > Availability
>> > ============
>> >
>> > Since this is still experimental code, we have for the time being set up
>> a separate (public) repository:
>> >
>> > https://bitbucket.org/srfgkmt/stanbol-nlp
>> >
>> > When it is more-or-less finished, we would however like to include this
>> into the main Stanbol code base so others can more easily benefit from it.
>> Feel free to look at what we have implemented there!
>> >
>> > ;-)
>> >
>> > Sebastian
>> > --
>> > | Dr. Sebastian Schaffert
>> [email protected]
>> > | Salzburg Research Forschungsgesellschaft
>> http://www.salzburgresearch.at
>> > | Head of Knowledge and Media Technologies Group          +43 662 2288
>> 423
>> > | Jakob-Haringer Strasse 5/II
>> > | A-5020 Salzburg
>> >
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
Fabian
http://twitter.com/fctwitt

Reply via email to