Hi all, In preparation of the upcoming code contribution I created a set of Issues describing the features discussed in this thread.
https://issues.apache.org/jira/browse/STANBOL-733 As soon as a patch with the current state of the development is attached to the issue (hopefully in the coming days) I will create an own branch for the further development. best Rupert On Tue, Aug 7, 2012 at 1:27 PM, Fabian Christ <[email protected]> wrote: > Hi, > > thanks for sharing these ideas. It totally fits into Stanbol as an > important part of the content enhancement process. This would enable > people to dig into the art of programming a high quality engine. > > I had never the time to play with UIMA but have you guys checked out > how this task is performed there? Is there something good or bad we > can learn from them? > > Best, > - Fabian > > 2012/8/6 harish suvarna <[email protected]>: >> Really Really exciting to hear all this. Quality enhancement engines are >> key to this great platform. >> Count me in as developer, tester, reviewer and contributor or in any other >> role. >> More on this subject later this week when I have some free time. >> >> Thanks, >> Harish >> >> >> On Mon, Aug 6, 2012 at 4:52 AM, Rupert Westenthaler < >> [email protected]> wrote: >> >>> Hi all, >>> >>> First thanks to Sebastian for writing this mail. I will try to add >>> some additional Information to it >>> >>> First let me provide an Overview about the AnalysedText API >>> >>> AnalysedText ContentPart >>> ===== >>> >>> You can find the source discussed in this part at >>> >>> >>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/model >>> >>> * It wraps the text/plain ContentPart of a ContentItem >>> * It allows the definition of Spans (type, start, end, spanText). Type >>> is an Enum: Text, TextSection, Sentence, Chunk, Span >>> * Spans are sorted naturally by type, start and end. This allows to >>> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality >>> to work with contained Tokens. The #higher and #lower methods of >>> NavigateableSet even allow to build Iterators that allow concurrent >>> modifications (e.g adding Chunks while iterating over the Tokens of a >>> Sentence). >>> * One can attach Annotations to Spans. Basically a multi-valued Map >>> with Object keys and Value<valueType> value(s) that support a type >>> save view by using generically typed Annotation<key,valueType> >>> * The Value<valueType> object natively supports confidence. This >>> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS >>> tag for Noun) to be used for all noun annotations. >>> >>> * Note that the AnalysedText does NOT use RDF as representing those >>> kind of data as RDF is not scaleable enough. This also means that the >>> data of the AnalysedText are NOT available in the Enhancement Metadata >>> of the ContentItem. However EnhancementEngines are free to write >>> all/some results to the AnalysedText AND the RDF metadata of the >>> ContentItem. >>> >>> Here is a sample code >>> >>> AnalysedText at; //the contentPart >>> Iterator<Sentence> sentences = at.getSentences; >>> while(sentences.hasNext){ >>> Sentence sentence = sentences.next(); >>> String sentText = sentence.getSpan(); >>> Iterator<SentenceToken> tokens = sentence.getTokens(); >>> while(tokens.hasNext()){ >>> Token token = tokens.next(); >>> String tokenText = token.getSpan(); >>> Value<PosTag> pos = token.getAnnotation( >>> NlpAnnotations.posAnnotation); >>> String tag = pos.value().getTag(); >>> double confidence = pos.probability(); >>> } >>> } >>> >>> NLP annotations >>> ===== >>> >>> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and >>> contains Tags of a specific generic type. The Tag only defines a >>> String "tag" property >>> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are >>> defined. Both define also an optional LexicalCategory. This is a enum >>> with the 12 top level concepts defined by the >>> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb, >>> Adjective, Adposition, Adverb ...) >>> * TagSets (including mapped LexicalCategories) are defined for all >>> languages where POS taggers are available for OpenNLP. This includes >>> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by >>> OLIA. The other TagSets used by OpenNLP are currently not available by >>> Olia. >>> * Note that the LexicalCategory can be used to process POS annotations >>> of different languages >>> >>> TagSet: >>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java >>> POS: >>> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos >>> >>> >>> A code sample: >>> >>> TagSet<PosTag> tagSet; //the used TagSet >>> Map<String,PosTag> unknown; //missing tags in the TagSet >>> >>> Token token; //the token >>> String tag; //the detected tag >>> double prob; //the probability >>> >>> PosTag pos = tagset.getTag(tag); >>> if(pos == null){ //unkonw tag >>> pos = unknown.get(tag); >>> } >>> if(pos == null) { >>> pos = new PosTag(tag); >>> //this tag will not have a LexicalCategory >>> unknown.add(pos); //only one instance >>> } >>> token.addAnnotation( >>> NlpAnnotations.POSAnnotation, >>> new Value<PosTag>(pos, prob)); >>> >>> >>> In the second part I will try to lay out future plans and TODOs >>> >>> 1. Next Steps: >>> >>> * The most important thing was already started by this mail thread >>> - to discuss this within the Stanbol Community. I am on vacation the >>> next two weeks, but I will have time to participate on such a >>> discussion. >>> >>> * Migrate the sentiment engine to recent API changes to the >>> AnalysedText ContentPart? Does anyone know an Sentiment Ontology? >>> >>> * AnalyzedText and Annotations currently do not keep >>> creator/contributor and creation/modification date information. Those >>> might be needed to convert them to fise:Enhancements - any use cases >>> why one would want to add those memory consuming information? >>> >>> 2. near term TODOs: thinks I would like to start in August >>> >>> * contribute this work to Apache Stanbol: Based on the >>> Feedback/Discussion we plan to do this as one of the first things >>> after vacation. Having this feature within Stanbol is important as it >>> has a lot of Opportunities for existing Components (see 3.) >>> >>> * adapt the KeywordLinkingEngine to use the AnalyzedText: This >>> would allow to use any NLP framework for preprocessing the Text before >>> linking its Tokens with a vocabulary. It would also solve the issue >>> that text needs to process n-times for n configured >>> KeywordLinkingEngines. In addition this would also allow to use lemma >>> information (if available) for linking. >>> >>> 3. mid-term improvements and opportunities: >>> >>> * nlp2rdf (NIF): I am confident that one could implement an >>> EnhancementEngine that converts the data of the AnalyzedText to RDF >>> data compatible to NIF as suggested by Sebastian Hellmann here on the >>> list (see [1]). While converting all NLP related information to RDF is >>> not something one would like to do in a typical text enhancement chain >>> this is an important feature for some use cases AND it might also help >>> during development/configuration and debugging. >>> >>> * CELI lemmatizer: Currently this Engine can provide POS tags and >>> Lemmas as RDF in the metadata. Migrating this engine to the >>> AnalyzedText would e.g. allow to use its results for the >>> KeywordLinking Engine. In addition the AnalysedText ContentPart would >>> also make it much simpler to add the discussed CELI sentiment engine >>> [2]. >>> >>> * Additions of new kind of EnhancementEngines (as mentioned in the >>> mail of Sebastian) >>> >>> best >>> Rupert >>> >>> [1] http://markmail.org/message/oq3y4ae2rhtbmpri >>> [2] http://markmail.org/message/m3m6vox46vewgomi >>> >>> On Mon, Aug 6, 2012 at 11:10 AM, Sebastian Schaffert >>> <[email protected]> wrote: >>> > Dear all, >>> > >>> > Rupert and I have been working on porting some of our OpenNLP based >>> natural language processing to Apache Stanbol. While not yet completely >>> finished, we decided it might be worthwhile for you all to have a look on >>> it and maybe even contribute. I will try to briefly summarise the goals and >>> current state of implementation: >>> > >>> > Goals >>> > ===== >>> > >>> > 1. provide a modular infrastructure for NLP-related things >>> > >>> > Many tasks in NLP can be computationally intensive, and there is no "one >>> fits all" NLP approach when analysing text. Therefore, we wanted to have a >>> NLP infrastructure that can be configured and wired together as needed for >>> the specific use case, with several specialised modules that can build upon >>> each other but many of which are optional. >>> > >>> > 2. provide a unified data model for representing NLP text annotations >>> > >>> > In many szenarios, it will be necessary to implement custom engines >>> building on the results of a previous "generic" analysis of the text (e.g. >>> POS tagging and chunking). For example, in a project we are identifying >>> so-called "noun phrases", use a lemmatizer to build the ground form, then >>> convert this to singular nominative form to have a gramatically correct >>> label to use in a tag cloud. Most of this builds on generic NLP >>> functionality, but the last step is very specific to the use case. >>> > >>> > Therefore, we wanted also to implement a generic NLP data model that >>> allows representing text annotations attached to individual words or also >>> to spans of words. >>> > >>> > >>> > Current State >>> > ============= >>> > >>> > Currently, the unified data model has been implemented by Rupert in a >>> first version. He has tested it thoroughly and it is reliable and useful >>> for the szenarios we had in mind. The current enhancement engines are using >>> OpenNLP for analysis, but the model can in general be used by any NLP >>> engine that associates tags with tokens or spans of tokens. >>> > >>> > I have in the meantime concentrated on implementing modules for >>> different NLP tasks. The following modules are already finished: >>> > >>> > - POS Tagger: takes text/plain from a content item and stores an >>> AnalyzedText content part in the content item where each token is assigned >>> its grammar POS tag >>> > - Chunker (Noun Phrase Detector): takes a content item with AnalyzedText >>> content part (from POS tagger) and applies noun phrase chunking on the >>> token stream; results are annotated token spans that are stored in the >>> AnalyzedText >>> > - Sentiment Analyzer (English/German): takes a content item with >>> AnalyzedText content part (from POS tagger) and assigns sentiment values to >>> each token in the stream; results are annotated tokens that are stored in >>> the AnalyzedText >>> > >>> > In progress: >>> > - Lemmatizer (English/German): takes a token stream (POS tagged >>> AnalyzedText) and adds the lemma for each token to the AnalyzedText content >>> part >>> > >>> > >>> > Future work >>> > =========== >>> > >>> > Based on these generic modules, we intend to implement a number of "NLP >>> result summarizers" that take the results in an AnalyzedText and perform >>> some post processing on them, storing them as RDF in the metadata >>> associated with the content item. Some ideas: >>> > - Average Sentiment: compute the average sentiment value for the text by >>> summing all sentiment values and dividing them by the number of annotated >>> tokens >>> > - Improved Sentiment: take into account negations in a sentence before a >>> sentiment value and invert the values in this case; otherwise like average >>> sentiment. >>> > - Per-Noun Sentiment: associate sentiment values with each noun >>> occurring in the text by taking into account the sentiment values of >>> adjectives associated with the noun in a noun phrase and negations before >>> them; result are text annotations where each noun is associated with a >>> sentiment value, so you could say "Product XYZ is typically mentioned with >>> an average sentiment of 0.N" >>> > - Noun Adjectives: collect the adjectives that are commonly used in >>> association with a noun by using the noun phrases and taking the adjectives >>> > - Simple Tag Cloud: take nouns, build lemmatized form, generate a tag >>> cloud in the metadata >>> > - Noun Phrase Cloud: take noun phrases, build lemmatized form, build >>> nominative singular form, generate tag cloud; this is useful when you want >>> to provide more context for the tags, e.g. in facetted search ("red car", >>> "blue car"). >>> > >>> > The possibilities are literally endless… feel free to think about other >>> options :) >>> > >>> > >>> > Availability >>> > ============ >>> > >>> > Since this is still experimental code, we have for the time being set up >>> a separate (public) repository: >>> > >>> > https://bitbucket.org/srfgkmt/stanbol-nlp >>> > >>> > When it is more-or-less finished, we would however like to include this >>> into the main Stanbol code base so others can more easily benefit from it. >>> Feel free to look at what we have implemented there! >>> > >>> > ;-) >>> > >>> > Sebastian >>> > -- >>> > | Dr. Sebastian Schaffert >>> [email protected] >>> > | Salzburg Research Forschungsgesellschaft >>> http://www.salzburgresearch.at >>> > | Head of Knowledge and Media Technologies Group +43 662 2288 >>> 423 >>> > | Jakob-Haringer Strasse 5/II >>> > | A-5020 Salzburg >>> > >>> >>> >>> >>> -- >>> | Rupert Westenthaler [email protected] >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> > > > > -- > Fabian > http://twitter.com/fctwitt -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
