On Mon, Jul 30, 2012 at 8:04 AM, Walter Kasper <[email protected]> wrote: > Hi Harish, > > I can provide a Stanbol wrapper for the > http://code.google.com/p/language-detection library as an additional > enhancement engine in the next days. I would be interested in evaluating it > anyway. >
cool thx! best Rupert > Best regards, > > Walter > > > harish suvarna wrote: >> >> Rupert, >> My initial debugging for Chinese text told me that the language >> identification done by langid enhancer using apache tika does not >> recognize >> chinese. The tika language detection seems is not supporting the CJK >> languages. With the result, the chinese language is identified as >> lithuanian language 'lt' . The apache tika group has an enhancement item >> 856 registered for detecting cjk languages >> https://issues.apache.org/jira/browse/TIKA-856 >> in Feb 2012. I am not sure about the use of language identification in >> stanbol yet. Is the language id used to select the dbpedia index >> (approprite dbpedia language dump) for entity lookups? >> >> >> I am just thinking that, for my purpose, pick option 3 and make sure that >> it is of my language of my interest and then call paoding segmenter. Then >> iterate over the ngrams and do an entityhub lookup. I just still need to >> understand the code around how the whole entity lookup for dbpedia works. >> >> I find that the language detection library >> http://code.google.com/p/language-detection/ is very good at language >> detection. It supports 53 languages out of box and the quality seems good. >> It is apache 2.0 license. I could volunteer to create a new langid engine >> based on this with the stanbol community approval. So if anyone sheds some >> light on how to add a new java library into stanbol, that be great. I am a >> maven beginner now. >> >> Thanks, >> harish >> >> >> >> >> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler < >> [email protected]> wrote: >> >>> Hi harish, >>> >>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last >>> answer. >>> >>> >>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]> >>> wrote: >>>> >>>> Thanks a lot Rupert. >>>> >>>> I am weighing between options 2 and 3. What is the difference? Optiion 2 >>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It >>> >>> may >>>> >>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like >>> >>> a >>>> >>>> separate engine. >>> >>> Option (2) will require some work improvements on the Stanbol side. >>> However there where already discussion on how to create a "text >>> processing chain" that allows to split up things like tokenizing, POS >>> tagging, Lemmatizing ... in different Enhancement Engines without >>> suffering form disadvantages of creating high amounts of RDF triples. >>> One Idea was to base this on the Apache Lucene TokenStream [1] API and >>> share the data as ContentPart [2] of the ContentItem. >>> >>> Option (3) indeed means that you will create your own >>> EnhancementEngine - a similar one to the KeywordLinkingEngine. >>> >>>> But will I be able to use the stanbol dbpedia lookup using option 3? >>> >>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the >>> "FieldQuery" interface to search for Entities (see [1] for an example) >>> >>> best >>> Rupert >>> >>> [1] >>> >>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html >>> [2] >>> >>> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts >>> [3] >>> >>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java >>> >>> >>>> Btw, I created my own enhancement engine chains and I could see them >>>> yesterday in localhost:8080. But today all of them have vanished and >>>> only >>>> the default chain shows up. Can I dig them up somewhere in the stanbol >>>> directory? >>>> >>>> -harish >>>> >>>> I just created the eclipse project >>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler >>>> <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> There are no NER (Named Entity Recognition) models for Chinese text >>>>> available via OpenNLP. So the default configuration of Stanbol will >>>>> not process Chinese text. What you can do is to configure a >>>>> KeywordLinking Engine for Chinese text as this engine can also process >>>>> in unknown languages (see [1] for details). >>>>> >>>>> However also the KeywordLinking Engine requires at least n tokenizer >>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP >>>>> Chinese text it will use the default one that uses a fixed set of >>>>> chars to split words (white spaces, hyphens ...). You may better how >>>>> well this would work with Chinese texts. My assumption would be that >>>>> it is not sufficient - so results will be sub-optimal. >>>>> >>>>> To apply Chinese optimization I see three possibilities: >>>>> >>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection, >>>>> POS tagging, Named Entity Detection) >>>>> 2. allow the KeywordLinkingEngine to use other already available tools >>>>> for text processing (e.g. stuff that is already available for >>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you >>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP, >>>>> because representing Tokens, POS ... as RDF would be to much of an >>>>> overhead. >>>>> 3. implement a new EnhancementEngine for processing Chinese text. >>>>> >>>>> Hope this helps to get you started. >>>>> >>>>> best >>>>> Rupert >>>>> >>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html >>>>> [2] >>>>> >>> >>> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean >>>>> >>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi Rupert, >>>>>> Finally I am getting some time to work on Stanbol. My job is to >>>>>> demonstrate >>>>>> Stanbol annotations for Chinese text. >>>>>> I am just starting on it. I am following the instructions to build an >>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data >>> >>> dump >>>>>> >>>>>> too. >>>>>> We may have to depend on the ngrams as keys and look them up in the >>>>>> dbpedia >>>>>> labels. >>>>>> >>>>>> I am planning to use the paoding chinese segmentor >>>>>> (http://code.google.com/p/paoding/) for word breaking. >>>>>> >>>>>> Just curious. I pasted some chinese text in default engine of stanbol. >>>>>> It >>>>>> kind of finished the processing in no time at all. This gave me >>>>>> suspicion >>>>>> that may be if the language is chinese, no further processing is done. >>>>>> Is it >>>>>> right? Any more tips for making all this work in Stanbol? >>>>>> >>>>>> -harish >>>>> >>>>> >>>>> >>>>> -- >>>>> | Rupert Westenthaler [email protected] >>>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>>> | A-5500 Bischofshofen >>>> >>>> >>> >>> >>> -- >>> | Rupert Westenthaler [email protected] >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> > > > -- > Dr. Walter Kasper > DFKI GmbH > Stuhlsatzenhausweg 3 > D-66123 Saarbrücken > Tel.: +49-681-85775-5300 > Fax: +49-681-85775-5338 > Email: [email protected] > ------------------------------------------------------------- > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > Geschaeftsfuehrung: > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > ------------------------------------------------------------- > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
