Dr Walter, No problem at all. Thanks. I was trying to use this as a learning experience for myself. I look forward for it. -harish
On Mon, Jul 30, 2012 at 12:18 AM, Rupert Westenthaler < [email protected]> wrote: > On Mon, Jul 30, 2012 at 8:04 AM, Walter Kasper <[email protected]> wrote: > > Hi Harish, > > > > I can provide a Stanbol wrapper for the > > http://code.google.com/p/language-detection library as an additional > > enhancement engine in the next days. I would be interested in evaluating > it > > anyway. > > > > cool thx! > > best > Rupert > > > Best regards, > > > > Walter > > > > > > harish suvarna wrote: > >> > >> Rupert, > >> My initial debugging for Chinese text told me that the language > >> identification done by langid enhancer using apache tika does not > >> recognize > >> chinese. The tika language detection seems is not supporting the CJK > >> languages. With the result, the chinese language is identified as > >> lithuanian language 'lt' . The apache tika group has an enhancement item > >> 856 registered for detecting cjk languages > >> https://issues.apache.org/jira/browse/TIKA-856 > >> in Feb 2012. I am not sure about the use of language identification in > >> stanbol yet. Is the language id used to select the dbpedia index > >> (approprite dbpedia language dump) for entity lookups? > >> > >> > >> I am just thinking that, for my purpose, pick option 3 and make sure > that > >> it is of my language of my interest and then call paoding segmenter. > Then > >> iterate over the ngrams and do an entityhub lookup. I just still need to > >> understand the code around how the whole entity lookup for dbpedia > works. > >> > >> I find that the language detection library > >> http://code.google.com/p/language-detection/ is very good at language > >> detection. It supports 53 languages out of box and the quality seems > good. > >> It is apache 2.0 license. I could volunteer to create a new langid > engine > >> based on this with the stanbol community approval. So if anyone sheds > some > >> light on how to add a new java library into stanbol, that be great. I > am a > >> maven beginner now. > >> > >> Thanks, > >> harish > >> > >> > >> > >> > >> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler < > >> [email protected]> wrote: > >> > >>> Hi harish, > >>> > >>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last > >>> answer. > >>> > >>> > >>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]> > >>> wrote: > >>>> > >>>> Thanks a lot Rupert. > >>>> > >>>> I am weighing between options 2 and 3. What is the difference? > Optiion 2 > >>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. > It > >>> > >>> may > >>>> > >>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is > like > >>> > >>> a > >>>> > >>>> separate engine. > >>> > >>> Option (2) will require some work improvements on the Stanbol side. > >>> However there where already discussion on how to create a "text > >>> processing chain" that allows to split up things like tokenizing, POS > >>> tagging, Lemmatizing ... in different Enhancement Engines without > >>> suffering form disadvantages of creating high amounts of RDF triples. > >>> One Idea was to base this on the Apache Lucene TokenStream [1] API and > >>> share the data as ContentPart [2] of the ContentItem. > >>> > >>> Option (3) indeed means that you will create your own > >>> EnhancementEngine - a similar one to the KeywordLinkingEngine. > >>> > >>>> But will I be able to use the stanbol dbpedia lookup using option 3? > >>> > >>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the > >>> "FieldQuery" interface to search for Entities (see [1] for an example) > >>> > >>> best > >>> Rupert > >>> > >>> [1] > >>> > >>> > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > >>> [2] > >>> > >>> > http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts > >>> [3] > >>> > >>> > http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java > >>> > >>> > >>>> Btw, I created my own enhancement engine chains and I could see them > >>>> yesterday in localhost:8080. But today all of them have vanished and > >>>> only > >>>> the default chain shows up. Can I dig them up somewhere in the stanbol > >>>> directory? > >>>> > >>>> -harish > >>>> > >>>> I just created the eclipse project > >>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler > >>>> <[email protected]> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> There are no NER (Named Entity Recognition) models for Chinese text > >>>>> available via OpenNLP. So the default configuration of Stanbol will > >>>>> not process Chinese text. What you can do is to configure a > >>>>> KeywordLinking Engine for Chinese text as this engine can also > process > >>>>> in unknown languages (see [1] for details). > >>>>> > >>>>> However also the KeywordLinking Engine requires at least n tokenizer > >>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP > >>>>> Chinese text it will use the default one that uses a fixed set of > >>>>> chars to split words (white spaces, hyphens ...). You may better how > >>>>> well this would work with Chinese texts. My assumption would be that > >>>>> it is not sufficient - so results will be sub-optimal. > >>>>> > >>>>> To apply Chinese optimization I see three possibilities: > >>>>> > >>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection, > >>>>> POS tagging, Named Entity Detection) > >>>>> 2. allow the KeywordLinkingEngine to use other already available > tools > >>>>> for text processing (e.g. stuff that is already available for > >>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you > >>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP, > >>>>> because representing Tokens, POS ... as RDF would be to much of an > >>>>> overhead. > >>>>> 3. implement a new EnhancementEngine for processing Chinese text. > >>>>> > >>>>> Hope this helps to get you started. > >>>>> > >>>>> best > >>>>> Rupert > >>>>> > >>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html > >>>>> [2] > >>>>> > >>> > >>> > http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean > >>>>> > >>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> Hi Rupert, > >>>>>> Finally I am getting some time to work on Stanbol. My job is to > >>>>>> demonstrate > >>>>>> Stanbol annotations for Chinese text. > >>>>>> I am just starting on it. I am following the instructions to build > an > >>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data > >>> > >>> dump > >>>>>> > >>>>>> too. > >>>>>> We may have to depend on the ngrams as keys and look them up in the > >>>>>> dbpedia > >>>>>> labels. > >>>>>> > >>>>>> I am planning to use the paoding chinese segmentor > >>>>>> (http://code.google.com/p/paoding/) for word breaking. > >>>>>> > >>>>>> Just curious. I pasted some chinese text in default engine of > stanbol. > >>>>>> It > >>>>>> kind of finished the processing in no time at all. This gave me > >>>>>> suspicion > >>>>>> that may be if the language is chinese, no further processing is > done. > >>>>>> Is it > >>>>>> right? Any more tips for making all this work in Stanbol? > >>>>>> > >>>>>> -harish > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> | Rupert Westenthaler [email protected] > >>>>> | Bodenlehenstraße 11 ++43-699-11108907 > >>>>> | A-5500 Bischofshofen > >>>> > >>>> > >>> > >>> > >>> -- > >>> | Rupert Westenthaler [email protected] > >>> | Bodenlehenstraße 11 ++43-699-11108907 > >>> | A-5500 Bischofshofen > >>> > > > > > > -- > > Dr. Walter Kasper > > DFKI GmbH > > Stuhlsatzenhausweg 3 > > D-66123 Saarbrücken > > Tel.: +49-681-85775-5300 > > Fax: +49-681-85775-5338 > > Email: [email protected] > > ------------------------------------------------------------- > > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > > > Geschaeftsfuehrung: > > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > > Dr. Walter Olthoff > > > > Vorsitzender des Aufsichtsrats: > > Prof. Dr. h.c. Hans A. Aukes > > > > Amtsgericht Kaiserslautern, HRB 2313 > > ------------------------------------------------------------- > > > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
