Re: Stanbol Chinese

Rupert Westenthaler Mon, 30 Jul 2012 00:18:45 -0700

On Mon, Jul 30, 2012 at 8:04 AM, Walter Kasper <[email protected]> wrote:
> Hi Harish,
>
> I can provide a Stanbol wrapper for the
> http://code.google.com/p/language-detection library as an additional
> enhancement engine in the next days. I would be interested in evaluating it
> anyway.
>


cool thx!

best
Rupert

> Best regards,
>
> Walter
>
>
> harish suvarna wrote:
>>
>> Rupert,
>> My initial debugging for Chinese text told me that the language
>> identification done by langid enhancer using apache tika does not
>> recognize
>> chinese. The tika language detection seems is not supporting the CJK
>> languages. With the result, the chinese language is identified as
>> lithuanian language 'lt' . The apache tika group has an enhancement item
>> 856 registered for detecting cjk languages
>>   https://issues.apache.org/jira/browse/TIKA-856
>>   in Feb 2012. I am not sure about the use of language identification in
>> stanbol yet. Is the language id used to select the dbpedia  index
>> (approprite dbpedia language dump) for entity lookups?
>>
>>
>> I am just thinking that, for my purpose, pick option 3 and make sure that
>> it is of my language of my interest and then call paoding segmenter. Then
>> iterate over the ngrams and do an entityhub lookup. I just still need to
>> understand the code around how the whole entity lookup for dbpedia works.
>>
>> I find that the language detection library
>> http://code.google.com/p/language-detection/ is very good at language
>> detection. It supports 53 languages out of box and the quality seems good.
>> It is apache 2.0 license. I could volunteer to create a new langid engine
>> based on this with the stanbol community approval. So if anyone sheds some
>> light on how to add a new java library into stanbol, that be great. I am a
>> maven beginner now.
>>
>> Thanks,
>> harish
>>
>>
>>
>>
>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>> [email protected]> wrote:
>>
>>> Hi harish,
>>>
>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>> answer.
>>>
>>>
>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
>>> wrote:
>>>>
>>>> Thanks a lot Rupert.
>>>>
>>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>>>
>>> may
>>>>
>>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>>>
>>> a
>>>>
>>>> separate engine.
>>>
>>> Option (2) will require some work improvements on the Stanbol side.
>>> However there where already discussion on how to create a "text
>>> processing chain" that allows to split up things like tokenizing, POS
>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>> suffering form disadvantages of creating high amounts of RDF triples.
>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>> share the data as ContentPart [2] of the ContentItem.
>>>
>>> Option (3) indeed means that you will create your own
>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>
>>>>   But will I be able to use the stanbol dbpedia lookup using option 3?
>>>
>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>
>>> best
>>> Rupert
>>>
>>> [1]
>>>
>>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>> [2]
>>>
>>> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
>>> [3]
>>>
>>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
>>>
>>>
>>>> Btw, I created my own enhancement engine chains and I could see them
>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>> only
>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>> directory?
>>>>
>>>> -harish
>>>>
>>>> I just created the eclipse project
>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>> not process Chinese text. What you can do is to configure a
>>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>>> in unknown languages (see [1] for details).
>>>>>
>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>> well this would work with Chinese texts. My assumption would be that
>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>
>>>>> To apply Chinese optimization I see three possibilities:
>>>>>
>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>> POS tagging, Named Entity Detection)
>>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>>> for text processing (e.g. stuff that is already available for
>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>> overhead.
>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>
>>>>> Hope this helps to get you started.
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
>>>>> [2]
>>>>>
>>>
>>> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
>>>>>
>>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Rupert,
>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>> demonstrate
>>>>>> Stanbol annotations for Chinese text.
>>>>>> I am just starting on it. I am following the instructions to build an
>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>
>>> dump
>>>>>>
>>>>>> too.
>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>> dbpedia
>>>>>> labels.
>>>>>>
>>>>>> I am planning to use the paoding chinese segmentor
>>>>>> (http://code.google.com/p/paoding/) for word breaking.
>>>>>>
>>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>>> It
>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>> suspicion
>>>>>> that may be if the language is chinese, no further processing is done.
>>>>>> Is it
>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>
>>>>>> -harish
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> | Rupert Westenthaler             [email protected]
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>
>>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             [email protected]
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: [email protected]
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Reply via email to