Re: Stanbol Chinese

harish suvarna Mon, 30 Jul 2012 10:45:32 -0700

Dr Walter,
No problem at all. Thanks. I was trying to use this as a learning
experience for myself.
I look forward for it.
-harish


On Mon, Jul 30, 2012 at 12:18 AM, Rupert Westenthaler <
[email protected]> wrote:

> On Mon, Jul 30, 2012 at 8:04 AM, Walter Kasper <[email protected]> wrote:
> > Hi Harish,
> >
> > I can provide a Stanbol wrapper for the
> > http://code.google.com/p/language-detection library as an additional
> > enhancement engine in the next days. I would be interested in evaluating
> it
> > anyway.
> >
>
> cool thx!
>
> best
> Rupert
>
> > Best regards,
> >
> > Walter
> >
> >
> > harish suvarna wrote:
> >>
> >> Rupert,
> >> My initial debugging for Chinese text told me that the language
> >> identification done by langid enhancer using apache tika does not
> >> recognize
> >> chinese. The tika language detection seems is not supporting the CJK
> >> languages. With the result, the chinese language is identified as
> >> lithuanian language 'lt' . The apache tika group has an enhancement item
> >> 856 registered for detecting cjk languages
> >>   https://issues.apache.org/jira/browse/TIKA-856
> >>   in Feb 2012. I am not sure about the use of language identification in
> >> stanbol yet. Is the language id used to select the dbpedia  index
> >> (approprite dbpedia language dump) for entity lookups?
> >>
> >>
> >> I am just thinking that, for my purpose, pick option 3 and make sure
> that
> >> it is of my language of my interest and then call paoding segmenter.
> Then
> >> iterate over the ngrams and do an entityhub lookup. I just still need to
> >> understand the code around how the whole entity lookup for dbpedia
> works.
> >>
> >> I find that the language detection library
> >> http://code.google.com/p/language-detection/ is very good at language
> >> detection. It supports 53 languages out of box and the quality seems
> good.
> >> It is apache 2.0 license. I could volunteer to create a new langid
> engine
> >> based on this with the stanbol community approval. So if anyone sheds
> some
> >> light on how to add a new java library into stanbol, that be great. I
> am a
> >> maven beginner now.
> >>
> >> Thanks,
> >> harish
> >>
> >>
> >>
> >>
> >> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
> >> [email protected]> wrote:
> >>
> >>> Hi harish,
> >>>
> >>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
> >>> answer.
> >>>
> >>>
> >>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
> >>> wrote:
> >>>>
> >>>> Thanks a lot Rupert.
> >>>>
> >>>> I am weighing between options 2 and 3. What is the difference?
> Optiion 2
> >>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text.
> It
> >>>
> >>> may
> >>>>
> >>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
> like
> >>>
> >>> a
> >>>>
> >>>> separate engine.
> >>>
> >>> Option (2) will require some work improvements on the Stanbol side.
> >>> However there where already discussion on how to create a "text
> >>> processing chain" that allows to split up things like tokenizing, POS
> >>> tagging, Lemmatizing ... in different Enhancement Engines without
> >>> suffering form disadvantages of creating high amounts of RDF triples.
> >>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
> >>> share the data as ContentPart [2] of the ContentItem.
> >>>
> >>> Option (3) indeed means that you will create your own
> >>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
> >>>
> >>>>   But will I be able to use the stanbol dbpedia lookup using option 3?
> >>>
> >>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
> >>> "FieldQuery" interface to search for Entities (see [1] for an example)
> >>>
> >>> best
> >>> Rupert
> >>>
> >>> [1]
> >>>
> >>>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>> [2]
> >>>
> >>>
> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
> >>> [3]
> >>>
> >>>
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
> >>>
> >>>
> >>>> Btw, I created my own enhancement engine chains and I could see them
> >>>> yesterday in localhost:8080. But today all of them have vanished and
> >>>> only
> >>>> the default chain shows up. Can I dig them up somewhere in the stanbol
> >>>> directory?
> >>>>
> >>>> -harish
> >>>>
> >>>> I just created the eclipse project
> >>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> There are no NER (Named Entity Recognition) models for Chinese text
> >>>>> available via OpenNLP. So the default configuration of Stanbol will
> >>>>> not process Chinese text. What you can do is to configure a
> >>>>> KeywordLinking Engine for Chinese text as this engine can also
> process
> >>>>> in unknown languages (see [1] for details).
> >>>>>
> >>>>> However also the KeywordLinking Engine requires at least n tokenizer
> >>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
> >>>>> Chinese text it will use the default one that uses a fixed set of
> >>>>> chars to split words (white spaces, hyphens ...). You may better how
> >>>>> well this would work with Chinese texts. My assumption would be that
> >>>>> it is not sufficient - so results will be sub-optimal.
> >>>>>
> >>>>> To apply Chinese optimization I see three possibilities:
> >>>>>
> >>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
> >>>>> POS tagging, Named Entity Detection)
> >>>>> 2. allow the KeywordLinkingEngine to use other already available
> tools
> >>>>> for text processing (e.g. stuff that is already available for
> >>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
> >>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
> >>>>> because representing Tokens, POS ... as RDF would be to much of an
> >>>>> overhead.
> >>>>> 3. implement a new EnhancementEngine for processing Chinese text.
> >>>>>
> >>>>> Hope this helps to get you started.
> >>>>>
> >>>>> best
> >>>>> Rupert
> >>>>>
> >>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
> >>>>> [2]
> >>>>>
> >>>
> >>>
> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
> >>>>>
> >>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Rupert,
> >>>>>> Finally I am getting some time to work on Stanbol. My job is to
> >>>>>> demonstrate
> >>>>>> Stanbol annotations for Chinese text.
> >>>>>> I am just starting on it. I am following the instructions to build
> an
> >>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
> >>>
> >>> dump
> >>>>>>
> >>>>>> too.
> >>>>>> We may have to depend on the ngrams as keys and look them up in the
> >>>>>> dbpedia
> >>>>>> labels.
> >>>>>>
> >>>>>> I am planning to use the paoding chinese segmentor
> >>>>>> (http://code.google.com/p/paoding/) for word breaking.
> >>>>>>
> >>>>>> Just curious. I pasted some chinese text in default engine of
> stanbol.
> >>>>>> It
> >>>>>> kind of finished the processing in no time at all. This gave me
> >>>>>> suspicion
> >>>>>> that may be if the language is chinese, no further processing is
> done.
> >>>>>> Is it
> >>>>>> right? Any more tips for making all this work in Stanbol?
> >>>>>>
> >>>>>> -harish
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> | Rupert Westenthaler             [email protected]
> >>>>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>>>> | A-5500 Bischofshofen
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> | Rupert Westenthaler             [email protected]
> >>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>> | A-5500 Bischofshofen
> >>>
> >
> >
> > --
> > Dr. Walter Kasper
> > DFKI GmbH
> > Stuhlsatzenhausweg 3
> > D-66123 Saarbrücken
> > Tel.:  +49-681-85775-5300
> > Fax:   +49-681-85775-5338
> > Email: [email protected]
> > -------------------------------------------------------------
> > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >
> > Geschaeftsfuehrung:
> > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> >
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> >
> > Amtsgericht Kaiserslautern, HRB 2313
> > -------------------------------------------------------------
> >
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Stanbol Chinese

Reply via email to