Rupert,
My initial debugging for Chinese text told me that the language
identification done by langid enhancer using apache tika does not recognize
chinese. The tika language detection seems is not supporting the CJK
languages. With the result, the chinese language is identified as
lithuanian language 'lt' . The apache tika group has an enhancement item
856 registered for detecting cjk languages
 https://issues.apache.org/jira/browse/TIKA-856
 in Feb 2012. I am not sure about the use of language identification in
stanbol yet. Is the language id used to select the dbpedia  index
(approprite dbpedia language dump) for entity lookups?


I am just thinking that, for my purpose, pick option 3 and make sure that
it is of my language of my interest and then call paoding segmenter. Then
iterate over the ngrams and do an entityhub lookup. I just still need to
understand the code around how the whole entity lookup for dbpedia works.

I find that the language detection library
http://code.google.com/p/language-detection/ is very good at language
detection. It supports 53 languages out of box and the quality seems good.
It is apache 2.0 license. I could volunteer to create a new langid engine
based on this with the stanbol community approval. So if anyone sheds some
light on how to add a new java library into stanbol, that be great. I am a
maven beginner now.

Thanks,
harish




On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
[email protected]> wrote:

> Hi harish,
>
> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
> answer.
>
>
> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
> wrote:
> > Thanks a lot Rupert.
> >
> > I am weighing between options 2 and 3. What is the difference? Optiion 2
> > sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
> may
> > be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
> a
> > separate engine.
>
> Option (2) will require some work improvements on the Stanbol side.
> However there where already discussion on how to create a "text
> processing chain" that allows to split up things like tokenizing, POS
> tagging, Lemmatizing ... in different Enhancement Engines without
> suffering form disadvantages of creating high amounts of RDF triples.
> One Idea was to base this on the Apache Lucene TokenStream [1] API and
> share the data as ContentPart [2] of the ContentItem.
>
> Option (3) indeed means that you will create your own
> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>
> >  But will I be able to use the stanbol dbpedia lookup using option 3?
>
> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
> "FieldQuery" interface to search for Entities (see [1] for an example)
>
> best
> Rupert
>
> [1]
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> [2]
> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
> [3]
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
>
>
> >
> > Btw, I created my own enhancement engine chains and I could see them
> > yesterday in localhost:8080. But today all of them have vanished and only
> > the default chain shows up. Can I dig them up somewhere in the stanbol
> > directory?
> >
> > -harish
> >
> > I just created the eclipse project
> > On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
> > <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> There are no NER (Named Entity Recognition) models for Chinese text
> >> available via OpenNLP. So the default configuration of Stanbol will
> >> not process Chinese text. What you can do is to configure a
> >> KeywordLinking Engine for Chinese text as this engine can also process
> >> in unknown languages (see [1] for details).
> >>
> >> However also the KeywordLinking Engine requires at least n tokenizer
> >> for looking up Words. As there is no specific Tokenizer for OpenNLP
> >> Chinese text it will use the default one that uses a fixed set of
> >> chars to split words (white spaces, hyphens ...). You may better how
> >> well this would work with Chinese texts. My assumption would be that
> >> it is not sufficient - so results will be sub-optimal.
> >>
> >> To apply Chinese optimization I see three possibilities:
> >>
> >> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
> >> POS tagging, Named Entity Detection)
> >> 2. allow the KeywordLinkingEngine to use other already available tools
> >> for text processing (e.g. stuff that is already available for
> >> Solr/Lucene [2] or the paoding chinese segment or referenced in you
> >> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
> >> because representing Tokens, POS ... as RDF would be to much of an
> >> overhead.
> >> 3. implement a new EnhancementEngine for processing Chinese text.
> >>
> >> Hope this helps to get you started.
> >>
> >> best
> >> Rupert
> >>
> >> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
> >> [2]
> >>
> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
> >>
> >> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]>
> >> wrote:
> >> > Hi Rupert,
> >> > Finally I am getting some time to work on Stanbol. My job is to
> >> > demonstrate
> >> > Stanbol annotations for Chinese text.
> >> > I am just starting on it. I am following the instructions to build an
> >> > enhancement engine from Anuj's blog. dbpedia has some chinese data
> dump
> >> > too.
> >> > We may have to depend on the ngrams as keys and look them up in the
> >> > dbpedia
> >> > labels.
> >> >
> >> > I am planning to use the paoding chinese segmentor
> >> > (http://code.google.com/p/paoding/) for word breaking.
> >> >
> >> > Just curious. I pasted some chinese text in default engine of stanbol.
> >> > It
> >> > kind of finished the processing in no time at all. This gave me
> >> > suspicion
> >> > that may be if the language is chinese, no further processing is done.
> >> > Is it
> >> > right? Any more tips for making all this work in Stanbol?
> >> >
> >> > -harish
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Reply via email to