Thanks Dr Walter. langdetect is very useful. I could successfully compile
it but unable to load into stanbol as I get th error
======
ERROR: Bundle org.apache.stanbol.enhancer.engines.langdetect [177]: Error
starting/stopping bundle. (org.osgi.framework.BundleException: Unresolved
constraint in bundle org.apache.stanbol.enhancer.engines.langdetect [177]:
Unable to resolve 177.0: missing requirement [177.0] package;
(package=com.google.inject))
org.osgi.framework.BundleException: Unresolved constraint in bundle
org.apache.stanbol.enhancer.engines.langdetect [177]: Unable to resolve
177.0: missing requirement [177.0] package; (package=com.google.inject)
at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
at org.apache.felix.framework.Felix.setBundleStartLevel(Felix.java:1333)
at
org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:270)
at java.lang.Thread.run(Thread.java:680)
==============
I added the dependency
<dependency>
<groupId>com.google.inject</groupId>
<artifactId>guice</artifactId>
<version>3.0</version>
</dependency>
but looks like it is looking for version 1.3.0, which I can't find in
repo1.maven.org. I am not sure who is needing the inject library. The
entire source of langdetect plugin does not contain the word inject. Only
the manifest file in target/classes has this listed.
-harish
On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote:
> Hi Harish,
>
> I checked in a new language identifier for Stanbol based on
> http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>.
> Just check out from Stanbol trunk, install and try out.
>
>
> Best regards,
>
> Walter
>
> harish suvarna wrote:
>
>> Rupert,
>> My initial debugging for Chinese text told me that the language
>> identification done by langid enhancer using apache tika does not
>> recognize
>> chinese. The tika language detection seems is not supporting the CJK
>> languages. With the result, the chinese language is identified as
>> lithuanian language 'lt' . The apache tika group has an enhancement item
>> 856 registered for detecting cjk languages
>>
>> https://issues.apache.org/**jira/browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>> in Feb 2012. I am not sure about the use of language identification in
>> stanbol yet. Is the language id used to select the dbpedia index
>> (approprite dbpedia language dump) for entity lookups?
>>
>>
>> I am just thinking that, for my purpose, pick option 3 and make sure that
>> it is of my language of my interest and then call paoding segmenter. Then
>> iterate over the ngrams and do an entityhub lookup. I just still need to
>> understand the code around how the whole entity lookup for dbpedia works.
>>
>> I find that the language detection library
>> http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>is
>> very good at language
>> detection. It supports 53 languages out of box and the quality seems good.
>> It is apache 2.0 license. I could volunteer to create a new langid engine
>> based on this with the stanbol community approval. So if anyone sheds some
>> light on how to add a new java library into stanbol, that be great. I am a
>> maven beginner now.
>>
>> Thanks,
>> harish
>>
>>
>>
>>
>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>> [email protected]> wrote:
>>
>> Hi harish,
>>>
>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>> answer.
>>>
>>>
>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
>>> wrote:
>>>
>>>> Thanks a lot Rupert.
>>>>
>>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>>>>
>>> may
>>>
>>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>>>>
>>> a
>>>
>>>> separate engine.
>>>>
>>> Option (2) will require some work improvements on the Stanbol side.
>>> However there where already discussion on how to create a "text
>>> processing chain" that allows to split up things like tokenizing, POS
>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>> suffering form disadvantages of creating high amounts of RDF triples.
>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>> share the data as ContentPart [2] of the ContentItem.
>>>
>>> Option (3) indeed means that you will create your own
>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>
>>> But will I be able to use the stanbol dbpedia lookup using option 3?
>>>>
>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>
>>> best
>>> Rupert
>>>
>>> [1]
>>> http://blog.mikemccandless.**com/2012/04/lucenes-**
>>> tokenstreams-are-actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>> [2]
>>> http://incubator.apache.org/**stanbol/docs/trunk/components/**
>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>> [3]
>>> http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**
>>> enhancer/engines/**keywordextraction/src/main/**java/org/apache/stanbol/
>>> **enhancer/engines/**keywordextraction/linking/**
>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>
>>>
>>> Btw, I created my own enhancement engine chains and I could see them
>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>> only
>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>> directory?
>>>>
>>>> -harish
>>>>
>>>> I just created the eclipse project
>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>> <[email protected]**> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>> not process Chinese text. What you can do is to configure a
>>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>>> in unknown languages (see [1] for details).
>>>>>
>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>> well this would work with Chinese texts. My assumption would be that
>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>
>>>>> To apply Chinese optimization I see three possibilities:
>>>>>
>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>> POS tagging, Named Entity Detection)
>>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>>> for text processing (e.g. stuff that is already available for
>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>> overhead.
>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>
>>>>> Hope this helps to get you started.
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1] http://incubator.apache.org/**stanbol/docs/trunk/**
>>>>> multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>> [2]
>>>>>
>>>>> http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>
>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Rupert,
>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>> demonstrate
>>>>>> Stanbol annotations for Chinese text.
>>>>>> I am just starting on it. I am following the instructions to build an
>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>
>>>>> dump
>>>
>>>> too.
>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>> dbpedia
>>>>>> labels.
>>>>>>
>>>>>> I am planning to use the paoding chinese segmentor
>>>>>> (http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>)
>>>>>> for word breaking.
>>>>>>
>>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>>> It
>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>> suspicion
>>>>>> that may be if the language is chinese, no further processing is done.
>>>>>> Is it
>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>
>>>>>> -harish
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> | Rupert Westenthaler [email protected]
>>>>> | Bodenlehenstraße 11 ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>>
>>>>
>>>>
>>>
>>> --
>>> | Rupert Westenthaler [email protected]
>>> | Bodenlehenstraße 11 ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.: +49-681-85775-5300
> Fax: +49-681-85775-5338
> Email: [email protected]
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>