Re: Stanbol Chinese

harish suvarna Wed, 01 Aug 2012 10:41:00 -0700

I did ' mvn clean install'.
Which stanbol folder is this ?

$HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
mean remove the enitre folder?


I restarted the machine and doing another mvn clean install now. I will
post you in another 30 mins.

-harish

On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <[email protected]> wrote:

> Hi again,
>
> It came to my mind that you should also clear the 'stanbol' folder of the
> Stanbol runtime system and restart the sysem.  The folder might contain old
> bundle configuration data that don't get updated automatically.
>
>
> Best regards,
>
> Walter
>
> harish suvarna wrote:
>
>> Did a fresh build and inside Stanbol in localhost:8080, it is installed
>> but
>> is not activated. I still see the com.google.inject errors.
>> I do see the pom.xml update from you.
>>
>> -harish
>>
>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <[email protected]> wrote:
>>
>>  Hi,
>>>
>>> The OSGI bundlöe declared some package imports that usually indeed are
>>> not
>>> available nor required. I fixed that. Just check out the corrected
>>> pom.xml.
>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>  Thanks Dr Walter. langdetect is very useful. I could successfully
>>>> compile
>>>> it but unable to load into stanbol as I get th error
>>>> ======
>>>> ERROR: Bundle org.apache.stanbol.enhancer.****engines.langdetect [177]:
>>>> Error
>>>> starting/stopping bundle. (org.osgi.framework.****BundleException:
>>>> Unresolved
>>>> constraint in bundle org.apache.stanbol.enhancer.****engines.langdetect
>>>> [177]:
>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>> (package=com.google.inject))
>>>> org.osgi.framework.****BundleException: Unresolved constraint in bundle
>>>> org.apache.stanbol.enhancer.****engines.langdetect [177]: Unable to
>>>> resolve
>>>>
>>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>>       at org.apache.felix.framework.****Felix.resolveBundle(Felix.**
>>>> java:3443)
>>>>       at org.apache.felix.framework.****Felix.startBundle(Felix.java:**
>>>> **1727)
>>>>       at org.apache.felix.framework.****Felix.setBundleStartLevel(**
>>>> Felix.java:1333)
>>>>       at
>>>> org.apache.felix.framework.****StartLevelImpl.run(**
>>>> StartLevelImpl.java:270)
>>>>       at java.lang.Thread.run(Thread.****java:680)
>>>>
>>>> ==============
>>>>
>>>> I added the dependency
>>>> <dependency>
>>>>         <groupId>com.google.inject</****groupId>
>>>>
>>>>         <artifactId>guice</artifactId>
>>>>         <version>3.0</version>
>>>>       </dependency>
>>>>
>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>> entire source of langdetect plugin does not contain the word inject.
>>>> Only
>>>> the manifest file in target/classes has this listed.
>>>>
>>>>
>>>> -harish
>>>>
>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote:
>>>>
>>>>   Hi Harish,
>>>>
>>>>> I checked in a new language identifier for Stanbol based on
>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>> >
>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>> >
>>>>>
>>>>>  .
>>>>>>
>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Walter
>>>>>
>>>>> harish suvarna wrote:
>>>>>
>>>>>   Rupert,
>>>>>
>>>>>> My initial debugging for Chinese text told me that the language
>>>>>> identification done by langid enhancer using apache tika does not
>>>>>> recognize
>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>> languages. With the result, the chinese language is identified as
>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>> item
>>>>>> 856 registered for detecting cjk languages
>>>>>>     
>>>>>> https://issues.apache.org/******jira/browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
>>>>>> <https://**issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>> >
>>>>>> <https://**issues.apache.org/**jira/browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
>>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>> >
>>>>>>
>>>>>>     in Feb 2012. I am not sure about the use of language
>>>>>> identification
>>>>>> in
>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>
>>>>>>
>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>> that
>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>> Then
>>>>>> iterate over the ngrams and do an entityhub lookup. I just still need
>>>>>> to
>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>> works.
>>>>>>
>>>>>> I find that the language detection library
>>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>> >
>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>> >>is
>>>>>>
>>>>>> very good at language
>>>>>>
>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>> good.
>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>> engine
>>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>>> some
>>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>>> am a
>>>>>> maven beginner now.
>>>>>>
>>>>>> Thanks,
>>>>>> harish
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>    Hi harish,
>>>>>>
>>>>>>  Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>> last
>>>>>>> answer.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Thanks a lot Rupert.
>>>>>>>
>>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>>> Optiion 2
>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>> text.
>>>>>>>> It
>>>>>>>>
>>>>>>>>   may
>>>>>>>>
>>>>>>>   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>>
>>>>>>>> like
>>>>>>>>
>>>>>>>>   a
>>>>>>>>
>>>>>>>   separate engine.
>>>>>>>
>>>>>>>>   Option (2) will require some work improvements on the Stanbol
>>>>>>>> side.
>>>>>>>>
>>>>>>> However there where already discussion on how to create a "text
>>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>> and
>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>
>>>>>>> Option (3) indeed means that you will create your own
>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>
>>>>>>>      But will I be able to use the stanbol dbpedia lookup using
>>>>>>> option
>>>>>>> 3?
>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>> example)
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1]
>>>>>>> http://blog.mikemccandless.******com/2012/04/lucenes-**
>>>>>>> tokenstreams-are-actually.****html<http://blog.**
>>>>>>> mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****
>>>>>>> actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
>>>>>>> <http://blog.**mikemccandless.com/2012/04/**
>>>>>>> lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>> >
>>>>>>> [2]
>>>>>>> http://incubator.apache.org/******stanbol/docs/trunk/**
>>>>>>> components/****<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
>>>>>>> <http://**incubator.apache.org/****stanbol/docs/trunk/components/**
>>>>>>> ** <http://incubator.apache.org/**stanbol/docs/trunk/components/**>>
>>>>>>> enhancer/contentitem.html#******content-parts<http://**
>>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>> >
>>>>>>> [3]
>>>>>>>
>>>>>>> http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
>>>>>>> <http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>> >
>>>>>>> enhancer/engines/******keywordextraction/src/main/******
>>>>>>> java/org/apache/stanbol/
>>>>>>> **enhancer/engines/******keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**>
>>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/**
>>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/**
>>>>>>> enhancer/engines/****keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>    Btw, I created my own enhancement engine chains and I could see
>>>>>>> them
>>>>>>>
>>>>>>>  yesterday in localhost:8080. But today all of them have vanished and
>>>>>>>> only
>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>> stanbol
>>>>>>>> directory?
>>>>>>>>
>>>>>>>> -harish
>>>>>>>>
>>>>>>>> I just created the eclipse project
>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>> <[email protected]******> wrote:
>>>>>>>>
>>>>>>>>   Hi,
>>>>>>>>
>>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>> process
>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>
>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>> tokenizer
>>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>> how
>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>> that
>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>
>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>
>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>> detection,
>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>> tools
>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>> OpenNLP,
>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>>> overhead.
>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>
>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> [1] 
>>>>>>>>> http://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>> <http:/**/incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>> >
>>>>>>>>> multilingual.html<http://**inc**ubator.apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
>>>>>>>>> docs/trunk/multilingual.html<h**ttp://incubator.apache.org/**
>>>>>>>>> stanbol/docs/trunk/**multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>> >
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>>    http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*
>>>>>>>>> ***<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
>>>>>>>>> <http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>  
>>>>>>>>> Japanese.2C_Korean<http://**wi**ki.apache.org/solr/**<http://wiki.apache.org/solr/**>
>>>>>>>>
>>>>>>> LanguageAnalysis#Chinese.2C_****Japanese.2C_Korean<http://**
>>>>>>> wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
>>>>>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>> >
>>>>>>>
>>>>>>>   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
>>>>>>> [email protected]>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   Hi Rupert,
>>>>>>>>>
>>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>> demonstrate
>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>>> an
>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>>
>>>>>>>>>>   dump
>>>>>>>>>>
>>>>>>>>> too.
>>>>>>>>
>>>>>>>>  We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>>
>>>>>>>>>> dbpedia
>>>>>>>>>> labels.
>>>>>>>>>>
>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>> (http://code.google.com/p/******paoding/<http://code.google.com/p/****paoding/>
>>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>>> >
>>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.**
>>>>>>>>>> com/p/paoding/ <http://code.google.com/p/paoding/>>
>>>>>>>>>>
>>>>>>>>>>  )
>>>>>>>>>>>
>>>>>>>>>> for word breaking.
>>>>>>>>>>
>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>> stanbol.
>>>>>>>>>> It
>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>> suspicion
>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>> done.
>>>>>>>>>> Is it
>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>
>>>>>>>>>> -harish
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>> ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    --
>>>>>>>>
>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>>
>>>>>> Dr. Walter Kasper
>>>>> DFKI GmbH
>>>>> Stuhlsatzenhausweg 3
>>>>> D-66123 Saarbrücken
>>>>> Tel.:  +49-681-85775-5300
>>>>> Fax:   +49-681-85775-5338
>>>>> Email: [email protected]
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>>
>>>>>
>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>
>>>>> Geschaeftsfuehrung:
>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>> Dr. Walter Olthoff
>>>>>
>>>>> Vorsitzender des Aufsichtsrats:
>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>
>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: [email protected]
>>> ------------------------------****----------------------------**--**-
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------****----------------------------**--**-
>>>
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: [email protected]
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Reply via email to