Re: Stanbol Chinese

harish suvarna Wed, 01 Aug 2012 11:49:59 -0700

I removed ~/stanbol folder. It is not helping. Let me clear the
trunk/stanbol folder and see what happens. I suspect some cache clearnace
problem.


-harish

On Wed, Aug 1, 2012 at 10:48 AM, Walter Kasper <[email protected]> wrote:

> harish suvarna wrote:
>
>> I did ' mvn clean install'.
>> Which stanbol folder is this ?
>>
>> $HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
>> mean remove the enitre folder?
>>
>
> I guess it is $HOME/stanbol where the runtime config data are stored. I
> usually clear the complete folder for a clean restart.
>
>
>> I restarted the machine and doing another mvn clean install now. I will
>> post you in another 30 mins.
>>
>> -harish
>>
>> On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <[email protected]> wrote:
>>
>>  Hi again,
>>>
>>> It came to my mind that you should also clear the 'stanbol' folder of the
>>> Stanbol runtime system and restart the sysem.  The folder might contain
>>> old
>>> bundle configuration data that don't get updated automatically.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>  Did a fresh build and inside Stanbol in localhost:8080, it is installed
>>>> but
>>>> is not activated. I still see the com.google.inject errors.
>>>> I do see the pom.xml update from you.
>>>>
>>>> -harish
>>>>
>>>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <[email protected]> wrote:
>>>>
>>>>   Hi,
>>>>
>>>>> The OSGI bundlöe declared some package imports that usually indeed are
>>>>> not
>>>>> available nor required. I fixed that. Just check out the corrected
>>>>> pom.xml.
>>>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Walter
>>>>>
>>>>> harish suvarna wrote:
>>>>>
>>>>>   Thanks Dr Walter. langdetect is very useful. I could successfully
>>>>>
>>>>>> compile
>>>>>> it but unable to load into stanbol as I get th error
>>>>>> ======
>>>>>> ERROR: Bundle org.apache.stanbol.enhancer.******engines.langdetect
>>>>>> [177]:
>>>>>> Error
>>>>>> starting/stopping bundle. (org.osgi.framework.******BundleException:
>>>>>> Unresolved
>>>>>> constraint in bundle org.apache.stanbol.enhancer.****
>>>>>> **engines.langdetect
>>>>>>
>>>>>> [177]:
>>>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>>>> (package=com.google.inject))
>>>>>> org.osgi.framework.******BundleException: Unresolved constraint in
>>>>>> bundle
>>>>>> org.apache.stanbol.enhancer.******engines.langdetect [177]: Unable to
>>>>>>
>>>>>> resolve
>>>>>>
>>>>>> 177.0: missing requirement [177.0] package;
>>>>>> (package=com.google.inject)
>>>>>>        at org.apache.felix.framework.*****
>>>>>> *Felix.resolveBundle(Felix.**
>>>>>> java:3443)
>>>>>>        at org.apache.felix.framework.*****
>>>>>> *Felix.startBundle(Felix.java:****
>>>>>> **1727)
>>>>>>        at org.apache.felix.framework.*****
>>>>>> *Felix.setBundleStartLevel(**
>>>>>> Felix.java:1333)
>>>>>>        at
>>>>>> org.apache.felix.framework.******StartLevelImpl.run(**
>>>>>> StartLevelImpl.java:270)
>>>>>>        at java.lang.Thread.run(Thread.******java:680)
>>>>>>
>>>>>>
>>>>>> ==============
>>>>>>
>>>>>> I added the dependency
>>>>>> <dependency>
>>>>>>          <groupId>com.google.inject</******groupId>
>>>>>>
>>>>>>
>>>>>>          <artifactId>guice</artifactId>
>>>>>>          <version>3.0</version>
>>>>>>        </dependency>
>>>>>>
>>>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>>>> entire source of langdetect plugin does not contain the word inject.
>>>>>> Only
>>>>>> the manifest file in target/classes has this listed.
>>>>>>
>>>>>>
>>>>>> -harish
>>>>>>
>>>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>    Hi Harish,
>>>>>>
>>>>>>  I checked in a new language identifier for Stanbol based on
>>>>>>> http://code.google.com/p/********language-detection/<http://code.google.com/p/******language-detection/>
>>>>>>> <http://**code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>> >
>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>> >
>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>> <http://**code.google.com/p/language-****detection/<http://code.google.com/p/language-**detection/>
>>>>>>> >
>>>>>>>
>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>> >
>>>>>>>   .
>>>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Walter
>>>>>>>
>>>>>>> harish suvarna wrote:
>>>>>>>
>>>>>>>    Rupert,
>>>>>>>
>>>>>>>  My initial debugging for Chinese text told me that the language
>>>>>>>> identification done by langid enhancer using apache tika does not
>>>>>>>> recognize
>>>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>>>> languages. With the result, the chinese language is identified as
>>>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>>>> item
>>>>>>>> 856 registered for detecting cjk languages
>>>>>>>>      
>>>>>>>> https://issues.apache.org/********jira/browse/TIKA-856<https://issues.apache.org/******jira/browse/TIKA-856>
>>>>>>>> <https:/**/issues.apache.org/****jira/**browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
>>>>>>>> >
>>>>>>>> <https://**issues.apache.org/****jira/**browse/TIKA-856<http://issues.apache.org/**jira/**browse/TIKA-856>
>>>>>>>> <https:**//issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>>> >
>>>>>>>> <https://**issues.apache.org/****jira/browse/**TIKA-856<http://issues.apache.org/**jira/browse/**TIKA-856>
>>>>>>>> <http:/**/issues.apache.org/jira/**browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
>>>>>>>> >
>>>>>>>> <https:/**/issues.apache.org/**jira/**browse/TIKA-856<http://issues.apache.org/jira/**browse/TIKA-856>
>>>>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>>> >
>>>>>>>>
>>>>>>>>      in Feb 2012. I am not sure about the use of language
>>>>>>>> identification
>>>>>>>> in
>>>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>>>
>>>>>>>>
>>>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>>>> that
>>>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>>>> Then
>>>>>>>> iterate over the ngrams and do an entityhub lookup. I just still
>>>>>>>> need
>>>>>>>> to
>>>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>>>> works.
>>>>>>>>
>>>>>>>> I find that the language detection library
>>>>>>>> http://code.google.com/p/********language-detection/<http://code.google.com/p/******language-detection/>
>>>>>>>> <http://**code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>>> >
>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>>> >
>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>> <http://**code.google.com/p/language-****detection/<http://code.google.com/p/language-**detection/>
>>>>>>>> >
>>>>>>>>
>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>>> >
>>>>>>>>
>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>> very good at language
>>>>>>>>
>>>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>>>> good.
>>>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>>>> engine
>>>>>>>> based on this with the stanbol community approval. So if anyone
>>>>>>>> sheds
>>>>>>>> some
>>>>>>>> light on how to add a new java library into stanbol, that be great.
>>>>>>>> I
>>>>>>>> am a
>>>>>>>> maven beginner now.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> harish
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>     Hi harish,
>>>>>>>>
>>>>>>>>   Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>>>
>>>>>>>>> last
>>>>>>>>> answer.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <
>>>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>    Thanks a lot Rupert.
>>>>>>>>>
>>>>>>>>>  I am weighing between options 2 and 3. What is the difference?
>>>>>>>>>> Optiion 2
>>>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>>>> text.
>>>>>>>>>> It
>>>>>>>>>>
>>>>>>>>>>    may
>>>>>>>>>>
>>>>>>>>>>     be like paoding is hardcoded into KeyWordLinkingEngine.
>>>>>>>>> Option 3 is
>>>>>>>>>
>>>>>>>>>  like
>>>>>>>>>>
>>>>>>>>>>    a
>>>>>>>>>>
>>>>>>>>>>     separate engine.
>>>>>>>>>
>>>>>>>>>     Option (2) will require some work improvements on the Stanbol
>>>>>>>>>> side.
>>>>>>>>>>
>>>>>>>>>>  However there where already discussion on how to create a "text
>>>>>>>>> processing chain" that allows to split up things like tokenizing,
>>>>>>>>> POS
>>>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>>>> suffering form disadvantages of creating high amounts of RDF
>>>>>>>>> triples.
>>>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>>>> and
>>>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>>>
>>>>>>>>> Option (3) indeed means that you will create your own
>>>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>>>
>>>>>>>>>       But will I be able to use the stanbol dbpedia lookup using
>>>>>>>>> option
>>>>>>>>> 3?
>>>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use
>>>>>>>>> the
>>>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>>>> example)
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://blog.mikemccandless.********com/2012/04/lucenes-**
>>>>>>>>> tokenstreams-are-actually.******html<http://blog.**
>>>>>>>>> mikemccandless.com/2012/04/******lucenes-tokenstreams-are-****<http://mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****>
>>>>>>>>> actually.html<http://**mikemccandless.com/2012/04/****
>>>>>>>>> lucenes-tokenstreams-are-****actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> <http://blog.**mikemccandless.**com/2012/04/**<http://mikemccandless.com/2012/04/**>
>>>>>>>>> lucenes-tokenstreams-are-****actually.html<http://blog.**
>>>>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**
>>>>>>>>> actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>>>> >
>>>>>>>>> [2]
>>>>>>>>> http://incubator.apache.org/********stanbol/docs/trunk/**<http://incubator.apache.org/******stanbol/docs/trunk/**>
>>>>>>>>> components/****<http://**incubator.apache.org/******
>>>>>>>>> stanbol/docs/trunk/components/******<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
>>>>>>>>> >
>>>>>>>>> <http://**incubator.apache.**org/****stanbol/docs/trunk/**
>>>>>>>>> components/**<http://incubator.apache.org/****stanbol/docs/trunk/components/**>
>>>>>>>>> ** <http://incubator.apache.org/****stanbol/docs/trunk/**
>>>>>>>>> components/**<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>>>>> >>
>>>>>>>>> enhancer/contentitem.html#********content-parts<http://**
>>>>>>>>> incubator.apache.org/stanbol/******docs/trunk/components/**<http://incubator.apache.org/stanbol/****docs/trunk/components/**>
>>>>>>>>> <ht**tp://incubator.apache.org/**stanbol/**docs/trunk/**
>>>>>>>>> components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> enhancer/contentitem.html#******content-parts<http://**
>>>>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>>>> >
>>>>>>>>> [3]
>>>>>>>>>
>>>>>>>>> http://svn.apache.org/repos/********asf/incubator/stanbol/**
>>>>>>>>> trunk/****<http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****>
>>>>>>>>> <http://svn.apache.**org/repos/****asf/incubator/**
>>>>>>>>> stanbol/trunk/**<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
>>>>>>>>> >
>>>>>>>>> <http://svn.apache.org/****repos/**asf/incubator/stanbol/**
>>>>>>>>> **trunk/**<http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**>
>>>>>>>>> <http://svn.apache.**org/repos/**asf/incubator/**stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>>>> >
>>>>>>>>> enhancer/engines/********keywordextraction/src/main/********
>>>>>>>>> java/org/apache/stanbol/
>>>>>>>>> **enhancer/engines/********keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<******http://svn.apache.org/**
>>>>>>>>> repos/**** <http://svn.apache.org/repos/****><http://svn.apache.**
>>>>>>>>> org/repos/** <http://svn.apache.org/repos/**>>
>>>>>>>>> asf/incubator/stanbol/trunk/******enhancer/engines/**
>>>>>>>>> keywordextraction/src/main/******java/org/apache/stanbol/**
>>>>>>>>>
>>>>>>>>> enhancer/engines/******keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**>
>>>>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/**
>>>>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/**
>>>>>>>>> enhancer/engines/****keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>     Btw, I created my own enhancement engine chains and I could see
>>>>>>>>> them
>>>>>>>>>
>>>>>>>>>   yesterday in localhost:8080. But today all of them have vanished
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> only
>>>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>>>> stanbol
>>>>>>>>>> directory?
>>>>>>>>>>
>>>>>>>>>> -harish
>>>>>>>>>>
>>>>>>>>>> I just created the eclipse project
>>>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>>>> <[email protected]********> wrote:
>>>>>>>>>>
>>>>>>>>>>    Hi,
>>>>>>>>>>
>>>>>>>>>>  There are no NER (Named Entity Recognition) models for Chinese
>>>>>>>>>>> text
>>>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol
>>>>>>>>>>> will
>>>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>>>> process
>>>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>>>
>>>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>>>> tokenizer
>>>>>>>>>>> for looking up Words. As there is no specific Tokenizer for
>>>>>>>>>>> OpenNLP
>>>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>>>> how
>>>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>>>> that
>>>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>>>
>>>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>>>
>>>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>>>> detection,
>>>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>>>> tools
>>>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in
>>>>>>>>>>> you
>>>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>>>> OpenNLP,
>>>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of
>>>>>>>>>>> an
>>>>>>>>>>> overhead.
>>>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>>>
>>>>>>>>>>> best
>>>>>>>>>>> Rupert
>>>>>>>>>>>
>>>>>>>>>>> [1] 
>>>>>>>>>>> http://incubator.apache.org/********stanbol/docs/trunk/**<http://incubator.apache.org/******stanbol/docs/trunk/**>
>>>>>>>>>>> <http**://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>>> >
>>>>>>>>>>> <http:/**/incubator.apache.**org/****stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>>> <**http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>>>> >
>>>>>>>>>>> multilingual.html<http://****inc**ubator.apache.org/**stanbol/**<http://ubator.apache.org/stanbol/**>
>>>>>>>>>>> <http://incubator.**apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
>>>>>>>>>>> >
>>>>>>>>>>> docs/trunk/multilingual.html<**h**ttp://incubator.apache.org/**
>>>>>>>>>>> ** <http://incubator.apache.org/**>
>>>>>>>>>>> stanbol/docs/trunk/****multilingual.html<http://**
>>>>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>>>> >
>>>>>>>>>>> [2]
>>>>>>>>>>>
>>>>>>>>>>>     http://wiki.apache.org/solr/****
>>>>>>>>>>> ****LanguageAnalysis#Chinese.**2C_*<http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*>
>>>>>>>>>>> ***<http://wiki.apache.org/**solr/****LanguageAnalysis#**
>>>>>>>>>>> Chinese.2C_**<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>> >
>>>>>>>>>>> <http://wiki.apache.org/****solr/**LanguageAnalysis#****
>>>>>>>>>>> Chinese.2C_**<http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**>
>>>>>>>>>>> <http://wiki.**apache.org/solr/****
>>>>>>>>>>> LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>> **>
>>>>>>>>>>>   Japanese.2C_Korean<http://****wi**ki.apache.org/solr/**<http**
>>>>>>>>>>> ://wiki.apache.org/solr/** <http://wiki.apache.org/solr/**>>
>>>>>>>>>>>
>>>>>>>>>> LanguageAnalysis#Chinese.2C_******Japanese.2C_Korean<http://**
>>>>>>>>> wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>
>>>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>>> >
>>>>>>>>>    On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
>>>>>>>>> [email protected]>
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>>    Hi Rupert,
>>>>>>>>>>>
>>>>>>>>>>>  Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>>>> demonstrate
>>>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>>>> I am just starting on it. I am following the instructions to
>>>>>>>>>>>> build
>>>>>>>>>>>> an
>>>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese
>>>>>>>>>>>> data
>>>>>>>>>>>>
>>>>>>>>>>>>    dump
>>>>>>>>>>>>
>>>>>>>>>>>>  too.
>>>>>>>>>>>
>>>>>>>>>>   We may have to depend on the ngrams as keys and look them up in
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> dbpedia
>>>>>>>>>>>> labels.
>>>>>>>>>>>>
>>>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>>> (http://code.google.com/p/********paoding/<http://code.google.com/p/******paoding/>
>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.com/p/****paoding/>
>>>>>>>>>>>> >
>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.**
>>>>>>>>>>>> com/p/**paoding/ <http://code.google.com/p/**paoding/>>
>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.**
>>>>>>>>>>>> com/p/paoding/ 
>>>>>>>>>>>> <http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>>>> >>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   )
>>>>>>>>>>>> for word breaking.
>>>>>>>>>>>>
>>>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>>>> stanbol.
>>>>>>>>>>>> It
>>>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>>>> suspicion
>>>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>>>> done.
>>>>>>>>>>>> Is it
>>>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>>>
>>>>>>>>>>>> -harish
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   --
>>>>>>>>>>>>
>>>>>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>>>> ++43-699-11108907
>>>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     --
>>>>>>>>>>>
>>>>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>> ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    --
>>>>>>>>>
>>>>>>>>>  Dr. Walter Kasper
>>>>>>>>
>>>>>>> DFKI GmbH
>>>>>>> Stuhlsatzenhausweg 3
>>>>>>> D-66123 Saarbrücken
>>>>>>> Tel.:  +49-681-85775-5300
>>>>>>> Fax:   +49-681-85775-5338
>>>>>>> Email: [email protected]
>>>>>>> ------------------------------********------------------------**--**
>>>>>>>
>>>>>>> --**--**-
>>>>>>>
>>>>>>>
>>>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>>>
>>>>>>> Geschaeftsfuehrung:
>>>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>>> Dr. Walter Olthoff
>>>>>>>
>>>>>>> Vorsitzender des Aufsichtsrats:
>>>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>>>
>>>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>>>> ------------------------------********------------------------**--**
>>>>>>> --**--**-
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>>
>>>>>> Dr. Walter Kasper
>>>>> DFKI GmbH
>>>>> Stuhlsatzenhausweg 3
>>>>> D-66123 Saarbrücken
>>>>> Tel.:  +49-681-85775-5300
>>>>> Fax:   +49-681-85775-5338
>>>>> Email: [email protected]
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>
>>>>> Geschaeftsfuehrung:
>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>> Dr. Walter Olthoff
>>>>>
>>>>> Vorsitzender des Aufsichtsrats:
>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>
>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: [email protected]
>>> ------------------------------****----------------------------**--**-
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------****----------------------------**--**-
>>>
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: [email protected]
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Reply via email to