Hi,
I am trying to add Chinese language processing using some opensource
segmenters. I had some communication with Rupert. I am attaching Rupert's
suggestions. This way I amy get some more suggestions help as well as
Rupert's ideas get distributed to all.

I am also following Anuj's blog to learn about Stanbol content enhancement
engine development.

I can successfully build Stanbol and play with the default chain.

I am trying to create the eclipse project now. mvn eclipse:eclipse was
successful too. Then I imported the stanbol directory into eclipse
workspace.
In eclipse certain Stanbol projects are in red.

Description    Resource    Path    Location    Type
The project cannot be built until its prerequisite
org.apache.stanbol.enhancer.servicesapi is built. Cleaning and building all
projects is recommended    org.apache.stanbol.enhancer.ldpath
Unknown    Java Problem
The project cannot be built until its prerequisite
org.apache.stanbol.entityhub.indexing.core is built. Cleaning and building
all projects is recommended
org.apache.stanbol.entityhub.indexing.destination.solryard
Unknown    Java Problem
The project cannot be built until its prerequisite
org.apache.stanbol.entityhub.core is built. Cleaning and building all
projects is recommended    org.apache.stanbol.entityhub.query.clerezza
    Unknown    Java Problem
The project cannot be built until its prerequisite
org.apache.stanbol.entityhub.core is built. Cleaning and building all
projects is recommended    org.apache.stanbol.entityhub.ldpath
Unknown    Java Problem
The project cannot be built until its prerequisite
org.apache.stanbol.enhancer.servicesapi is built. Cleaning and building all
projects is recommended    org.apache.stanbol.enhancer.rdfentities
Unknown    Java Problem
The project cannot be built until its prerequisite
org.apache.stanbol.enhancer.servicesapi is built. Cleaning and building all
projects is recommended    org.apache.stanbol.enhancer.test
Unknown    Java Problem
The project cannot be built until its prerequisite
org.apache.stanbol.entityhub.core is built. Cleaning and building all
projects is recommended    org.apache.stanbol.entityhub.site.managed
Unknown    Java Problem
....
...

Any extra steps are needed?
Should I try to build and debug inside eclipse or build using mvn and debug
in eclipse? What developers do in common?

-harish



================================================Previous
communication================================================
Hi,

There are no NER (Named Entity Recognition) models for Chinese text
available via OpenNLP. So the default configuration of Stanbol will
not process Chinese text. What you can do is to configure a
KeywordLinking Engine for Chinese text as this engine can also process
in unknown languages (see [1] for details).

However also the KeywordLinking Engine requires at least n tokenizer
for looking up Words. As there is no specific Tokenizer for OpenNLP
Chinese text it will use the default one that uses a fixed set of
chars to split words (white spaces, hyphens ...). You may better how
well this would work with Chinese texts. My assumption would be that
it is not sufficient - so results will be sub-optimal.

To apply Chinese optimization I see three possibilities:

1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
POS tagging, Named Entity Detection)
2. allow the KeywordLinkingEngine to use other already available tools
for text processing (e.g. stuff that is already available for
Solr/Lucene [2] or the paoding chinese segment or referenced in you
mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
because representing Tokens, POS ... as RDF would be to much of an
overhead.
3. implement a new EnhancementEngine for processing Chinese text.

Hope this helps to get you started.

best
Rupert

[1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
[2]
http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
harish suvarna
6:33 PM (22 minutes ago)

to Rupert
Thanks a lot Rupert.

I am weighing between options 2 and 3. What is the difference? Optiion 2
sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
may be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
like a separate engine. But will I be able to use the stanbol dbpedia
lookup using option 3?

Btw, I created my own enhancement engine chains and I could see them
yesterday in localhost:8080. But today all of them have vanished and only
the default chain shows up. Can I dig them up somewhere in the stanbol
directory?

-harish

I just created the eclipse project

Reply via email to