Re: Stanbol Chinese

Walter Kasper Wed, 01 Aug 2012 00:56:18 -0700

Hi,

The OSGI bundlöe declared some package imports that usually indeed arenot available nor required. I fixed that. Just check out the correctedpom.xml. On a fresh clean Stanbol installation langdetect worked finefor me.


Best regards,

Walter

harish suvarna wrote:

Thanks Dr Walter. langdetect is very useful. I could successfully compile
it but unable to load into stanbol as I get th error
======
ERROR: Bundle org.apache.stanbol.enhancer.engines.langdetect [177]: Error
starting/stopping bundle. (org.osgi.framework.BundleException: Unresolved
constraint in bundle org.apache.stanbol.enhancer.engines.langdetect [177]:
Unable to resolve 177.0: missing requirement [177.0] package;
(package=com.google.inject))
org.osgi.framework.BundleException: Unresolved constraint in bundle
org.apache.stanbol.enhancer.engines.langdetect [177]: Unable to resolve
177.0: missing requirement [177.0] package; (package=com.google.inject)
     at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
     at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
     at org.apache.felix.framework.Felix.setBundleStartLevel(Felix.java:1333)
     at
org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:270)
     at java.lang.Thread.run(Thread.java:680)
==============

I added the dependency
<dependency>
       <groupId>com.google.inject</groupId>
       <artifactId>guice</artifactId>
       <version>3.0</version>
     </dependency>

but looks like it is looking for version 1.3.0, which I can't find in
repo1.maven.org. I am not sure who is needing the inject library. The
entire source of langdetect plugin does not contain the word inject. Only
the manifest file in target/classes has this listed.


-harish

On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote:

Hi Harish,

I checked in a new language identifier for Stanbol based on
http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>.
Just check out from Stanbol trunk, install and try out.


Best regards,

Walter

harish suvarna wrote:

Rupert,
My initial debugging for Chinese text told me that the language
identification done by langid enhancer using apache tika does not
recognize
chinese. The tika language detection seems is not supporting the CJK
languages. With the result, the chinese language is identified as
lithuanian language 'lt' . The apache tika group has an enhancement item
856 registered for detecting cjk languages

https://issues.apache.org/**jira/browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
in Feb 2012. I am not sure about the use of language identification in
stanbol yet. Is the language id used to select the dbpedia index
(approprite dbpedia language dump) for entity lookups?

I am just thinking that, for my purpose, pick option 3 and make sure that
it is of my language of my interest and then call paoding segmenter. Then
iterate over the ngrams and do an entityhub lookup. I just still need to
understand the code around how the whole entity lookup for dbpedia works.

I find that the language detection library
http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>is
very good at language
detection. It supports 53 languages out of box and the quality seems good.
It is apache 2.0 license. I could volunteer to create a new langid engine
based on this with the stanbol community approval. So if anyone sheds some
light on how to add a new java library into stanbol, that be great. I am a
maven beginner now.

Thanks,
harish

On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
[email protected]> wrote:

Hi harish,

Note: Sorry I forgot to include the stanbol-dev mailing list in my last
answer.


On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
wrote:

Thanks a lot Rupert.

I am weighing between options 2 and 3. What is the difference? Optiion 2
sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It

may

be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like

separate engine.

Option (2) will require some work improvements on the Stanbol side.
However there where already discussion on how to create a "text
processing chain" that allows to split up things like tokenizing, POS
tagging, Lemmatizing ... in different Enhancement Engines without
suffering form disadvantages of creating high amounts of RDF triples.
One Idea was to base this on the Apache Lucene TokenStream [1] API and
share the data as ContentPart [2] of the ContentItem.

Option (3) indeed means that you will create your own
EnhancementEngine - a similar one to the KeywordLinkingEngine.

    But will I be able to use the stanbol dbpedia lookup using option 3?
Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
"FieldQuery" interface to search for Entities (see [1] for an example)

best
Rupert

[1]
http://blog.mikemccandless.**com/2012/04/lucenes-**
tokenstreams-are-actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
[2]
http://incubator.apache.org/**stanbol/docs/trunk/components/**
enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
[3]
http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**
enhancer/engines/**keywordextraction/src/main/**java/org/apache/stanbol/
**enhancer/engines/**keywordextraction/linking/**
impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>


  Btw, I created my own enhancement engine chains and I could see them

yesterday in localhost:8080. But today all of them have vanished and
only
the default chain shows up. Can I dig them up somewhere in the stanbol
directory?

-harish

I just created the eclipse project
On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
<[email protected]**> wrote:

Hi,

There are no NER (Named Entity Recognition) models for Chinese text
available via OpenNLP. So the default configuration of Stanbol will
not process Chinese text. What you can do is to configure a
KeywordLinking Engine for Chinese text as this engine can also process
in unknown languages (see [1] for details).

However also the KeywordLinking Engine requires at least n tokenizer
for looking up Words. As there is no specific Tokenizer for OpenNLP
Chinese text it will use the default one that uses a fixed set of
chars to split words (white spaces, hyphens ...). You may better how
well this would work with Chinese texts. My assumption would be that
it is not sufficient - so results will be sub-optimal.

To apply Chinese optimization I see three possibilities:

1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
POS tagging, Named Entity Detection)
2. allow the KeywordLinkingEngine to use other already available tools
for text processing (e.g. stuff that is already available for
Solr/Lucene [2] or the paoding chinese segment or referenced in you
mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
because representing Tokens, POS ... as RDF would be to much of an
overhead.
3. implement a new EnhancementEngine for processing Chinese text.

Hope this helps to get you started.

best
Rupert

[1] http://incubator.apache.org/**stanbol/docs/trunk/**
multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
[2]

  http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**

Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>

On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]>

wrote:

Hi Rupert,
Finally I am getting some time to work on Stanbol. My job is to
demonstrate
Stanbol annotations for Chinese text.
I am just starting on it. I am following the instructions to build an
enhancement engine from Anuj's blog. dbpedia has some chinese data

dump

too.

We may have to depend on the ngrams as keys and look them up in the
dbpedia
labels.

I am planning to use the paoding chinese segmentor
(http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>)
for word breaking.

Just curious. I pasted some chinese text in default engine of stanbol.
It
kind of finished the processing in no time at all. This gave me
suspicion
that may be if the language is chinese, no further processing is done.
Is it
right? Any more tips for making all this work in Stanbol?

-harish


--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

--
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
------------------------------**------------------------------**-
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
------------------------------**------------------------------**-



--
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------

Re: Stanbol Chinese

Reply via email to