Re: Stanbol Chinese

Walter Kasper Wed, 01 Aug 2012 10:49:21 -0700

harish suvarna wrote:

I did ' mvn clean install'.
Which stanbol folder is this ?


$HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
mean remove the enitre folder?

I guess it is $HOME/stanbol where the runtime config data are stored. Iusually clear the complete folder for a clean restart.


I restarted the machine and doing another mvn clean install now. I will
post you in another 30 mins.

-harish

On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <[email protected]> wrote:

Hi again,

It came to my mind that you should also clear the 'stanbol' folder of the
Stanbol runtime system and restart the sysem.  The folder might contain old
bundle configuration data that don't get updated automatically.


Best regards,

Walter

harish suvarna wrote:

Did a fresh build and inside Stanbol in localhost:8080, it is installed
but
is not activated. I still see the com.google.inject errors.
I do see the pom.xml update from you.

-harish

On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <[email protected]> wrote:

  Hi,

The OSGI bundlöe declared some package imports that usually indeed are
not
available nor required. I fixed that. Just check out the corrected
pom.xml.
On a fresh clean Stanbol installation langdetect worked fine for me.


Best regards,

Walter

harish suvarna wrote:

  Thanks Dr Walter. langdetect is very useful. I could successfully

compile
it but unable to load into stanbol as I get th error
======
ERROR: Bundle org.apache.stanbol.enhancer.****engines.langdetect [177]:
Error
starting/stopping bundle. (org.osgi.framework.****BundleException:
Unresolved
constraint in bundle org.apache.stanbol.enhancer.****engines.langdetect
[177]:
Unable to resolve 177.0: missing requirement [177.0] package;
(package=com.google.inject))
org.osgi.framework.****BundleException: Unresolved constraint in bundle
org.apache.stanbol.enhancer.****engines.langdetect [177]: Unable to
resolve

177.0: missing requirement [177.0] package; (package=com.google.inject)
       at org.apache.felix.framework.****Felix.resolveBundle(Felix.**
java:3443)
       at org.apache.felix.framework.****Felix.startBundle(Felix.java:**
**1727)
       at org.apache.felix.framework.****Felix.setBundleStartLevel(**
Felix.java:1333)
       at
org.apache.felix.framework.****StartLevelImpl.run(**
StartLevelImpl.java:270)
       at java.lang.Thread.run(Thread.****java:680)

==============

I added the dependency
<dependency>
         <groupId>com.google.inject</****groupId>

         <artifactId>guice</artifactId>
         <version>3.0</version>
       </dependency>

but looks like it is looking for version 1.3.0, which I can't find in
repo1.maven.org. I am not sure who is needing the inject library. The
entire source of langdetect plugin does not contain the word inject.
Only
the manifest file in target/classes has this listed.


-harish

On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote:

   Hi Harish,

I checked in a new language identifier for Stanbol based on
http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
<http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
<http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
<http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
  .
Just check out from Stanbol trunk, install and try out.


Best regards,

Walter

harish suvarna wrote:

   Rupert,

My initial debugging for Chinese text told me that the language
identification done by langid enhancer using apache tika does not
recognize
chinese. The tika language detection seems is not supporting the CJK
languages. With the result, the chinese language is identified as
lithuanian language 'lt' . The apache tika group has an enhancement
item
856 registered for detecting cjk languages
     
https://issues.apache.org/******jira/browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
<https://**issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
<https://**issues.apache.org/**jira/browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
<https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
     in Feb 2012. I am not sure about the use of language
identification
in
stanbol yet. Is the language id used to select the dbpedia  index
(approprite dbpedia language dump) for entity lookups?


I am just thinking that, for my purpose, pick option 3 and make sure
that
it is of my language of my interest and then call paoding segmenter.
Then
iterate over the ngrams and do an entityhub lookup. I just still need
to
understand the code around how the whole entity lookup for dbpedia
works.

I find that the language detection library
http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
<http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
<http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
<http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>

is

very good at language

detection. It supports 53 languages out of box and the quality seems
good.
It is apache 2.0 license. I could volunteer to create a new langid
engine
based on this with the stanbol community approval. So if anyone sheds
some
light on how to add a new java library into stanbol, that be great. I
am a
maven beginner now.

Thanks,
harish




On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
[email protected]> wrote:

    Hi harish,

  Note: Sorry I forgot to include the stanbol-dev mailing list in my

last
answer.


On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
wrote:

   Thanks a lot Rupert.

I am weighing between options 2 and 3. What is the difference?
Optiion 2
sounds like enhancing KeyWordLinkingEngine to deal with chinese
text.
It

   may

   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is

like

   a

   separate engine.

   Option (2) will require some work improvements on the Stanbol
side.

However there where already discussion on how to create a "text
processing chain" that allows to split up things like tokenizing, POS
tagging, Lemmatizing ... in different Enhancement Engines without
suffering form disadvantages of creating high amounts of RDF triples.
One Idea was to base this on the Apache Lucene TokenStream [1] API
and
share the data as ContentPart [2] of the ContentItem.

Option (3) indeed means that you will create your own
EnhancementEngine - a similar one to the KeywordLinkingEngine.

      But will I be able to use the stanbol dbpedia lookup using
option
3?
Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
"FieldQuery" interface to search for Entities (see [1] for an
example)

best
Rupert

[1]
http://blog.mikemccandless.******com/2012/04/lucenes-**
tokenstreams-are-actually.****html<http://blog.**
mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****
actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
<http://blog.**mikemccandless.com/2012/04/**
lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
[2]
http://incubator.apache.org/******stanbol/docs/trunk/**
components/****<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
<http://**incubator.apache.org/****stanbol/docs/trunk/components/**
** <http://incubator.apache.org/**stanbol/docs/trunk/components/**>>
enhancer/contentitem.html#******content-parts<http://**
incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
enhancer/contentitem.html#****content-parts<http://**
incubator.apache.org/stanbol/**docs/trunk/components/**
enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
[3]

http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
<http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
enhancer/engines/******keywordextraction/src/main/******
java/org/apache/stanbol/
**enhancer/engines/******keywordextraction/linking/**
impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**>
asf/incubator/stanbol/trunk/****enhancer/engines/**
keywordextraction/src/main/****java/org/apache/stanbol/**
enhancer/engines/****keywordextraction/linking/**
impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
asf/incubator/stanbol/trunk/**enhancer/engines/**
keywordextraction/src/main/**java/org/apache/stanbol/**
enhancer/engines/**keywordextraction/linking/**
impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>

    Btw, I created my own enhancement engine chains and I could see
them

  yesterday in localhost:8080. But today all of them have vanished and

only
the default chain shows up. Can I dig them up somewhere in the
stanbol
directory?

-harish

I just created the eclipse project
On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
<[email protected]******> wrote:

   Hi,

There are no NER (Named Entity Recognition) models for Chinese text
available via OpenNLP. So the default configuration of Stanbol will
not process Chinese text. What you can do is to configure a
KeywordLinking Engine for Chinese text as this engine can also
process
in unknown languages (see [1] for details).

However also the KeywordLinking Engine requires at least n
tokenizer
for looking up Words. As there is no specific Tokenizer for OpenNLP
Chinese text it will use the default one that uses a fixed set of
chars to split words (white spaces, hyphens ...). You may better
how
well this would work with Chinese texts. My assumption would be
that
it is not sufficient - so results will be sub-optimal.

To apply Chinese optimization I see three possibilities:

1. add support for Chinese to OpenNLP (Tokenizer, Sentence
detection,
POS tagging, Named Entity Detection)
2. allow the KeywordLinkingEngine to use other already available
tools
for text processing (e.g. stuff that is already available for
Solr/Lucene [2] or the paoding chinese segment or referenced in you
mail). Currently the KeywordLinkingEngine is hardwired with
OpenNLP,
because representing Tokens, POS ... as RDF would be to much of an
overhead.
3. implement a new EnhancementEngine for processing Chinese text.

Hope this helps to get you started.

best
Rupert

[1] 
http://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
<http:/**/incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
multilingual.html<http://**inc**ubator.apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
docs/trunk/multilingual.html<h**ttp://incubator.apache.org/**
stanbol/docs/trunk/**multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
[2]

    http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*
***<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
<http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
  
Japanese.2C_Korean<http://**wi**ki.apache.org/solr/**<http://wiki.apache.org/solr/**>

LanguageAnalysis#Chinese.2C_****Japanese.2C_Korean<http://**
wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
[email protected]>

wrote:

   Hi Rupert,

Finally I am getting some time to work on Stanbol. My job is to
demonstrate
Stanbol annotations for Chinese text.
I am just starting on it. I am following the instructions to build
an
enhancement engine from Anuj's blog. dbpedia has some chinese data

   dump

too.

  We may have to depend on the ngrams as keys and look them up in the

dbpedia
labels.

I am planning to use the paoding chinese segmentor
(http://code.google.com/p/******paoding/<http://code.google.com/p/****paoding/>
<http://code.google.**com/p/**paoding/<http://code.google.com/p/**paoding/>
<http://code.google.**com/p/**paoding/<http://code.google.**
com/p/paoding/ <http://code.google.com/p/paoding/>>

  )
for word breaking.

Just curious. I pasted some chinese text in default engine of
stanbol.
It
kind of finished the processing in no time at all. This gave me
suspicion
that may be if the language is chinese, no further processing is
done.
Is it
right? Any more tips for making all this work in Stanbol?

-harish


  --

| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11
++43-699-11108907
| A-5500 Bischofshofen


    --

| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen


   --

Dr. Walter Kasper

DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
------------------------------******--------------------------**
--**--**-


Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
------------------------------******--------------------------**
--**--**-



  --

Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
------------------------------****----------------------------**--**-
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
------------------------------****----------------------------**--**-

--
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
------------------------------**------------------------------**-
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
------------------------------**------------------------------**-



--
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------

Re: Stanbol Chinese

Reply via email to