I did ' mvn clean install'. Which stanbol folder is this ? $HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You mean remove the enitre folder?
I restarted the machine and doing another mvn clean install now. I will post you in another 30 mins. -harish On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <[email protected]> wrote: > Hi again, > > It came to my mind that you should also clear the 'stanbol' folder of the > Stanbol runtime system and restart the sysem. The folder might contain old > bundle configuration data that don't get updated automatically. > > > Best regards, > > Walter > > harish suvarna wrote: > >> Did a fresh build and inside Stanbol in localhost:8080, it is installed >> but >> is not activated. I still see the com.google.inject errors. >> I do see the pom.xml update from you. >> >> -harish >> >> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <[email protected]> wrote: >> >> Hi, >>> >>> The OSGI bundlöe declared some package imports that usually indeed are >>> not >>> available nor required. I fixed that. Just check out the corrected >>> pom.xml. >>> On a fresh clean Stanbol installation langdetect worked fine for me. >>> >>> >>> Best regards, >>> >>> Walter >>> >>> harish suvarna wrote: >>> >>> Thanks Dr Walter. langdetect is very useful. I could successfully >>>> compile >>>> it but unable to load into stanbol as I get th error >>>> ====== >>>> ERROR: Bundle org.apache.stanbol.enhancer.****engines.langdetect [177]: >>>> Error >>>> starting/stopping bundle. (org.osgi.framework.****BundleException: >>>> Unresolved >>>> constraint in bundle org.apache.stanbol.enhancer.****engines.langdetect >>>> [177]: >>>> Unable to resolve 177.0: missing requirement [177.0] package; >>>> (package=com.google.inject)) >>>> org.osgi.framework.****BundleException: Unresolved constraint in bundle >>>> org.apache.stanbol.enhancer.****engines.langdetect [177]: Unable to >>>> resolve >>>> >>>> 177.0: missing requirement [177.0] package; (package=com.google.inject) >>>> at org.apache.felix.framework.****Felix.resolveBundle(Felix.** >>>> java:3443) >>>> at org.apache.felix.framework.****Felix.startBundle(Felix.java:** >>>> **1727) >>>> at org.apache.felix.framework.****Felix.setBundleStartLevel(** >>>> Felix.java:1333) >>>> at >>>> org.apache.felix.framework.****StartLevelImpl.run(** >>>> StartLevelImpl.java:270) >>>> at java.lang.Thread.run(Thread.****java:680) >>>> >>>> ============== >>>> >>>> I added the dependency >>>> <dependency> >>>> <groupId>com.google.inject</****groupId> >>>> >>>> <artifactId>guice</artifactId> >>>> <version>3.0</version> >>>> </dependency> >>>> >>>> but looks like it is looking for version 1.3.0, which I can't find in >>>> repo1.maven.org. I am not sure who is needing the inject library. The >>>> entire source of langdetect plugin does not contain the word inject. >>>> Only >>>> the manifest file in target/classes has this listed. >>>> >>>> >>>> -harish >>>> >>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote: >>>> >>>> Hi Harish, >>>> >>>>> I checked in a new language identifier for Stanbol based on >>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/> >>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/> >>>>> > >>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/> >>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/> >>>>> > >>>>> >>>>> . >>>>>> >>>>> Just check out from Stanbol trunk, install and try out. >>>>> >>>>> >>>>> Best regards, >>>>> >>>>> Walter >>>>> >>>>> harish suvarna wrote: >>>>> >>>>> Rupert, >>>>> >>>>>> My initial debugging for Chinese text told me that the language >>>>>> identification done by langid enhancer using apache tika does not >>>>>> recognize >>>>>> chinese. The tika language detection seems is not supporting the CJK >>>>>> languages. With the result, the chinese language is identified as >>>>>> lithuanian language 'lt' . The apache tika group has an enhancement >>>>>> item >>>>>> 856 registered for detecting cjk languages >>>>>> >>>>>> https://issues.apache.org/******jira/browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856> >>>>>> <https://**issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856> >>>>>> > >>>>>> <https://**issues.apache.org/**jira/browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856> >>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856> >>>>>> > >>>>>> >>>>>> in Feb 2012. I am not sure about the use of language >>>>>> identification >>>>>> in >>>>>> stanbol yet. Is the language id used to select the dbpedia index >>>>>> (approprite dbpedia language dump) for entity lookups? >>>>>> >>>>>> >>>>>> I am just thinking that, for my purpose, pick option 3 and make sure >>>>>> that >>>>>> it is of my language of my interest and then call paoding segmenter. >>>>>> Then >>>>>> iterate over the ngrams and do an entityhub lookup. I just still need >>>>>> to >>>>>> understand the code around how the whole entity lookup for dbpedia >>>>>> works. >>>>>> >>>>>> I find that the language detection library >>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/> >>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/> >>>>>> > >>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/> >>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/> >>>>>> >>is >>>>>> >>>>>> very good at language >>>>>> >>>>>> detection. It supports 53 languages out of box and the quality seems >>>>>> good. >>>>>> It is apache 2.0 license. I could volunteer to create a new langid >>>>>> engine >>>>>> based on this with the stanbol community approval. So if anyone sheds >>>>>> some >>>>>> light on how to add a new java library into stanbol, that be great. I >>>>>> am a >>>>>> maven beginner now. >>>>>> >>>>>> Thanks, >>>>>> harish >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Hi harish, >>>>>> >>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my >>>>>>> last >>>>>>> answer. >>>>>>> >>>>>>> >>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Thanks a lot Rupert. >>>>>>> >>>>>>>> I am weighing between options 2 and 3. What is the difference? >>>>>>>> Optiion 2 >>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese >>>>>>>> text. >>>>>>>> It >>>>>>>> >>>>>>>> may >>>>>>>> >>>>>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is >>>>>>> >>>>>>>> like >>>>>>>> >>>>>>>> a >>>>>>>> >>>>>>> separate engine. >>>>>>> >>>>>>>> Option (2) will require some work improvements on the Stanbol >>>>>>>> side. >>>>>>>> >>>>>>> However there where already discussion on how to create a "text >>>>>>> processing chain" that allows to split up things like tokenizing, POS >>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without >>>>>>> suffering form disadvantages of creating high amounts of RDF triples. >>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API >>>>>>> and >>>>>>> share the data as ContentPart [2] of the ContentItem. >>>>>>> >>>>>>> Option (3) indeed means that you will create your own >>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine. >>>>>>> >>>>>>> But will I be able to use the stanbol dbpedia lookup using >>>>>>> option >>>>>>> 3? >>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the >>>>>>> "FieldQuery" interface to search for Entities (see [1] for an >>>>>>> example) >>>>>>> >>>>>>> best >>>>>>> Rupert >>>>>>> >>>>>>> [1] >>>>>>> http://blog.mikemccandless.******com/2012/04/lucenes-** >>>>>>> tokenstreams-are-actually.****html<http://blog.** >>>>>>> mikemccandless.com/2012/04/****lucenes-tokenstreams-are-**** >>>>>>> actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html> >>>>>>> <http://blog.**mikemccandless.com/2012/04/** >>>>>>> lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html> >>>>>>> > >>>>>>> [2] >>>>>>> http://incubator.apache.org/******stanbol/docs/trunk/** >>>>>>> components/****<http://incubator.apache.org/****stanbol/docs/trunk/components/****> >>>>>>> <http://**incubator.apache.org/****stanbol/docs/trunk/components/** >>>>>>> ** <http://incubator.apache.org/**stanbol/docs/trunk/components/**>> >>>>>>> enhancer/contentitem.html#******content-parts<http://** >>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**> >>>>>>> enhancer/contentitem.html#****content-parts<http://** >>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/** >>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts> >>>>>>> > >>>>>>> [3] >>>>>>> >>>>>>> http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**> >>>>>>> <http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**> >>>>>>> > >>>>>>> enhancer/engines/******keywordextraction/src/main/****** >>>>>>> java/org/apache/stanbol/ >>>>>>> **enhancer/engines/******keywordextraction/linking/** >>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**> >>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/** >>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/** >>>>>>> enhancer/engines/****keywordextraction/linking/** >>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/** >>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/** >>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/** >>>>>>> enhancer/engines/**keywordextraction/linking/** >>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java> >>>>>>> > >>>>>>> >>>>>>> >>>>>>> Btw, I created my own enhancement engine chains and I could see >>>>>>> them >>>>>>> >>>>>>> yesterday in localhost:8080. But today all of them have vanished and >>>>>>>> only >>>>>>>> the default chain shows up. Can I dig them up somewhere in the >>>>>>>> stanbol >>>>>>>> directory? >>>>>>>> >>>>>>>> -harish >>>>>>>> >>>>>>>> I just created the eclipse project >>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler >>>>>>>> <[email protected]******> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text >>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will >>>>>>>>> not process Chinese text. What you can do is to configure a >>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also >>>>>>>>> process >>>>>>>>> in unknown languages (see [1] for details). >>>>>>>>> >>>>>>>>> However also the KeywordLinking Engine requires at least n >>>>>>>>> tokenizer >>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP >>>>>>>>> Chinese text it will use the default one that uses a fixed set of >>>>>>>>> chars to split words (white spaces, hyphens ...). You may better >>>>>>>>> how >>>>>>>>> well this would work with Chinese texts. My assumption would be >>>>>>>>> that >>>>>>>>> it is not sufficient - so results will be sub-optimal. >>>>>>>>> >>>>>>>>> To apply Chinese optimization I see three possibilities: >>>>>>>>> >>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence >>>>>>>>> detection, >>>>>>>>> POS tagging, Named Entity Detection) >>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available >>>>>>>>> tools >>>>>>>>> for text processing (e.g. stuff that is already available for >>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you >>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with >>>>>>>>> OpenNLP, >>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an >>>>>>>>> overhead. >>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text. >>>>>>>>> >>>>>>>>> Hope this helps to get you started. >>>>>>>>> >>>>>>>>> best >>>>>>>>> Rupert >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> http://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**> >>>>>>>>> <http:/**/incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**> >>>>>>>>> > >>>>>>>>> multilingual.html<http://**inc**ubator.apache.org/stanbol/**<http://incubator.apache.org/stanbol/**> >>>>>>>>> docs/trunk/multilingual.html<h**ttp://incubator.apache.org/** >>>>>>>>> stanbol/docs/trunk/**multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html> >>>>>>>>> > >>>>>>>>> [2] >>>>>>>>> >>>>>>>>> http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_* >>>>>>>>> ***<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**> >>>>>>>>> <http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**> >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> Japanese.2C_Korean<http://**wi**ki.apache.org/solr/**<http://wiki.apache.org/solr/**> >>>>>>>> >>>>>>> LanguageAnalysis#Chinese.2C_****Japanese.2C_Korean<http://** >>>>>>> wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_** >>>>>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean> >>>>>>> > >>>>>>> >>>>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna < >>>>>>> [email protected]> >>>>>>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Rupert, >>>>>>>>> >>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to >>>>>>>>>> demonstrate >>>>>>>>>> Stanbol annotations for Chinese text. >>>>>>>>>> I am just starting on it. I am following the instructions to build >>>>>>>>>> an >>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data >>>>>>>>>> >>>>>>>>>> dump >>>>>>>>>> >>>>>>>>> too. >>>>>>>> >>>>>>>> We may have to depend on the ngrams as keys and look them up in the >>>>>>>>> >>>>>>>>>> dbpedia >>>>>>>>>> labels. >>>>>>>>>> >>>>>>>>>> I am planning to use the paoding chinese segmentor >>>>>>>>>> (http://code.google.com/p/******paoding/<http://code.google.com/p/****paoding/> >>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.com/p/**paoding/> >>>>>>>>>> > >>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.** >>>>>>>>>> com/p/paoding/ <http://code.google.com/p/paoding/>> >>>>>>>>>> >>>>>>>>>> ) >>>>>>>>>>> >>>>>>>>>> for word breaking. >>>>>>>>>> >>>>>>>>>> Just curious. I pasted some chinese text in default engine of >>>>>>>>>> stanbol. >>>>>>>>>> It >>>>>>>>>> kind of finished the processing in no time at all. This gave me >>>>>>>>>> suspicion >>>>>>>>>> that may be if the language is chinese, no further processing is >>>>>>>>>> done. >>>>>>>>>> Is it >>>>>>>>>> right? Any more tips for making all this work in Stanbol? >>>>>>>>>> >>>>>>>>>> -harish >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> | Rupert Westenthaler [email protected] >>>>>>>>> | Bodenlehenstraße 11 >>>>>>>>> ++43-699-11108907 >>>>>>>>> | A-5500 Bischofshofen >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>> >>>>>>> | Rupert Westenthaler [email protected] >>>>>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>>>>> | A-5500 Bischofshofen >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>> Dr. Walter Kasper >>>>> DFKI GmbH >>>>> Stuhlsatzenhausweg 3 >>>>> D-66123 Saarbrücken >>>>> Tel.: +49-681-85775-5300 >>>>> Fax: +49-681-85775-5338 >>>>> Email: [email protected] >>>>> ------------------------------******--------------------------** >>>>> --**--**- >>>>> >>>>> >>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH >>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern >>>>> >>>>> Geschaeftsfuehrung: >>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) >>>>> Dr. Walter Olthoff >>>>> >>>>> Vorsitzender des Aufsichtsrats: >>>>> Prof. Dr. h.c. Hans A. Aukes >>>>> >>>>> Amtsgericht Kaiserslautern, HRB 2313 >>>>> ------------------------------******--------------------------** >>>>> --**--**- >>>>> >>>>> >>>>> >>>>> -- >>> Dr. Walter Kasper >>> DFKI GmbH >>> Stuhlsatzenhausweg 3 >>> D-66123 Saarbrücken >>> Tel.: +49-681-85775-5300 >>> Fax: +49-681-85775-5338 >>> Email: [email protected] >>> ------------------------------****----------------------------**--**- >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH >>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern >>> >>> Geschaeftsfuehrung: >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) >>> Dr. Walter Olthoff >>> >>> Vorsitzender des Aufsichtsrats: >>> Prof. Dr. h.c. Hans A. Aukes >>> >>> Amtsgericht Kaiserslautern, HRB 2313 >>> ------------------------------****----------------------------**--**- >>> >>> >>> > > -- > Dr. Walter Kasper > DFKI GmbH > Stuhlsatzenhausweg 3 > D-66123 Saarbrücken > Tel.: +49-681-85775-5300 > Fax: +49-681-85775-5338 > Email: [email protected] > ------------------------------**------------------------------**- > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > Geschaeftsfuehrung: > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > ------------------------------**------------------------------**- > >
