I read the readme.md in the entityhub/indexing/dbpedia and started indexing the chinese dbpedia 3.8. Chinese dbpedia3.8 dump does not seem to have 2 files needed. instance_types and person_data. Still I went ahead and tried to run the index generation. I get a java exception. The other observations are 1. the curl thing to generate incoming_links.txt took more than 3 hours and generated 2.5GB of this file. 2. dbpedia 3.8 seem to have Category: and not Cat:. So the step to substitute Cat: with Category: is not required now. 3. After Java exception the program to generate index seems doing nothing nor is terminated. I waited overnight and killed it. Previously I did not understand your question
Are this the data for the Entities with the URIs "http://zh.dboedua.org/resource/{name}"? Now I understand. There is no Chinese dbpedia server running. http://wiki.dbpedia.org/Internationalization has list of language chapters supported for dbpedia now. Chinese is yet to come. My intention is to make a stanbol solr chinese dbpedia dump so that I can 'spot' keywords in dbpedia better than English dump. -harish ======================================================= ttp://www.w3.org/2000/01/rdf-sche 08:23:37,914 [Thread-5] ERROR source.ResourceLoader - Unable to load resource /Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2 org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token: http://www.w3.org/2000/01/rdf-sche at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97) at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205) at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22) at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58) at org.openjena.riot.lang.LangBase.parse(LangBase.java:75) at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193) at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74) at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137) at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245) at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43) at java.lang.Thread.run(Thread.java:680) 08:23:37,917 [Thread-5] ERROR source.ResourceLoader - Exception while loading file /Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2 org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token: http://www.w3.org/2000/01/rdf-sche at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97) at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205) at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22) at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58) at org.openjena.riot.lang.LangBase.parse(LangBase.java:75) at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193) at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74) at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137) at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245) at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43) at java.lang.Thread.run(Thread.java:680) Exception in thread "Thread-5" java.lang.IllegalStateException: Error while loading Resource /Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2 at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.setResourceState(ResourceLoader.java:273) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:215) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137) at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245) at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43) at java.lang.Thread.run(Thread.java:680) Caused by: org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token: http://www.w3.org/2000/01/rdf-sche at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97) at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205) at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38) at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22) at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58) at org.openjena.riot.lang.LangBase.parse(LangBase.java:75) at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193) at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74) at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72) at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201) ... 4 more ========================================================= On Wed, Aug 15, 2012 at 6:01 PM, harish suvarna <[email protected]> wrote: > Thanks Rupert. I am making some progress here. I am finding that paoding > breaks words into small segments, espcially foreign names. For ex, motorola > is broken into two parts (mot, rola), similarly > michael is borken into (mik, kael). Now the ngram based dbpedia lookup > looks for these in the dbpedia index and cannot find. > My segmentation process and dbpedia solr index must both use the same > segmenter. There is a paoding analyzer for solr too. I just need to create > the solr index for dbpedia using that. > Actually now, I have more dbpedia hits in character ngram based dbpedia > lookup for chinese than the number of hits I get if I use paoding. > We dont know what language analyzers have been used by ogrisel is creating > the solr dbpedia dump of 1.19gb. > > I also experimented with contenthub search for chinese. Right now it does > not work. I need to debug that part also. Even the UI in the contenthub > does not display the chinese characters. The enhancer UI does display the > characters well. > > Also for English Stanbol, I did play with contenthub. I took a small text > as follows. > ============== > United States produced an Olympic-record time to win gold in the women's > 200m freestyle relay final. A brilliant final leg from Allison Schmitt led > the Americans home, ahead of Australia, in a time of seven minutes 42.92 > seconds. Missy Franklin gave them a great start, while Dana Vollmer and > Shannon Vreeland also produced fast times. > ===================================================================== > > The above text is properly processed and I get the dbpedia links for all > persons, countries in the above. Hoewver, the above piece is related to > 'swimming' and this word does not appear at all in the text. In the dbpedia > link of Allison Scmitt, the dbpedia categories do tell us that it is in > swimming category. Did anyone try to process the categories inside the link > and add them as metadata for this content. If we add this, then we add more > value than a simple solr based search in content store. Some one in IKS > conference demoed this as a semantic search. Any hints/clues on this work ? > > > > > > On Wed, Aug 15, 2012 at 1:25 PM, Rupert Westenthaler < > [email protected]> wrote: > >> On Wed, Aug 15, 2012 at 3:06 AM, harish suvarna <[email protected]> >> wrote: >> > Is {stanbol-trunk}/entityhub/indeing/dbpedia it different from the >> custom >> > ontology file tool that is mentioned in >> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html ? >> > >> >> The custom DBpedia indexing tool comes with a different default >> configuration and also with a custolmised Solr schema (schema.xml >> file) for dbpedia. Otherwise it is the same software as the generic >> RDF indexing tool. Most of the things mentioned in >> "customvocabulary.html" are also valid for the dbpedia indexing tool. >> Please also notice the readme and the comments in the configuration of >> the dbpedia indexing tool. >> >> > Is it same as the entityhub page in Stanbol localhost:8080? >> >> This tool was used to create all available dbpedia indexes for Apache >> Stanbol. This includes the dbpedia default data (shipped with the >> launcher). >> >> best >> Rupert >> >> > >> > -harish >> > >> > >> > On Thu, Aug 9, 2012 at 10:58 PM, Rupert Westenthaler < >> > [email protected]> wrote: >> > >> >> Hi >> >> >> >> >> >> On Fri, Aug 10, 2012 at 1:28 AM, harish suvarna <[email protected]> >> >> wrote: >> >> > Thanks Rupert for the update. >> >> > Meanwhile I am looking at generating custom vocab index page >> >> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.htmland >> >> > trying to know which files I have to use under dbpedia chinese >> download >> >> > available at http://downloads.dbpedia.org/3.8/zh/ >> >> >> >> Are this the data for the Entities with the URIs >> >> "http://zh.dboedua.org/resource/{name}"? >> >> >> >> Anyway cool that dbpedia 3.8 got finally released! >> >> >> >> > >> >> > The dbpedia download for chinese has article categories, lables, >> >> short/long >> >> > abstracts, inter language links. Donot know which ones to use for the >> >> > stanbol entityhub custom vocabulary index tool. >> >> >> >> For linking concepts you need only the labels. If you also include the >> >> short abstracts you will also have the mouse over text in the Stanbol >> >> Enhancer UI. Geo coordinates are needed for the map in the enhancer >> >> UI. >> >> >> >> You should also include the data providing the rdf:types of the >> >> Entities. However I do not know which of the files does include those. >> >> >> >> Categories are currently not used by Stanbol. If you want to include >> >> them you should add (1) the categories (2) categories labels and (3) >> >> article categories >> >> >> >> Note that there is an own Entityhub Indexing Tool for dbpedia >> >> {stanbol-trunk}/entityhub/indeing/dbpedia. >> >> >> >> >> >> best >> >> Rupert >> >> >> >> > >> >> > -harish >> >> > >> >> > >> >> > On Thu, Aug 9, 2012 at 11:08 AM, Rupert Westenthaler < >> >> > [email protected]> wrote: >> >> > >> >> >> Hi >> >> >> >> >> >> the dbpedia 3.7 index was build by ogrisel so I do not know the >> details. >> >> >> >> >> >> I think Chinese (zh) labels are included, but the index only >> contains >> >> >> Entities for Wikipedia pages with 5 or more incoming links. >> >> >> >> >> >> In addition while the English DBpedia contains zh labels it will >> not >> >> >> contain Entities that do not have a counterpart in the English >> >> >> Wikipedia. >> >> >> >> >> >> best >> >> >> Rupert >> >> >> >> >> >> On Thu, Aug 9, 2012 at 1:00 AM, harish suvarna <[email protected]> >> >> wrote: >> >> >> > I received a USB in IKS conf which contained the 1.19GB of dbpedia >> >> full >> >> >> > solr index. Does it contain the data from the chinese dump >> (available >> >> in >> >> >> > the dbpedia.org download server under zh folder)? >> >> >> > >> >> >> > I do get some dbpedia entries for chinese text in stanbol >> >> enhancements. I >> >> >> > am using the 1.19GB dump. I am expecting some more enhancements >> which >> >> are >> >> >> > present in wikipedia chinese. Just wondering if chinese dump is >> not >> >> >> > utilized. >> >> >> > >> >> >> > -harish >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> | Rupert Westenthaler [email protected] >> >> >> | Bodenlehenstraße 11 ++43-699-11108907 >> >> >> | A-5500 Bischofshofen >> >> >> >> >> >> >> >> >> >> >> -- >> >> | Rupert Westenthaler [email protected] >> >> | Bodenlehenstraße 11 ++43-699-11108907 >> >> | A-5500 Bischofshofen >> >> >> >> >> >> -- >> | Rupert Westenthaler [email protected] >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> > >
