Re: dbpedia solr index dump

harish suvarna Fri, 17 Aug 2012 09:41:26 -0700

I read the readme.md in the entityhub/indexing/dbpedia and started indexing
the chinese dbpedia 3.8. Chinese dbpedia3.8 dump does not seem to have 2
files needed. instance_types and person_data. Still I went ahead and tried
to run the index generation.
I get a java exception. The other observations are
1. the curl thing to generate incoming_links.txt took more than 3 hours and
generated 2.5GB of this file.
2. dbpedia 3.8 seem to have Category: and not Cat:. So the step to
substitute Cat: with Category: is not required now.
3. After Java exception the program to generate index seems doing nothing
nor is terminated. I waited overnight and killed it.
Previously I did not understand your question


Are this the data for the Entities with the URIs
"http://zh.dboedua.org/resource/{name}";?

Now I understand. There is no Chinese dbpedia server running.

http://wiki.dbpedia.org/Internationalization has list of language chapters
supported for dbpedia now. Chinese is yet to come.
My intention is to make a stanbol solr chinese dbpedia dump so that I can
'spot' keywords in dbpedia better than English dump.

-harish

=======================================================
ttp://www.w3.org/2000/01/rdf-sche
08:23:37,914 [Thread-5] ERROR source.ResourceLoader - Unable to load
resource
/Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2
org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token:
http://www.w3.org/2000/01/rdf-sche
    at
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
    at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
    at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
    at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
    at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
    at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
    at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245)
    at
org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
    at java.lang.Thread.run(Thread.java:680)
08:23:37,917 [Thread-5] ERROR source.ResourceLoader - Exception while
loading file
/Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2
org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token:
http://www.w3.org/2000/01/rdf-sche
    at
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
    at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
    at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
    at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
    at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
    at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
    at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245)
    at
org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
    at java.lang.Thread.run(Thread.java:680)
Exception in thread "Thread-5" java.lang.IllegalStateException: Error while
loading Resource
/Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.setResourceState(ResourceLoader.java:273)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:215)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245)
    at
org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
    at java.lang.Thread.run(Thread.java:680)
Caused by: org.openjena.riot.RiotException: [line: 6972, col: 46] Broken
token: http://www.w3.org/2000/01/rdf-sche
    at
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
    at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
    at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38)
    at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
    at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
    at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
    at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
    at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
    at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
    at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
    at
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72)
    at
org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
    ... 4 more
=========================================================

On Wed, Aug 15, 2012 at 6:01 PM, harish suvarna <[email protected]> wrote:

> Thanks Rupert. I am making some progress here. I am finding that paoding
> breaks words into small segments, espcially foreign names. For ex, motorola
> is broken into two parts (mot, rola), similarly
> michael is borken into (mik, kael). Now the ngram based dbpedia lookup
> looks for these in the dbpedia index and cannot find.
> My segmentation process and dbpedia solr index must both use the same
> segmenter. There is a paoding analyzer for solr too. I just need to create
> the solr index for dbpedia using that.
> Actually now, I have more dbpedia hits in character ngram based dbpedia
> lookup for chinese than the number of hits I get if I use paoding.
> We dont know what language analyzers have been used by ogrisel is creating
> the solr dbpedia dump of 1.19gb.
>
> I also experimented with contenthub search for chinese. Right now it does
> not work. I need to debug that part also. Even the UI in the contenthub
> does not display the chinese characters. The enhancer UI does display the
> characters well.
>
> Also for English Stanbol, I did play with contenthub. I took a small text
> as follows.
> ==============
>  United States produced an Olympic-record time to win gold in the women's
> 200m freestyle relay final. A brilliant final leg from Allison Schmitt led
> the Americans home, ahead of Australia, in a time of seven minutes 42.92
> seconds. Missy Franklin gave them a great start, while Dana Vollmer and
> Shannon Vreeland also produced fast times.
> =====================================================================
>
> The above text is properly processed and I get the dbpedia links for all
> persons, countries in the above. Hoewver, the above piece is related to
> 'swimming' and this word does not appear at all in the text. In the dbpedia
> link of Allison Scmitt, the dbpedia categories do tell us that it is in
> swimming category. Did anyone try to process the categories inside the link
> and add them as metadata for this content. If we add this, then we add more
> value than a simple solr based search in content store. Some one in IKS
> conference demoed this as a semantic search. Any hints/clues on this work ?
>
>
>
>
>
> On Wed, Aug 15, 2012 at 1:25 PM, Rupert Westenthaler <
> [email protected]> wrote:
>
>> On Wed, Aug 15, 2012 at 3:06 AM, harish suvarna <[email protected]>
>> wrote:
>> > Is {stanbol-trunk}/entityhub/indeing/dbpedia it different from the
>> custom
>> > ontology file tool that is mentioned in
>> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html ?
>> >
>>
>> The custom DBpedia indexing tool comes with a different default
>> configuration and also with a custolmised Solr schema (schema.xml
>> file) for dbpedia. Otherwise it is the same software as the generic
>> RDF indexing tool. Most of the things mentioned in
>> "customvocabulary.html" are also valid for the dbpedia indexing tool.
>> Please also notice the readme and the comments in the configuration of
>> the dbpedia indexing tool.
>>
>> > Is it same as the entityhub page in Stanbol localhost:8080?
>>
>> This tool was used to create all available dbpedia indexes for Apache
>> Stanbol. This includes the dbpedia default data (shipped with the
>> launcher).
>>
>> best
>> Rupert
>>
>> >
>> > -harish
>> >
>> >
>> > On Thu, Aug 9, 2012 at 10:58 PM, Rupert Westenthaler <
>> > [email protected]> wrote:
>> >
>> >> Hi
>> >>
>> >>
>> >> On Fri, Aug 10, 2012 at 1:28 AM, harish suvarna <[email protected]>
>> >> wrote:
>> >> > Thanks Rupert for the update.
>> >> > Meanwhile I am looking at generating custom vocab index page
>> >> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.htmland
>> >> > trying to know which files I have to use under dbpedia chinese
>> download
>> >> > available at http://downloads.dbpedia.org/3.8/zh/
>> >>
>> >> Are this the data for the Entities with the URIs
>> >> "http://zh.dboedua.org/resource/{name}";?
>> >>
>> >> Anyway cool that dbpedia 3.8 got finally released!
>> >>
>> >> >
>> >> > The dbpedia download for chinese has article categories, lables,
>> >> short/long
>> >> > abstracts, inter language links. Donot know which ones to use for the
>> >> > stanbol entityhub custom vocabulary index tool.
>> >>
>> >> For linking concepts you need only the labels. If you also include the
>> >> short abstracts you will also have the mouse over text in the Stanbol
>> >> Enhancer UI. Geo coordinates are needed for the map in the enhancer
>> >> UI.
>> >>
>> >> You should also include the data providing the rdf:types of the
>> >> Entities. However I do not know which of the files does include those.
>> >>
>> >> Categories are currently not used by Stanbol. If you want to include
>> >> them you should add (1) the categories (2) categories labels and (3)
>> >> article categories
>> >>
>> >> Note that there is an own Entityhub Indexing Tool for dbpedia
>> >> {stanbol-trunk}/entityhub/indeing/dbpedia.
>> >>
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> >
>> >> > -harish
>> >> >
>> >> >
>> >> > On Thu, Aug 9, 2012 at 11:08 AM, Rupert Westenthaler <
>> >> > [email protected]> wrote:
>> >> >
>> >> >> Hi
>> >> >>
>> >> >> the dbpedia 3.7 index was build by ogrisel so I do not know the
>> details.
>> >> >>
>> >> >> I think Chinese (zh) labels are included, but the index only
>> contains
>> >> >> Entities for Wikipedia pages with 5 or more incoming links.
>> >> >>
>> >> >> In addition while  the English DBpedia contains zh labels it will
>> not
>> >> >> contain Entities that do not have a counterpart in the English
>> >> >> Wikipedia.
>> >> >>
>> >> >> best
>> >> >> Rupert
>> >> >>
>> >> >> On Thu, Aug 9, 2012 at 1:00 AM, harish suvarna <[email protected]>
>> >> wrote:
>> >> >> > I received a USB in IKS conf which contained the 1.19GB of dbpedia
>> >> full
>> >> >> > solr index. Does it contain the data from the chinese dump
>> (available
>> >> in
>> >> >> > the dbpedia.org download server under zh folder)?
>> >> >> >
>> >> >> > I do get some dbpedia entries for chinese text in stanbol
>> >> enhancements. I
>> >> >> > am using the 1.19GB dump. I am expecting some more enhancements
>> which
>> >> are
>> >> >> > present  in wikipedia chinese. Just wondering if chinese dump is
>> not
>> >> >> > utilized.
>> >> >> >
>> >> >> > -harish
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> | Rupert Westenthaler             [email protected]
>> >> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> >> | A-5500 Bischofshofen
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> | Rupert Westenthaler             [email protected]
>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> | A-5500 Bischofshofen
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>

Re: dbpedia solr index dump

Reply via email to