Rupert, Thank you. More progress now. I see some indexes being added and the resulting chinese dbpedia.solr.zip is 17MB. I drop this into Stanbol, I donot get any results back. I am using the english persondata and instance_types. These files still have the dbpedia.org namespace. Perhaps I should convert them to zh.dbpedia.org. Do you think it is needed?
Also the person data has strings like "Aristotle"@en . Do I need "Aristotle"@zh Some insight into what exactly Oliver did would help. Some statistics of generating index on my 8gb MacBookPro. Chinese index is taking 3hrs and macpro desktop is taking <15mins. I am starting the English index to make sure that it works. -harish ================instance types <http://dbpedia.org/resource/Autism> < http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < http://dbpedia.org/ontology/Disease> . <http://dbpedia.org/resource/Autism> < http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < http://www.w3.org/2002/07/owl#Thing> . <http://dbpedia.org/resource/Animal_Farm> < http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < http://dbpedia.org/ontology/Book> . <http://dbpedia.org/resource/Animal_Farm> < http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Book> . ============Person data==================== <http://dbpedia.org/resource/Aristotle> <http://xmlns.com/foaf/0.1/name> "Aristotle"@en . <http://dbpedia.org/resource/Aristotle> < http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < http://xmlns.com/foaf/0.1/Person> . <http://dbpedia.org/resource/Aristotle> < http://purl.org/dc/elements/1.1/description> "Greek philosopher"@en . On Mon, Aug 27, 2012 at 11:59 PM, Rupert Westenthaler < [email protected]> wrote: > Hi, > > oh sorry I completely forgot to answer your question your problem with > the indexing configuration. But it looks like you where on the right > track anyway as the problem is indeed with the format of the > "incoming_links.txt" what is caused by the different namespace of the > Chinese dump. > > Here are the details > > The expected format of the "incoming_links.txt" (based on the > configuration in "iditerator.properties") is > > {score} {local-name} > > Note also the 'id-namespace' property that is set to to > "http://dbpedia.org/resource/" in the "iditerator.properties" file. > > This configuration corresponds to the 'sed' command in your shell script > > > curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \ > > | bzcat \ > > | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> > ./\1/' \ > > | sort \ > > | uniq -c \ > > | sort -nr > incoming_links.txt > > Because the Chinese dump uses a different namespace than the regex (of > the -e parameter) does not match and because of that URIs of the > Entities are not correctly extracted form the "page_links_zh.nt.bz2" > file. Because of that the results of the script are not the expected > one. > > To fix this you need to make the following two changes: > > 1) change the sed command so that is uses the correct namespace > > sed -e 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \ > > 2) change the value for the 'id-namespace' in > {indexing-working-dir}/indexing/config/iditerator.properties to the > namespace used by the Chinese dump "http://zh.dbpedia.org/resource/" > > > NOTES: > > * I recognized that the curl part of the included shell script still > refers to version 3.6. You might probably want to download the data > from "http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2" > instead. > > * for testing it is nice to add a '| head -n 1000 \' between ' | bzcat > \' and the 'sed' command. This causes only the first 'n' lines of the > dump to be processed. This will execute in <1sec and allows you to > review the results of the comment. You can even use the resulting > "incoming_links.txt" file for indexing! While this will only index a > small fraction of the entities it might still be useful for testing. > > I made some test and the following script looked fine to me (NOTE it > contains the '| head -n 1000 \' - you might want to remove this line > after checking the results) > > curl http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2 \ > | bzcat \ > | head -n 1000 \ > | sed -e > 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \ > | sort \ > | uniq -c \ > | sort -nr > incoming_links.txt > > > again sorry the late response > > best > Rupert > > On Mon, Aug 27, 2012 at 5:09 PM, harish suvarna <[email protected]> > wrote: > > Rupert, any clues on this problem? > > > > The resources below have http://zh.dbpedia.org. That does not exist. > Does > > it cause any problems? I did > > > > curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \ > > | bzcat \ > > | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> > ./\1/' \ > > | sort \ > > | uniq -c \ > > | sort -nr > incoming_links.txt > > > > to generate chinese incoming_links.txt. > > > > -harish > > > > On Thu, Aug 23, 2012 at 2:15 PM, harish suvarna <[email protected]> > wrote: > > > >> OK. Great. It may be easy to fix then. here are few lines. > >> > >> 1192 < > >> > http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u7F8E\u570B\u96FB\u5F71\u5217\u8868 > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u660E\u73E0\u53F0> . > >> 876 < > >> http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1000-1999)> > < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u661F\u7CFB> . > >> 781 < > >> > http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u7BC0\u76EE\u5217\u8868 > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> . > >> 611 < > >> > http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u52D5\u756B\u5217\u8868 > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> . > >> 573 < > http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1-999)> > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u661F\u7CFB> . > >> 519 < > >> > http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u52D5\u756B\u96C6\u6578\u5217\u8868 > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> > http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u6F2B\u756B\u5217\u8868 > > > >> . > >> 384 < > >> > http://zh.dbpedia.org/resource/2006\u5E74\u9999\u6E2F\u9078\u8209\u59D4\u54E1\u6703\u754C\u5225\u5206\u7D44\u9078\u8209 > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/File:Black_check.svg> . > >> 366 < > >> > http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74) > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u5C0F\u9B3C> . > >> 365 < > >> > http://zh.dbpedia.org/resource/\u7C21\u7E41\u8F49\u63DB\u4E00\u5C0D\u591A\u5217\u8868 > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/File:Cmbox_move.png> . > >> 355 < > >> > http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74) > > > >> <http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/\u5C0F\u8C6C> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category:\u90B5\u9633\u4EBA> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category:\u8523\u59D3> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category > :\u806F\u5408\u570B\u5B89\u5168\u7406\u4E8B\u6703\u4E3B\u5E2D> > >> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category > :\u570B\u7ACB\u6E05\u83EF\u5927\u5B78\u6559\u6388> > >> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category > :\u54E5\u502B\u6BD4\u4E9E\u5927\u5B78\u6821\u53CB> > >> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category:\u53F0\u7063\u5916\u7701\u4EBA> > . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category > :\u5357\u958B\u5927\u5B78\u6559\u6388> > >> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category > :\u4E2D\u83EF\u6C11\u570B\u99D0\u8607\u806F\u5927\u4F7F> > >> . > >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> < > >> http://dbpedia.org/ontology/wikiPageWikiLink> < > >> http://zh.dbpedia.org/resource/Category > :\u4E2D\u83EF\u6C11\u570B\u99D0\u7F8E\u570B\u5927\u4F7F> > >> . > >> > >> > >> > >> On Thu, Aug 23, 2012 at 1:37 PM, Rupert Westenthaler < > >> [email protected]> wrote: > >> > >>> Hi, > >>> > >>> one more thing. Can you please post me the first few lines of > >>> > >>> {indexing-source}/indexing/resource/incoming_links.txt > >>> > >>> so that I can check the data against the configuration of the > >>> iditerator.properties file > >>> > >>> best > >>> Rupert > >>> > >>> On Thu, Aug 23, 2012 at 10:31 PM, Rupert Westenthaler > >>> <[email protected]> wrote: > >>> > Hi > >>> > > >>> > The log shows clearly that you only import the triples from the dumps > >>> > to the Jena TDB triple store used as Source for the indexing. > >>> > > >>> > See all the lines such as > >>> > > >>> > 8:14:08,196 [Thread-5] INFO tdb.loader - Add: 50,000 triples > >>> > (Batch: 3,256 / Avg: 3,256) > >>> > 08:14:12,802 [Thread-5] INFO tdb.loader - Add: 100,000 triples > >>> > (Batch: 10,855 / Avg: 5,009) > >>> > > >>> > BTW: this needs only to be done once. After this initialization step > >>> > completes you can remove the RDF files from > >>> > "{indexing-root}/indexing/resources/rdfdata/" (I usually just rename > >>> > the rdfdata folder to imported-rdfdata). > >>> > > >>> > The ~1.5hrs are just the time needed to import the data from the RDF > >>> > dumps to the Jena TDB store. > >>> > > >>> > With > >>> > > >>> > 08:18:04,242 [main] INFO impl.IndexerImpl - Indexing started ... > >>> > > >>> > the indexing starts and > >>> > > >>> > 08:21:03,176 [Indexing: Finished Entity Logger Deamon] INFO > >>> > impl.IndexerImpl - Indexed 0 items in 1410320sec (Infinityms/item): > >>> > processing: -1.000ms/item | queue: -1.000ms > >>> > > >>> > states clearly that no single Entity was indexed. > >>> > > >>> > I guess this has to do with the configuration. I will have a look at > >>> > it tomorrow morning. > >>> > > >>> > best > >>> > Rupert > >>> > > >>> > On Thu, Aug 23, 2012 at 9:53 PM, harish suvarna <[email protected]> > >>> wrote: > >>> >> I am attaching the zip of config folder. The indexing takes quiet > some > >>> time > >>> >> (~1.5hrs). The number of triples it generates is high. > >>> >> I am attaching the english indexing output also. I used 10 files > >>> (except > >>> >> long_abstarcts_en.nt, it is 2.5 GB and I could not save it in utf8 > on > >>> my > >>> >> mac.). But for Chinese I had all files. > >>> >> -harish > >>> >> > >>> >> > >>> >> On Thu, Aug 23, 2012 at 12:27 PM, Rupert Westenthaler > >>> >> <[email protected]> wrote: > >>> >>> > >>> >>> I would expect the dbpedia.solrindex.zip file to be several > hundreds > >>> >>> MByte in size (if not gigabytes). > >>> >>> > >>> >>> The only explanation for this file to be so small is that > something is > >>> >>> going wrong during indexing. > >>> >>> > >>> >>> Can you maybe provide the {indexing-root}/indexing/config folder so > >>> >>> that I can have a look at your configuration > >>> >>> > >>> >>> best > >>> >>> Rupert > >>> >>> > >>> >>> On Thu, Aug 23, 2012 at 5:49 PM, harish suvarna < > [email protected]> > >>> >>> wrote: > >>> >>> > > >>> >>> > Rupert, > >>> >>> > I generated the index for dbpedia3.8 English files only. > >>> >>> > One thing that intrigues me is that the dbpedia.solrindex.zip > >>> filesize > >>> >>> > is > >>> >>> > 53kb, same when I generated for chinese. The english files are > much > >>> >>> > bigger. > >>> >>> > In the english zip also, I can't find paris. > >>> >>> > I am attaching English dbpedia.solrindex.zip for any clues. > >>> >>> > Do I need to load the bundle jar file created by the dbpedia > >>> indexing? > >>> >>> > > >>> >>> > -harish > >>> >>> > >>> >>> > >>> >>> > >>> >>> -- > >>> >>> | Rupert Westenthaler [email protected] > >>> >>> | Bodenlehenstraße 11 > ++43-699-11108907 > >>> >>> | A-5500 Bischofshofen > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> -- > >>> >> Thanks > >>> >> Harish > >>> >> > >>> > > >>> > > >>> > > >>> > -- > >>> > | Rupert Westenthaler [email protected] > >>> > | Bodenlehenstraße 11 ++43-699-11108907 > >>> > | A-5500 Bischofshofen > >>> > >>> > >>> > >>> -- > >>> | Rupert Westenthaler [email protected] > >>> | Bodenlehenstraße 11 ++43-699-11108907 > >>> | A-5500 Bischofshofen > >>> > >> > >> > >> > >> -- > >> Thanks > >> Harish > >> > >> > > > > > > -- > > Thanks > > Harish > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > -- Thanks Harish
