Rupert,
Thank you.
More progress now. I see some indexes being added and the resulting chinese
dbpedia.solr.zip is 17MB. I drop this into Stanbol, I donot get any results
back.
I am using the english persondata and instance_types. These files still
have
the dbpedia.org namespace. Perhaps I should convert them to zh.dbpedia.org.
Do you think it is needed?

Also the person data has strings like "Aristotle"@en . Do I need
"Aristotle"@zh

Some insight into what exactly Oliver did would help. Some statistics of
generating index on my 8gb MacBookPro. Chinese index is taking 3hrs and
macpro desktop is taking <15mins.

I am starting the English index to make sure that it works.

-harish

================instance types
<http://dbpedia.org/resource/Autism> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://dbpedia.org/ontology/Disease> .
<http://dbpedia.org/resource/Autism> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Animal_Farm> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://dbpedia.org/ontology/Book> .
<http://dbpedia.org/resource/Animal_Farm> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Book> .

============Person data====================
<http://dbpedia.org/resource/Aristotle> <http://xmlns.com/foaf/0.1/name>
"Aristotle"@en .
<http://dbpedia.org/resource/Aristotle> <
http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://xmlns.com/foaf/0.1/Person> .
<http://dbpedia.org/resource/Aristotle> <
http://purl.org/dc/elements/1.1/description> "Greek philosopher"@en .





On Mon, Aug 27, 2012 at 11:59 PM, Rupert Westenthaler <
[email protected]> wrote:

> Hi,
>
> oh sorry I completely forgot to answer your question your problem with
> the indexing configuration. But it looks like you where on the right
> track anyway as the problem is indeed with the format of the
> "incoming_links.txt" what is caused by the different namespace of the
> Chinese dump.
>
> Here are the details
>
> The expected format of the "incoming_links.txt" (based on the
> configuration in "iditerator.properties") is
>
>     {score} {local-name}
>
> Note also the 'id-namespace' property that is set to to
> "http://dbpedia.org/resource/"; in the "iditerator.properties" file.
>
> This configuration corresponds to the 'sed' command in your shell script
>
> > curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \
> >         | bzcat \
> >         | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)>
> ./\1/' \
> >         | sort \
> >         | uniq -c  \
> >         | sort -nr > incoming_links.txt
>
> Because the Chinese dump uses a different namespace than the regex (of
> the -e parameter) does not match and because of that URIs of the
> Entities are not correctly extracted form the "page_links_zh.nt.bz2"
> file. Because of that the results of the script are not the expected
> one.
>
> To fix this you need to make the following two changes:
>
> 1) change the sed command so that is uses the correct namespace
>
>     sed -e 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
>
> 2) change the value for the 'id-namespace' in
> {indexing-working-dir}/indexing/config/iditerator.properties to the
> namespace used by the Chinese dump "http://zh.dbpedia.org/resource/";
>
>
> NOTES:
>
> * I recognized that the curl part of the included shell script still
> refers to version 3.6. You might probably want to download the data
> from "http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2";
> instead.
>
> * for testing it is nice to add a '| head -n 1000 \' between ' | bzcat
> \' and the 'sed' command. This causes only the first 'n' lines of the
> dump to be processed. This will execute in <1sec and allows you to
> review the results of the comment. You can even use the resulting
> "incoming_links.txt" file for indexing! While this will only index a
> small fraction of the entities it might still be useful for testing.
>
> I made some test and the following script looked fine to me (NOTE it
> contains the '| head -n 1000 \' - you might want to remove this line
> after checking the results)
>
> curl http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2 \
>          | bzcat \
>          | head -n 1000 \
>          | sed -e
> 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
>          | sort \
>          | uniq -c  \
>          | sort -nr > incoming_links.txt
>
>
> again sorry the late response
>
> best
> Rupert
>
> On Mon, Aug 27, 2012 at 5:09 PM, harish suvarna <[email protected]>
> wrote:
> > Rupert, any clues on this problem?
> >
> > The resources below have http://zh.dbpedia.org. That does not exist.
> Does
> > it cause any problems? I did
> >
> > curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \
> >         | bzcat \
> >         | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)>
> ./\1/' \
> >         | sort \
> >         | uniq -c  \
> >         | sort -nr > incoming_links.txt
> >
> > to generate chinese incoming_links.txt.
> >
> > -harish
> >
> > On Thu, Aug 23, 2012 at 2:15 PM, harish suvarna <[email protected]>
> wrote:
> >
> >> OK. Great. It may be easy to fix then. here are few lines.
> >>
> >> 1192 <
> >>
> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u7F8E\u570B\u96FB\u5F71\u5217\u8868
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u660E\u73E0\u53F0> .
> >>  876 <
> >> http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1000-1999)>
> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u661F\u7CFB> .
> >>  781 <
> >>
> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u7BC0\u76EE\u5217\u8868
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> .
> >>  611 <
> >>
> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u52D5\u756B\u5217\u8868
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> .
> >>  573 <
> http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1-999)>
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u661F\u7CFB> .
> >>  519 <
> >>
> http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u52D5\u756B\u96C6\u6578\u5217\u8868
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >>
> http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u6F2B\u756B\u5217\u8868
> >
> >> .
> >>  384 <
> >>
> http://zh.dbpedia.org/resource/2006\u5E74\u9999\u6E2F\u9078\u8209\u59D4\u54E1\u6703\u754C\u5225\u5206\u7D44\u9078\u8209
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/File:Black_check.svg> .
> >>  366 <
> >>
> http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74)
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u5C0F\u9B3C> .
> >>  365 <
> >>
> http://zh.dbpedia.org/resource/\u7C21\u7E41\u8F49\u63DB\u4E00\u5C0D\u591A\u5217\u8868
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/File:Cmbox_move.png> .
> >>  355 <
> >>
> http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74)
> >
> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/\u5C0F\u8C6C> .
> >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category:\u90B5\u9633\u4EBA> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category:\u8523\u59D3> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category
> :\u806F\u5408\u570B\u5B89\u5168\u7406\u4E8B\u6703\u4E3B\u5E2D>
> >> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category
> :\u570B\u7ACB\u6E05\u83EF\u5927\u5B78\u6559\u6388>
> >> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category
> :\u54E5\u502B\u6BD4\u4E9E\u5927\u5B78\u6821\u53CB>
> >> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category:\u53F0\u7063\u5916\u7701\u4EBA>
> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category
> :\u5357\u958B\u5927\u5B78\u6559\u6388>
> >> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category
> :\u4E2D\u83EF\u6C11\u570B\u99D0\u8607\u806F\u5927\u4F7F>
> >> .
> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
> >> http://zh.dbpedia.org/resource/Category
> :\u4E2D\u83EF\u6C11\u570B\u99D0\u7F8E\u570B\u5927\u4F7F>
> >> .
> >>
> >>
> >>
> >> On Thu, Aug 23, 2012 at 1:37 PM, Rupert Westenthaler <
> >> [email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> one more thing. Can you please post me the first few lines of
> >>>
> >>>  {indexing-source}/indexing/resource/incoming_links.txt
> >>>
> >>> so that I can check the data against the configuration of the
> >>> iditerator.properties file
> >>>
> >>> best
> >>> Rupert
> >>>
> >>> On Thu, Aug 23, 2012 at 10:31 PM, Rupert Westenthaler
> >>> <[email protected]> wrote:
> >>> > Hi
> >>> >
> >>> > The log shows clearly that you only import the triples from the dumps
> >>> > to the Jena TDB triple store used as Source for the indexing.
> >>> >
> >>> > See all the lines such as
> >>> >
> >>> >     8:14:08,196 [Thread-5] INFO  tdb.loader - Add: 50,000 triples
> >>> > (Batch: 3,256 / Avg: 3,256)
> >>> >     08:14:12,802 [Thread-5] INFO  tdb.loader - Add: 100,000 triples
> >>> > (Batch: 10,855 / Avg: 5,009)
> >>> >
> >>> > BTW: this needs only to be done once. After this initialization step
> >>> > completes you can remove the RDF files from
> >>> > "{indexing-root}/indexing/resources/rdfdata/" (I usually just rename
> >>> > the rdfdata folder to imported-rdfdata).
> >>> >
> >>> > The ~1.5hrs are just the time needed to import the data from the RDF
> >>> > dumps to the Jena TDB store.
> >>> >
> >>> > With
> >>> >
> >>> >     08:18:04,242 [main] INFO  impl.IndexerImpl - Indexing started ...
> >>> >
> >>> > the indexing starts and
> >>> >
> >>> >     08:21:03,176 [Indexing: Finished Entity Logger Deamon] INFO
> >>> > impl.IndexerImpl - Indexed 0 items in 1410320sec (Infinityms/item):
> >>> > processing:  -1.000ms/item | queue:  -1.000ms
> >>> >
> >>> > states clearly that no single Entity was indexed.
> >>> >
> >>> > I guess this has to do with the configuration. I will have a look at
> >>> > it tomorrow morning.
> >>> >
> >>> > best
> >>> > Rupert
> >>> >
> >>> > On Thu, Aug 23, 2012 at 9:53 PM, harish suvarna <[email protected]>
> >>> wrote:
> >>> >> I am attaching the zip of config folder. The indexing takes quiet
> some
> >>> time
> >>> >> (~1.5hrs). The number of triples it generates is high.
> >>> >> I am attaching the english indexing output also. I used 10 files
> >>> (except
> >>> >> long_abstarcts_en.nt, it is 2.5 GB and I could not save it in utf8
> on
> >>> my
> >>> >> mac.). But for Chinese I had all files.
> >>> >> -harish
> >>> >>
> >>> >>
> >>> >> On Thu, Aug 23, 2012 at 12:27 PM, Rupert Westenthaler
> >>> >> <[email protected]> wrote:
> >>> >>>
> >>> >>> I would expect the dbpedia.solrindex.zip file to be several
> hundreds
> >>> >>> MByte in size (if not gigabytes).
> >>> >>>
> >>> >>> The only explanation for this file to be so small is that
> something is
> >>> >>> going wrong during indexing.
> >>> >>>
> >>> >>> Can you maybe provide the {indexing-root}/indexing/config folder so
> >>> >>> that I can have a look at your configuration
> >>> >>>
> >>> >>> best
> >>> >>> Rupert
> >>> >>>
> >>> >>> On Thu, Aug 23, 2012 at 5:49 PM, harish suvarna <
> [email protected]>
> >>> >>> wrote:
> >>> >>> >
> >>> >>> > Rupert,
> >>> >>> > I generated the index for dbpedia3.8 English files only.
> >>> >>> > One thing that intrigues me is that the dbpedia.solrindex.zip
> >>> filesize
> >>> >>> > is
> >>> >>> > 53kb, same when I generated for chinese. The english files are
> much
> >>> >>> > bigger.
> >>> >>> > In the english zip also, I can't find paris.
> >>> >>> > I am attaching English dbpedia.solrindex.zip for any clues.
> >>> >>> > Do I need to load the bundle jar file created by the dbpedia
> >>> indexing?
> >>> >>> >
> >>> >>> > -harish
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> --
> >>> >>> | Rupert Westenthaler             [email protected]
> >>> >>> | Bodenlehenstraße 11
> ++43-699-11108907
> >>> >>> | A-5500 Bischofshofen
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Thanks
> >>> >> Harish
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > | Rupert Westenthaler             [email protected]
> >>> > | Bodenlehenstraße 11                             ++43-699-11108907
> >>> > | A-5500 Bischofshofen
> >>>
> >>>
> >>>
> >>> --
> >>> | Rupert Westenthaler             [email protected]
> >>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>> | A-5500 Bischofshofen
> >>>
> >>
> >>
> >>
> >> --
> >> Thanks
> >> Harish
> >>
> >>
> >
> >
> > --
> > Thanks
> > Harish
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Thanks
Harish

Reply via email to