Hi,
oh sorry I completely forgot to answer your question your problem with
the indexing configuration. But it looks like you where on the right
track anyway as the problem is indeed with the format of the
"incoming_links.txt" what is caused by the different namespace of the
Chinese dump.
Here are the details
The expected format of the "incoming_links.txt" (based on the
configuration in "iditerator.properties") is
{score} {local-name}
Note also the 'id-namespace' property that is set to to
"http://dbpedia.org/resource/" in the "iditerator.properties" file.
This configuration corresponds to the 'sed' command in your shell script
> curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \
> | bzcat \
> | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
> | sort \
> | uniq -c \
> | sort -nr > incoming_links.txt
Because the Chinese dump uses a different namespace than the regex (of
the -e parameter) does not match and because of that URIs of the
Entities are not correctly extracted form the "page_links_zh.nt.bz2"
file. Because of that the results of the script are not the expected
one.
To fix this you need to make the following two changes:
1) change the sed command so that is uses the correct namespace
sed -e 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
2) change the value for the 'id-namespace' in
{indexing-working-dir}/indexing/config/iditerator.properties to the
namespace used by the Chinese dump "http://zh.dbpedia.org/resource/"
NOTES:
* I recognized that the curl part of the included shell script still
refers to version 3.6. You might probably want to download the data
from "http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2"
instead.
* for testing it is nice to add a '| head -n 1000 \' between ' | bzcat
\' and the 'sed' command. This causes only the first 'n' lines of the
dump to be processed. This will execute in <1sec and allows you to
review the results of the comment. You can even use the resulting
"incoming_links.txt" file for indexing! While this will only index a
small fraction of the entities it might still be useful for testing.
I made some test and the following script looked fine to me (NOTE it
contains the '| head -n 1000 \' - you might want to remove this line
after checking the results)
curl http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2 \
| bzcat \
| head -n 1000 \
| sed -e
's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
| sort \
| uniq -c \
| sort -nr > incoming_links.txt
again sorry the late response
best
Rupert
On Mon, Aug 27, 2012 at 5:09 PM, harish suvarna <[email protected]> wrote:
> Rupert, any clues on this problem?
>
> The resources below have http://zh.dbpedia.org. That does not exist. Does
> it cause any problems? I did
>
> curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \
> | bzcat \
> | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
> | sort \
> | uniq -c \
> | sort -nr > incoming_links.txt
>
> to generate chinese incoming_links.txt.
>
> -harish
>
> On Thu, Aug 23, 2012 at 2:15 PM, harish suvarna <[email protected]> wrote:
>
>> OK. Great. It may be easy to fix then. here are few lines.
>>
>> 1192 <
>> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u7F8E\u570B\u96FB\u5F71\u5217\u8868>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u660E\u73E0\u53F0> .
>> 876 <
>> http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1000-1999)> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u661F\u7CFB> .
>> 781 <
>> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u7BC0\u76EE\u5217\u8868>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> .
>> 611 <
>> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u52D5\u756B\u5217\u8868>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> .
>> 573 <http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1-999)>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u661F\u7CFB> .
>> 519 <
>> http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u52D5\u756B\u96C6\u6578\u5217\u8868>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u6F2B\u756B\u5217\u8868>
>> .
>> 384 <
>> http://zh.dbpedia.org/resource/2006\u5E74\u9999\u6E2F\u9078\u8209\u59D4\u54E1\u6703\u754C\u5225\u5206\u7D44\u9078\u8209>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/File:Black_check.svg> .
>> 366 <
>> http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74)>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u5C0F\u9B3C> .
>> 365 <
>> http://zh.dbpedia.org/resource/\u7C21\u7E41\u8F49\u63DB\u4E00\u5C0D\u591A\u5217\u8868>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/File:Cmbox_move.png> .
>> 355 <
>> http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74)>
>> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/\u5C0F\u8C6C> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u90B5\u9633\u4EBA> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u8523\u59D3> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u806F\u5408\u570B\u5B89\u5168\u7406\u4E8B\u6703\u4E3B\u5E2D>
>> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u570B\u7ACB\u6E05\u83EF\u5927\u5B78\u6559\u6388>
>> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u54E5\u502B\u6BD4\u4E9E\u5927\u5B78\u6821\u53CB>
>> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u53F0\u7063\u5916\u7701\u4EBA> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u5357\u958B\u5927\u5B78\u6559\u6388>
>> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u4E2D\u83EF\u6C11\u570B\u99D0\u8607\u806F\u5927\u4F7F>
>> .
>> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> http://zh.dbpedia.org/resource/Category:\u4E2D\u83EF\u6C11\u570B\u99D0\u7F8E\u570B\u5927\u4F7F>
>> .
>>
>>
>>
>> On Thu, Aug 23, 2012 at 1:37 PM, Rupert Westenthaler <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> one more thing. Can you please post me the first few lines of
>>>
>>> {indexing-source}/indexing/resource/incoming_links.txt
>>>
>>> so that I can check the data against the configuration of the
>>> iditerator.properties file
>>>
>>> best
>>> Rupert
>>>
>>> On Thu, Aug 23, 2012 at 10:31 PM, Rupert Westenthaler
>>> <[email protected]> wrote:
>>> > Hi
>>> >
>>> > The log shows clearly that you only import the triples from the dumps
>>> > to the Jena TDB triple store used as Source for the indexing.
>>> >
>>> > See all the lines such as
>>> >
>>> > 8:14:08,196 [Thread-5] INFO tdb.loader - Add: 50,000 triples
>>> > (Batch: 3,256 / Avg: 3,256)
>>> > 08:14:12,802 [Thread-5] INFO tdb.loader - Add: 100,000 triples
>>> > (Batch: 10,855 / Avg: 5,009)
>>> >
>>> > BTW: this needs only to be done once. After this initialization step
>>> > completes you can remove the RDF files from
>>> > "{indexing-root}/indexing/resources/rdfdata/" (I usually just rename
>>> > the rdfdata folder to imported-rdfdata).
>>> >
>>> > The ~1.5hrs are just the time needed to import the data from the RDF
>>> > dumps to the Jena TDB store.
>>> >
>>> > With
>>> >
>>> > 08:18:04,242 [main] INFO impl.IndexerImpl - Indexing started ...
>>> >
>>> > the indexing starts and
>>> >
>>> > 08:21:03,176 [Indexing: Finished Entity Logger Deamon] INFO
>>> > impl.IndexerImpl - Indexed 0 items in 1410320sec (Infinityms/item):
>>> > processing: -1.000ms/item | queue: -1.000ms
>>> >
>>> > states clearly that no single Entity was indexed.
>>> >
>>> > I guess this has to do with the configuration. I will have a look at
>>> > it tomorrow morning.
>>> >
>>> > best
>>> > Rupert
>>> >
>>> > On Thu, Aug 23, 2012 at 9:53 PM, harish suvarna <[email protected]>
>>> wrote:
>>> >> I am attaching the zip of config folder. The indexing takes quiet some
>>> time
>>> >> (~1.5hrs). The number of triples it generates is high.
>>> >> I am attaching the english indexing output also. I used 10 files
>>> (except
>>> >> long_abstarcts_en.nt, it is 2.5 GB and I could not save it in utf8 on
>>> my
>>> >> mac.). But for Chinese I had all files.
>>> >> -harish
>>> >>
>>> >>
>>> >> On Thu, Aug 23, 2012 at 12:27 PM, Rupert Westenthaler
>>> >> <[email protected]> wrote:
>>> >>>
>>> >>> I would expect the dbpedia.solrindex.zip file to be several hundreds
>>> >>> MByte in size (if not gigabytes).
>>> >>>
>>> >>> The only explanation for this file to be so small is that something is
>>> >>> going wrong during indexing.
>>> >>>
>>> >>> Can you maybe provide the {indexing-root}/indexing/config folder so
>>> >>> that I can have a look at your configuration
>>> >>>
>>> >>> best
>>> >>> Rupert
>>> >>>
>>> >>> On Thu, Aug 23, 2012 at 5:49 PM, harish suvarna <[email protected]>
>>> >>> wrote:
>>> >>> >
>>> >>> > Rupert,
>>> >>> > I generated the index for dbpedia3.8 English files only.
>>> >>> > One thing that intrigues me is that the dbpedia.solrindex.zip
>>> filesize
>>> >>> > is
>>> >>> > 53kb, same when I generated for chinese. The english files are much
>>> >>> > bigger.
>>> >>> > In the english zip also, I can't find paris.
>>> >>> > I am attaching English dbpedia.solrindex.zip for any clues.
>>> >>> > Do I need to load the bundle jar file created by the dbpedia
>>> indexing?
>>> >>> >
>>> >>> > -harish
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> | Rupert Westenthaler [email protected]
>>> >>> | Bodenlehenstraße 11 ++43-699-11108907
>>> >>> | A-5500 Bischofshofen
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks
>>> >> Harish
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > | Rupert Westenthaler [email protected]
>>> > | Bodenlehenstraße 11 ++43-699-11108907
>>> > | A-5500 Bischofshofen
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler [email protected]
>>> | Bodenlehenstraße 11 ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>>
>>
>> --
>> Thanks
>> Harish
>>
>>
>
>
> --
> Thanks
> Harish
--
| Rupert Westenthaler [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen