Hi,
I solved the problem by changing the propertiy
db.ignore.internal.links in nutch-site.xml file. Indeed, the
description of this property in Nutch-default.xml file is "If true,
when adding new links to a page, links from the same host are ignored.
This is an effective way to limit the size of the link database,
keeping only the highest quality links."

Linkdb is now complete and so the class LinkDbReader builds a file
dump  not empty.

Greets,
Davide

2011/1/11 Norman Birke <[email protected]>:
> Hi Davide,
>
> sorry, I never worked that one out and never received a reply.
>
> Greets,
> Norman
>
> -----Ursprüngliche Nachricht-----
> Von: Davide Cavalaglio [mailto:[email protected]]
> Gesendet: Dienstag, 11. Januar 2011 13:07
> An: [email protected]; [email protected]
> Betreff: Re: readlinkdb does not work on nutch 1.0 installation
>
> Hi,
> have you solved the problem? I have the same issue and I can not read
> the link db which I think is wrong because the generated file (linkdb
> directory) are all 1kb.
> Using the 1.0 version of Nutch
>
> Thanks
>
> 2010/4/14 Norman Birke <[email protected]>
>>
>> Hi,
>>
>> I am trying to dump my linkdb content for analysis using the following
>> command:
>> bin/nutch readlinkdb crawl/linkdb -dump readlinkdb_dump
>>
>> I receive the following output in my shell:
>> LinkDb dump: starting
>> LinkDb db: crawl/linkdb/
>>
>> After that the readlinkdb_dump folder exists and in it the 2 files:
>> .part-00000.crc (which has a size of 8 byte)
>> part-00000 (which has a size of 0 byte)
>>
>> As I have 686 URLs in my crawldb the file size seems a bit small to me.
>> Everything else works fine - I can read and dump my crawldb, read and dump
>> segments and so on.
>>
>> Any idea what might be messed up?
>>
>> Many thanks in advance,
>> Norman
>>
>
>

Reply via email to