Hello,

If I understand this correctly, I need to set link.ignore.limit.page and link.ignore.limit.domain to false and the link.ignore.internal.xxx can be set to true? Or should I just set all of the link.ignore.xxx.xxx values to false?

On 5/29/2012 4:43 PM, Markus Jelsma wrote:
Hi,

That's a patch for the fetcher. The error you are seeing is quite simple 
actually. Because you set those two link.ignore parameters to true, no links 
between the same domain and host or aggregated, only links from/to external 
hosts and domains. This is a good setting for wide web crawls. If you restrict 
crawling to a few domains and they don't share links between them, then with 
these settings you will have no links to process.

Markus


-----Original message-----
From:Dustine Rene Bernasor<[email protected]>
Sent: Tue 29-May-2012 10:40
To: [email protected]
Subject: Re: No links to process, is the webgraph empty?

Hello,

I tried to read the segment containing the site which I am sure has a
link towards another site and I was surprised to find out that the outlinks
stored all belong to the same domain. I came across this

https://issues.apache.org/jira/browse/NUTCH-1346

It seems a patch is available for 1.6. I am currently using 1.2. The
latest release for Nutch is 1.4. Would it be safe to switch directly to
1.6?



On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
Hello,

Whenever I set link.ignore.internal.host and link.ignore.internal.domain
in nutch-site.xml to "true", I get the "No links to process, is the
webgraph empty?" error when performing LinkRank. However, if I set it to
"false", LinkRank works just fine. I have been searching about this
error but I haven't found anything conclusive so far.  Btw, I have also
set both the link.ignore.limit.page and the link.ignore.limit.domain to
"true".

Furthermore, if I perform NodeReader on a certain page A, it says that
that that page has 0 inlinks and outlinks but I know that there's
another page B that links to A. But if I do the NodeReader on B it says
there's 1 inlink and 1 outlink although B has links to many other sites.

I hope someone can shed light on this matter.

Thanks.

Dustine



Reply via email to