Hi,

That's a patch for the fetcher. The error you are seeing is quite simple 
actually. Because you set those two link.ignore parameters to true, no links 
between the same domain and host or aggregated, only links from/to external 
hosts and domains. This is a good setting for wide web crawls. If you restrict 
crawling to a few domains and they don't share links between them, then with 
these settings you will have no links to process.

Markus
 
 
-----Original message-----
> From:Dustine Rene Bernasor <[email protected]>
> Sent: Tue 29-May-2012 10:40
> To: [email protected]
> Subject: Re: No links to process, is the webgraph empty?
> 
> Hello,
> 
> I tried to read the segment containing the site which I am sure has a 
> link towards another site and I was surprised to find out that the outlinks
> stored all belong to the same domain. I came across this
> 
> https://issues.apache.org/jira/browse/NUTCH-1346
> 
> It seems a patch is available for 1.6. I am currently using 1.2. The 
> latest release for Nutch is 1.4. Would it be safe to switch directly to 
> 1.6?
> 
> 
> 
> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> > Hello,
> >
> > Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> > in nutch-site.xml to "true", I get the "No links to process, is the
> > webgraph empty?" error when performing LinkRank. However, if I set it to
> > "false", LinkRank works just fine. I have been searching about this
> > error but I haven't found anything conclusive so far.  Btw, I have also
> > set both the link.ignore.limit.page and the link.ignore.limit.domain to
> > "true".
> >
> > Furthermore, if I perform NodeReader on a certain page A, it says that
> > that that page has 0 inlinks and outlinks but I know that there's
> > another page B that links to A. But if I do the NodeReader on B it says
> > there's 1 inlink and 1 outlink although B has links to many other sites.
> >
> > I hope someone can shed light on this matter.
> >
> > Thanks.
> >
> > Dustine
> >
> 
> 

Reply via email to