RE: No links to process, is the webgraph empty?

Markus Jelsma Tue, 29 May 2012 01:55:47 -0700

Hi,

That depends on what you crawl, many connected/linked sites or isolated sites. 
If you crawl isolated sites then do not ignore internal links or you won't be 
able to build the webgraph. Keep in mind that without ignoring interal links 
the webgraph will become very dense.


Cheers
 
 
-----Original message-----
> From:Dustine Rene Bernasor <[email protected]>
> Sent: Tue 29-May-2012 10:51
> To: [email protected]
> Subject: Re: No links to process, is the webgraph empty?
> 
> Hello,
> 
> If I understand this correctly, I need to set link.ignore.limit.page and 
> link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
> set to true? Or should I just set all of the link.ignore.xxx.xxx values 
> to false?
> 
> On 5/29/2012 4:43 PM, Markus Jelsma wrote:
> > Hi,
> >
> > That's a patch for the fetcher. The error you are seeing is quite simple 
> > actually. Because you set those two link.ignore parameters to true, no 
> > links between the same domain and host or aggregated, only links from/to 
> > external hosts and domains. This is a good setting for wide web crawls. If 
> > you restrict crawling to a few domains and they don't share links between 
> > them, then with these settings you will have no links to process.
> >
> > Markus
> >
> >
> > -----Original message-----
> >> From:Dustine Rene Bernasor<[email protected]>
> >> Sent: Tue 29-May-2012 10:40
> >> To: [email protected]
> >> Subject: Re: No links to process, is the webgraph empty?
> >>
> >> Hello,
> >>
> >> I tried to read the segment containing the site which I am sure has a
> >> link towards another site and I was surprised to find out that the outlinks
> >> stored all belong to the same domain. I came across this
> >>
> >> https://issues.apache.org/jira/browse/NUTCH-1346
> >>
> >> It seems a patch is available for 1.6. I am currently using 1.2. The
> >> latest release for Nutch is 1.4. Would it be safe to switch directly to
> >> 1.6?
> >>
> >>
> >>
> >> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> >>> Hello,
> >>>
> >>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> >>> in nutch-site.xml to "true", I get the "No links to process, is the
> >>> webgraph empty?" error when performing LinkRank. However, if I set it to
> >>> "false", LinkRank works just fine. I have been searching about this
> >>> error but I haven't found anything conclusive so far.  Btw, I have also
> >>> set both the link.ignore.limit.page and the link.ignore.limit.domain to
> >>> "true".
> >>>
> >>> Furthermore, if I perform NodeReader on a certain page A, it says that
> >>> that that page has 0 inlinks and outlinks but I know that there's
> >>> another page B that links to A. But if I do the NodeReader on B it says
> >>> there's 1 inlink and 1 outlink although B has links to many other sites.
> >>>
> >>> I hope someone can shed light on this matter.
> >>>
> >>> Thanks.
> >>>
> >>> Dustine
> >>>
> >>
> 
>

RE: No links to process, is the webgraph empty?

Reply via email to