RE: No links to process, is the webgraph empty?

Markus Jelsma Wed, 30 May 2012 00:08:21 -0700

-----Original message-----
> From:Dustine Rene Bernasor <[email protected]>
> Sent: Wed 30-May-2012 05:45
> To: [email protected]
> Subject: Re: No links to process, is the webgraph empty?
> 
> Hello,
> 
> I tried your suggestion by setting the link.ignore.xxx.xxx values to 
> false but it does not work. I tried to crawl a very small list of sites. 
> Without performing webgraph, I dumped the segment using this command:
> 
> /bin/nutch readseg -dump 
> /user/fetchdb/crawled/test/segments/20120530112254 /user/dump -nocontent 
> -nofetch -nogenerate -noparse -noparsetext/
> 
> Here's a sample entry from the dump:
> 
> /ParseData::
> Version: 5
> Status: success(1,0)
> Title: TinyMCE - Home
> Outlinks: 35
>    outlink: toUrl: http://www.tinymce.com/index.php anchor: Home
>    outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
>    outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
> Download
>    outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
>    outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php 
> anchor: Enterprise
>    outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: 
> Develop
>    outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
>    outlink: toUrl: http://www.tinymce.com/# anchor: Login
>    outlink: toUrl: http://www.tinymce.com/forum/register.php anchor: 
> Register
>    outlink: toUrl: http://www.tinymce.com/index.php anchor: Version: 3.5.1.1
>    outlink: toUrl: http://www.tinymce.com/# anchor: always the same.
>    outlink: toUrl: http://www.tinymce.com/# anchor: Easy to integrate
>    outlink: toUrl: http://www.tinymce.com/# anchor: Customizable
>    outlink: toUrl: http://www.tinymce.com/# anchor: Browserfriendly
>    outlink: toUrl: http://www.tinymce.com/# anchor: Lightweight
>    outlink: toUrl: http://www.tinymce.com/# anchor: AJAX Compatible
>    outlink: toUrl: http://www.tinymce.com/# anchor: International
>    outlink: toUrl: http://www.tinymce.com/# anchor: Open Source
>    outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor:
>    outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
> Download
>    outlink: toUrl: 
> http://www.tinymce.com/js/tinymce/jscripts/tiny_mce/license.txt anchor: 
> License
>    outlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager.php 
> anchor: Learn more
>    outlink: toUrl: 
> http://www.tinymce.com/enterprise/mcimagemanager_buy.php anchor: Buy
>    outlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager.php 
> anchor: Learn more
>    outlink: toUrl: 
> http://www.tinymce.com/enterprise/mcfilemanager_buy.php anchor: Buy
>    outlink: toUrl: http://www.tinymce.com/enterprise/support.php anchor: 
> ask question
>    outlink: toUrl: http://www.tinymce.com/develop/bugtracker.php anchor: 
> submit bug
>    outlink: toUrl: http://www.tinymce.com/enterprise/using.php anchor: 
> More TinyMCE Users
>    outlink: toUrl: http://www.tinymce.com/# anchor: Back to site top
>    outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
>    outlink: toUrl: http://www.tinymce.com/download/download.php anchor: 
> Download
>    outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
>    outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php 
> anchor: Enterprise
>    outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: 
> Develop
>    outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
> Content Metadata: nutch.content.digest=647e4d7705884d232ce5456145f7cb99 
> Date=Wed, 30 May 2012 03:22:43 GMT Vary=Accept-Encoding 
> Content-Length=3895 Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 
> nutch.segment.name=20120530112254 Content-Type=text/html; charset=UTF-8 
> Connection=close Server=Apache _ftk_=1338348124863
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 /
> 
> As you can see, even in the parse data, there are no outlinks to 
> external sites. (If you check the tinymce site, it has links to 
> microsoft, facebook, etc) So I am thinking my problem is more or less 
> related to the issue described
> here
> 
> https://issues.apache.org/jira/browse/NUTCH-1346


No, that is a fix for an entirely different feature that is not yet released. 
If external outlinks are not present then check URL filters and 
db.ignore.external.

> 
> 
> On 5/29/2012 4:55 PM, Markus Jelsma wrote:
> > Hi,
> >
> > That depends on what you crawl, many connected/linked sites or isolated 
> > sites. If you crawl isolated sites then do not ignore internal links or you 
> > won't be able to build the webgraph. Keep in mind that without ignoring 
> > interal links the webgraph will become very dense.
> >
> > Cheers
> >
> >
> > -----Original message-----
> >> From:Dustine Rene Bernasor<[email protected]>
> >> Sent: Tue 29-May-2012 10:51
> >> To: [email protected]
> >> Subject: Re: No links to process, is the webgraph empty?
> >>
> >> Hello,
> >>
> >> If I understand this correctly, I need to set link.ignore.limit.page and
> >> link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
> >> set to true? Or should I just set all of the link.ignore.xxx.xxx values
> >> to false?
> >>
> >> On 5/29/2012 4:43 PM, Markus Jelsma wrote:
> >>> Hi,
> >>>
> >>> That's a patch for the fetcher. The error you are seeing is quite simple 
> >>> actually. Because you set those two link.ignore parameters to true, no 
> >>> links between the same domain and host or aggregated, only links from/to 
> >>> external hosts and domains. This is a good setting for wide web crawls. 
> >>> If you restrict crawling to a few domains and they don't share links 
> >>> between them, then with these settings you will have no links to process.
> >>>
> >>> Markus
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Dustine Rene Bernasor<[email protected]>
> >>>> Sent: Tue 29-May-2012 10:40
> >>>> To: [email protected]
> >>>> Subject: Re: No links to process, is the webgraph empty?
> >>>>
> >>>> Hello,
> >>>>
> >>>> I tried to read the segment containing the site which I am sure has a
> >>>> link towards another site and I was surprised to find out that the 
> >>>> outlinks
> >>>> stored all belong to the same domain. I came across this
> >>>>
> >>>> https://issues.apache.org/jira/browse/NUTCH-1346
> >>>>
> >>>> It seems a patch is available for 1.6. I am currently using 1.2. The
> >>>> latest release for Nutch is 1.4. Would it be safe to switch directly to
> >>>> 1.6?
> >>>>
> >>>>
> >>>>
> >>>> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain
> >>>>> in nutch-site.xml to "true", I get the "No links to process, is the
> >>>>> webgraph empty?" error when performing LinkRank. However, if I set it to
> >>>>> "false", LinkRank works just fine. I have been searching about this
> >>>>> error but I haven't found anything conclusive so far.  Btw, I have also
> >>>>> set both the link.ignore.limit.page and the link.ignore.limit.domain to
> >>>>> "true".
> >>>>>
> >>>>> Furthermore, if I perform NodeReader on a certain page A, it says that
> >>>>> that that page has 0 inlinks and outlinks but I know that there's
> >>>>> another page B that links to A. But if I do the NodeReader on B it says
> >>>>> there's 1 inlink and 1 outlink although B has links to many other sites.
> >>>>>
> >>>>> I hope someone can shed light on this matter.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> Dustine
> >>>>>
> >>
> 
>

RE: No links to process, is the webgraph empty?

Reply via email to