Re: No links to process, is the webgraph empty?

Dustine Rene Bernasor Tue, 29 May 2012 20:45:38 -0700

Hello,

I tried your suggestion by setting the link.ignore.xxx.xxx values tofalse but it does not work. I tried to crawl a very small list of sites.Without performing webgraph, I dumped the segment using this command:

/bin/nutch readseg -dump/user/fetchdb/crawled/test/segments/20120530112254 /user/dump -nocontent-nofetch -nogenerate -noparse -noparsetext/


Here's a sample entry from the dump:

/ParseData::
Version: 5
Status: success(1,0)
Title: TinyMCE - Home
Outlinks: 35
  outlink: toUrl: http://www.tinymce.com/index.php anchor: Home
  outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it

outlink: toUrl: http://www.tinymce.com/download/download.php anchor:Download

  outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation

outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.phpanchor: Enterpriseoutlink: toUrl: http://www.tinymce.com/develop/develop.php anchor:Develop

  outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
  outlink: toUrl: http://www.tinymce.com/# anchor: Login

outlink: toUrl: http://www.tinymce.com/forum/register.php anchor:Register

  outlink: toUrl: http://www.tinymce.com/index.php anchor: Version: 3.5.1.1
  outlink: toUrl: http://www.tinymce.com/# anchor: always the same.
  outlink: toUrl: http://www.tinymce.com/# anchor: Easy to integrate
  outlink: toUrl: http://www.tinymce.com/# anchor: Customizable
  outlink: toUrl: http://www.tinymce.com/# anchor: Browserfriendly
  outlink: toUrl: http://www.tinymce.com/# anchor: Lightweight
  outlink: toUrl: http://www.tinymce.com/# anchor: AJAX Compatible
  outlink: toUrl: http://www.tinymce.com/# anchor: International
  outlink: toUrl: http://www.tinymce.com/# anchor: Open Source
  outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor:

outlink: toUrl: http://www.tinymce.com/download/download.php anchor:Downloadoutlink: toUrl:http://www.tinymce.com/js/tinymce/jscripts/tiny_mce/license.txt anchor:Licenseoutlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager.phpanchor: Learn moreoutlink: toUrl:http://www.tinymce.com/enterprise/mcimagemanager_buy.php anchor: Buyoutlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager.phpanchor: Learn moreoutlink: toUrl:http://www.tinymce.com/enterprise/mcfilemanager_buy.php anchor: Buyoutlink: toUrl: http://www.tinymce.com/enterprise/support.php anchor:ask questionoutlink: toUrl: http://www.tinymce.com/develop/bugtracker.php anchor:submit bugoutlink: toUrl: http://www.tinymce.com/enterprise/using.php anchor:More TinyMCE Users

  outlink: toUrl: http://www.tinymce.com/# anchor: Back to site top
  outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it

outlink: toUrl: http://www.tinymce.com/download/download.php anchor:Download

  outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation

outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.phpanchor: Enterpriseoutlink: toUrl: http://www.tinymce.com/develop/develop.php anchor:Develop

  outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum

Content Metadata: nutch.content.digest=647e4d7705884d232ce5456145f7cb99Date=Wed, 30 May 2012 03:22:43 GMT Vary=Accept-EncodingContent-Length=3895 Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33nutch.segment.name=20120530112254 Content-Type=text/html; charset=UTF-8Connection=close Server=Apache _ftk_=1338348124863

Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 /

As you can see, even in the parse data, there are no outlinks toexternal sites. (If you check the tinymce site, it has links tomicrosoft, facebook, etc) So I am thinking my problem is more or lessrelated to the issue described

here

https://issues.apache.org/jira/browse/NUTCH-1346


On 5/29/2012 4:55 PM, Markus Jelsma wrote:

Hi,

That depends on what you crawl, many connected/linked sites or isolated sites. 
If you crawl isolated sites then do not ignore internal links or you won't be 
able to build the webgraph. Keep in mind that without ignoring interal links 
the webgraph will become very dense.

Cheers


-----Original message-----

From:Dustine Rene Bernasor<[email protected]>
Sent: Tue 29-May-2012 10:51
To: [email protected]
Subject: Re: No links to process, is the webgraph empty?

Hello,

If I understand this correctly, I need to set link.ignore.limit.page and
link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
set to true? Or should I just set all of the link.ignore.xxx.xxx values
to false?

On 5/29/2012 4:43 PM, Markus Jelsma wrote:

Hi,

That's a patch for the fetcher. The error you are seeing is quite simple 
actually. Because you set those two link.ignore parameters to true, no links 
between the same domain and host or aggregated, only links from/to external 
hosts and domains. This is a good setting for wide web crawls. If you restrict 
crawling to a few domains and they don't share links between them, then with 
these settings you will have no links to process.

Markus


-----Original message-----

From:Dustine Rene Bernasor<[email protected]>
Sent: Tue 29-May-2012 10:40
To: [email protected]
Subject: Re: No links to process, is the webgraph empty?

Hello,

I tried to read the segment containing the site which I am sure has a
link towards another site and I was surprised to find out that the outlinks
stored all belong to the same domain. I came across this

https://issues.apache.org/jira/browse/NUTCH-1346

It seems a patch is available for 1.6. I am currently using 1.2. The
latest release for Nutch is 1.4. Would it be safe to switch directly to
1.6?

On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:

Hello,

Whenever I set link.ignore.internal.host and link.ignore.internal.domain
in nutch-site.xml to "true", I get the "No links to process, is the
webgraph empty?" error when performing LinkRank. However, if I set it to
"false", LinkRank works just fine. I have been searching about this
error but I haven't found anything conclusive so far.  Btw, I have also
set both the link.ignore.limit.page and the link.ignore.limit.domain to
"true".

Furthermore, if I perform NodeReader on a certain page A, it says that
that that page has 0 inlinks and outlinks but I know that there's
another page B that links to A. But if I do the NodeReader on B it says
there's 1 inlink and 1 outlink although B has links to many other sites.

I hope someone can shed light on this matter.

Thanks.

Dustine

Re: No links to process, is the webgraph empty?

Reply via email to