Hello,

I tried your suggestion by setting the link.ignore.xxx.xxx values to false but it does not work. I tried to crawl a very small list of sites. Without performing webgraph, I dumped the segment using this command:

/bin/nutch readseg -dump /user/fetchdb/crawled/test/segments/20120530112254 /user/dump -nocontent -nofetch -nogenerate -noparse -noparsetext/

Here's a sample entry from the dump:

/ParseData::
Version: 5
Status: success(1,0)
Title: TinyMCE - Home
Outlinks: 35
  outlink: toUrl: http://www.tinymce.com/index.php anchor: Home
  outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
outlink: toUrl: http://www.tinymce.com/download/download.php anchor: Download
  outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php anchor: Enterprise outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: Develop
  outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
  outlink: toUrl: http://www.tinymce.com/# anchor: Login
outlink: toUrl: http://www.tinymce.com/forum/register.php anchor: Register
  outlink: toUrl: http://www.tinymce.com/index.php anchor: Version: 3.5.1.1
  outlink: toUrl: http://www.tinymce.com/# anchor: always the same.
  outlink: toUrl: http://www.tinymce.com/# anchor: Easy to integrate
  outlink: toUrl: http://www.tinymce.com/# anchor: Customizable
  outlink: toUrl: http://www.tinymce.com/# anchor: Browserfriendly
  outlink: toUrl: http://www.tinymce.com/# anchor: Lightweight
  outlink: toUrl: http://www.tinymce.com/# anchor: AJAX Compatible
  outlink: toUrl: http://www.tinymce.com/# anchor: International
  outlink: toUrl: http://www.tinymce.com/# anchor: Open Source
  outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor:
outlink: toUrl: http://www.tinymce.com/download/download.php anchor: Download outlink: toUrl: http://www.tinymce.com/js/tinymce/jscripts/tiny_mce/license.txt anchor: License outlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager.php anchor: Learn more outlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager_buy.php anchor: Buy outlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager.php anchor: Learn more outlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager_buy.php anchor: Buy outlink: toUrl: http://www.tinymce.com/enterprise/support.php anchor: ask question outlink: toUrl: http://www.tinymce.com/develop/bugtracker.php anchor: submit bug outlink: toUrl: http://www.tinymce.com/enterprise/using.php anchor: More TinyMCE Users
  outlink: toUrl: http://www.tinymce.com/# anchor: Back to site top
  outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it
outlink: toUrl: http://www.tinymce.com/download/download.php anchor: Download
  outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation
outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php anchor: Enterprise outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: Develop
  outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum
Content Metadata: nutch.content.digest=647e4d7705884d232ce5456145f7cb99 Date=Wed, 30 May 2012 03:22:43 GMT Vary=Accept-Encoding Content-Length=3895 Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20120530112254 Content-Type=text/html; charset=UTF-8 Connection=close Server=Apache _ftk_=1338348124863
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 /

As you can see, even in the parse data, there are no outlinks to external sites. (If you check the tinymce site, it has links to microsoft, facebook, etc) So I am thinking my problem is more or less related to the issue described
here

https://issues.apache.org/jira/browse/NUTCH-1346


On 5/29/2012 4:55 PM, Markus Jelsma wrote:
Hi,

That depends on what you crawl, many connected/linked sites or isolated sites. 
If you crawl isolated sites then do not ignore internal links or you won't be 
able to build the webgraph. Keep in mind that without ignoring interal links 
the webgraph will become very dense.

Cheers


-----Original message-----
From:Dustine Rene Bernasor<[email protected]>
Sent: Tue 29-May-2012 10:51
To: [email protected]
Subject: Re: No links to process, is the webgraph empty?

Hello,

If I understand this correctly, I need to set link.ignore.limit.page and
link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
set to true? Or should I just set all of the link.ignore.xxx.xxx values
to false?

On 5/29/2012 4:43 PM, Markus Jelsma wrote:
Hi,

That's a patch for the fetcher. The error you are seeing is quite simple 
actually. Because you set those two link.ignore parameters to true, no links 
between the same domain and host or aggregated, only links from/to external 
hosts and domains. This is a good setting for wide web crawls. If you restrict 
crawling to a few domains and they don't share links between them, then with 
these settings you will have no links to process.

Markus


-----Original message-----
From:Dustine Rene Bernasor<[email protected]>
Sent: Tue 29-May-2012 10:40
To: [email protected]
Subject: Re: No links to process, is the webgraph empty?

Hello,

I tried to read the segment containing the site which I am sure has a
link towards another site and I was surprised to find out that the outlinks
stored all belong to the same domain. I came across this

https://issues.apache.org/jira/browse/NUTCH-1346

It seems a patch is available for 1.6. I am currently using 1.2. The
latest release for Nutch is 1.4. Would it be safe to switch directly to
1.6?



On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote:
Hello,

Whenever I set link.ignore.internal.host and link.ignore.internal.domain
in nutch-site.xml to "true", I get the "No links to process, is the
webgraph empty?" error when performing LinkRank. However, if I set it to
"false", LinkRank works just fine. I have been searching about this
error but I haven't found anything conclusive so far.  Btw, I have also
set both the link.ignore.limit.page and the link.ignore.limit.domain to
"true".

Furthermore, if I perform NodeReader on a certain page A, it says that
that that page has 0 inlinks and outlinks but I know that there's
another page B that links to A. But if I do the NodeReader on B it says
there's 1 inlink and 1 outlink although B has links to many other sites.

I hope someone can shed light on this matter.

Thanks.

Dustine



Reply via email to