-----Original message----- > From:Dustine Rene Bernasor <[email protected]> > Sent: Wed 30-May-2012 05:45 > To: [email protected] > Subject: Re: No links to process, is the webgraph empty? > > Hello, > > I tried your suggestion by setting the link.ignore.xxx.xxx values to > false but it does not work. I tried to crawl a very small list of sites. > Without performing webgraph, I dumped the segment using this command: > > /bin/nutch readseg -dump > /user/fetchdb/crawled/test/segments/20120530112254 /user/dump -nocontent > -nofetch -nogenerate -noparse -noparsetext/ > > Here's a sample entry from the dump: > > /ParseData:: > Version: 5 > Status: success(1,0) > Title: TinyMCE - Home > Outlinks: 35 > outlink: toUrl: http://www.tinymce.com/index.php anchor: Home > outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it > outlink: toUrl: http://www.tinymce.com/download/download.php anchor: > Download > outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation > outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php > anchor: Enterprise > outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: > Develop > outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum > outlink: toUrl: http://www.tinymce.com/# anchor: Login > outlink: toUrl: http://www.tinymce.com/forum/register.php anchor: > Register > outlink: toUrl: http://www.tinymce.com/index.php anchor: Version: 3.5.1.1 > outlink: toUrl: http://www.tinymce.com/# anchor: always the same. > outlink: toUrl: http://www.tinymce.com/# anchor: Easy to integrate > outlink: toUrl: http://www.tinymce.com/# anchor: Customizable > outlink: toUrl: http://www.tinymce.com/# anchor: Browserfriendly > outlink: toUrl: http://www.tinymce.com/# anchor: Lightweight > outlink: toUrl: http://www.tinymce.com/# anchor: AJAX Compatible > outlink: toUrl: http://www.tinymce.com/# anchor: International > outlink: toUrl: http://www.tinymce.com/# anchor: Open Source > outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: > outlink: toUrl: http://www.tinymce.com/download/download.php anchor: > Download > outlink: toUrl: > http://www.tinymce.com/js/tinymce/jscripts/tiny_mce/license.txt anchor: > License > outlink: toUrl: http://www.tinymce.com/enterprise/mcimagemanager.php > anchor: Learn more > outlink: toUrl: > http://www.tinymce.com/enterprise/mcimagemanager_buy.php anchor: Buy > outlink: toUrl: http://www.tinymce.com/enterprise/mcfilemanager.php > anchor: Learn more > outlink: toUrl: > http://www.tinymce.com/enterprise/mcfilemanager_buy.php anchor: Buy > outlink: toUrl: http://www.tinymce.com/enterprise/support.php anchor: > ask question > outlink: toUrl: http://www.tinymce.com/develop/bugtracker.php anchor: > submit bug > outlink: toUrl: http://www.tinymce.com/enterprise/using.php anchor: > More TinyMCE Users > outlink: toUrl: http://www.tinymce.com/# anchor: Back to site top > outlink: toUrl: http://www.tinymce.com/tryit/full.php anchor: Try it > outlink: toUrl: http://www.tinymce.com/download/download.php anchor: > Download > outlink: toUrl: http://www.tinymce.com/wiki.php anchor: Documentation > outlink: toUrl: http://www.tinymce.com/enterprise/enterprise.php > anchor: Enterprise > outlink: toUrl: http://www.tinymce.com/develop/develop.php anchor: > Develop > outlink: toUrl: http://www.tinymce.com/forum/index.php anchor: Forum > Content Metadata: nutch.content.digest=647e4d7705884d232ce5456145f7cb99 > Date=Wed, 30 May 2012 03:22:43 GMT Vary=Accept-Encoding > Content-Length=3895 Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 > nutch.segment.name=20120530112254 Content-Type=text/html; charset=UTF-8 > Connection=close Server=Apache _ftk_=1338348124863 > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 / > > As you can see, even in the parse data, there are no outlinks to > external sites. (If you check the tinymce site, it has links to > microsoft, facebook, etc) So I am thinking my problem is more or less > related to the issue described > here > > https://issues.apache.org/jira/browse/NUTCH-1346
No, that is a fix for an entirely different feature that is not yet released. If external outlinks are not present then check URL filters and db.ignore.external. > > > On 5/29/2012 4:55 PM, Markus Jelsma wrote: > > Hi, > > > > That depends on what you crawl, many connected/linked sites or isolated > > sites. If you crawl isolated sites then do not ignore internal links or you > > won't be able to build the webgraph. Keep in mind that without ignoring > > interal links the webgraph will become very dense. > > > > Cheers > > > > > > -----Original message----- > >> From:Dustine Rene Bernasor<[email protected]> > >> Sent: Tue 29-May-2012 10:51 > >> To: [email protected] > >> Subject: Re: No links to process, is the webgraph empty? > >> > >> Hello, > >> > >> If I understand this correctly, I need to set link.ignore.limit.page and > >> link.ignore.limit.domain to false and the link.ignore.internal.xxx can be > >> set to true? Or should I just set all of the link.ignore.xxx.xxx values > >> to false? > >> > >> On 5/29/2012 4:43 PM, Markus Jelsma wrote: > >>> Hi, > >>> > >>> That's a patch for the fetcher. The error you are seeing is quite simple > >>> actually. Because you set those two link.ignore parameters to true, no > >>> links between the same domain and host or aggregated, only links from/to > >>> external hosts and domains. This is a good setting for wide web crawls. > >>> If you restrict crawling to a few domains and they don't share links > >>> between them, then with these settings you will have no links to process. > >>> > >>> Markus > >>> > >>> > >>> -----Original message----- > >>>> From:Dustine Rene Bernasor<[email protected]> > >>>> Sent: Tue 29-May-2012 10:40 > >>>> To: [email protected] > >>>> Subject: Re: No links to process, is the webgraph empty? > >>>> > >>>> Hello, > >>>> > >>>> I tried to read the segment containing the site which I am sure has a > >>>> link towards another site and I was surprised to find out that the > >>>> outlinks > >>>> stored all belong to the same domain. I came across this > >>>> > >>>> https://issues.apache.org/jira/browse/NUTCH-1346 > >>>> > >>>> It seems a patch is available for 1.6. I am currently using 1.2. The > >>>> latest release for Nutch is 1.4. Would it be safe to switch directly to > >>>> 1.6? > >>>> > >>>> > >>>> > >>>> On 5/29/2012 10:19 AM, Dustine Rene Bernasor wrote: > >>>>> Hello, > >>>>> > >>>>> Whenever I set link.ignore.internal.host and link.ignore.internal.domain > >>>>> in nutch-site.xml to "true", I get the "No links to process, is the > >>>>> webgraph empty?" error when performing LinkRank. However, if I set it to > >>>>> "false", LinkRank works just fine. I have been searching about this > >>>>> error but I haven't found anything conclusive so far. Btw, I have also > >>>>> set both the link.ignore.limit.page and the link.ignore.limit.domain to > >>>>> "true". > >>>>> > >>>>> Furthermore, if I perform NodeReader on a certain page A, it says that > >>>>> that that page has 0 inlinks and outlinks but I know that there's > >>>>> another page B that links to A. But if I do the NodeReader on B it says > >>>>> there's 1 inlink and 1 outlink although B has links to many other sites. > >>>>> > >>>>> I hope someone can shed light on this matter. > >>>>> > >>>>> Thanks. > >>>>> > >>>>> Dustine > >>>>> > >> > >

