Hi Danicela, I can confirm that I can recreate this behaviour.
My example, The follwing page domain robots.txt http://www.heraldscotland.com/robots.txt has a crawl.delay of 10ms With fetcher.verbose set to true and fetcher.max.crawl.delay set to -1 in nutch-site.xml my logs read 2012-02-17 11:44:58,079 INFO fetcher.Fetcher - fetching http://www.heraldscotland.com/ So after fetching of the segment finished I dump the segment/fetch data lewis@lewis-01:~/ASF/trunk-test/runtime/local$ bin/nutch readseg -dump segments/20120217115205 output -nocontent -nogenerate -noparse -noparsedata -noparsetext SegmentReader: dump segment: segments/20120217115205 SegmentReader: done which looks like (I also added nutch site for clarity, just to see that something is getting fetched.) Recno:: 0 URL:: http://nutch.apache.org/ CrawlDatum:: Version: 7 Status: 33 (fetch_success) Fetch time: Fri Feb 17 11:52:22 GMT 2012 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1329479517277Content-Type: text/html_pst_: success(1), lastModified=0 Recno:: 1 URL:: http://www.heraldscotland.com/ CrawlDatum:: Version: 7 Status: 37 (fetch_gone) Fetch time: Fri Feb 17 11:52:21 GMT 2012 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1329479517277_pst_: robots_denied(18), lastModified=0 So I delete my crawldb and segment dir and try again with fetcher.max.crawl.delay set to 100, and the page is fetched and I go for a cup of tea and everything is fine. Off the top of my head, I think something would be really neat, if we could grab the crawl-delay value and display it next to the fetcher log output something like 2012-02-17 11:44:58,079 INFO fetcher.Fetcher - fetching http://www.heraldscotland.com/ (crawl.delay=10ms) or something... wdyt? But it appears that unless there is something else at play here, that there is a small problem with this property. Lewis On Thu, Feb 16, 2012 at 9:38 AM, Danicela nutch <[email protected]>wrote: > I have sites which have crawl delay at 20, others at 720, but in both > cases, it should fetch some pages, but it couldn't. > > ----- Message d'origine ----- > De : Lewis John Mcgibbney > Envoyés : 15.02.12 23:11 > À : [email protected] > Objet : Re: Re : Re: fetcher.max.crawl.delay = -1 doesn't work? > > Another question I should have asked is how long is the crawl delay in > robots.txt? If you read the fetcher.max.crawl.delay property description it > explicitly notes that the fetcher will wait however long it is required by > robots.tx until it fetches the page. Do you have this information? Thanks > On Wed, Feb 15, 2012 at 9:08 AM, Danicela nutch > <[email protected]>wrote: > > I don't think I configured such things, how can I be sure ? > > ----- > Message d'origine ----- > De : Lewis John Mcgibbney > Envoyés : 14.02.12 > 19:18 > À : [email protected] > Objet : Re: fetcher.max.crawl.delay = > -1 doesn't work? > > Hi Danicela, Before I try this, have you configured > any other overrides > for generating or fetching in nutch-site.xml? Thanks > On Tue, Feb 14, 2012 > at 3:10 PM, Danicela nutch > <[email protected]>wrote: > > Hi, > > I > have in my nutch-site.xml the value fetcher.max.crawl.delay = > -1. > > When > I try to fetch a site with a robots.txt with a Crawl Delay, > it > doesn' > t > work. > > If I put fetcher.max.crawl.delay = 10000, it works. > > I > use > Nutch 1.2, but according to the changelog, nothing has been changed > > about > that since then. > > Is this a Nutch bug or I misused something ? > > > > Another thing, in hadoop.log, the pages which couldn't be fetched are > > > still marked as "fetching", is this normal ? Shouldn't they be marked as > > > "dropped" or something ? > > Thanks. > -- *Lewis* > -- *Lewis* > -- *Lewis*

