Andrzej Bialecki schrieb: > On 2010-06-29 11:37, Alex McLintock wrote: > >> One thing I don't really understand about running Nutch. >> >> If I am doing several topical crawls - or perhaps crawls constrained >> to a number of sites I will be fetching the same page several times. >> > > Do you mean the crawls that use disjoint CrawlDb-s? Then yes, there is > no mechanism in Nutch to prevent this. > > >> It would obviously be polite to not fetch the same page twice. >> > > If you use a single CrawlDb the same page won't appear on fetchlists > multiple times, because the first time that you run CrawlDb update it > already records that it was fetched. > i have reported a bug some time ago. https://issues.apache.org/jira/browse/NUTCH-774 the provided patch has not been integrated into nutch 1.1 release. it may happen that the retry interval is set to 0 and the same page is fetched again and again.
> >> Now one way I have seen is to use some kind of http caching proxy >> between your nutch/hadoop crawl and the outside world. But that kind >> of defeats the point of using Nutch if the proxy is all on one big >> box. >> >> Does nutch do anything like that itself? As far as I can see it only >> really stores the processed documents - not the originally fetched >> ones. Each new crawl is effectively a new crawl - ignoring whether or >> not any pages were fetched before. >> > > By default Nutch stores everything in segments - both raw pages, parsed > text, metadata, outlinks, etc. Page status is maintained in crawldb. If > you use the same crawldb to generate/fetch/parse/update then CrawlDb is > the place that remembers what pages have been fetched and which ones to > schedule for re-fetching. > > >> PS I found this Jira issue to refetch only new pages. Is this >> available in the release? >> >> https://issues.apache.org/jira/browse/NUTCH-49 >> > > Yes. See AdaptiveFetchSchedule class for details. > >

