On 2010-06-29 11:37, Alex McLintock wrote: > One thing I don't really understand about running Nutch. > > If I am doing several topical crawls - or perhaps crawls constrained > to a number of sites I will be fetching the same page several times.
Do you mean the crawls that use disjoint CrawlDb-s? Then yes, there is no mechanism in Nutch to prevent this. > It would obviously be polite to not fetch the same page twice. If you use a single CrawlDb the same page won't appear on fetchlists multiple times, because the first time that you run CrawlDb update it already records that it was fetched. > > Now one way I have seen is to use some kind of http caching proxy > between your nutch/hadoop crawl and the outside world. But that kind > of defeats the point of using Nutch if the proxy is all on one big > box. > > Does nutch do anything like that itself? As far as I can see it only > really stores the processed documents - not the originally fetched > ones. Each new crawl is effectively a new crawl - ignoring whether or > not any pages were fetched before. By default Nutch stores everything in segments - both raw pages, parsed text, metadata, outlinks, etc. Page status is maintained in crawldb. If you use the same crawldb to generate/fetch/parse/update then CrawlDb is the place that remembers what pages have been fetched and which ones to schedule for re-fetching. > PS I found this Jira issue to refetch only new pages. Is this > available in the release? > > https://issues.apache.org/jira/browse/NUTCH-49 Yes. See AdaptiveFetchSchedule class for details. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

