On 2010-06-29 11:37, Alex McLintock wrote:
> One thing I don't really understand about running Nutch.
> 
> If I am doing several topical crawls - or perhaps crawls constrained
> to a number of sites I will be fetching the same page several times.

Do you mean the crawls that use disjoint CrawlDb-s? Then yes, there is
no mechanism in Nutch to prevent this.

> It would obviously be polite to not fetch the same page twice.

If you use a single CrawlDb the same page won't appear on fetchlists
multiple times, because the first time that you run CrawlDb update it
already records that it was fetched.

> 
> Now one way I have seen is to use some kind of http caching proxy
> between your nutch/hadoop crawl and the outside world. But that kind
> of defeats the point of using Nutch if the proxy is all on one big
> box.
> 
> Does nutch do anything like that itself? As far as I can see it only
> really stores the processed documents - not the originally fetched
> ones. Each new crawl is effectively a new crawl - ignoring whether or
> not any pages were fetched before.

By default Nutch stores everything in segments - both raw pages, parsed
text, metadata, outlinks, etc. Page status is maintained in crawldb. If
you use the same crawldb to generate/fetch/parse/update then CrawlDb is
the place that remembers what pages have been fetched and which ones to
schedule for re-fetching.

> PS I found this Jira issue to refetch only new pages. Is this
> available in the release?
> 
> https://issues.apache.org/jira/browse/NUTCH-49

Yes. See AdaptiveFetchSchedule class for details.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to