Andrzej Bialecki schrieb:
> On 2010-06-29 11:37, Alex McLintock wrote:
>   
>> One thing I don't really understand about running Nutch.
>>
>> If I am doing several topical crawls - or perhaps crawls constrained
>> to a number of sites I will be fetching the same page several times.
>>     
>
> Do you mean the crawls that use disjoint CrawlDb-s? Then yes, there is
> no mechanism in Nutch to prevent this.
>
>   
>> It would obviously be polite to not fetch the same page twice.
>>     
>
> If you use a single CrawlDb the same page won't appear on fetchlists
> multiple times, because the first time that you run CrawlDb update it
> already records that it was fetched.
>   
i have reported a bug some time ago.
https://issues.apache.org/jira/browse/NUTCH-774
the provided patch has not been integrated into nutch 1.1 release.
it may happen that the retry interval is set to 0 and the same page is
fetched again and again.

>   
>> Now one way I have seen is to use some kind of http caching proxy
>> between your nutch/hadoop crawl and the outside world. But that kind
>> of defeats the point of using Nutch if the proxy is all on one big
>> box.
>>
>> Does nutch do anything like that itself? As far as I can see it only
>> really stores the processed documents - not the originally fetched
>> ones. Each new crawl is effectively a new crawl - ignoring whether or
>> not any pages were fetched before.
>>     
>
> By default Nutch stores everything in segments - both raw pages, parsed
> text, metadata, outlinks, etc. Page status is maintained in crawldb. If
> you use the same crawldb to generate/fetch/parse/update then CrawlDb is
> the place that remembers what pages have been fetched and which ones to
> schedule for re-fetching.
>
>   
>> PS I found this Jira issue to refetch only new pages. Is this
>> available in the release?
>>
>> https://issues.apache.org/jira/browse/NUTCH-49
>>     
>
> Yes. See AdaptiveFetchSchedule class for details.
>
>   

Reply via email to