Re: Same pages crawled more than once and slow crawling

Julien Nioche Fri, 19 Oct 2012 03:18:15 -0700

> Well, Nutch (resp. Hadoop) are designed to process much data. Job
>> management has some overhead
>> (and some artificial sleeps): 5 cycles * 4 jobs
>> (generate/fetch/parse/update) = 20 jobs.
>> 6s per job seems roughly ok, though it could be slightly faster.
>>
>
> Yes, this test is not well designed for Nutch, but I thought, as Stefan
> said, about a config or hardcoded delay somewhere in the nutch files I can
> try to reduce, since I will use on a single machine.
>


If your crawl is small and does not require more than one machine then you
could use the local mode instead of the distributed one. Nutch is designed
for large scale crawling : if you are after low latency crawler then it is
not the right tool. Most people use it on a small scale though but do not
find the slight overhead in the distribution to be a big issue. A few
seconds is quite insignificant considering that a round of fetching can
take hours.

J.

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Same pages crawled more than once and slow crawling

Reply via email to