Re: performance question: fetcher and parser in separate map/reduce jobs?

Ken Krugler Wed, 06 Feb 2013 20:21:44 -0800

Hi Lewis,

On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote:

> I've eventually added this to our FAQ's
> 
> http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F

I'm way out of date on the Nutch code base, but I thought that fetching 
happened during the reduce phase (to enable queue processing by domain or IP 
address).

And that multiple threads were spun up to fetch at a higher level of 
parallelization than what you'd get out of configuring Hadoop's # of reducers 
per slave.

In which case if you parse at the same time that you fetch, you'd need # 
threads * (memory & CPU parsing requirements) in addition to the (mostly 
I/O-bound) resources from fetching.

But from the note on the wiki ("In a parsing fetcher, outlinks are processed in 
the mapper") it sounds like when using a parsing fetcher this is happening in a 
map task.

So I'm curious about the current architecture of Nutch.

Thanks,

-- Ken

> 
> This should explain for you.
> Lewis
> 
> On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <[email protected]> wrote:
> 
>> Hi
>> I have a performance question:
>> why fetcher and parser is staged in two separate jobs instead of one?
>> Intuitively, parser can be included as a part of fetcher reducer,  is
>> it? This seems to be more efficient.
>> Thanks
>> --
>> Best Regards
>> -Weilei
>> 
> 
> 
> 
> -- 
> *Lewis*

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: performance question: fetcher and parser in separate map/reduce jobs?

Reply via email to