Re: performance question: fetcher and parser in separate map/reduce jobs?

Lewis John Mcgibbney Wed, 06 Feb 2013 20:47:02 -0800

Hi Ken,

So the question relates to whether "fetching and parsing" of a parsing
fetcher happen in the map or the reduce phase of the fetch job ;)


Looking at the code it now appears that the new wiki entry is incorrect (I
will change this ASAP) in this regard and that outlinks and therefore
parsing (and subsequently) fetching jobs are all executed in the reduce
phase... this is easier to identify in the 2.x code to be honest as it gets
somewhat hidden in amongst the 1.x Fetcher code.

BTW please feel free to add your comments below to the wiki entry as I
think they are valuable to the discussion. Thanks for the input Ken.
Lewis

On Wed, Feb 6, 2013 at 8:21 PM, Ken Krugler <[email protected]>wrote:

> Hi Lewis,
>
> On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote:
>
> > I've eventually added this to our FAQ's
> >
> >
> http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
>
> I'm way out of date on the Nutch code base, but I thought that fetching
> happened during the reduce phase (to enable queue processing by domain or
> IP address).
>
> And that multiple threads were spun up to fetch at a higher level of
> parallelization than what you'd get out of configuring Hadoop's # of
> reducers per slave.
>
> In which case if you parse at the same time that you fetch, you'd need #
> threads * (memory & CPU parsing requirements) in addition to the (mostly
> I/O-bound) resources from fetching.
>
> But from the note on the wiki ("In a parsing fetcher, outlinks are
> processed in the mapper") it sounds like when using a parsing fetcher this
> is happening in a map task.
>
> So I'm curious about the current architecture of Nutch.
>
> Thanks,
>
> -- Ken
>
>
> >
> > This should explain for you.
> > Lewis
> >
> > On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <[email protected]> wrote:
> >
> >> Hi
> >> I have a performance question:
> >> why fetcher and parser is staged in two separate jobs instead of one?
> >> Intuitively, parser can be included as a part of fetcher reducer,  is
> >> it? This seems to be more efficient.
> >> Thanks
> >> --
> >> Best Regards
> >> -Weilei
> >>
> >
> >
> >
> > --
> > *Lewis*
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
*Lewis*

Re: performance question: fetcher and parser in separate map/reduce jobs?

Reply via email to