Hi Ken, So the question relates to whether "fetching and parsing" of a parsing fetcher happen in the map or the reduce phase of the fetch job ;)
Looking at the code it now appears that the new wiki entry is incorrect (I will change this ASAP) in this regard and that outlinks and therefore parsing (and subsequently) fetching jobs are all executed in the reduce phase... this is easier to identify in the 2.x code to be honest as it gets somewhat hidden in amongst the 1.x Fetcher code. BTW please feel free to add your comments below to the wiki entry as I think they are valuable to the discussion. Thanks for the input Ken. Lewis On Wed, Feb 6, 2013 at 8:21 PM, Ken Krugler <[email protected]>wrote: > Hi Lewis, > > On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote: > > > I've eventually added this to our FAQ's > > > > > http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F > > I'm way out of date on the Nutch code base, but I thought that fetching > happened during the reduce phase (to enable queue processing by domain or > IP address). > > And that multiple threads were spun up to fetch at a higher level of > parallelization than what you'd get out of configuring Hadoop's # of > reducers per slave. > > In which case if you parse at the same time that you fetch, you'd need # > threads * (memory & CPU parsing requirements) in addition to the (mostly > I/O-bound) resources from fetching. > > But from the note on the wiki ("In a parsing fetcher, outlinks are > processed in the mapper") it sounds like when using a parsing fetcher this > is happening in a map task. > > So I'm curious about the current architecture of Nutch. > > Thanks, > > -- Ken > > > > > > This should explain for you. > > Lewis > > > > On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <[email protected]> wrote: > > > >> Hi > >> I have a performance question: > >> why fetcher and parser is staged in two separate jobs instead of one? > >> Intuitively, parser can be included as a part of fetcher reducer, is > >> it? This seems to be more efficient. > >> Thanks > >> -- > >> Best Regards > >> -Weilei > >> > > > > > > > > -- > > *Lewis* > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > -- *Lewis*

