Hi Lewis, On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote:
> I've eventually added this to our FAQ's > > http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F I'm way out of date on the Nutch code base, but I thought that fetching happened during the reduce phase (to enable queue processing by domain or IP address). And that multiple threads were spun up to fetch at a higher level of parallelization than what you'd get out of configuring Hadoop's # of reducers per slave. In which case if you parse at the same time that you fetch, you'd need # threads * (memory & CPU parsing requirements) in addition to the (mostly I/O-bound) resources from fetching. But from the note on the wiki ("In a parsing fetcher, outlinks are processed in the mapper") it sounds like when using a parsing fetcher this is happening in a map task. So I'm curious about the current architecture of Nutch. Thanks, -- Ken > > This should explain for you. > Lewis > > On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <[email protected]> wrote: > >> Hi >> I have a performance question: >> why fetcher and parser is staged in two separate jobs instead of one? >> Intuitively, parser can be included as a part of fetcher reducer, is >> it? This seems to be more efficient. >> Thanks >> -- >> Best Regards >> -Weilei >> > > > > -- > *Lewis* -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

