In a parsing fetcher iirc outlinks are processed in the mapper (at least when 
outlinks are followed). If a fetcher's reducer stalls you may run out of memory 
or disk space.
 
 
-----Original message-----
> From:kaveh minooie <[email protected]>
> Sent: Wed 13-Jun-2012 19:28
> To: [email protected]
> Subject: Re: very long fetch reduce task
> 
> Thanks for the responses, and yes, in my case, parsing IS enabled and 
> happens during the fetch job.
> 
> On 06/13/2012 07:43 AM, Ferdy Galema wrote:
> > You're right. I was already assuming parsing was enabled. If it's not,
> > normalizing and filtering is most likely the next probable cause why tasks
> > are stalling.
> >
> > On Wed, Jun 13, 2012 at 4:36 PM, Julien Nioche <
> > [email protected]> wrote:
> >
> >> unless the parsing is activated in the fetch step - this is likely to be a
> >> different issue e.g. normalization of URL taking forever or something like
> >> this. Use jstack to see what the problem is
> >>
> >> On 13 June 2012 12:36, Ferdy Galema <[email protected]> wrote:
> >>
> >>> I'd like to add that I've recently opened an issue that describes one of
> >>> the causes of this problem. Look for the lazy man's profiler trick to see
> >>> stacktraces of the slow parser task. It will give an indication which
> >>> parser code is stalling:
> >>> https://issues.apache.org/jira/browse/NUTCH-1387
> >>>
> >>> On Wed, Jun 13, 2012 at 12:40 PM, Lewis John Mcgibbney <
> >>> [email protected]> wrote:
> >>>
> >>>> Hi kaveh,
> >>>>
> >>>> We have recently been informed about parsing taking forever and a day
> >>>> in the reduce phase. This is currently being investigated. FYI the
> >>>> thread can be found below
> >>>>
> >>>> http://www.mail-archive.com/user%40nutch.apache.org/msg06560.html
> >>>>
> >>>> I wonder if you have looked into this and if there is a more general
> >>>> link between such issues?
> >>>>
> >>>> Lewis
> >>>>
> >>>> On Wed, Jun 13, 2012 at 1:31 AM, kaveh minooie <[email protected]>
> >> wrote:
> >>>>> Hi everybody
> >>>>>
> >>>>> I have an unusual issue. when i run nutch on top off hadoop, after
> >> the
> >>>> map
> >>>>> tasks finish, the reduce task start to finish very fast almost all of
> >>>> them
> >>>>> finish in less than 2 hours but there is alway one or two that take a
> >>> lot
> >>>>> longer. this is a link to the list of a completed reduce tasks ( that
> >>> is
> >>>> all
> >>>>> of them for that fetch job) and you can see on the list that the last
> >>> one
> >>>>> took more than 18 hours to finish and there is another one that took
> >>> more
> >>>>> than 6 hours. does any body have any idea why this is happening?
> >>>>>
> >>>>> http://plutooz.com/hadoop.html
> >>>>>
> >>>>> p.s. this fetch job had about 1.5 million pages in it.
> >>>>>
> >>>>> thanks,
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lewis
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble
> >>
> >
> 
> -- 
> Kaveh Minooie
> 
> www.plutoz.com
> 
> 
> 

Reply via email to