It seems that I've missed a few important details in my original email. I do apologize for that, but let me clarify a few of them. I see in the threads and issues that has been created regarding this matter and links to them has been posted in the previous messages people are usually talking about encountering errors which is not my case. so here is my situation. I am running parser inside fetch job (the parsing is enabled during fetch in the config file) and as markus has pointed out the parsing seems to be happening during the map tasks not reduce. I do not experience any time out messages during that very long reduce job. nor any exceptions or anything else. I don't run out of memory or disk space either.also that reduce task is not idle either. when i go to the machine that is running that reduce task and run "top", I can see that that process is using hundred percent of the cpu (or the core that it is running on). so in a nut shell, everything seem to be fine except that it seems to be doing lots and lots of work that at least I can't justify at this point.

thanks,

On 06/13/2012 10:33 AM, Markus Jelsma wrote:
In a parsing fetcher iirc outlinks are processed in the mapper (at least when 
outlinks are followed). If a fetcher's reducer stalls you may run out of memory 
or disk space.


-----Original message-----
From:kaveh minooie <[email protected]>
Sent: Wed 13-Jun-2012 19:28
To: [email protected]
Subject: Re: very long fetch reduce task

Thanks for the responses, and yes, in my case, parsing IS enabled and
happens during the fetch job.

On 06/13/2012 07:43 AM, Ferdy Galema wrote:
You're right. I was already assuming parsing was enabled. If it's not,
normalizing and filtering is most likely the next probable cause why tasks
are stalling.

On Wed, Jun 13, 2012 at 4:36 PM, Julien Nioche <
[email protected]> wrote:

unless the parsing is activated in the fetch step - this is likely to be a
different issue e.g. normalization of URL taking forever or something like
this. Use jstack to see what the problem is

On 13 June 2012 12:36, Ferdy Galema <[email protected]> wrote:

I'd like to add that I've recently opened an issue that describes one of
the causes of this problem. Look for the lazy man's profiler trick to see
stacktraces of the slow parser task. It will give an indication which
parser code is stalling:
https://issues.apache.org/jira/browse/NUTCH-1387

On Wed, Jun 13, 2012 at 12:40 PM, Lewis John Mcgibbney <
[email protected]> wrote:

Hi kaveh,

We have recently been informed about parsing taking forever and a day
in the reduce phase. This is currently being investigated. FYI the
thread can be found below

http://www.mail-archive.com/user%40nutch.apache.org/msg06560.html

I wonder if you have looked into this and if there is a more general
link between such issues?

Lewis

On Wed, Jun 13, 2012 at 1:31 AM, kaveh minooie <[email protected]>
wrote:
Hi everybody

I have an unusual issue. when i run nutch on top off hadoop, after
the
map
tasks finish, the reduce task start to finish very fast almost all of
them
finish in less than 2 hours but there is alway one or two that take a
lot
longer. this is a link to the list of a completed reduce tasks ( that
is
all
of them for that fetch job) and you can see on the list that the last
one
took more than 18 hours to finish and there is another one that took
more
than 6 hours. does any body have any idea why this is happening?

http://plutooz.com/hadoop.html

p.s. this fetch job had about 1.5 million pages in it.

thanks,



--
Lewis





--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble



--
Kaveh Minooie

www.plutoz.com




--
Kaveh Minooie

www.plutoz.com


Reply via email to