You're right. I was already assuming parsing was enabled. If it's not, normalizing and filtering is most likely the next probable cause why tasks are stalling.
On Wed, Jun 13, 2012 at 4:36 PM, Julien Nioche < [email protected]> wrote: > unless the parsing is activated in the fetch step - this is likely to be a > different issue e.g. normalization of URL taking forever or something like > this. Use jstack to see what the problem is > > On 13 June 2012 12:36, Ferdy Galema <[email protected]> wrote: > > > I'd like to add that I've recently opened an issue that describes one of > > the causes of this problem. Look for the lazy man's profiler trick to see > > stacktraces of the slow parser task. It will give an indication which > > parser code is stalling: > > https://issues.apache.org/jira/browse/NUTCH-1387 > > > > On Wed, Jun 13, 2012 at 12:40 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > > > Hi kaveh, > > > > > > We have recently been informed about parsing taking forever and a day > > > in the reduce phase. This is currently being investigated. FYI the > > > thread can be found below > > > > > > http://www.mail-archive.com/user%40nutch.apache.org/msg06560.html > > > > > > I wonder if you have looked into this and if there is a more general > > > link between such issues? > > > > > > Lewis > > > > > > On Wed, Jun 13, 2012 at 1:31 AM, kaveh minooie <[email protected]> > wrote: > > > > Hi everybody > > > > > > > > I have an unusual issue. when i run nutch on top off hadoop, after > the > > > map > > > > tasks finish, the reduce task start to finish very fast almost all of > > > them > > > > finish in less than 2 hours but there is alway one or two that take a > > lot > > > > longer. this is a link to the list of a completed reduce tasks ( that > > is > > > all > > > > of them for that fetch job) and you can see on the list that the last > > one > > > > took more than 18 hours to finish and there is another one that took > > more > > > > than 6 hours. does any body have any idea why this is happening? > > > > > > > > http://plutooz.com/hadoop.html > > > > > > > > p.s. this fetch job had about 1.5 million pages in it. > > > > > > > > thanks, > > > > > > > > > > > > -- > > > Lewis > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

