I'd like to add that I've recently opened an issue that describes one of the causes of this problem. Look for the lazy man's profiler trick to see stacktraces of the slow parser task. It will give an indication which parser code is stalling: https://issues.apache.org/jira/browse/NUTCH-1387
On Wed, Jun 13, 2012 at 12:40 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi kaveh, > > We have recently been informed about parsing taking forever and a day > in the reduce phase. This is currently being investigated. FYI the > thread can be found below > > http://www.mail-archive.com/user%40nutch.apache.org/msg06560.html > > I wonder if you have looked into this and if there is a more general > link between such issues? > > Lewis > > On Wed, Jun 13, 2012 at 1:31 AM, kaveh minooie <[email protected]> wrote: > > Hi everybody > > > > I have an unusual issue. when i run nutch on top off hadoop, after the > map > > tasks finish, the reduce task start to finish very fast almost all of > them > > finish in less than 2 hours but there is alway one or two that take a lot > > longer. this is a link to the list of a completed reduce tasks ( that is > all > > of them for that fetch job) and you can see on the list that the last one > > took more than 18 hours to finish and there is another one that took more > > than 6 hours. does any body have any idea why this is happening? > > > > http://plutooz.com/hadoop.html > > > > p.s. this fetch job had about 1.5 million pages in it. > > > > thanks, > > > > -- > Lewis >

