We set a part of the failure reason as the diagnostic message for a failed task that a JobClient API retrieves/can retrieve: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getTaskDiagnostics(org.apache.hadoop.mapred.TaskAttemptID). Often this is 'useless' given the stack trace's top part isn't always carrying the most relevant information, so perhaps HADOOP-9861 may help here once it is checked in.
On Tue, Aug 27, 2013 at 10:34 AM, Gopi Krishna M <[email protected]> wrote: > Hi > > We are seeing our map-reduce jobs crashing once in a while and have to go > through the logs on all the nodes to figure out what went wrong. Sometimes > it is low resources and sometimes it is a programming error which is > triggered on specific inputs.. Same is true for some of our hive queries. > > Are there any tools (free/paid) which help us to do this debugging quickly? > I am planning to write a debugging tool for sifting through the distributed > logs of hadoop but wanted to check if there are already any useful tools for > this. > > Thx > Gopi | www.wignite.com -- Harsh J
