It is not clear from your post but your job is always failing during the same step? Or only sometimes? Or only once? Since it's a hive query I would modify it to find the root cause.
First create temporary "files" which are the results from the three first M/R. Then run the fourth M/R on it and try to filter the data in order to see if it is related to the volume or the format. Regards Bertrand On Fri, Aug 24, 2012 at 7:44 PM, Igor Tatarinov <i...@decide.com> wrote: > Why don't you try splitting the big query into smaller ones? > > > On Fri, Aug 24, 2012 at 10:20 AM, Tim Havens <timhav...@gmail.com> wrote: > >> >> Just curious if you've tried using Hive's explain method to see what IT >> thinks of your query. >> >> >> On Fri, Aug 24, 2012 at 9:36 AM, Himanish Kushary <himan...@gmail.com>wrote: >> >>> Hi, >>> >>> We have a complex query that involves several left outer joins resulting >>> in 8 M/R jobs in Hive.During execution of one of the stages ( after three >>> M/R has run) the M/R job fails due to few Reduce tasks failing due to >>> inactivity. >>> >>> Most of the reduce tasks go through fine ( within 3 mins) but the last >>> one gets stuck for a long time (> 1 hour) and finally after several >>> attempts gets killed due to "failed to report status for 600 seconds. >>> Killing!" >>> >>> What may be causing this issue ? Would hive.script.auto.progress help in >>> this case ? As we are not able to get much information from the log files >>> how may we approach resolving this ? Will tweaking of any specific M/R >>> parameters help ? >>> >>> The task attempt log shows several lines like this before exiting : >>> >>> 2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 219000000 >>> rows: used memory = 408582240 >>> 2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 220000000 >>> rows: used memory = 346110400 >>> 2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 221000000 >>> rows: used memory = 583913576 >>> 2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 222000000 >>> rows: used memory = 513071504 >>> 2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat: >>> Total input paths to process : 1 >>> >>> Here are the reduce task counters: >>> >>> *Map-Reduce Framework* Combine input records0 Combine output records0Reduce >>> input groups >>> 222,480,335 Reduce shuffle bytes7,726,141,897 Reduce input records >>> 222,480,335 Reduce output records0 Spilled Records355,827,191 CPU time >>> spent (ms)2,152,160 Physical memory (bytes) snapshot1,182,490,624Virtual >>> memory (bytes) snapshot >>> 1,694,531,584 Total committed heap usage (bytes)990,052,352 >>> >>> The tasktracker log gives a thread dump at that time but no exception. >>> >>> *2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker: >>> Process Thread Dump: lost task* >>> *69 active threads* >>> >>> --------------------------- >>> Thanks & Regards >>> Himanish >>> >> >> >> >> -- >> "The whole world is you. Yet you keep thinking there is something else." >> - Xuefeng Yicun 822-902 A.D. >> >> Tim R. Havens >> Google Phone: 573.454.1232 >> ICQ: 495992798 >> ICBM: 37°51'34.79"N 90°35'24.35"W >> ham radio callsign: NW0W >> > > -- Bertrand Dechoux