[jira] [Created] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time
Bob created MAPREDUCE-6513: -- Summary: MR job got hanged forever when one NM unstable for some time Key: MAPREDUCE-6513 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, resourcemanager Affects Versions: 3.0.0 Reporter: Bob when job is in-progress which is having more tasks,one node became unstable due to some OS issue.After the node became unstable, the map on this node status changed to KILLED state. Currently maps which were running on unstable node are rescheduled, and all are in scheduled state and wait for RM assign container.Seen ask requests for map till Node is good (all those failed), there are no ask request after this. But AM keeps on preempting the reducers (it's recycling). Finally reducers are waiting for complete mappers and mappers did n't get container.. My Question Is: why map requests did not sent AM ,once after node recovery.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run
[ https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated MAPREDUCE-6485: --- Description: The scenarios is like this: With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will take resource and start to run when all the map have not finished. But It could happened that when all the resources are taken up by running reduces, there is still one map not finished. Under this condition , the last map have two task attempts . As for the first attempt was killed due to timeout(mapreduce.task.timeout), and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to FAILED, but failed map attempt would not be restarted for there is still one speculate map attempt in progressing. As for the second attempt which was started due to having enable map task speculative is pending at UNASSINGED state because of no resource available. But the second map attempt request have lower priority than reduces, so preemption would not happened. As a result all reduces would not finished because of there is one map left. and the last map hanged there because of no resource available. so, the job would never finish. was: The scenarios is like this: With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will take resource and start to run when all the map have not finished. But It could happened that when all the resources are taken up by running reduces, there is still one map not finished. Under this condition , the last map have two task attempts . As for the first attempt was killed due to timeout(mapreduce.task.timeout), and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map attempt would not be started. As for the second attempt which was started due to having enable map task speculative is pending at UNASSINGED state because of no resource available. But the second map attempt request have lower priority than reduces, so preemption would not happened. As a result all reduces would not finished because of there is one map left. and the last map hanged there because of no resource available. so, the job would never finish. > MR job hanged forever because all resources are taken up by reducers and the > last map attempt never get resource to run > --- > > Key: MAPREDUCE-6485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1 >Reporter: Bob >Assignee: Xianyin Xin >Priority: Critical > Attachments: MAPREDUCE-6485.001.patch > > > The scenarios is like this: > With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces > will take resource and start to run when all the map have not finished. > But It could happened that when all the resources are taken up by running > reduces, there is still one map not finished. > Under this condition , the last map have two task attempts . > As for the first attempt was killed due to timeout(mapreduce.task.timeout), > and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to > FAILED, but failed map attempt would not be restarted for there is still one > speculate map attempt in progressing. > As for the second attempt which was started due to having enable map task > speculative is pending at UNASSINGED state because of no resource available. > But the second map attempt request have lower priority than reduces, so > preemption would not happened. > As a result all reduces would not finished because of there is one map left. > and the last map hanged there because of no resource available. so, the job > would never finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run
[ https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903811#comment-14903811 ] Bob commented on MAPREDUCE-6485: [~xinxianyin],Thanks for your deeply analysis. Though we have found the rootcause of this issue, could you provide one patch for this? > MR job hanged forever because all resources are taken up by reducers and the > last map attempt never get resource to run > --- > > Key: MAPREDUCE-6485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1 >Reporter: Bob >Priority: Critical > > The scenarios is like this: > With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces > will take resource and start to run when all the map have not finished. > But It could happened that when all the resources are taken up by running > reduces, there is still one map not finished. > Under this condition , the last map have two task attempts . > As for the first attempt was killed due to timeout(mapreduce.task.timeout), > and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed > map attempt would not be started. > As for the second attempt which was started due to having enable map task > speculative is pending at UNASSINGED state because of no resource available. > But the second map attempt request have lower priority than reduces, so > preemption would not happened. > As a result all reduces would not finished because of there is one map left. > and the last map hanged there because of no resource available. so, the job > would never finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run
[ https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated MAPREDUCE-6485: --- Description: The scenarios is like this: With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will take resource and start to run when all the map have not finished. But It could happened that when all the resources are taken up by running reduces, there is still one map not finished. Under this condition , the last map have two task attempts . As for the first attempt was killed due to timeout(mapreduce.task.timeout), and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map attempt would not be started. As for the second attempt which was started due to having enable map task speculative is pending at UNASSINGED state because of no resource available. But the second map attempt request have lower priority than reduces, so preemption would not happened. As a result all reduces would not finished because of there is one map left. and the last map hanged there because of no resource available. so, the job would never finish. was: The scenarios is like this: With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will take resource and start to run when all the map have not finished. But It could happened that when all the resources are taken up by running reduces, there is still one map not finished. Under this condition , the last map have two task attempts . As for the first attempt was killed due to timeout(mapreduce.task.timeout), and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map attempt would not be started. As for the second attempt which was started due to having enable map task speculative is pending at UNASSINGED state because of no resource available. But the second map attempt request have lower priority than reduces, so preemption would not happened. As a result all reduces would not finished because of there is one map left. and the last map hanged there because of no resource available. so, the job would never finished. > MR job hanged forever because all resources are taken up by reducers and the > last map attempt never get resource to run > --- > > Key: MAPREDUCE-6485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1 >Reporter: Bob >Priority: Critical > > The scenarios is like this: > With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces > will take resource and start to run when all the map have not finished. > But It could happened that when all the resources are taken up by running > reduces, there is still one map not finished. > Under this condition , the last map have two task attempts . > As for the first attempt was killed due to timeout(mapreduce.task.timeout), > and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed > map attempt would not be started. > As for the second attempt which was started due to having enable map task > speculative is pending at UNASSINGED state because of no resource available. > But the second map attempt request have lower priority than reduces, so > preemption would not happened. > As a result all reduces would not finished because of there is one map left. > and the last map hanged there because of no resource available. so, the job > would never finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run
[ https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated MAPREDUCE-6485: --- Affects Version/s: 2.4.1 2.6.0 2.7.1 > MR job hanged forever because all resources are taken up by reducers and the > last map attempt never get resource to run > --- > > Key: MAPREDUCE-6485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1 >Reporter: Bob >Priority: Critical > > The scenarios is like this: > With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces > will take resource and start to run when all the map have not finished. > But It could happened that when all the resources are taken up by running > reduces, there is still one map not finished. > Under this condition , the last map have two task attempts . > As for the first attempt was killed due to timeout(mapreduce.task.timeout), > and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed > map attempt would not be started. > As for the second attempt which was started due to having enable map task > speculative is pending at UNASSINGED state because of no resource available. > But the second map attempt request have lower priority than reduces, so > preemption would not happened. > As a result all reduces would not finished because of there is one map left. > and the last map hanged there because of no resource available. so, the job > would never finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run
Bob created MAPREDUCE-6485: -- Summary: MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run Key: MAPREDUCE-6485 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 3.0.0 Reporter: Bob Priority: Critical The scenarios is like this: With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will take resource and start to run when all the map have not finished. But It could happened that when all the resources are taken up by running reduces, there is still one map not finished. Under this condition , the last map have two task attempts . As for the first attempt was killed due to timeout(mapreduce.task.timeout), and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map attempt would not be started. As for the second attempt which was started due to having enable map task speculative is pending at UNASSINGED state because of no resource available. But the second map attempt request have lower priority than reduces, so preemption would not happened. As a result all reduces would not finished because of there is one map left. and the last map hanged there because of no resource available. so, the job would never finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run
[ https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877128#comment-14877128 ] Bob commented on MAPREDUCE-6485: @[~varun_saxena] Thanks for your checking this issue. Below are some related logs you can refer to. *1. All logs in AM about first attempt:* {code} 03:30:32,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from NEW to UNASSIGNED 03:33:22,037 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_003425 to attempt_1439640536651_4710_m_002807_0 03:33:22,038 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED 03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0 03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1439640536651_4710_m_002807_0 03:33:22,071 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1439640536651_4710_m_002807_0 : 26008 03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1439640536651_4710_m_002807_0] using containerId: [container_1439640536651_4710_01_003425 on NM: [SCCHDPHIV02129:26009] 03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from ASSIGNED to RUNNING 03:33:31,481 INFO [IPC Server handler 24 on 27102] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1439640536651_4710_m_003425 given task: attempt_1439640536651_4710_m_002807_0 03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs 03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs 03:43:31,474 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP 03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0 03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1439640536651_4710_m_002807_0 03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP 03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED 03:43:32,701 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: Container killed by the ApplicationMaster. {code} *2. All logs in AM about second attempt (only have one split):* {code} 03:39:55,339 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_1 TaskAttempt Transitioned from NEW to UNASSIGNED {code} *3. Checking the log with time_stamp we can see later available resource are all allocated to reduces not the last map attempt after the second attempt started :* {code} 03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015149 to attempt_1439640536651_4710_r_000669_0 03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:330 AssignedMaps:1 AssignedReds:669 CompletedMaps:14257 CompletedReds:0
[jira] [Created] (MAPREDUCE-6381) Some of MapReduce Commands opreations should have audit log printed
Bob created MAPREDUCE-6381: -- Summary: Some of MapReduce Commands opreations should have audit log printed Key: MAPREDUCE-6381 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6381 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.7.0 Reporter: Bob Below mapred commands are important operations that should also have audit logs recorded like 'yarn' commands. Mapred commands: mapred job -set-priority job-id priority mapred job -kill-task task-attempt-id mapred job -fail-task task-attempt-id mapred job -kill job-id mapred pipes mapred job -submit job-file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6381) Some of MapReduce Commands opreations should have audit log printed
[ https://issues.apache.org/jira/browse/MAPREDUCE-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated MAPREDUCE-6381: --- Description: Below mapred commands are important operations that should also have audit logs recorded like 'yarn' commands. *Mapred commands:* {noformat} mapred job -set-priority job-id priority mapred job -kill-task task-attempt-id mapred job -fail-task task-attempt-id mapred job -kill job-id mapred pipes mapred job -submit job-file {noformat} was: Below mapred commands are important operations that should also have audit logs recorded like 'yarn' commands. Mapred commands: mapred job -set-priority job-id priority mapred job -kill-task task-attempt-id mapred job -fail-task task-attempt-id mapred job -kill job-id mapred pipes mapred job -submit job-file Some of MapReduce Commands opreations should have audit log printed --- Key: MAPREDUCE-6381 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6381 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.7.0 Reporter: Bob Below mapred commands are important operations that should also have audit logs recorded like 'yarn' commands. *Mapred commands:* {noformat} mapred job -set-priority job-id priority mapred job -kill-task task-attempt-id mapred job -fail-task task-attempt-id mapred job -kill job-id mapred pipes mapred job -submit job-file {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5255) Reduce task preemption results in task never completing , incomplete fix to MAPREDUCE-3858 ?
[ https://issues.apache.org/jira/browse/MAPREDUCE-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560272#comment-14560272 ] Bob commented on MAPREDUCE-5255: [~rjain7] Have you finished verify this issue based on MAPREDUCE-5009? Are you sure this issue have been solved? Reduce task preemption results in task never completing , incomplete fix to MAPREDUCE-3858 ? - Key: MAPREDUCE-5255 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5255 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.0.3-alpha Reporter: Rahul Jain The problem was seen with symptoms very similar to MAPREDUCE-3858: the job is hung with continuous reduce task attempts, each attempt getting killed around commit phase. After a while the single reduce task was the only one remaining in the job, with 50K 'kills' done for the task. Relevant logs from application master: (the problem task is: attempt_1368653326922_0080_r_001278_0 {code} 2013-05-16 19:27:19,891 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 3 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1368653326922_0080_r_001266_0 is : 0.7212161 2013-05-16 19:27:19,893 INFO [IPC Server handler 13 on 40095] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001266_0 TaskAttempt Transitioned from COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1368653326922_0080_01_001296 taskAttempt attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1368653326922_0080_r_001279_0 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1368653326922_0080_r_001278_0 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1368653326922_0080_r_001277_0 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001279_0 TaskAttempt Transitioned from RUNNING to KILL_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:3 ScheduledReds:0 AssignedMaps:0 AssignedReds:63 CompletedMaps:16 CompletedReds:1233 ContAlloc:1324 ContRel:25 HostLocal:2 RackLocal:17 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001278_0 TaskAttempt Transitioned from COMMIT_PENDING to KILL_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001277_0 TaskAttempt Transitioned from RUNNING to KILL_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1368653326922_0080_01_001311 taskAttempt attempt_1368653326922_0080_r_001279_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1368653326922_0080_r_001279_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #4] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1368653326922_0080_01_001310 taskAttempt attempt_1368653326922_0080_r_001278_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #2]
[jira] [Commented] (MAPREDUCE-5255) Reduce task preemption results in task never completing , incomplete fix to MAPREDUCE-3858 ?
[ https://issues.apache.org/jira/browse/MAPREDUCE-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560271#comment-14560271 ] Bob commented on MAPREDUCE-5255: [~rjain7] Have you finished verify this issue based on MAPREDUCE-5009? Are you sure this issue have been solved? Reduce task preemption results in task never completing , incomplete fix to MAPREDUCE-3858 ? - Key: MAPREDUCE-5255 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5255 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.0.3-alpha Reporter: Rahul Jain The problem was seen with symptoms very similar to MAPREDUCE-3858: the job is hung with continuous reduce task attempts, each attempt getting killed around commit phase. After a while the single reduce task was the only one remaining in the job, with 50K 'kills' done for the task. Relevant logs from application master: (the problem task is: attempt_1368653326922_0080_r_001278_0 {code} 2013-05-16 19:27:19,891 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 3 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1368653326922_0080_r_001266_0 is : 0.7212161 2013-05-16 19:27:19,893 INFO [IPC Server handler 13 on 40095] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001266_0 TaskAttempt Transitioned from COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1368653326922_0080_01_001296 taskAttempt attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1368653326922_0080_r_001266_0 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1368653326922_0080_r_001279_0 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1368653326922_0080_r_001278_0 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1368653326922_0080_r_001277_0 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001279_0 TaskAttempt Transitioned from RUNNING to KILL_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:3 ScheduledReds:0 AssignedMaps:0 AssignedReds:63 CompletedMaps:16 CompletedReds:1233 ContAlloc:1324 ContRel:25 HostLocal:2 RackLocal:17 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001278_0 TaskAttempt Transitioned from COMMIT_PENDING to KILL_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1368653326922_0080_r_001277_0 TaskAttempt Transitioned from RUNNING to KILL_CONTAINER_CLEANUP 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1368653326922_0080_01_001311 taskAttempt attempt_1368653326922_0080_r_001279_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1368653326922_0080_r_001279_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #4] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1368653326922_0080_01_001310 taskAttempt attempt_1368653326922_0080_r_001278_0 2013-05-16 19:27:19,893 INFO [ContainerLauncher #2]