[jira] [Created] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

2015-10-15 Thread Bob (JIRA)
Bob created MAPREDUCE-6513:
--

 Summary: MR job got hanged forever when one NM unstable for some 
time
 Key: MAPREDUCE-6513
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster, resourcemanager
Affects Versions: 3.0.0
Reporter: Bob


when job is in-progress which is having more tasks,one node became unstable due 
to some OS issue.After the node became unstable, the map on this node status 
changed to KILLED state. 

Currently maps which were running on unstable node are rescheduled, and all are 
in scheduled state and wait for RM assign container.Seen ask requests for map 
till Node is good (all those failed), there are no ask request after this. But 
AM keeps on preempting the reducers (it's recycling).

Finally reducers are waiting for complete mappers and mappers did n't get 
container..

My Question Is:

why map requests did not sent AM ,once after node recovery.?










--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-23 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6485:
---
Description: 
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to FAILED, 
but failed map attempt would not be restarted for there is still one speculate 
map attempt in progressing. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finish.

  was:
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finish.


> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-22 Thread Bob (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903811#comment-14903811
 ] 

Bob commented on MAPREDUCE-6485:


[~xinxianyin],Thanks for your deeply analysis. Though we have found the 
rootcause of this issue, could you provide one patch for this? 

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-19 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6485:
---
Description: 
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finish.

  was:
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finished.


> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-19 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6485:
---
Affects Version/s: 2.4.1
   2.6.0
   2.7.1

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-19 Thread Bob (JIRA)
Bob created MAPREDUCE-6485:
--

 Summary: MR job hanged forever because all resources are taken up 
by reducers and the last map attempt never get resource to run
 Key: MAPREDUCE-6485
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 3.0.0
Reporter: Bob
Priority: Critical


The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-19 Thread Bob (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877128#comment-14877128
 ] 

Bob commented on MAPREDUCE-6485:


@[~varun_saxena] Thanks for your checking this issue. Below are some related 
logs you can refer to.
*1. All logs in AM about first attempt:*
{code}
03:30:32,457 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED
03:33:22,037 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_003425 to attempt_1439640536651_4710_m_002807_0
03:33:22,038 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from UNASSIGNED 
to ASSIGNED
03:33:22,044 INFO [ContainerLauncher #27] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_LAUNCH for container 
container_1439640536651_4710_01_003425 taskAttempt 
attempt_1439640536651_4710_m_002807_0
03:33:22,044 INFO [ContainerLauncher #27] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching 
attempt_1439640536651_4710_m_002807_0
03:33:22,071 INFO [ContainerLauncher #27] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port 
returned by ContainerManager for attempt_1439640536651_4710_m_002807_0 : 26008
03:33:22,071 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: 
[attempt_1439640536651_4710_m_002807_0] using containerId: 
[container_1439640536651_4710_01_003425 on NM: [SCCHDPHIV02129:26009]
03:33:22,071 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from ASSIGNED to 
RUNNING
03:33:31,481 INFO [IPC Server handler 24 on 27102] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: 
jvm_1439640536651_4710_m_003425 given task: 
attempt_1439640536651_4710_m_002807_0
03:43:31,473 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1439640536651_4710_m_002807_0: 
AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs
03:43:31,473 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1439640536651_4710_m_002807_0: 
AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs
03:43:31,474 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from RUNNING to 
FAIL_CONTAINER_CLEANUP
03:43:31,474 INFO [ContainerLauncher #23] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_CLEANUP for container 
container_1439640536651_4710_01_003425 taskAttempt 
attempt_1439640536651_4710_m_002807_0
03:43:31,474 INFO [ContainerLauncher #23] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
attempt_1439640536651_4710_m_002807_0
03:43:31,478 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from 
FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
03:43:31,478 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from 
FAIL_TASK_CLEANUP to FAILED
03:43:32,701 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1439640536651_4710_m_002807_0: Container killed by the 
ApplicationMaster.
{code}
*2. All logs in AM about second  attempt (only have one split):*
{code}
03:39:55,339 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_1 TaskAttempt Transitioned from NEW to 
UNASSIGNED
{code}
*3. Checking the log with  time_stamp we can see later available resource are 
all allocated to reduces not the last map attempt after the second attempt 
started :*
{code}
03:39:55,978 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015149 to attempt_1439640536651_4710_r_000669_0
03:39:55,978 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:330 AssignedMaps:1 AssignedReds:669 
CompletedMaps:14257 CompletedReds:0 

[jira] [Created] (MAPREDUCE-6381) Some of MapReduce Commands opreations should have audit log printed

2015-06-01 Thread Bob (JIRA)
Bob created MAPREDUCE-6381:
--

 Summary: Some of MapReduce Commands opreations should have audit 
log printed
 Key: MAPREDUCE-6381
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6381
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.0
Reporter: Bob


Below mapred commands are important operations that should also have audit logs 
recorded like 'yarn' commands. 
Mapred commands:
mapred job -set-priority job-id priority 
mapred job -kill-task task-attempt-id
mapred job -fail-task task-attempt-id
mapred job -kill job-id
mapred pipes
mapred job -submit job-file




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6381) Some of MapReduce Commands opreations should have audit log printed

2015-06-01 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6381:
---
Description: 
Below mapred commands are important operations that should also have audit logs 
recorded like 'yarn' commands. 
*Mapred commands:*
{noformat}
mapred job -set-priority job-id priority 
mapred job -kill-task task-attempt-id
mapred job -fail-task task-attempt-id
mapred job -kill job-id
mapred pipes
mapred job -submit job-file
{noformat}

  was:
Below mapred commands are important operations that should also have audit logs 
recorded like 'yarn' commands. 
Mapred commands:
mapred job -set-priority job-id priority 
mapred job -kill-task task-attempt-id
mapred job -fail-task task-attempt-id
mapred job -kill job-id
mapred pipes
mapred job -submit job-file



 Some of MapReduce Commands opreations should have audit log printed
 ---

 Key: MAPREDUCE-6381
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6381
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.0
Reporter: Bob

 Below mapred commands are important operations that should also have audit 
 logs recorded like 'yarn' commands. 
 *Mapred commands:*
 {noformat}
 mapred job -set-priority job-id priority 
 mapred job -kill-task task-attempt-id
 mapred job -fail-task task-attempt-id
 mapred job -kill job-id
 mapred pipes
 mapred job -submit job-file
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5255) Reduce task preemption results in task never completing , incomplete fix to MAPREDUCE-3858 ?

2015-05-26 Thread Bob (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560272#comment-14560272
 ] 

Bob commented on MAPREDUCE-5255:


[~rjain7]
   Have you finished verify this issue based on MAPREDUCE-5009? Are you sure 
this issue have been solved?

 Reduce task preemption  results in task never completing , incomplete fix to 
 MAPREDUCE-3858 ?
 -

 Key: MAPREDUCE-5255
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5255
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 2.0.3-alpha
Reporter: Rahul Jain

 The problem was seen with symptoms very similar to MAPREDUCE-3858: the job is 
 hung with continuous reduce task attempts, each attempt getting killed around 
 commit phase.
 After a while the single reduce task was the only one remaining in the job, 
 with 50K 'kills' done for the task.
 Relevant logs from application master:
 (the problem task is: attempt_1368653326922_0080_r_001278_0
 {code}
 2013-05-16 19:27:19,891 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 3
 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
 attempt_1368653326922_0080_r_001266_0 is : 0.7212161
 2013-05-16 19:27:19,893 INFO [IPC Server handler 13 on 40095] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001266_0 TaskAttempt Transitioned from 
 COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
 the event EventType: CONTAINER_REMOTE_CLEANUP for container 
 container_1368653326922_0080_01_001296 taskAttempt 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
 attempt_1368653326922_0080_r_001279_0
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
 attempt_1368653326922_0080_r_001278_0
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
 attempt_1368653326922_0080_r_001277_0
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001279_0 TaskAttempt Transitioned from RUNNING 
 to KILL_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
 PendingReds:0 ScheduledMaps:3 ScheduledReds:0 AssignedMaps:0 AssignedReds:63 
 CompletedMaps:16 CompletedReds:1233 ContAlloc:1324 ContRel:25 HostLocal:2 
 RackLocal:17
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001278_0 TaskAttempt Transitioned from 
 COMMIT_PENDING to KILL_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001277_0 TaskAttempt Transitioned from RUNNING 
 to KILL_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
 the event EventType: CONTAINER_REMOTE_CLEANUP for container 
 container_1368653326922_0080_01_001311 taskAttempt 
 attempt_1368653326922_0080_r_001279_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
 attempt_1368653326922_0080_r_001279_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #4] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
 the event EventType: CONTAINER_REMOTE_CLEANUP for container 
 container_1368653326922_0080_01_001310 taskAttempt 
 attempt_1368653326922_0080_r_001278_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #2] 
 

[jira] [Commented] (MAPREDUCE-5255) Reduce task preemption results in task never completing , incomplete fix to MAPREDUCE-3858 ?

2015-05-26 Thread Bob (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560271#comment-14560271
 ] 

Bob commented on MAPREDUCE-5255:


[~rjain7]
   Have you finished verify this issue based on MAPREDUCE-5009? Are you sure 
this issue have been solved?

 Reduce task preemption  results in task never completing , incomplete fix to 
 MAPREDUCE-3858 ?
 -

 Key: MAPREDUCE-5255
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5255
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 2.0.3-alpha
Reporter: Rahul Jain

 The problem was seen with symptoms very similar to MAPREDUCE-3858: the job is 
 hung with continuous reduce task attempts, each attempt getting killed around 
 commit phase.
 After a while the single reduce task was the only one remaining in the job, 
 with 50K 'kills' done for the task.
 Relevant logs from application master:
 (the problem task is: attempt_1368653326922_0080_r_001278_0
 {code}
 2013-05-16 19:27:19,891 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 3
 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,892 INFO [IPC Server handler 22 on 40095] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
 attempt_1368653326922_0080_r_001266_0 is : 0.7212161
 2013-05-16 19:27:19,893 INFO [IPC Server handler 13 on 40095] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001266_0 TaskAttempt Transitioned from 
 COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
 the event EventType: CONTAINER_REMOTE_CLEANUP for container 
 container_1368653326922_0080_01_001296 taskAttempt 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #19] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
 attempt_1368653326922_0080_r_001266_0
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
 attempt_1368653326922_0080_r_001279_0
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
 attempt_1368653326922_0080_r_001278_0
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
 attempt_1368653326922_0080_r_001277_0
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001279_0 TaskAttempt Transitioned from RUNNING 
 to KILL_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
 PendingReds:0 ScheduledMaps:3 ScheduledReds:0 AssignedMaps:0 AssignedReds:63 
 CompletedMaps:16 CompletedReds:1233 ContAlloc:1324 ContRel:25 HostLocal:2 
 RackLocal:17
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001278_0 TaskAttempt Transitioned from 
 COMMIT_PENDING to KILL_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1368653326922_0080_r_001277_0 TaskAttempt Transitioned from RUNNING 
 to KILL_CONTAINER_CLEANUP
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
 the event EventType: CONTAINER_REMOTE_CLEANUP for container 
 container_1368653326922_0080_01_001311 taskAttempt 
 attempt_1368653326922_0080_r_001279_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #10] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
 attempt_1368653326922_0080_r_001279_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #4] 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
 the event EventType: CONTAINER_REMOTE_CLEANUP for container 
 container_1368653326922_0080_01_001310 taskAttempt 
 attempt_1368653326922_0080_r_001278_0
 2013-05-16 19:27:19,893 INFO [ContainerLauncher #2]