[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-10-02 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated MAPREDUCE-6485:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

committed to branch-2/trunk.. Thanks [~xinxianyin] for contributions!! [~kasha] 
for the additional review.. 

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6485.004.patch, 
> MAPREDUCE-6485.005.patch, MAPREDUCE-6485.006.patch, MAPREDUCE-6845.002.patch, 
> MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-29 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Attachment: MAPREDUCE-6485.005.patch

The checkstyle and whitespace warnings are not generated in the patch. However, 
attach ver-005 to fix the whitespace warning.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6485.004.patch, 
> MAPREDUCE-6485.005.patch, MAPREDUCE-6845.002.patch, MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-29 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Status: Patch Available  (was: Reopened)

Tick jenkins.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 2.7.1, 2.6.0, 2.4.1, 3.0.0
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6485.004.patch, 
> MAPREDUCE-6845.002.patch, MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-29 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Attachment: MAPREDUCE-6485.004.patch

Discussed with [~rohithsharma] off-line, and changed the method of checking the 
taskAttempt whether being hanging for resource according to  [~rohithsharma]'s 
suggestion. Now the checking looks,
{code}
boolean isContainerAssigned() {
  return container == null ? false : true;
}
{code}
other than checking the internal states in patch ver-003.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6485.004.patch, 
> MAPREDUCE-6845.002.patch, MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-29 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Attachment: MAPREDUCE-6485.006.patch

Thanks, [~kasha].
{quote}
 isContainerAssigned() - can we do the following please
{code}
boolean isContainerAssigned() {
  return container != null;
}
{code}
{quote}
fixed.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6485.004.patch, 
> MAPREDUCE-6485.005.patch, MAPREDUCE-6485.006.patch, MAPREDUCE-6845.002.patch, 
> MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-25 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Attachment: MAPREDUCE-6845.003.patch

Thanks [~rohithsharma], better solution! Attached new patch.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6845.002.patch, 
> MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-24 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Attachment: MAPREDUCE-6845.002.patch

Thanks for review, [~rohithsharma]
{quote}
Can you add test for handling regression?
{quote}
Add a test case which mock the scenario.
{quote}
nit : can you check for greater then rather not equal?
{{task.inProgressAttempts.size() != 0}}
{quote}
Fixed.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6845.002.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-23 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6485:
---
Description: 
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to FAILED, 
but failed map attempt would not be restarted for there is still one speculate 
map attempt in progressing. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finish.

  was:
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finish.


> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Assignee: Xianyin Xin
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-22 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated MAPREDUCE-6485:
---
Attachment: MAPREDUCE-6485.001.patch

Upload a patch, pls review.

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Priority: Critical
> Attachments: MAPREDUCE-6485.001.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-19 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6485:
---
Description: 
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finish.

  was:
The scenarios is like this:
With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will 
take resource and  start to run when all the map have not finished. 
But It could happened that when all the resources are taken up by running 
reduces, there is still one map not finished. 
Under this condition , the last map have two task attempts .
As for the first attempt was killed due to timeout(mapreduce.task.timeout), and 
its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed map 
attempt would not be started. 
As for the second attempt which was started due to having enable map task 
speculative is pending at UNASSINGED state because of no resource available. 
But the second map attempt request have lower priority than reduces, so 
preemption would not happened.
As a result all reduces would not finished because of there is one map left. 
and the last map hanged there because of no resource available. so, the job 
would never finished.


> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

2015-09-19 Thread Bob (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated MAPREDUCE-6485:
---
Affects Version/s: 2.4.1
   2.6.0
   2.7.1

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> ---
>
> Key: MAPREDUCE-6485
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>Reporter: Bob
>Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)