subject:"\[jira\] \[Commented\] \(MAPREDUCE\-6944\) MR job got hanged forever when some NMs unstable for some time"

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2021-12-01 Thread zhangyangyang (Jira)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451623#comment-17451623
 ] 

zhangyangyang commented on MAPREDUCE-6944:
--

[~daemon]  [~luxianghao]  for 2.6, there isn't function applyRequestLimits() , 
and when i kill the stuck taskAttempt manually，the new attempt can be assigned 
& run .

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Affects Versions: 2.7.2
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met.

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2021-12-01 Thread zhangyangyang (Jira)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451615#comment-17451615
 ] 

zhangyangyang commented on MAPREDUCE-6944:
--

hello, I have encountered the same problem that the job is hang for the last 
map cannot be assigned since its container ask was send to RM. At the same 
time, all of *MR REDUCEs* does not start executing.By the way, the version of 
hadoop is 2.6. please help me!!!

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Affects Versions: 2.7.2
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
>

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2019-01-06 Thread lqjacklee (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735383#comment-16735383
 ] 

lqjacklee commented on MAPREDUCE-6944:
--

PR update 

1, remove from earlier failed mapping iff exist

{{}}2, remove from host mapping iff exist
3, remove from rack mapping iff exist
4, remove from task attempt contain request mapping iff exist
 
[~luxianghao] [~daemon] please help review this update. thanks .

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2019-01-04 Thread Xianghao Lu (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734773#comment-16734773
 ] 

Xianghao Lu commented on MAPREDUCE-6944:


[~Jack-Lee] thanks for your work, and as far as I know, your pull request is 
similar with my early fix(please see the photo), this will just cover the first 
case, in which container request or container assign will happen, but in the 
second case, anything about container will not hapopen, so when the second case 
happens, the job will still hang, and my patch above will cover the both case.  
am I wrong? what do you think?

# allocating a container with PRIORITY_MAP to a rescheduled failed map(should 
be PRIORITY_FAST_FAIL_MAP)
# a rescheduled failed map is killed or failed without assigned container

!image-2019-01-05-12-03-19-887.png!

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
>

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2019-01-04 Thread lqjacklee (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734748#comment-16734748
 ] 

lqjacklee commented on MAPREDUCE-6944:
--

https://github.com/apache/hadoop/pull/456

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
>

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2018-12-27 Thread lqjacklee (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729638#comment-16729638
 ] 

lqjacklee commented on MAPREDUCE-6944:
--

[~luxianghao] Thanks , I will append the test case you provided. 

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2018-12-27 Thread Xianghao Lu (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729561#comment-16729561
 ] 

Xianghao Lu commented on MAPREDUCE-6944:


my case is as follows
# make a queue Q1 with some memory and 2 vcores 
# submit the MR_APP(3 map tasks and 1 reduce task) to the Q1 and waiting for a 
map task map1 run
# modify the vcore of Q1 from 2 to 1 to make all left map task pending 
# find the host which map1 run on and make the host unusable by NM healthy 
check script
# map1 will be killed, a new map task map1_fail will be scheduled as a failed 
map and added into earlierFailedMaps
# map1_fail will pending because of no resource, then kill the map1_fail
# modify the vcore of Q1 from 1 to 10, then the MR_APP will hang

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
>

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2018-12-25 Thread lqjacklee (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728714#comment-16728714
 ] 

lqjacklee commented on MAPREDUCE-6944:
--

[~luxianghao] could append the test case ? 

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
>

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2018-12-25 Thread Xianghao Lu (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728705#comment-16728705
 ] 

Xianghao Lu commented on MAPREDUCE-6944:


I cat not attatch patch file into Attachments area, so paste patch text here.
{quote}
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
index def9872..a00354f 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
@@ -1564,6 +1564,14 @@ public void handle(TaskAttemptEvent event) {
   Task task = job.getTask(event.getTaskAttemptID().getTaskId());
   TaskAttempt attempt = task.getAttempt(event.getTaskAttemptID());
   ((EventHandler) attempt).handle(event);
+
+  // fix bug of app hang because of attemptID not removed from 
earlierFailedMaps in some cases, such as
+  // 1 allocating a container with PRIORITY_MAP to a rescheduled failed 
map(should be PRIORITY_FAST_FAIL_MAP)
+  // 2 a rescheduled failed map is killed or failed without assigned 
container
+  if (attempt.isFinished() && 
((RMContainerAllocator)((ContainerAllocatorRouter)containerAllocator).containerAllocator).scheduledRequests.earlierFailedMaps.size()
 > 0
+  && 
((RMContainerAllocator)((ContainerAllocatorRouter)containerAllocator).containerAllocator).scheduledRequests.earlierFailedMaps.remove(event.getTaskAttemptID())){
+LOG.info("Remove " + event.getTaskAttemptID() + " from 
earlierFailedMaps");
+  }
 }
   }
 
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
index e459cb5..c99a098 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
@@ -153,7 +153,7 @@ added to the pending and are ramped up (added to scheduled) 
based
   private final AssignedRequests assignedRequests;
 
   //holds scheduled requests to be fulfilled by RM
-  private final ScheduledRequests scheduledRequests = new ScheduledRequests();
+  public final ScheduledRequests scheduledRequests = new ScheduledRequests();
 
   private int containersAllocated = 0;
   private int containersReleased = 0;
@@ -1042,11 +1042,10 @@ public Resource getResourceLimit() {
   Resources.add(assignedMapResource, assignedReduceResource));
   }
 
-  @Private
   @VisibleForTesting
-  class ScheduledRequests {
+  public class ScheduledRequests {
 
-private final LinkedList earlierFailedMaps = 
+public final LinkedList earlierFailedMaps = 
   new LinkedList();
 
 /** Maps from a host to a list of Map tasks with data on the host */
{quote}

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2018-12-25 Thread Xianghao Lu (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728697#comment-16728697
 ] 

Xianghao Lu commented on MAPREDUCE-6944:


I have similar problems and I find the app hang because of attemptID not 
removed from earlierFailedMaps in some cases, such as
# allocating a container with PRIORITY_MAP to a rescheduled failed map(should 
be PRIORITY_FAST_FAIL_MAP)
# a rescheduled failed map is killed or failed without assigned container

This will cause AM can not apply resource with wrong request limit(refer to 
applyRequestLimits() in RMContainerRequestor.java) 
I have uploaded a patch and test in my cluster, it works fine. [~eepayne], 
[~daemon], [~ashwinshankar77]  would you like to have a look? 

{quote}
Because in our scenario, all of MR REDUCEs does not start executing.
{quote}
When mapreduce.job.reduce.slowstart.completedmaps is 1, the reduce tasks will 
not run until all map tasks finished.


> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2018-03-13 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397543#comment-16397543
 ] 

Ashwin Shankar commented on MAPREDUCE-6944:
---

[~daemon] we are having similar problems. Were you able to root cause this?

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2017-08-27 Thread YunFan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143134#comment-16143134
 ] 

YunFan Zhou commented on MAPREDUCE-6944:


Thank [~eepayne], we used the version of *hadoop-2.7.2*.

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met.

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

2017-08-25 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141978#comment-16141978
 ] 

Eric Payne commented on MAPREDUCE-6944:
---

[~daemon], what version of Hadoop are you running?

> MR job got hanged forever when some NMs unstable for some time
> --
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact 
> that some of the NM unstable cause one *MAP* of the job to be stuck there, 
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in 
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513].  Because in our 
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask 
> limit of 1 for priority:5 and capability:
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit= knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning 
> container Container: [ContainerId: 
> container_e84_1502793246072_73922_01_015726, NodeId: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: 
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource:  vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from 
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_e84_1502793246072_73922_01_015726 to 
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
> threshold not met.

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

[jira] [Commented] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time

13 matches

Site Navigation

Mail list logo

Footer information