[
https://issues.apache.org/jira/browse/YARN-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969940#comment-16969940
]
Xianghao Lu edited comment on YARN-9957 at 11/8/19 8:29 AM:
------------------------------------------------------------
IMO, the root cause of the following case in YARN-7382 is
[app.getPendingDemand|#L373], which we rely on to get apps with pending
resource.
we know demand = pending + usage, when map get to 100%, usage become 0 after
map container completed, but demand not be updated immediately, so the
scheduler mistakenly think of there are pending resouces request.
{code:java}
While running an MR job (e.g. sleep) and an RM failover occurs, once the maps
gets to 100%, the now active RM will crash
{code}
Here is my test log
{quote}2019-11-08 14:39:58,640 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: lxh
debug app application_1573179594570_0001 demand <memory:4096,
vCores:1>resouceusage <memory:4096, vCores:1>
2019-11-08 14:39:59,643 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: lxh
debug app application_1573179594570_0001{color:#FF0000} demand <memory:4096,
vCores:1>resouceusage <memory:4096, vCores:1>{color}
2019-11-08 14:40:00,439 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e2053_1573179594570_0001_01_000003 Container Transitioned
{color:#FF0000}from RUNNING to COMPLETED{color}
2019-11-08 14:40:00,439 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1573179594570_0001
CONTAINERID=container_e2053_1573179594570_0001_01_000003 RESOURCE=<memory:4096,
vCores:1>
2019-11-08 14:40:00,440 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: app
application_1573179594570_0001 {color:#FF0000}demand <memory:4096,
vCores:1>resouceusage <memory:0, vCores:0>{color}
2019-11-08 14:40:00,440 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: app
application_1573179594570_0001 demand <memory:4096, vCores:1>resouceusage
<memory:0, vCores:0>
2019-11-08 14:40:00,440 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: app
application_1573179594570_0001 pending demand <memory:4096,
vCores:1>{color:#FF0000}schedulerKey []{color}
2019-11-08 14:40:00,442 FATAL org.apache.hadoop.yarn.event.EventDispatcher:
Error in handling event type NODE_UPDATE to the Event Dispatcher
java.util.NoSuchElementException
at
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2053)
at
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:372)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:934)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1359)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:346)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:207)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1034)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:902)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1119)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
at
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
2019-11-08 14:40:00,443 INFO org.apache.hadoop.yarn.event.EventDispatcher:
Exiting, bbye..
{quote}
was (Author: luxianghao):
IMO, the root cause of the following case in YARN-7382 is
[app.getPendingDemand|#L373]], which we rely on to get apps with pending
resource.
we know demand = pending + usage, when map get to 100%, usage become 0 after
map container completed, but demand not be updated immediately, so the
scheduler mistakenly think of there are pending resouces request.
{code:java}
While running an MR job (e.g. sleep) and an RM failover occurs, once the maps
gets to 100%, the now active RM will crash
{code}
Here is my test log
{quote}2019-11-08 14:39:58,640 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: lxh
debug app application_1573179594570_0001 demand <memory:4096,
vCores:1>resouceusage <memory:4096, vCores:1>
2019-11-08 14:39:59,643 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: lxh
debug app application_1573179594570_0001{color:#FF0000} demand <memory:4096,
vCores:1>resouceusage <memory:4096, vCores:1>{color}
2019-11-08 14:40:00,439 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e2053_1573179594570_0001_01_000003 Container Transitioned
{color:#FF0000}from RUNNING to COMPLETED{color}
2019-11-08 14:40:00,439 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1573179594570_0001
CONTAINERID=container_e2053_1573179594570_0001_01_000003 RESOURCE=<memory:4096,
vCores:1>
2019-11-08 14:40:00,440 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: app
application_1573179594570_0001 {color:#FF0000}demand <memory:4096,
vCores:1>resouceusage <memory:0, vCores:0>{color}
2019-11-08 14:40:00,440 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: app
application_1573179594570_0001 demand <memory:4096, vCores:1>resouceusage
<memory:0, vCores:0>
2019-11-08 14:40:00,440 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: app
application_1573179594570_0001 pending demand <memory:4096,
vCores:1>{color:#FF0000}schedulerKey []{color}
2019-11-08 14:40:00,442 FATAL org.apache.hadoop.yarn.event.EventDispatcher:
Error in handling event type NODE_UPDATE to the Event Dispatcher
java.util.NoSuchElementException
at
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2053)
at
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:372)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:934)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1359)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:346)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:207)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1034)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:902)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1119)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
at
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
2019-11-08 14:40:00,443 INFO org.apache.hadoop.yarn.event.EventDispatcher:
Exiting, bbye..
{quote}
> The first container we recover may not be the AM
> ------------------------------------------------
>
> Key: YARN-9957
> URL: https://issues.apache.org/jira/browse/YARN-9957
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler
> Affects Versions: 2.9.1
> Reporter: Xianghao Lu
> Assignee: Xianghao Lu
> Priority: Major
> Fix For: 2.9.1
>
> Attachments: 1.jpg, 2.jpg, YARN-9957-branch-2.9.1.001.patch,
> YARN-9957-branch-2.9.1.002.patch
>
>
> YARN-7382 says that if not running unmanaged, the first container we recover
> is always the AM, however, the actual situation is not like this, this can
> lead to a wrong am resource usage after rm recover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]