[
https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271453#comment-15271453
]
Jason Lowe commented on YARN-5039:
----------------------------------
The screenshot also shows one app pending and one running -- I assume we're
discussing the app that's already running since pending apps won't be scheduled
at all until they become activated once AM resource and user limits allow it.
It's interesting that most of the nodes that have any containers at all don't
have enough space for the container. Is it expected that the containers for
this app would cluster on just a few nodes like this? If not then it's like
the scheduler is somehow ignoring those, which would explain why it hangs once
the remaining nodes fill up.
We probably need some more logging to see exactly what's going on. It would be
helpful to turn on debug logging for
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue so
we can see when each node heartbeats and can get more visibility into the
scheduling for each node. That can be dynamically enabled/disabled via the
logLevel servlet at http://<RMwebaddr>/logLevel.
> Applications ACCEPTED but not starting
> --------------------------------------
>
> Key: YARN-5039
> URL: https://issues.apache.org/jira/browse/YARN-5039
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Miles Crawford
> Attachments: Screen Shot 2016-05-04 at 1.57.19 PM.png
>
>
> Often when we submit applications to an incompletely utilized cluster, they
> sit, unable to start for no apparent reason.
> There are multiple nodes in the cluster with available resources, but the
> resourcemanger logs show that scheduling is being skipped. The scheduling is
> skipped because the application itself has reserved the node? I'm not sure
> how to interpret this log output:
> {code}
> 2016-05-04 20:19:21,315 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Trying to fulfill reservation for
> application application_1462291866507_0025 on node:
> ip-10-12-43-54.us-west-2.compute.internal:8041
> 2016-05-04 20:19:21,316 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue
> (ResourceManager Event Processor): Reserved container
> application=application_1462291866507_0025 resource=<memory:50688, vCores:1>
> queue=default: capacity=1.0, absoluteCapacity=1.0,
> usedResources=<memory:1894464, vCores:33>, usedCapacity=0.7126589,
> absoluteUsedCapacity=0.7126589, numApps=2, numContainers=33
> usedCapacity=0.7126589 absoluteUsedCapacity=0.7126589 used=<memory:1894464,
> vCores:33> cluster=<memory:2658304, vCores:704>
> 2016-05-04 20:19:21,316 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Skipping scheduling since node
> ip-10-12-43-54.us-west-2.compute.internal:8041 is reserved by application
> appattempt_1462291866507_0025_000001
> 2016-05-04 20:19:22,232 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Trying to fulfill reservation for
> application application_1462291866507_0025 on node:
> ip-10-12-43-53.us-west-2.compute.internal:8041
> 2016-05-04 20:19:22,232 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue
> (ResourceManager Event Processor): Reserved container
> application=application_1462291866507_0025 resource=<memory:50688, vCores:1>
> queue=default: capacity=1.0, absoluteCapacity=1.0,
> usedResources=<memory:1894464, vCores:33>, usedCapacity=0.7126589,
> absoluteUsedCapacity=0.7126589, numApps=2, numContainers=33
> usedCapacity=0.7126589 absoluteUsedCapacity=0.7126589 used=<memory:1894464,
> vCores:33> cluster=<memory:2658304, vCores:704>
> 2016-05-04 20:19:22,232 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Skipping scheduling since node
> ip-10-12-43-53.us-west-2.compute.internal:8041 is reserved by application
> appattempt_1462291866507_0025_000001
> 2016-05-04 20:19:22,316 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Trying to fulfill reservation for
> application application_1462291866507_0025 on node:
> ip-10-12-43-54.us-west-2.compute.internal:8041
> 2016-05-04 20:19:22,316 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue
> (ResourceManager Event Processor): Reserved container
> application=application_1462291866507_0025 resource=<memory:50688, vCores:1>
> queue=default: capacity=1.0, absoluteCapacity=1.0,
> usedResources=<memory:1894464, vCores:33>, usedCapacity=0.7126589,
> absoluteUsedCapacity=0.7126589, numApps=2, numContainers=33
> usedCapacity=0.7126589 absoluteUsedCapacity=0.7126589 used=<memory:1894464,
> vCores:33> cluster=<memory:2658304, vCores:704>
> 2016-05-04 20:19:22,316 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Skipping scheduling since node
> ip-10-12-43-54.us-west-2.compute.internal:8041 is reserved by application
> appattempt_1462291866507_0025_000001
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]