[
https://issues.apache.org/jira/browse/YARN-9449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon Scheller updated YARN-9449:
-----------------------------------
Description:
https://issues.apache.org/jira/browse/YARN-5342 Added a counter to Yarn so that
unscheduled resource requests were attempted to be scheduled on unlabeled nodes
first.
This counter is reset only when an attempt to schedule happens on an unlabeled
node.
On hadoop clusters with only labeled nodes, this counter can never be reset and
therefore it will block skipping that node.
Because the node will not be skipped, it creates the loop shown below in the
Yarn RM logs.
This can block scheduling of a spark executor for example and cause the spark
application to get stuck.
{{_2019-02-18 23:54:22,591 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1550533628872_0003_01_000023
Container Transitioned from NEW to RESERVED 2019-02-18 23:54:22,591 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:22,592 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,592 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1550533628872_0003 on node:
ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:23,592 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application application_1550533628872_0003
unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1
available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has
0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED
2019-02-18 23:54:23,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1550533628872_0003_01_000024
Container Transitioned from NEW to RESERVED 2019-02-18 23:54:23,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1550533628872_0003 on node:
ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:24,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application application_1550533628872_0003
unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1
available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has
0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED
2019-02-18 23:54:24,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1550533628872_0003_01_000025
Container Transitioned from NEW to RESERVED 2019-02-18 23:54:24,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:25,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1550533628872_0003 on node:
ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:25,595 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application application_1550533628872_0003
unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1
available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has
0 at priority 1; currentReservation <memory:0, vCores:0> on
node-label=LABELED_}}
was:
https://issues.apache.org/jira/browse/YARN-5342 Added a counter to Yarn so that
unscheduled resource requests were attempted to be scheduled on unlabeled nodes
first.
This counter is reset only when an attempt to schedule happens on an unlabeled
node.
On hadoop clusters with only labeled nodes, this counter can never be reset and
therefore it will block skipping that node.
Because the node will not be skipped, it creates the loop shown below in the
Yarn RM logs.
This can block scheduling of an app master and cause applications to get stuck.
{{_2019-02-18 23:54:22,591 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1550533628872_0003_01_000023
Container Transitioned from NEW to RESERVED 2019-02-18 23:54:22,591 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:22,592 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,592 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1550533628872_0003 on node:
ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:23,592 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application application_1550533628872_0003
unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1
available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has
0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED
2019-02-18 23:54:23,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1550533628872_0003_01_000024
Container Transitioned from NEW to RESERVED 2019-02-18 23:54:23,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1550533628872_0003 on node:
ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:24,593 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application application_1550533628872_0003
unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1
available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has
0 at priority 1; currentReservation <memory:0, vCores:0> on node-label=LABELED
2019-02-18 23:54:24,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_1550533628872_0003_01_000025
Container Transitioned from NEW to RESERVED 2019-02-18 23:54:24,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
(ResourceManager Event Processor): Reserved container
application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
(ResourceManager Event Processor): assignedContainer queue=root
usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
cluster=<memory:24576, vCores:16> 2019-02-18 23:54:25,594 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
(ResourceManager Event Processor): Trying to fulfill reservation for
application application_1550533628872_0003 on node:
ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:25,595 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
(ResourceManager Event Processor): Application application_1550533628872_0003
unreserved on node host: ip-10-0-0-122.ec2.internal:8041 #containers=1
available=<memory:1024, vCores:7> used=<memory:11264, vCores:1>, currently has
0 at priority 1; currentReservation <memory:0, vCores:0> on
node-label=LABELED_}}
> Non-exclusive labels can create reservation loop on cluster without unlabeled
> node
> ----------------------------------------------------------------------------------
>
> Key: YARN-9449
> URL: https://issues.apache.org/jira/browse/YARN-9449
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.8.5
> Reporter: Brandon Scheller
> Priority: Major
>
> https://issues.apache.org/jira/browse/YARN-5342 Added a counter to Yarn so
> that unscheduled resource requests were attempted to be scheduled on
> unlabeled nodes first.
> This counter is reset only when an attempt to schedule happens on an
> unlabeled node.
> On hadoop clusters with only labeled nodes, this counter can never be reset
> and therefore it will block skipping that node.
> Because the node will not be skipped, it creates the loop shown below in the
> Yarn RM logs.
> This can block scheduling of a spark executor for example and cause the spark
> application to get stuck.
>
> {{_2019-02-18 23:54:22,591 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
> (ResourceManager Event Processor): container_1550533628872_0003_01_000023
> Container Transitioned from NEW to RESERVED 2019-02-18 23:54:22,591 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
> (ResourceManager Event Processor): Reserved container
> application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:22,592 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
> (ResourceManager Event Processor): assignedContainer queue=root
> usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,592 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Trying to fulfill reservation for
> application application_1550533628872_0003 on node:
> ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:23,592 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
> (ResourceManager Event Processor): Application
> application_1550533628872_0003 unreserved on node host:
> ip-10-0-0-122.ec2.internal:8041 #containers=1 available=<memory:1024,
> vCores:7> used=<memory:11264, vCores:1>, currently has 0 at priority 1;
> currentReservation <memory:0, vCores:0> on node-label=LABELED 2019-02-18
> 23:54:23,593 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
> (ResourceManager Event Processor): container_1550533628872_0003_01_000024
> Container Transitioned from NEW to RESERVED 2019-02-18 23:54:23,593 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
> (ResourceManager Event Processor): Reserved container
> application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:23,593 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
> (ResourceManager Event Processor): assignedContainer queue=root
> usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,593 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Trying to fulfill reservation for
> application application_1550533628872_0003 on node:
> ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:24,593 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
> (ResourceManager Event Processor): Application
> application_1550533628872_0003 unreserved on node host:
> ip-10-0-0-122.ec2.internal:8041 #containers=1 available=<memory:1024,
> vCores:7> used=<memory:11264, vCores:1>, currently has 0 at priority 1;
> currentReservation <memory:0, vCores:0> on node-label=LABELED 2019-02-18
> 23:54:24,594 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
> (ResourceManager Event Processor): container_1550533628872_0003_01_000025
> Container Transitioned from NEW to RESERVED 2019-02-18 23:54:24,594 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
> (ResourceManager Event Processor): Reserved container
> application=application_1550533628872_0003 resource=<memory:11264, vCores:1>
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:24,594 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
> (ResourceManager Event Processor): assignedContainer queue=root
> usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> cluster=<memory:24576, vCores:16> 2019-02-18 23:54:25,594 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> (ResourceManager Event Processor): Trying to fulfill reservation for
> application application_1550533628872_0003 on node:
> ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:25,595 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
> (ResourceManager Event Processor): Application
> application_1550533628872_0003 unreserved on node host:
> ip-10-0-0-122.ec2.internal:8041 #containers=1 available=<memory:1024,
> vCores:7> used=<memory:11264, vCores:1>, currently has 0 at priority 1;
> currentReservation <memory:0, vCores:0> on node-label=LABELED_}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]