[jira] [Updated] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

Benjamin Teke (Jira) Fri, 29 May 2020 07:25:50 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benjamin Teke updated YARN-10295:
---------------------------------
    Description: 
When the CapacityScheduler Asynchronous scheduling is enabled there is an 
edge-case where a NullPointerException can cause the scheduler thread to exit 
and the apps to get stuck without allocated resources. Consider the following 
log:

{code:java}
2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
(FiCaSchedulerApp.java:apply(681)) - Reserved 
container=container_e10_1590502305306_0660_01_000115, on node=host: 
ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 
available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with 
resource=<memory:4096, vCores:1>
2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
(FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
application_1590502305306_0660 unreserved  on node host: 
ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 
available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently 
has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
Thread[Thread-4953,5,main] threw an Exception.
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
{code}

A container gets allocated on a host, but the host doesn't have enough memory, 
so after a short while it gets unreserved. However because the scheduler thread 
is running asynchronously it might have entered into the following if block 
located in 
[CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
 because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
second time for getting the ApplicationAttemptId would be an NPE, as the 
container got unreserved in the meantime.

{code:java}
// Do not schedule if there are any reservations to fulfill on the node
if (node.getReservedContainer() != null) {
    if (LOG.isDebugEnabled()) {
        LOG.debug("Skipping scheduling since node " + node.getNodeID()
                + " is reserved by application " + node.getReservedContainer()
                .getContainerId().getApplicationAttemptId());
     }
     return null;
}
{code}

A fix would be to store the container object before the if block. 

Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
which indirectly fixed this.

  was:
When the CapacityScheduler Asynchronous scheduling is enabled there is an 
edge-case where a NullPointerException can cause the scheduler thread to exit 
and the apps to get stuck without allocated resources. Consider the following 
log:

{code:java}
2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
(FiCaSchedulerApp.java:apply(681)) - Reserved 
container=container_e10_1590502305306_0660_01_000115, on node=host: 
ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 
available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with 
resource=<memory:4096, vCores:1>
2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
(FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
application_1590502305306_0660 unreserved  on node host: 
ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 
available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently 
has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
Thread[Thread-4953,5,main] threw an Exception.
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
{code}

A container gets allocated on a host, but the host doesn't have enough memory, 
so after a short while it gets unreserved. However because the scheduler thread 
is running asynchronously it might have entered into the following if block 
located in 
[CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
 because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
second time for getting the ApplicationAttemptId would be an NPE, as the 
container got unreserved in the meantime.

{code:java}
// Do not schedule if there are any reservations to fulfill on the node
if (node.getReservedContainer() != null) {
    if (LOG.isDebugEnabled()) {
        LOG.debug("Skipping scheduling since node " + node.getNodeID()
                + " is reserved by application " + node.getReservedContainer()
                .getContainerId().getApplicationAttemptId());
     }
     return null;
}
{code}

A fix would be to store the container object before the if, and as a precaution 
the org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl#getId/setId 
methods should be declared synchronised, as they'll be accessed from multiple 
threads. 

Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
which indirectly fixed this.


> CapacityScheduler NPE can cause apps to get stuck without resources
> -------------------------------------------------------------------
>
>                 Key: YARN-10295
>                 URL: https://issues.apache.org/jira/browse/YARN-10295
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.1.0, 3.2.0
>            Reporter: Benjamin Teke
>            Assignee: Benjamin Teke
>            Priority: Major
>
> When the CapacityScheduler Asynchronous scheduling is enabled there is an 
> edge-case where a NullPointerException can cause the scheduler thread to exit 
> and the apps to get stuck without allocated resources. Consider the following 
> log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 
> available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with 
> resource=<memory:4096, vCores:1>
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 
> available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently 
> has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
>     if (LOG.isDebugEnabled()) {
>         LOG.debug("Skipping scheduling since node " + node.getNodeID()
>                 + " is reserved by application " + node.getReservedContainer()
>                 .getContainerId().getApplicationAttemptId());
>      }
>      return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

Reply via email to