[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Teke updated YARN-10295: --------------------------------- Description: When the CapacityScheduler Asynchronous scheduling is enabled there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log: {code:java} 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with resource=<memory:4096, vCores:1> 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label= 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) {code} A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], because at the time _node.getReservedContainer()_ wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime. {code:java} // Do not schedule if there are any reservations to fulfill on the node if (node.getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping scheduling since node " + node.getNodeID() + " is reserved by application " + node.getReservedContainer() .getContainerId().getApplicationAttemptId()); } return null; } {code} A fix would be to store the container object before the if, and as a precaution the org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl#getId/setId methods should be declared synchronised, as they'll be accessed from multiple threads. Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 which indirectly fixed this. was: When the CapacityScheduler Asynchronous scheduling is enabled there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log: {code:java} 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with resource=<memory:4096, vCores:1> 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label= 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) {code} A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], because at the time _node.getReservedContainer()_ wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime. {code:java} // Do not schedule if there are any reservations to fulfill on the node if (node.getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping scheduling since node " + node.getNodeID() + " is reserved by application " + node.getReservedContainer() .getContainerId().getApplicationAttemptId()); } return null; } {code} A fix would be to store the container object before the if, and as a precaution the org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl#getId/setId methods should be declared synchronyzed, as they'll be accessed from multiple threads. Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 which indirectly fixed this. > CapacityScheduler NPE can cause apps to get stuck without resources > ------------------------------------------------------------------- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.1.0, 3.2.0 > Reporter: Benjamin Teke > Assignee: Benjamin Teke > Priority: Major > > When the CapacityScheduler Asynchronous scheduling is enabled there is an > edge-case where a NullPointerException can cause the scheduler thread to exit > and the apps to get stuck without allocated resources. Consider the following > log: > > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 > available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with > resource=<memory:4096, vCores:1> > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 > available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently > has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if, and as a > precaution the > org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl#getId/setId > methods should be declared synchronised, as they'll be accessed from multiple > threads. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org