[ https://issues.apache.org/jira/browse/YARN-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456229#comment-16456229 ]
Weiwei Yang commented on YARN-8222: ----------------------------------- Very straightforward fix, +1. Pending on Jenkins. Thanks [~Tao Yang] > NPE when calling rmContext.getRMApps().get(...).getCurrentAppAttempt() > ---------------------------------------------------------------------- > > Key: YARN-8222 > URL: https://issues.apache.org/jira/browse/YARN-8222 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Attachments: YARN-8222.001.patch > > > Recently we did some performance tests and found two NPE problems when > calling rmContext.getRMApps().get(appId).get... > These NPE problems occasionally happened when doing performance tests with > large number and fast-finished applications. We have checked other places > which call rmContext.getRMApps().get(...), most of them have null check and > some does not need (The process can guarantee that the return result will not > be null). > To fix these problems, We can add a null check for application before getting > attempt form it. > (1) NPE in RMContainerImpl$FinishedTransition#updateAttemptMetrics > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:742) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:715) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:699) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:482) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:64) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:195) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1793) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:2624) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:663) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1514) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:2396) > at > org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205) > at > org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:834) > {noformat} > This NPE looks like happen when node heartbeat delay and try to update > attempt metrics for a non-exist app. > Reference code of RMContainerImpl$FinishedTransition#updateAttemptMetrics: > {code:java} > private static void updateAttemptMetrics(RMContainerImpl container) { > Resource resource = container.getContainer().getResource(); > RMAppAttempt rmAttempt = container.rmContext.getRMApps() > .get(container.getApplicationAttemptId().getApplicationId()) > .getCurrentAppAttempt(); > if (rmAttempt != null) { > //.... > } > } > {code} > (2) NPE in SchedulerApplicationAttempt#incNumAllocatedContainers > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.incNumAllocatedContainers(SchedulerApplicationAttempt.java:1268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:638) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:3589) > at > org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.tryCommit(SLSCapacityScheduler.java:142) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:962) > {noformat} > This NPE should happen when apply a outdated proposal for a non-existed > application in rmContext. > Reference code: > {code:java} > RMAppAttempt attempt = > rmContext.getRMApps().get(attemptId.getApplicationId()) > .getCurrentAppAttempt(); > if (attempt != null) { > > attempt.getRMAppAttemptMetrics().incNumAllocatedContainers(containerType, > requestType); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org