[ 
https://issues.apache.org/jira/browse/YARN-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8222:
------------------------------
    Summary: Fix potential NPE when gets RMApp from RM context  (was: NPE when 
calling rmContext.getRMApps().get(...).getCurrentAppAttempt())

> Fix potential NPE when gets RMApp from RM context
> -------------------------------------------------
>
>                 Key: YARN-8222
>                 URL: https://issues.apache.org/jira/browse/YARN-8222
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Critical
>         Attachments: YARN-8222.001.patch
>
>
> Recently we did some performance tests and found two NPE problems when 
> calling rmContext.getRMApps().get(appId).get...
> These NPE problems occasionally happened when doing performance tests with 
> large number and fast-finished applications. We have checked other places 
> which call rmContext.getRMApps().get(...), most of them have null check and 
> some does not need (The process can guarantee that the return result will not 
> be null). 
> To fix these problems, We can add a null check for application before getting 
> attempt form it.
> (1) NPE in RMContainerImpl$FinishedTransition#updateAttemptMetrics
> {noformat}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:742)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:715)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:699)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:482)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:64)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:195)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1793)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:2624)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:663)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1514)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:2396)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:834)
> {noformat}
> This NPE looks like happen when node heartbeat delay and try to update 
> attempt metrics for a non-exist app. 
> Reference code of RMContainerImpl$FinishedTransition#updateAttemptMetrics:
> {code:java}
> private static void updateAttemptMetrics(RMContainerImpl container) {
>       Resource resource = container.getContainer().getResource();
>       RMAppAttempt rmAttempt = container.rmContext.getRMApps()
>           .get(container.getApplicationAttemptId().getApplicationId())
>           .getCurrentAppAttempt();
>       if (rmAttempt != null) {
>          //....
>       }
> }
> {code}
> (2) NPE in SchedulerApplicationAttempt#incNumAllocatedContainers
> {noformat}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.incNumAllocatedContainers(SchedulerApplicationAttempt.java:1268)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:638)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:3589)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.tryCommit(SLSCapacityScheduler.java:142)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:962)
> {noformat}
> This NPE should happen when apply a outdated proposal for a non-existed 
> application in rmContext.
> Reference code:
> {code:java}
>     RMAppAttempt attempt =
>         rmContext.getRMApps().get(attemptId.getApplicationId())
>           .getCurrentAppAttempt();
>     if (attempt != null) {
>       
> attempt.getRMAppAttemptMetrics().incNumAllocatedContainers(containerType,
>         requestType);
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to