[
https://issues.apache.org/jira/browse/YARN-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456133#comment-16456133
]
Tao Yang commented on YARN-8222:
--------------------------------
Attached v1 patch for review.
> NPE when calling rmContext.getRMApps().get(...).getCurrentAppAttempt()
> ----------------------------------------------------------------------
>
> Key: YARN-8222
> URL: https://issues.apache.org/jira/browse/YARN-8222
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Critical
> Attachments: YARN-8222.001.patch
>
>
> Recently we did some performance tests and found two NPE problems when
> calling rmContext.getRMApps().get(appId).get...
> These NPE problems occasionally happened when doing performance tests with
> large number and fast-finished applications. We have checked other places
> which call rmContext.getRMApps().get(...), most of them have null check and
> some does not need (The process can guarantee that the return result will not
> be null).
> To fix these problems, We can add a null check for application before getting
> attempt form it.
> (1) NPE in RMContainerImpl$FinishedTransition#updateAttemptMetrics
> {noformat}
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:742)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:715)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:699)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:482)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:64)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:195)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1793)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:2624)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:663)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1514)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:2396)
> at
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205)
> at
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60)
> at
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> This NPE looks like happen when node heartbeat delay and try to update
> attempt metrics for a non-exist app.
> Reference code of RMContainerImpl$FinishedTransition#updateAttemptMetrics:
> {code:java}
> private static void updateAttemptMetrics(RMContainerImpl container) {
> Resource resource = container.getContainer().getResource();
> RMAppAttempt rmAttempt = container.rmContext.getRMApps()
> .get(container.getApplicationAttemptId().getApplicationId())
> .getCurrentAppAttempt();
> if (rmAttempt != null) {
> //....
> }
> }
> {code}
> (2) NPE in SchedulerApplicationAttempt#incNumAllocatedContainers
> {noformat}
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.incNumAllocatedContainers(SchedulerApplicationAttempt.java:1268)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:638)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:3589)
> at
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.tryCommit(SLSCapacityScheduler.java:142)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:962)
> {noformat}
> This NPE should happen when apply a outdated proposal for a non-existed
> application in rmContext.
> Reference code:
> {code:java}
> RMAppAttempt attempt =
> rmContext.getRMApps().get(attemptId.getApplicationId())
> .getCurrentAppAttempt();
> if (attempt != null) {
>
> attempt.getRMAppAttemptMetrics().incNumAllocatedContainers(containerType,
> requestType);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]