[jira] [Updated] (YARN-10051) Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE

2019-12-19 Thread Yong Xing (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Xing updated YARN-10051:
-
Description: 
 Restarting a NM, I found the active RM crash. The Exception stack is as 
follows.
{code:java}
2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_UPDATE to the Event Dispatcher
java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1374)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:345)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:958)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
{code}
 

This issue looks a bit same as 
[YARN-9552|https://issues.apache.org/jira/browse/YARN-9552],[YARN-7382|https://issues.apache.org/jira/browse/YARN-7382].
 But the root cause is different.

  was:
 Restarting a NM, I found the active RM crash. The Exception stack is as 
follows.
{code:java}
2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_UPDATE to the Event Dispatcher
java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1374)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:345)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:958)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
{code}
 


> Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE
> -
>
> Key: YARN-10051
> URL: https://issues.apache.org/jira/browse/YARN-10051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Yong Xing
>Priority: Critical
> Fix For: 3.0.0
>
>
>  Restarting a NM, I found the active RM crash. The Exception stack is as 
> follows.
> {code:java}
> 2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_UPDATE to the Event Dispatcher
> java.util.NoSuchElementException
> at 
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
> at 
> java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
> 

[jira] [Created] (YARN-10051) Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE

2019-12-19 Thread Yong Xing (Jira)
Yong Xing created YARN-10051:


 Summary: Throwing NoSuchElementException when even dispatcher 
handling NODE_UPDATE
 Key: YARN-10051
 URL: https://issues.apache.org/jira/browse/YARN-10051
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Yong Xing
 Fix For: 3.0.0


 Restarting a NM, I found the active RM crash. The Exception stack is as 
follows.
{code:java}
2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type NODE_UPDATE to the Event Dispatcher
java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1374)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:345)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:958)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10046) RM failed to transition to Active because of App recovery throwing java.lang.NullPointerException

2019-12-19 Thread Yong Xing (Jira)
Yong Xing created YARN-10046:


 Summary: RM failed to transition to Active because of App recovery 
throwing java.lang.NullPointerException
 Key: YARN-10046
 URL: https://issues.apache.org/jira/browse/YARN-10046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Yong Xing


 

CDH Distribution: Hadoop 3.0.0-cdh6.0.1

The exception stack is as follows.

2019-12-12 17:09:41,422 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
load/recover state
java.lang.NullPointerException
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:521)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1221)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1265)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1206)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:907)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1046)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2000(RMAppImpl.java:118)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1110)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1051)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:875)
 at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:357)
 at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:544)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1393)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1146)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1186)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1182)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1182)
 at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894)
 at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
 at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:491)

 

During the recovery of Application attempts, the status of one app attempt is 
NULL. The following LOG  describes:

2019-12-12 17:09:41,381 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: 
application_1576136386231_0742 with 1 attempts and final state =