[jira] [Updated] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs
[ https://issues.apache.org/jira/browse/YARN-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-10301: - Description: We observed the "Mismatched response." error in RM's log when a NM gets network-partitioned after RM failover. Here's how it happens: Initially, we have a sleeper YARN service running in a cluster with two RMs (an active RM1 and a standby RM2) and one NM. At some point, we perform a RM failover from RM1 to RM2. RM1's log: {noformat} 2020-06-01 16:29:20,387 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state{noformat} RM2's log: {noformat} 2020-06-01 16:29:27,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to active state{noformat} After the RM failover, the NM encounters a network partition and fails to register with RM2. In other words, there's no "NodeManager from node *** registered" in RM2's log. This does not affect the sleeper YARN service. The sleeper service successfully recovers after the RM failover. We can see in RM2's log: {noformat} 2020-06-01 16:30:06,703 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = REGISTERED{noformat} Then, we stop the sleeper service. In RM2's log, we can see that: {noformat} 2020-06-01 16:30:12,157 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: application_6_0001 unregistered successfully. ... 2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: Successfully stopped service sleeper1{noformat} And in AM's log, we can see that: {noformat} 2020-06-01 16:30:12,651 [shutdown-hook-0] INFO service.ServiceMaster - SHUTDOWN_MSG:{noformat} Some time later, we observe the "Mismatched response" in RM2's log: {noformat} 2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response. at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623) at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667) at org.apache.hadoop.ipc.Client.call(Client.java:1483) at org.apache.hadoop.ipc.Client.call(Client.java:1436) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy102.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy103.stopContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:354) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at
[jira] [Created] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs
YCozy created YARN-10301: Summary: "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs Key: YARN-10301 URL: https://issues.apache.org/jira/browse/YARN-10301 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.0 Reporter: YCozy We observed the "Mismatched response." error in RM's log when a NM gets network-partitioned after RM failover. Here's how it happens: Initially, we have a sleeper YARN service running in a cluster with two RMs (an active RM1 and a standby RM2) and one NM. At some point, we perform a RM failover from RM1 to RM2. RM1's log: {noformat} 2020-06-01 16:29:20,387 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state{noformat} RM2's log: {noformat} 2020-06-01 16:29:27,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to active state{noformat} After the RM failover, the NM encounters a network partition and fails to register with RM2. In other words, there's no "NodeManager from node *** registered" in RM2's log. This does not affect the sleeper YARN service. The sleeper service successfully recovers after the RM failover. We can see in RM2's log: {noformat} 2020-06-01 16:30:06,703 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = REGISTERED{noformat} Then, we stop the sleeper service. In RM2's log, we can see that: {noformat} 2020-06-01 16:30:12,157 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: application_6_0001 unregistered successfully. ... 2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: Successfully stopped service sleeper1{noformat} And in AM's log, we can see that: {noformat} 2020-06-01 16:30:12,651 [shutdown-hook-0] INFO service.ServiceMaster - SHUTDOWN_MSG:{noformat} Some time later, we observe the "Mismatched response" in RM2's log: {noformat} 2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response. at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623) at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667) at org.apache.hadoop.ipc.Client.call(Client.java:1483) at org.apache.hadoop.ipc.Client.call(Client.java:1436) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy102.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy103.stopContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153) at
[jira] [Commented] (YARN-10166) Add detail log for ApplicationAttemptNotFoundException
[ https://issues.apache.org/jira/browse/YARN-10166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119886#comment-17119886 ] YCozy commented on YARN-10166: -- We encountered the same issue. An AM is killed during NM failover, but the AM still manages to send the allocate() heartbeat to RM after the AM is unregistered and before the AM is totally gone. As a result, the confusing ERROR entry "Application attempt ... doesn't exist" occurs in RM's log. Logging more information about the app would be a great way to clear the confusion. Btw, why do we want this to be an ERROR for the RM? > Add detail log for ApplicationAttemptNotFoundException > -- > > Key: YARN-10166 > URL: https://issues.apache.org/jira/browse/YARN-10166 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Youquan Lin >Priority: Minor > Labels: patch > Attachments: YARN-10166-001.patch, YARN-10166-002.patch, > YARN-10166-003.patch, YARN-10166-004.patch > > > Suppose user A killed the app, then ApplicationMasterService will call > unregisterAttempt() for this app. Sometimes, app's AM continues to call the > alloate() method and reports an error as follows. > {code:java} > Application attempt appattempt_1582520281010_15271_01 doesn't exist in > ApplicationMasterService cache. > {code} > If user B has been watching the AM log, he will be confused why the > attempt is no longer in the ApplicationMasterService cache. So I think we can > add detail log for ApplicationAttemptNotFoundException as follows. > {code:java} > Application attempt appattempt_1582630210671_14658_01 doesn't exist in > ApplicationMasterService cache.App state: KILLED,finalStatus: KILLED > ,diagnostics: App application_1582630210671_14658 killed by userA from > 127.0.0.1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10294) NodeManager shows a wrong reason when a YARN service fails to start
[ https://issues.apache.org/jira/browse/YARN-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-10294: - Description: We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 gets assigned and tries to start the service. We can see from its log: {noformat} 2020-05-28 14:48:18,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_6_0001_01_01] 2020-05-28 14:48:18,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_01 transitioned from SCHEDULED to RUNNING{noformat} Due to some misconfiguration, the container fails to start. We can also see from the container's serviceam.log: {noformat} 2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - Failed to register app sleeper1 in registry org.apache.hadoop.registry.client.exceptions.RegistryIOException: `/registry/users/root/services/yarn-service': Failure of mkdir() on /registry/users/root/services/yarn-service: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595) at org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210) at org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) at
[jira] [Created] (YARN-10294) NodeManager shows a wrong reason when a YARN service fails to start
YCozy created YARN-10294: Summary: NodeManager shows a wrong reason when a YARN service fails to start Key: YARN-10294 URL: https://issues.apache.org/jira/browse/YARN-10294 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.3.0 Reporter: YCozy We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 gets assigned and tries to start the service. We can see from its log: {noformat} 2020-05-28 14:48:18,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_6_0001_01_01] 2020-05-28 14:48:18,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_01 transitioned from SCHEDULED to RUNNING{noformat} Due to some misconfiguration, the container fails to start. We can also see from the container's serviceam.log: {noformat} 2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - Failed to register app sleeper1 in registry org.apache.hadoop.registry.client.exceptions.RegistryIOException: `/registry/users/root/services/yarn-service': Failure of mkdir() on /registry/users/root/services/yarn-service: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595) at org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210) at org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
[jira] [Updated] (YARN-10288) InvalidStateTransitionException: LAUNCH_FAILED at FAILED
[ https://issues.apache.org/jira/browse/YARN-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-10288: - Description: We encountered the following exception when testing YARN (2.10.0) under network partition: {noformat} org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) at java.lang.Thread.run(Thread.java:748) {noformat} Upon investigation we find that it is similar to YARN-9201. Can we backport the fix of that bug to 2.10.0 as well? was: We encountered the following exception when testing YARN (2.10.0) under network partition: {noformat} org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) at java.lang.Thread.run(Thread.java:748) {noformat} Upon investigation we find that it is similar to [YARN-9201|https://issues.apache.org/jira/browse/YARN-9201]. Can we backport the fix of that bug to YARN-2.10.0 as well? > InvalidStateTransitionException: LAUNCH_FAILED at FAILED > > > Key: YARN-10288 > URL: https://issues.apache.org/jira/browse/YARN-10288 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0 >Reporter: YCozy >Priority: Major > > We encountered the following exception when testing YARN (2.10.0) under > network partition: > {noformat} > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > LAUNCH_FAILED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:748) {noformat} > Upon investigation we find that it is similar to YARN-9201. Can we backport > the fix of that bug to 2.10.0 as well? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For
[jira] [Created] (YARN-10288) InvalidStateTransitionException: LAUNCH_FAILED at FAILED
YCozy created YARN-10288: Summary: InvalidStateTransitionException: LAUNCH_FAILED at FAILED Key: YARN-10288 URL: https://issues.apache.org/jira/browse/YARN-10288 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.10.0 Reporter: YCozy We encountered the following exception when testing YARN (2.10.0) under network partition: {noformat} org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) at java.lang.Thread.run(Thread.java:748) {noformat} Upon investigation we find that it is similar to [YARN-9201|https://issues.apache.org/jira/browse/YARN-9201]. Can we backport the fix of that bug to YARN-2.10.0 as well? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM
[ https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-9194: Comment: was deleted (was: Hi, we were able to trigger the same bug (LAUNCH_FAILED at FAILED) in 2.10.0. Can we also backport the fix to that version? Thanks!) > Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and > NullPointerException happens in RM while shutdown a NM > - > > Key: YARN-9194 > URL: https://issues.apache.org/jira/browse/YARN-9194 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Critical > Fix For: 3.1.2, 3.3.0, 3.2.1 > > Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, > YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, > hadoop-hires-resourcemanager-hadoop11.log > > > While the attempt fails, the REGISTERED comes, hence the > InvalidStateTransitionException happens. > > {code:java} > 2019-01-13 00:41:57,127 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547311267249_0001_02 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > REGISTERED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM
[ https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114123#comment-17114123 ] YCozy commented on YARN-9194: - Hi, we were able to trigger the same bug (LAUNCH_FAILED at FAILED) in 2.10.0. Can we also backport the fix to that version? Thanks! > Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and > NullPointerException happens in RM while shutdown a NM > - > > Key: YARN-9194 > URL: https://issues.apache.org/jira/browse/YARN-9194 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Critical > Fix For: 3.1.2, 3.3.0, 3.2.1 > > Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, > YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, > hadoop-hires-resourcemanager-hadoop11.log > > > While the attempt fails, the REGISTERED comes, hence the > InvalidStateTransitionException happens. > > {code:java} > 2019-01-13 00:41:57,127 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547311267249_0001_02 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > REGISTERED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10232) InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING
[ https://issues.apache.org/jira/browse/YARN-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-10232: - Description: We were testing YARN under network partition and found the following ERROR in RM's log. {code:java} 2020-04-11 13:10:39,739 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: App attempt: appattempt_6_0001_02 can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:916) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1097) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1078) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:222) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:138) at java.lang.Thread.run(Thread.java:748) {code} After analyzing the logs, we have recovered the triggering process of this bug: * We have a cluster with one RM and one NM. * A client tries to start a YARN service. * RM send a request to NM to start the containers: NM's log: {code:java} 2020-04-11 14:23:44,030 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_6_0001_02 (auth:SIMPLE) 2020-04-11 14:23:44,229 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_6_0001_02_01 by user appattempt_6_0001_02 {code} * NM starts the containers successfully: NM's log: {code:java} 2020-04-11 14:23:44,347 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_6_0001 transitioned from INITING to RUNNING 2020-04-11 14:23:44,357 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_02_01 transitioned from NEW to LOCALIZING {code} * However, due to network partition, NM failed to send back the RPC response. * After a while, the application is running happily: RM's log: {code:java} 2020-04-11 14:23:50,359 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_02 State change from ALLOCATED to RUNNING on event = REGISTERED 020-04-11 14:23:50,359 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_6_0001 State change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED{code} * Then, since RM didn't receive the RPC response for startContainers, it retries. The network partition has already stopped, so NM will receive the new startContainers RPC: NM's log: {code:java} 2020-04-11 14:23:54,392 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_6_0001_02 (auth:SIMPLE) {code} * But since the attempt is actually running, this launch request does not succeed: NM's log: {code:java} 2020-04-11 14:23:54,401 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Unauthorized request to start container. Attempt to relaunch the same container with id container_6_0001_02_01. {code} RM's log: {code:java} 2020-04-11 14:23:54,428 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_6_0001_02. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. Attempt to relaunch the same container with id container_6_0001_02_01. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at
[jira] [Updated] (YARN-10232) InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING
[ https://issues.apache.org/jira/browse/YARN-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YCozy updated YARN-10232: - Description: We were testing YARN under network partition and found the following ERROR in RM's log. {code:java} 2020-04-11 13:10:39,739 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: App attempt: appattempt_6_0001_02 can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:916) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1097) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1078) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:222) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:138) at java.lang.Thread.run(Thread.java:748) {code} After analyzing the logs, we have recovered the triggering process of this bug: * We have a cluster with one RM and one NM. * A client tries to start a YARN service. * RM send a request to NM to start the containers: NM's log: {code:java} 2020-04-11 14:23:44,030 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_6_0001_02 (auth:SIMPLE) 2020-04-11 14:23:44,229 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_6_0001_02_01 by user appattempt_6_0001_02 {code} * NM starts the containers successfully: NM's log: {code:java} 2020-04-11 14:23:44,347 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_6_0001 transitioned from INITING to RUNNING 2020-04-11 14:23:44,357 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_02_01 transitioned from NEW to LOCALIZING {code} * However, due to network partition, NM failed to send back the RPC response. * After a while, the application is running happily: RM's log: {code:java} 2020-04-11 14:23:50,359 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_02 State change from ALLOCATED to RUNNING on event = REGISTERED 020-04-11 14:23:50,359 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_6_0001 State change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED{code} * Then, since RM didn't receive the RPC response for startContainers, it retries: NM's log: {code:java} 2020-04-11 14:23:54,392 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_6_0001_02 (auth:SIMPLE) {code} * But since the attempt is actually running, this launch request does not succeed: NM's log: {code:java} 2020-04-11 14:23:54,401 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Unauthorized request to start container. Attempt to relaunch the same container with id container_6_0001_02_01. {code} RM's log: {code:java} 2020-04-11 14:23:54,428 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_6_0001_02. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. Attempt to relaunch the same container with id container_6_0001_02_01. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:137) at
[jira] [Created] (YARN-10232) InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING
YCozy created YARN-10232: Summary: InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING Key: YARN-10232 URL: https://issues.apache.org/jira/browse/YARN-10232 Project: Hadoop YARN Issue Type: Bug Reporter: YCozy We were testing YARN under network partition and found the following ERROR in RM's log. {code:java} 2020-04-11 13:10:39,739 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: App attempt: appattempt_6_0001_02 can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:916) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1097) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1078) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:222) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:138) at java.lang.Thread.run(Thread.java:748) {code} After analyzing the logs, we have recovered the triggering process of this bug: * We have a cluster with one RM and one NM. * A client tries to start a YARN service. * RM send a request to NM to start the containers: NM's log: {code:java} 2020-04-11 14:23:44,030 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_6_0001_02 (auth:SIMPLE) 2020-04-11 14:23:44,229 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_6_0001_02_01 by user appattempt_6_0001_02 {code} * NM starts the containers successfully: NM's log: {code:java} 2020-04-11 14:23:44,347 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_6_0001 transitioned from INITING to RUNNING 2020-04-11 14:23:44,357 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_02_01 transitioned from NEW to LOCALIZING {code} * However, due to network partition, NM failed to send back the RPC response. * After a while, the application is running happily: RM's log: {code:java} 2020-04-11 14:23:50,359 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_02 State change from ALLOCATED to RUNNING on event = REGISTERED 020-04-11 14:23:50,359 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_6_0001 State change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED{code} * Then, since RM didn't receive the RPC response for startContainers, it retries: NM's log: {code:java} 2020-04-11 14:23:54,392 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_6_0001_02 (auth:SIMPLE) {code} * But since the attempt is actually running, this launch request does not succeed: NM's log: {code:java} 2020-04-11 14:23:54,401 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Unauthorized request to start container. Attempt to relaunch the same container with id container_6_0001_02_01. {code} RM's log: {code:java} 2020-04-11 14:23:54,428 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_6_0001_02. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. Attempt to relaunch the same container with id container_6_0001_02_01. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182) at
[jira] [Created] (YARN-10231) When a NM is partitioned away, YARN service will complain about "Queue's AM resource limit exceeded"
YCozy created YARN-10231: Summary: When a NM is partitioned away, YARN service will complain about "Queue's AM resource limit exceeded" Key: YARN-10231 URL: https://issues.apache.org/jira/browse/YARN-10231 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.0 Reporter: YCozy We were testing YARN's RM failover code under network partition, and we observed the following failure. We think this is a bug and would like to confirm with you. Basically, we were testing the following scenario: # Start a YARN cluster with two RMs (e.g., RM1 and RM2) and one NM. # Make RM1 active. # Start a YARN service, e.g., the built-in sleeper service. Name it sleeper1. # Failover from RM1 to RM2. # Stop the sleeper1 and start another YARN service, e.g., still the sleeper service, and call it sleeper2. When no network partition happens, everything is fine (e.g., sleeper2 can start successfully). However, if the NM is partitioned after the RM failover, sleeper2 will fail to start: After polling sleeper2's status for 30 seconds, its application report is still as follows: {code:java} Application Report : Application-Id : application_4_0001 Application-Name : sleeper2 Application-Type : yarn-service User : root Queue : default Application Priority : 0 Start-Time : 1585525063950 Finish-Time : 0 Progress : 0% State : ACCEPTED Final-State : UNDEFINED Tracking-URL : N/A RPC Port : -1 AM Host : N/A Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds Log Aggregation Status : DISABLED Diagnostics : [Sun Mar 29 23:37:44 + 2020] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = ; AM Resource Request = ; Queue Resource Limit for AM = ; User AM Resource Limit of the queue = ; Queue AM Resource Usage = ; Unmanaged Application : false Application Node Label Expression : AM container Node Label Expression : TimeoutType : LIFETIME ExpiryTime : UNLIMITED RemainingTime : -1seconds {code} Since the only fault happens is network partition, the "queue's AM resource limit" shouldn't be exceeded. We can reliably reproduce this bug using our fault injection engine. Please let us know if you need any info for debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org