[jira] [Updated] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs

2020-06-01 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10301:
-
Description: 
We observed the "Mismatched response." error in RM's log when a NM gets 
network-partitioned after RM failover. Here's how it happens:

 

Initially, we have a sleeper YARN service running in a cluster with two RMs (an 
active RM1 and a standby RM2) and one NM. At some point, we perform a RM 
failover from RM1 to RM2.

RM1's log:
{noformat}
2020-06-01 16:29:20,387 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
standby state{noformat}
RM2's log:
{noformat}
2020-06-01 16:29:27,818 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
active state{noformat}
 

After the RM failover, the NM encounters a network partition and fails to 
register with RM2. In other words, there's no "NodeManager from node *** 
registered" in RM2's log.

 

This does not affect the sleeper YARN service. The sleeper service successfully 
recovers after the RM failover. We can see in RM2's log: 
{noformat}
2020-06-01 16:30:06,703 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = 
REGISTERED{noformat}
 

Then, we stop the sleeper service. In RM2's log, we can see that:
{noformat}
2020-06-01 16:30:12,157 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
application_6_0001 unregistered successfully.
...
2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: 
Successfully stopped service sleeper1{noformat}
And in AM's log, we can see that: 
{noformat}
2020-06-01 16:30:12,651 [shutdown-hook-0] INFO  service.ServiceMaster - 
SHUTDOWN_MSG:{noformat}
 

Some time later, we observe the "Mismatched response" in RM2's log:
{noformat}
2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.
  at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376)
  at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623)
  at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827)             
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823)             
 
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823)    
 
  at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667)               
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1483)                        
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1436)                        
 
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
  at com.sun.proxy.$Proxy102.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                
 
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)                           
 
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  at com.sun.proxy.$Proxy103.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153)
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:354)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 

[jira] [Created] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs

2020-06-01 Thread YCozy (Jira)
YCozy created YARN-10301:


 Summary: "DIGEST-MD5: digest response format violation. Mismatched 
response." when network partition occurs
 Key: YARN-10301
 URL: https://issues.apache.org/jira/browse/YARN-10301
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: YCozy


We observed the "Mismatched response." error in RM's log when a NM gets 
network-partitioned after RM failover. Here's how it happens:

 

Initially, we have a sleeper YARN service running in a cluster with two RMs (an 
active RM1 and a standby RM2) and one NM. At some point, we perform a RM 
failover from RM1 to RM2.

RM1's log:

 
{noformat}
2020-06-01 16:29:20,387 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
standby state{noformat}
RM2's log:

 

 
{noformat}
2020-06-01 16:29:27,818 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
active state{noformat}
 

After the RM failover, the NM encounters a network partition and fails to 
register with RM2. In other words, there's no "NodeManager from node *** 
registered" in RM2's log.

 

This does not affect the sleeper YARN service. The sleeper service successfully 
recovers after the RM failover. We can see in RM2's log:

 
{noformat}
2020-06-01 16:30:06,703 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = 
REGISTERED{noformat}
 

Then, we stop the sleeper service. In RM2's log, we can see that:

 
{noformat}
2020-06-01 16:30:12,157 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
application_6_0001 unregistered successfully.
...
2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: 
Successfully stopped service sleeper1{noformat}
And in AM's log, we can see that:

 

 
{noformat}
2020-06-01 16:30:12,651 [shutdown-hook-0] INFO  service.ServiceMaster - 
SHUTDOWN_MSG:{noformat}
 

Some time later, we observe the "Mismatched response" in RM2's log:
{noformat}
2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.
  at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376)
  at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623)
  at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827)             
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823)             
 
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823)    
 
  at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667)               
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1483)                        
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1436)                        
 
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
  at com.sun.proxy.$Proxy102.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                
 
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)                           
 
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  at com.sun.proxy.$Proxy103.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153)
  at 

[jira] [Commented] (YARN-10166) Add detail log for ApplicationAttemptNotFoundException

2020-05-29 Thread YCozy (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119886#comment-17119886
 ] 

YCozy commented on YARN-10166:
--

We encountered the same issue. An AM is killed during NM failover, but the AM 
still manages to send the allocate() heartbeat to RM after the AM is 
unregistered and before the AM is totally gone. As a result, the confusing 
ERROR entry "Application attempt ... doesn't exist" occurs in RM's log. Logging 
more information about the app would be a great way to clear the confusion.

 

Btw, why do we want this to be an ERROR for the RM?

> Add detail log for ApplicationAttemptNotFoundException
> --
>
> Key: YARN-10166
> URL: https://issues.apache.org/jira/browse/YARN-10166
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Youquan Lin
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10166-001.patch, YARN-10166-002.patch, 
> YARN-10166-003.patch, YARN-10166-004.patch
>
>
>      Suppose user A killed the app, then ApplicationMasterService will  call 
> unregisterAttempt() for this app. Sometimes, app's AM continues to call the 
> alloate() method and reports an error as follows.
> {code:java}
> Application attempt appattempt_1582520281010_15271_01 doesn't exist in 
> ApplicationMasterService cache.
> {code}
>     If user B has been watching the AM log, he will be confused why the 
> attempt is no longer in the ApplicationMasterService cache. So I think we can 
> add detail log for ApplicationAttemptNotFoundException as follows.
> {code:java}
> Application attempt appattempt_1582630210671_14658_01 doesn't exist in 
> ApplicationMasterService cache.App state: KILLED,finalStatus: KILLED 
> ,diagnostics: App application_1582630210671_14658 killed by userA from 
> 127.0.0.1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10294) NodeManager shows a wrong reason when a YARN service fails to start

2020-05-28 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10294:
-
Description: 
We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 
gets assigned and tries to start the service. We can see from its log:
{noformat}
2020-05-28 14:48:18,650 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
 Starting container [container_6_0001_01_01]
2020-05-28 14:48:18,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_01_01 transitioned from SCHEDULED to 
RUNNING{noformat}
Due to some misconfiguration, the container fails to start. We can also see 
from the container's serviceam.log:
{noformat}
2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - 
Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - 
Failed to register app sleeper1 in registry
org.apache.hadoop.registry.client.exceptions.RegistryIOException: 
`/registry/users/root/services/yarn-service': Failure of mkdir()  on 
/registry/users/root/services/yarn-service: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode 
= ConnectionLoss for /registry/users/root/services/yarn-service
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
  at 
org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)    
 
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                      
 
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)      
 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)       
 
  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)                 
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
  at 
org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)             
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
                                                                                
                                                                                
  at 

[jira] [Created] (YARN-10294) NodeManager shows a wrong reason when a YARN service fails to start

2020-05-28 Thread YCozy (Jira)
YCozy created YARN-10294:


 Summary: NodeManager shows a wrong reason when a YARN service 
fails to start
 Key: YARN-10294
 URL: https://issues.apache.org/jira/browse/YARN-10294
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.3.0
Reporter: YCozy


We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 
gets assigned and tries to start the service. We can see from its log:
{noformat}
2020-05-28 14:48:18,650 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
 Starting container [container_6_0001_01_01]
2020-05-28 14:48:18,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_01_01 transitioned from SCHEDULED to 
RUNNING{noformat}
Due to some misconfiguration, the container fails to start. We can also see 
from the container's serviceam.log:
{noformat}
2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - 
Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - 
Failed to register app sleeper1 in registry
org.apache.hadoop.registry.client.exceptions.RegistryIOException: 
`/registry/users/root/services/yarn-service': Failure of mkdir()  on 
/registry/users/root/services/yarn-service: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
       ConnectionLoss for /registry/users/root/services/yarn-service: 
KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
  at 
org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)    
 
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                      
 
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)      
 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)       
 
  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)                 
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
  at 
org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)             
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
                 

[jira] [Updated] (YARN-10288) InvalidStateTransitionException: LAUNCH_FAILED at FAILED

2020-05-22 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10288:
-
Description: 
We encountered the following exception when testing YARN (2.10.0) under network 
partition:
{noformat}
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at FAILED
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
  at java.lang.Thread.run(Thread.java:748)   {noformat}
Upon investigation we find that it is similar to YARN-9201. Can we backport the 
fix of that bug to 2.10.0 as well?

 

  was:
We encountered the following exception when testing YARN (2.10.0) under network 
partition:

 
{noformat}
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at FAILED
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
  at java.lang.Thread.run(Thread.java:748)   {noformat}
Upon investigation we find that it is similar to 
[YARN-9201|https://issues.apache.org/jira/browse/YARN-9201]. Can we backport 
the fix of that bug to YARN-2.10.0 as well?

 


> InvalidStateTransitionException: LAUNCH_FAILED at FAILED
> 
>
> Key: YARN-10288
> URL: https://issues.apache.org/jira/browse/YARN-10288
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: YCozy
>Priority: Major
>
> We encountered the following exception when testing YARN (2.10.0) under 
> network partition:
> {noformat}
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> LAUNCH_FAILED at FAILED
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:748)   {noformat}
> Upon investigation we find that it is similar to YARN-9201. Can we backport 
> the fix of that bug to 2.10.0 as well?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Created] (YARN-10288) InvalidStateTransitionException: LAUNCH_FAILED at FAILED

2020-05-22 Thread YCozy (Jira)
YCozy created YARN-10288:


 Summary: InvalidStateTransitionException: LAUNCH_FAILED at FAILED
 Key: YARN-10288
 URL: https://issues.apache.org/jira/browse/YARN-10288
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.10.0
Reporter: YCozy


We encountered the following exception when testing YARN (2.10.0) under network 
partition:

 
{noformat}
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at FAILED
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:115)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:970)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:951)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
  at java.lang.Thread.run(Thread.java:748)   {noformat}
Upon investigation we find that it is similar to 
[YARN-9201|https://issues.apache.org/jira/browse/YARN-9201]. Can we backport 
the fix of that bug to YARN-2.10.0 as well?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM

2020-05-22 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-9194:

Comment: was deleted

(was: Hi, we were able to trigger the same bug (LAUNCH_FAILED at FAILED) in 
2.10.0. Can we also backport the fix to that version? Thanks!)

> Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and 
> NullPointerException happens in RM while shutdown a NM
> -
>
> Key: YARN-9194
> URL: https://issues.apache.org/jira/browse/YARN-9194
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, 
> YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, 
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the 
> InvalidStateTransitionException happens.
>  
> {code:java}
> 2019-01-13 00:41:57,127 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547311267249_0001_02 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> REGISTERED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM

2020-05-22 Thread YCozy (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114123#comment-17114123
 ] 

YCozy commented on YARN-9194:
-

Hi, we were able to trigger the same bug (LAUNCH_FAILED at FAILED) in 2.10.0. 
Can we also backport the fix to that version? Thanks!

> Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and 
> NullPointerException happens in RM while shutdown a NM
> -
>
> Key: YARN-9194
> URL: https://issues.apache.org/jira/browse/YARN-9194
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, 
> YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, 
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the 
> InvalidStateTransitionException happens.
>  
> {code:java}
> 2019-01-13 00:41:57,127 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547311267249_0001_02 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> REGISTERED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10232) InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING

2020-04-11 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10232:
-
Description: 
We were testing YARN under network partition and found the following ERROR in 
RM's log.
{code:java}
2020-04-11 13:10:39,739 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
App attempt: appattempt_6_0001_02 can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at RUNNING
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:916)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1097)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1078)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:222)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:138)
  at java.lang.Thread.run(Thread.java:748)   
{code}
After analyzing the logs, we have recovered the triggering process of this bug:
 * We have a cluster with one RM and one NM.
 * A client tries to start a YARN service.
 * RM send a request to NM to start the containers:

NM's log:
{code:java}
2020-04-11 14:23:44,030 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_6_0001_02 (auth:SIMPLE)
2020-04-11 14:23:44,229 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Start request for container_6_0001_02_01 by user appattempt_6_0001_02
{code}
 * NM starts the containers successfully:

NM's log:
{code:java}
2020-04-11 14:23:44,347 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Application application_6_0001 transitioned from INITING to RUNNING
2020-04-11 14:23:44,357 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_02_01 transitioned from NEW to LOCALIZING
{code}
 * However, due to network partition, NM failed to send back the RPC response.
 * After a while, the application is running happily:

RM's log:
{code:java}
2020-04-11 14:23:50,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_02 State change from ALLOCATED to RUNNING on event = 
REGISTERED
020-04-11 14:23:50,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_6_0001 State change from ACCEPTED to RUNNING on event = 
ATTEMPT_REGISTERED{code}
 * Then, since RM didn't receive the RPC response for startContainers, it 
retries. The network partition has already stopped, so NM will receive the new 
startContainers RPC:

NM's log:
{code:java}
2020-04-11 14:23:54,392 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_6_0001_02 (auth:SIMPLE)
{code}
 * But since the attempt is actually running, this launch request does not 
succeed: 

NM's log:
{code:java}
2020-04-11 14:23:54,401 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Unauthorized request to start container.
 Attempt to relaunch the same container with id container_6_0001_02_01.
{code}
RM's log:
{code:java}
2020-04-11 14:23:54,428 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
launching appattempt_6_0001_02. Got exception: 
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container.
 Attempt to relaunch the same container with id container_6_0001_02_01.
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
  at 

[jira] [Updated] (YARN-10232) InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING

2020-04-11 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10232:
-
Description: 
We were testing YARN under network partition and found the following ERROR in 
RM's log.
{code:java}
2020-04-11 13:10:39,739 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
App attempt: appattempt_6_0001_02 can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at RUNNING
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:916)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1097)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1078)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:222)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:138)
  at java.lang.Thread.run(Thread.java:748)   
{code}
After analyzing the logs, we have recovered the triggering process of this bug:
 * We have a cluster with one RM and one NM.
 * A client tries to start a YARN service.
 * RM send a request to NM to start the containers:

NM's log:
{code:java}
2020-04-11 14:23:44,030 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_6_0001_02 (auth:SIMPLE)
2020-04-11 14:23:44,229 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Start request for container_6_0001_02_01 by user appattempt_6_0001_02
{code}
 * NM starts the containers successfully:

NM's log:
{code:java}
2020-04-11 14:23:44,347 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Application application_6_0001 transitioned from INITING to RUNNING
2020-04-11 14:23:44,357 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_02_01 transitioned from NEW to LOCALIZING
{code}
 * However, due to network partition, NM failed to send back the RPC response.
 * After a while, the application is running happily:

RM's log:
{code:java}
2020-04-11 14:23:50,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_02 State change from ALLOCATED to RUNNING on event = 
REGISTERED
020-04-11 14:23:50,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_6_0001 State change from ACCEPTED to RUNNING on event = 
ATTEMPT_REGISTERED{code}
 * Then, since RM didn't receive the RPC response for startContainers, it 
retries:

NM's log:
{code:java}
2020-04-11 14:23:54,392 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_6_0001_02 (auth:SIMPLE)
{code}
 * But since the attempt is actually running, this launch request does not 
succeed: 

NM's log:
{code:java}
2020-04-11 14:23:54,401 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Unauthorized request to start container.
 Attempt to relaunch the same container with id container_6_0001_02_01.
{code}
RM's log:
{code:java}
2020-04-11 14:23:54,428 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
launching appattempt_6_0001_02. Got exception: 
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container.
 Attempt to relaunch the same container with id container_6_0001_02_01.
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:137)
  at 

[jira] [Created] (YARN-10232) InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at RUNNING

2020-04-11 Thread YCozy (Jira)
YCozy created YARN-10232:


 Summary: InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at RUNNING
 Key: YARN-10232
 URL: https://issues.apache.org/jira/browse/YARN-10232
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: YCozy


We were testing YARN under network partition and found the following ERROR in 
RM's log.

 
{code:java}
2020-04-11 13:10:39,739 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
App attempt: appattempt_6_0001_02 can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at RUNNING
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
  at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:916)
  at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1097)
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1078)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:222)
  at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:138)
  at java.lang.Thread.run(Thread.java:748)   
{code}
After analyzing the logs, we have recovered the triggering process of this bug:

 
 * We have a cluster with one RM and one NM.
 * A client tries to start a YARN service.
 * RM send a request to NM to start the containers:

NM's log:

 
{code:java}
2020-04-11 14:23:44,030 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_6_0001_02 (auth:SIMPLE)
2020-04-11 14:23:44,229 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Start request for container_6_0001_02_01 by user appattempt_6_0001_02
{code}
 * NM starts the containers successfully:

NM's log:

 
{code:java}
2020-04-11 14:23:44,347 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Application application_6_0001 transitioned from INITING to RUNNING
2020-04-11 14:23:44,357 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_02_01 transitioned from NEW to LOCALIZING
{code}
 * However, due to network partition, NM failed to send back the RPC response.
 * After a while, the application is running happily:

RM's log:

 
{code:java}
2020-04-11 14:23:50,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_02 State change from ALLOCATED to RUNNING on event = 
REGISTERED
020-04-11 14:23:50,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_6_0001 State change from ACCEPTED to RUNNING on event = 
ATTEMPT_REGISTERED{code}
 
 * Then, since RM didn't receive the RPC response for startContainers, it 
retries:

NM's log:

 
{code:java}
2020-04-11 14:23:54,392 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_6_0001_02 (auth:SIMPLE)
{code}
 * But since the attempt is actually running, this launch request does not 
succeed:

 

NM's log:

 
{code:java}
2020-04-11 14:23:54,401 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Unauthorized request to start container.
 Attempt to relaunch the same container with id container_6_0001_02_01.
{code}
RM's log:
{code:java}
2020-04-11 14:23:54,428 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
launching appattempt_6_0001_02. Got exception: 
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container.
 Attempt to relaunch the same container with id container_6_0001_02_01.
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171)
  at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182)
  at 

[jira] [Created] (YARN-10231) When a NM is partitioned away, YARN service will complain about "Queue's AM resource limit exceeded"

2020-04-10 Thread YCozy (Jira)
YCozy created YARN-10231:


 Summary: When a NM is partitioned away, YARN service will complain 
about "Queue's AM resource limit exceeded" 
 Key: YARN-10231
 URL: https://issues.apache.org/jira/browse/YARN-10231
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: YCozy


We were testing YARN's RM failover code under network partition, and we 
observed the following failure. We think this is a bug and would like to 
confirm with you.

Basically, we were testing the following scenario:
 # Start a YARN cluster with two RMs (e.g., RM1 and RM2) and one NM.
 # Make RM1 active.
 # Start a YARN service, e.g., the built-in sleeper service. Name it sleeper1.
 # Failover from RM1 to RM2.
 # Stop the sleeper1 and start another YARN service, e.g., still the sleeper 
service, and call it sleeper2.

When no network partition happens, everything is fine (e.g., sleeper2 can start 
successfully).

However, if the NM is partitioned after the RM failover, sleeper2 will fail to 
start: After polling sleeper2's status for 30 seconds, its application report 
is still as follows:
{code:java}
Application Report :
Application-Id : application_4_0001
Application-Name : sleeper2
Application-Type : yarn-service
User : root
Queue : default
Application Priority : 0
Start-Time : 1585525063950
Finish-Time : 0
Progress : 0%
State : ACCEPTED 
Final-State : UNDEFINED 
Tracking-URL : N/A 
RPC Port : -1 
AM Host : N/A Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds
Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds 
Log Aggregation Status : DISABLED
Diagnostics : [Sun Mar 29 23:37:44 + 2020] Application is added to the 
scheduler and is not yet activated. Queue's AM resource limit exceeded.  
Details : AM Partition = ; AM Resource Request = 
; Queue Resource Limit for AM = ; 
User AM Resource Limit of the queue = ; Queue AM 
Resource Usage = ;  
Unmanaged Application : false 
Application Node Label Expression :  
AM container Node Label Expression :  
TimeoutType : LIFETIME ExpiryTime : UNLIMITED RemainingTime : -1seconds
{code}
Since the only fault happens is network partition, the "queue's AM resource 
limit" shouldn't be exceeded.

We can reliably reproduce this bug using our fault injection engine. Please let 
us know if you need any info for debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org