[ 
https://issues.apache.org/jira/browse/YARN-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10294:
-------------------------
    Description: 
We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 
gets assigned and tries to start the service. We can see from its log:
{noformat}
2020-05-28 14:48:18,650 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
 Starting container [container_6_0001_01_000001]
2020-05-28 14:48:18,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_01_000001 transitioned from SCHEDULED to 
RUNNING{noformat}
Due to some misconfiguration, the container fails to start. We can also see 
from the container's serviceam.log:
{noformat}
2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - 
Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - 
Failed to register app sleeper1 in registry
org.apache.hadoop.registry.client.exceptions.RegistryIOException: 
`/registry/users/root/services/yarn-service': Failure of mkdir()  on 
/registry/users/root/services/yarn-service: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode 
= ConnectionLoss for /registry/users/root/services/yarn-service
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
  at 
org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)    
 
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                      
 
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)      
 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)       
 
  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)                 
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
  at 
org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)             
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
                                                                                
                                                                                
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587)
  ... 11 more
2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO  
service.AbstractService - Service sleeper1 failed in state STARTED
java.lang.NullPointerException                                                  
 
  at 
org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
                                                                                
                                                                                
                      
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
 
  at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)

...

2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR service.ServiceMaster 
- Error starting service master
java.lang.NullPointerException                                                  
 
  at 
org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
 
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
 
  at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)  
 
2020-05-28 14:49:13,463 [IPC Server Responder] INFO  ipc.Server - Stopping IPC 
Server Responder
2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO  util.ExitUtil - 
Exiting with status 1: Error starting service master{noformat}
 

*However, NM1 shows an incorrect reason for why the container fails to start:*
{noformat}
2020-05-28 14:49:13,516 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
from container-launch with container ID: container_6_0001_01_000001 and exit 
code: 1
ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change 
locale (en_US.UTF-8)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)       
 
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)       
 

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)                   
 
  at org.apache.hadoop.util.Shell.run(Shell.java:901)                           
 
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) 
 
  at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                      
 
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
container-launch.
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
container_6_0001_01_000001
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: 
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: 
warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,517 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: 
warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,519 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container launch failed : Container exited with a non-zero exit code 1. 
2020-05-28 14:49:13,580 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_01_000001 transitioned from RUNNING to 
EXITED_WITH_FAILURE{noformat}
The above error shown by NM1 is actually a warning from the container's 
prelaunch.err file, which differs from the error as shown in the container's 
serviceam.log.

  was:
We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 
gets assigned and tries to start the service. We can see from its log:
{noformat}
2020-05-28 14:48:18,650 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
 Starting container [container_6_0001_01_000001]
2020-05-28 14:48:18,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_01_000001 transitioned from SCHEDULED to 
RUNNING{noformat}
Due to some misconfiguration, the container fails to start. We can also see 
from the container's serviceam.log:
{noformat}
2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - 
Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
  at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - 
Failed to register app sleeper1 in registry
org.apache.hadoop.registry.client.exceptions.RegistryIOException: 
`/registry/users/root/services/yarn-service': Failure of mkdir()  on 
/registry/users/root/services/yarn-service: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
       ConnectionLoss for /registry/users/root/services/yarn-service: 
KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
  at 
org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
  at 
org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)    
 
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                      
 
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)      
 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)       
 
  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)                 
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
  at 
org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)             
 
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
                                                                                
                                                                                
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
  at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
  at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587)
  ... 11 more
2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO  
service.AbstractService - Service sleeper1 failed in state STARTED
java.lang.NullPointerException                                                  
 
  at 
org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
                                                                                
                                                                                
                      
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
 
  at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)

...

2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR service.ServiceMaster 
- Error starting service master
java.lang.NullPointerException                                                  
 
  at 
org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
  at 
org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
 
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at 
org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)  
 
  at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)  
 
2020-05-28 14:49:13,463 [IPC Server Responder] INFO  ipc.Server - Stopping IPC 
Server Responder
2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO  util.ExitUtil - 
Exiting with status 1: Error starting service master{noformat}
However, NM1 shows an incorrect reason for why the container fails to start:
{noformat}
2020-05-28 14:49:13,516 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
from container-launch with container ID: container_6_0001_01_000001 and exit 
code: 1
ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change 
locale (en_US.UTF-8)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)       
 
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)       
 

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)                   
 
  at org.apache.hadoop.util.Shell.run(Shell.java:901)                           
 
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) 
 
  at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                   
 
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                      
 
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
container-launch.
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
container_6_0001_01_000001
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: 
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,516 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: 
warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,517 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: 
warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,519 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container launch failed : Container exited with a non-zero exit code 1. 
2020-05-28 14:49:13,580 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_6_0001_01_000001 transitioned from RUNNING to 
EXITED_WITH_FAILURE{noformat}
The error shown by NM1 is actually a warning from the container's prelaunch.err 
file, which differs from the error as shown in the container's serviceam.log.


> NodeManager shows a wrong reason when a YARN service fails to start
> -------------------------------------------------------------------
>
>                 Key: YARN-10294
>                 URL: https://issues.apache.org/jira/browse/YARN-10294
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.3.0
>            Reporter: YCozy
>            Priority: Major
>
> We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 
> gets assigned and tries to start the service. We can see from its log:
> {noformat}
> 2020-05-28 14:48:18,650 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Starting container [container_6_0001_01_000001]
> 2020-05-28 14:48:18,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_6_0001_01_000001 transitioned from SCHEDULED to 
> RUNNING{noformat}
> Due to some misconfiguration, the container fails to start. We can also see 
> from the container's serviceam.log:
> {noformat}
> 2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl 
> - Background retry gave up
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 
>    
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - 
> Failed to register app sleeper1 in registry
> org.apache.hadoop.registry.client.exceptions.RegistryIOException: 
> `/registry/users/root/services/yarn-service': Failure of mkdir()  on 
> /registry/users/root/services/yarn-service: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /registry/users/root/services/yarn-service: 
> KeeperErrorCode = ConnectionLoss for 
> /registry/users/root/services/yarn-service
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
>   at 
> org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
>   at 
> org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
>   at 
> org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
>   at 
> org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  
>    
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 
>    
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)                                    
>    
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /registry/users/root/services/yarn-service
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)    
>    
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)     
>    
>   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)               
>    
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
>   at 
> org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)           
>    
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
>                                                                               
>                                                                               
>     
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
>   at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
>   at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587)
>   ... 11 more
> 2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO  
> service.AbstractService - Service sleeper1 failed in state STARTED
> java.lang.NullPointerException                                                
>    
>   at 
> org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
>   at 
> org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
>   at 
> org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)     
>                                                                               
>                                                                               
>                        
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
>   at 
> org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
>   at java.security.AccessController.doPrivileged(Native Method)               
>    
>   at javax.security.auth.Subject.doAs(Subject.java:422)                       
>    
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
>   at 
> org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
>   at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)
> ...
> 2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR 
> service.ServiceMaster - Error starting service master
> java.lang.NullPointerException                                                
>    
>   at 
> org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
>   at 
> org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
>   at 
> org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
>   at 
> org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
>   at java.security.AccessController.doPrivileged(Native Method)               
>    
>   at javax.security.auth.Subject.doAs(Subject.java:422)                       
>    
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
>   at 
> org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
>   at 
> org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)   
> 2020-05-28 14:49:13,463 [IPC Server Responder] INFO  ipc.Server - Stopping 
> IPC Server Responder
> 2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO  util.ExitUtil - 
> Exiting with status 1: Error starting service master{noformat}
>  
> *However, NM1 shows an incorrect reason for why the container fails to start:*
> {noformat}
> 2020-05-28 14:49:13,516 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
> from container-launch with container ID: container_6_0001_01_000001 and exit 
> code: 1
> ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change 
> locale (en_US.UTF-8)
> /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)     
>    
> /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)     
>    
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)                 
>    
>   at org.apache.hadoop.util.Shell.run(Shell.java:901)                         
>    
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)  
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 
>    
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)                                    
>    
> 2020-05-28 14:49:13,516 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2020-05-28 14:49:13,516 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_6_0001_01_000001
> 2020-05-28 14:49:13,516 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
> 2020-05-28 14:49:13,516 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception 
> message: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
> 2020-05-28 14:49:13,516 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: 
> warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
> 2020-05-28 14:49:13,517 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: 
> warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
> 2020-05-28 14:49:13,519 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container launch failed : Container exited with a non-zero exit code 1. 
> 2020-05-28 14:49:13,580 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_6_0001_01_000001 transitioned from RUNNING to 
> EXITED_WITH_FAILURE{noformat}
> The above error shown by NM1 is actually a warning from the container's 
> prelaunch.err file, which differs from the error as shown in the container's 
> serviceam.log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to