[ https://issues.apache.org/jira/browse/YARN-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
YCozy updated YARN-10294: ------------------------- Description: We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 gets assigned and tries to start the service. We can see from its log: {noformat} 2020-05-28 14:48:18,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_6_0001_01_000001] 2020-05-28 14:48:18,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from SCHEDULED to RUNNING{noformat} Due to some misconfiguration, the container fails to start. We can also see from the container's serviceam.log: {noformat} 2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - Failed to register app sleeper1 in registry org.apache.hadoop.registry.client.exceptions.RegistryIOException: `/registry/users/root/services/yarn-service': Failure of mkdir() on /registry/users/root/services/yarn-service: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595) at org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210) at org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587) ... 11 more 2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO service.AbstractService - Service sleeper1 failed in state STARTED java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120) at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463) at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122) at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346) ... 2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR service.ServiceMaster - Error starting service master java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120) at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463) at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122) at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346) 2020-05-28 14:49:13,463 [IPC Server Responder] INFO ipc.Server - Stopping IPC Server Responder 2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO util.ExitUtil - Exiting with status 1: Error starting service master{noformat} *However, NM1 shows an incorrect reason for why the container fails to start:* {noformat} 2020-05-28 14:49:13,516 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_6_0001_01_000001 and exit code: 1 ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at org.apache.hadoop.util.Shell.run(Shell.java:901) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch. 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_6_0001_01_000001 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) 2020-05-28 14:49:13,517 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) 2020-05-28 14:49:13,519 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 2020-05-28 14:49:13,580 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE{noformat} The above error shown by NM1 is actually a warning from the container's prelaunch.err file, which differs from the error as shown in the container's serviceam.log. was: We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 gets assigned and tries to start the service. We can see from its log: {noformat} 2020-05-28 14:48:18,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_6_0001_01_000001] 2020-05-28 14:48:18,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from SCHEDULED to RUNNING{noformat} Due to some misconfiguration, the container fails to start. We can also see from the container's serviceam.log: {noformat} 2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - Background retry gave up org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - Failed to register app sleeper1 in registry org.apache.hadoop.registry.client.exceptions.RegistryIOException: `/registry/users/root/services/yarn-service': Failure of mkdir() on /registry/users/root/services/yarn-service: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595) at org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194) at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210) at org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587) ... 11 more 2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO service.AbstractService - Service sleeper1 failed in state STARTED java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120) at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463) at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122) at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346) ... 2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR service.ServiceMaster - Error starting service master java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120) at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463) at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122) at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346) 2020-05-28 14:49:13,463 [IPC Server Responder] INFO ipc.Server - Stopping IPC Server Responder 2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO util.ExitUtil - Exiting with status 1: Error starting service master{noformat} However, NM1 shows an incorrect reason for why the container fails to start: {noformat} 2020-05-28 14:49:13,516 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_6_0001_01_000001 and exit code: 1 ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at org.apache.hadoop.util.Shell.run(Shell.java:901) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch. 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_6_0001_01_000001 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) 2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) 2020-05-28 14:49:13,517 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) 2020-05-28 14:49:13,519 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 2020-05-28 14:49:13,580 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE{noformat} The error shown by NM1 is actually a warning from the container's prelaunch.err file, which differs from the error as shown in the container's serviceam.log. > NodeManager shows a wrong reason when a YARN service fails to start > ------------------------------------------------------------------- > > Key: YARN-10294 > URL: https://issues.apache.org/jira/browse/YARN-10294 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 3.3.0 > Reporter: YCozy > Priority: Major > > We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 > gets assigned and tries to start the service. We can see from its log: > {noformat} > 2020-05-28 14:48:18,650 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: > Starting container [container_6_0001_01_000001] > 2020-05-28 14:48:18,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_6_0001_01_000001 transitioned from SCHEDULED to > RUNNING{noformat} > Due to some misconfiguration, the container fails to start. We can also see > from the container's serviceam.log: > {noformat} > 2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl > - Background retry gave up > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > ConnectionLoss > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972) > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) > at > org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - > Failed to register app sleeper1 in registry > org.apache.hadoop.registry.client.exceptions.RegistryIOException: > `/registry/users/root/services/yarn-service': Failure of mkdir() on > /registry/users/root/services/yarn-service: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /registry/users/root/services/yarn-service: > KeeperErrorCode = ConnectionLoss for > /registry/users/root/services/yarn-service > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440) > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595) > at > org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99) > at > org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194) > at > org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210) > at > org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for > /registry/users/root/services/yarn-service > at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) > > at > org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) > at > org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) > at > org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) > > at > org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) > > > > at > org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575) > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51) > at > org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587) > ... 11 more > 2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO > service.AbstractService - Service sleeper1 failed in state STARTED > java.lang.NullPointerException > > at > org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120) > at > org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463) > at > org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > > > > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122) > at > org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267) > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) > at > org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346) > ... > 2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR > service.ServiceMaster - Error starting service master > java.lang.NullPointerException > > at > org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120) > at > org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463) > at > org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122) > at > org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267) > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) > at > org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346) > 2020-05-28 14:49:13,463 [IPC Server Responder] INFO ipc.Server - Stopping > IPC Server Responder > 2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO util.ExitUtil - > Exiting with status 1: Error starting service master{noformat} > > *However, NM1 shows an incorrect reason for why the container fails to start:* > {noformat} > 2020-05-28 14:49:13,516 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception > from container-launch with container ID: container_6_0001_01_000001 and exit > code: 1 > ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change > locale (en_US.UTF-8) > /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) > > /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) > > at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) > > at org.apache.hadoop.util.Shell.run(Shell.java:901) > > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > 2020-05-28 14:49:13,516 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2020-05-28 14:49:13,516 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_6_0001_01_000001 > 2020-05-28 14:49:13,516 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1 > 2020-05-28 14:49:13,516 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception > message: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) > 2020-05-28 14:49:13,516 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: > warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) > 2020-05-28 14:49:13,517 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: > warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) > 2020-05-28 14:49:13,519 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container launch failed : Container exited with a non-zero exit code 1. > 2020-05-28 14:49:13,580 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_6_0001_01_000001 transitioned from RUNNING to > EXITED_WITH_FAILURE{noformat} > The above error shown by NM1 is actually a warning from the container's > prelaunch.err file, which differs from the error as shown in the container's > serviceam.log. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org