Hello,
I submit a spark job to YARN cluster with spark-submit command. the environment is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per node. The YARN sets 16G max of memory for every container. The job requests 6 of 8G memory of executors, and 8G of driver. However, I alway get the errors after try submit the job several times. Any help? --- ------------here are the error logs of Application Master for the job -------------- 17/06/22 15:18:44 INFO yarn.YarnAllocator: Completed container container_1498115278902_0001_02_000013 (state: COMPLETE, exit status: 1) 17/06/22 15:18:44 INFO yarn.YarnAllocator: Container marked as failed: container_1498115278902_0001_02_000013. Exit status: 1. Diagnostics: Exception from container-launch. Container id: container_1498115278902_0001_02_000013 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 -------- Here is the yarn application logs of the job. LogLength:2611 Log Contents: 17/06/22 15:18:09 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/06/22 15:18:10 INFO spark.SecurityManager: Changing view acls to: yarn,root 17/06/22 15:18:10 INFO spark.SecurityManager: Changing modify acls to: yarn,root 17/06/22 15:18:10 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, root); users with modify permissions: Set(yarn, root) 17/06/22 15:18:10 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/06/22 15:18:10 INFO Remoting: Starting remoting 17/06/22 15:18:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dn006:45701] 17/06/22 15:18:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@dn006:45701] 17/06/22 15:18:10 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 45701. 17/06/22 15:18:40 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1684) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:139) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:235) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:155) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) ... 4 more ---- a snippet of RM log for the job --------- 2017-06-22 15:18:41,586 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1498115278902_0001_02_000014 of capacity <memory:6656, vCores:4> on host dn006:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:32> available, release resources=true 2017-06-22 15:18:41,586 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1498115278902_0001_000002 release d container container_1498115278902_0001_02_000014 on node: host: dn006:8041 #containers=0 available=8192 used=0 with event: FINISHED 2017-06-22 15:18:41,677 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1498115278902_0001_02_000012 Container Transitioned fro m RUNNING to COMPLETED 2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1498115278902_0001_02_000012 in st ate: COMPLETED event:FINISHED 2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SU CCESS APPID=application_1498115278902_0001 CONTAINERID=container_1498115278902_0001_02_000012 2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1498115278902_0001_02_000012 of capacity <memory:6656, vCores:4> on host dn003:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:32> available, release resources=true 2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1498115278902_0001_000002 release d container container_1498115278902_0001_02_000012 on node: host: dn003:8041 #containers=0 available=8192 used=0 with event: FINISHED 2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1498115278902_0001_02_000010 Container Transitioned fro m RUNNING to COMPLETED 2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1498115278902_0001_02_000010 in st ate: COMPLETED event:FINISHED thanks in advance. Link Qian