Michael Le created YARN-3606:
--------------------------------
Summary: Spark container fails to launch if spark-assembly.jar
file has different timestamp
Key: YARN-3606
URL: https://issues.apache.org/jira/browse/YARN-3606
Project: Hadoop YARN
Issue Type: Bug
Components: yarn
Affects Versions: 2.6.0
Environment: YARN 2.6.0
Spark 1.3.1
Reporter: Michael Le
Priority: Minor
In a YARN cluster, when submitting a Spark job, the Spark job will fail to run
because YARN fails to launch containers on the other nodes (not the node where
the job submission took place).
YARN checks for similar spark-assembly.jar file by looking at the timestamps.
This check will fail when the spark-assembly.jar is the same but copied to the
location at different time.
YARN throws this exception:
15/05/07 20:13:22 INFO yarn.ExecutorRunnable: Setting up executor with
commands: List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill
%p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
'-Dspark.driver.port=52357', -Dspark.yarn.app.container.log.dir=<LOG_DIR>,
org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url,
akka.tcp://sparkDriver@xxx:52357/user/CoarseGrainedScheduler, --executor-id, 4,
--hostname, xxx, --cores, 1, --app-id, application_1431047540996_0001,
--user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>,
<LOG_DIR>/stderr)
15/05/07 20:13:22 INFO impl.ContainerManagementProtocolProxy: Opening proxy :
xxx:34165
15/05/07 20:13:27 INFO yarn.YarnAllocator: Completed container
container_1431047540996_0001_02_000005 (state: COMPLETE, exit status: -1000)
15/05/07 20:13:27 INFO yarn.YarnAllocator: Container marked as failed:
container_1431047540996_0001_02_000005. Exit status: -1000. Diagnostics:
Resource
file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
changed on src filesystem (expected 1430944255000, was 1430944249000
java.io.IOException: Resource
file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
changed on src filesystem (expected 1430944255000, was 1430944249000
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Problem can be easily replicated by setting up two nodes and copying the
spark-assembly.jar to each node but changing the timestamp of the file on one
of the nodes. Then execute spark-shell --master yarn-client. Observe the
nodemanager log on the other node to find the error.
Work around is to make sure the jar file has the same timestamp. But it looks
like perhaps the function that does the copy and check of the jar file
(org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) should check
for file similarity using a checksum rather than timestamp.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)