Michael Le created YARN-3606:
--------------------------------

             Summary: Spark container fails to launch if spark-assembly.jar 
file has different timestamp
                 Key: YARN-3606
                 URL: https://issues.apache.org/jira/browse/YARN-3606
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn
    Affects Versions: 2.6.0
         Environment: YARN 2.6.0
Spark 1.3.1
            Reporter: Michael Le
            Priority: Minor


In a YARN cluster, when submitting a Spark job, the Spark job will fail to run 
because YARN fails to launch containers on the other nodes (not the node where 
the job submission took place).

YARN checks for similar spark-assembly.jar file by looking at the timestamps. 
This check will fail when the spark-assembly.jar is the same but copied to the 
location at different time.

YARN throws this exception:

15/05/07 20:13:22 INFO yarn.ExecutorRunnable: Setting up executor with 
commands: List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill 
%p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp, 
'-Dspark.driver.port=52357', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, 
org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, 
akka.tcp://sparkDriver@xxx:52357/user/CoarseGrainedScheduler, --executor-id, 4, 
--hostname, xxx, --cores, 1, --app-id, application_1431047540996_0001, 
--user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, 
<LOG_DIR>/stderr)
15/05/07 20:13:22 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
xxx:34165
15/05/07 20:13:27 INFO yarn.YarnAllocator: Completed container 
container_1431047540996_0001_02_000005 (state: COMPLETE, exit status: -1000)
15/05/07 20:13:27 INFO yarn.YarnAllocator: Container marked as failed: 
container_1431047540996_0001_02_000005. Exit status: -1000. Diagnostics: 
Resource 
file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
 changed on src filesystem (expected 1430944255000, was 1430944249000
java.io.IOException: Resource 
file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
 changed on src filesystem (expected 1430944255000, was 1430944249000
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Problem can be easily replicated by setting up two nodes and copying the 
spark-assembly.jar to each node but changing the timestamp of the file on one 
of the nodes. Then execute spark-shell --master yarn-client. Observe the 
nodemanager log on the other node to find the error.

Work around is to make sure the jar file has the same timestamp. But it looks 
like perhaps the function that does the copy and check of the jar file 
(org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) should check 
for file similarity using a checksum rather than timestamp.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to