Steve Loughran commented on YARN-3606:

Looking at timestamp is the strategy chosen based on a key assumption : there 
is a single artifact to localise by downloading from a single shared 
filesystem. Trying to use local filesystems, each with a cached copy of the 
artifact, isn't what the NM expects to be doing. If it is, then the normal 
localisation checks aren't

I think the checksum is probably omitted as you have to read the whole file to 
see if it has changed; plus there's the cost of actually recalculating that 
checksum prior to launching every container. Timestamps aren't too great though 
—the check as stands will reject the same file with two different times *or* 
two differently sized files with the same timestamp.

> Spark container fails to launch if spark-assembly.jar file has different 
> timestamp
> ----------------------------------------------------------------------------------
>                 Key: YARN-3606
>                 URL: https://issues.apache.org/jira/browse/YARN-3606
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.6.0
>         Environment: YARN 2.6.0
> Spark 1.3.1
>            Reporter: Michael Le
>            Priority: Minor
> In a YARN cluster, when submitting a Spark job, the Spark job will fail to 
> run because YARN fails to launch containers on the other nodes (not the node 
> where the job submission took place).
> YARN checks for similar spark-assembly.jar file by looking at the timestamps. 
> This check will fail when the spark-assembly.jar is the same but copied to 
> the location at different time.
> YARN throws this exception:
> 15/05/07 20:13:22 INFO yarn.ExecutorRunnable: Setting up executor with 
> commands: List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill 
> %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp, 
> '-Dspark.driver.port=52357', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, 
> org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, 
> akka.tcp://sparkDriver@xxx:52357/user/CoarseGrainedScheduler, --executor-id, 
> 4, --hostname, xxx, --cores, 1, --app-id, application_1431047540996_0001, 
> --user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, 
> <LOG_DIR>/stderr)
> 15/05/07 20:13:22 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> xxx:34165
> 15/05/07 20:13:27 INFO yarn.YarnAllocator: Completed container 
> container_1431047540996_0001_02_000005 (state: COMPLETE, exit status: -1000)
> 15/05/07 20:13:27 INFO yarn.YarnAllocator: Container marked as failed: 
> container_1431047540996_0001_02_000005. Exit status: -1000. Diagnostics: 
> Resource 
> file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
>  changed on src filesystem (expected 1430944255000, was 1430944249000
> java.io.IOException: Resource 
> file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
>  changed on src filesystem (expected 1430944255000, was 1430944249000
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>         at 
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>         at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>         at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Problem can be easily replicated by setting up two nodes and copying the 
> spark-assembly.jar to each node but changing the timestamp of the file on one 
> of the nodes. Then execute spark-shell --master yarn-client. Observe the 
> nodemanager log on the other node to find the error.
> Work around is to make sure the jar file has the same timestamp. But it looks 
> like perhaps the function that does the copy and check of the jar file 
> (org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) should 
> check for file similarity using a checksum rather than timestamp.

This message was sent by Atlassian JIRA

Reply via email to