[ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5655.
------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.0

> YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster 
> configured in secure mode
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5655
>                 URL: https://issues.apache.org/jira/browse/SPARK-5655
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.0, 1.2.1
>         Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
>            Reporter: Andrew Rowson
>            Assignee: Andrew Rowson
>            Priority: Critical
>              Labels: hadoop
>             Fix For: 1.3.0
>
>
> When running a Spark job on a YARN cluster which doesn't run containers under 
> the same user as the nodemanager, and also when using the YARN auxiliary 
> shuffle service, jobs fail with something similar to:
> {code:java}
> java.io.FileNotFoundException: 
> /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
>  (Permission denied)
> {code}
> The root cause of this here: 
> https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287
> Spark will attempt to chmod 700 any application directories it creates during 
> the job, which includes files created in the nodemanager's usercache 
> directory. The owner of these files is the container UID, which on a secure 
> cluster is the name of the user creating the job, and on an nonsecure cluster 
> but with the yarn.nodemanager.container-executor.class configured is the 
> value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.
> The problem with this is that the auxiliary shuffle manager runs as part of 
> the nodemanager, which is typically running as the user 'yarn'. This can't 
> access these files that are only owner-readable.
> YARN already attempts to secure files created under appcache but keep them 
> readable by the nodemanager, by setting the group of the appcache directory 
> to 'yarn' and also setting the setgid flag. This means that files and 
> directories created under this should also have the 'yarn' group. Normally 
> this means that the nodemanager should also be able to read these files, but 
> Spark setting chmod700 wipes this out.
> I'm not sure what the right approach is here. Commenting out the chmod700 
> functionality makes this work on YARN, and still makes the application files 
> only readable by the owner and the group:
> {code}
> /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
>  # ls -lah
> total 206M
> drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
> drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
> -rw-r-----  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data
> {code}
> But this may not be the right approach on non-YARN. Perhaps an additional 
> step to see if this chmod700 step is necessary (ie non-YARN) is required. 
> Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to 
> suggest a patch.
> I believe this is a related issue in the MapReduce framwork: 
> https://issues.apache.org/jira/browse/MAPREDUCE-3728



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to