|
When a Spark job on Yarn writes to HDFS using RDD.saveAsTextFile(hdfsPath:
String) or when it writes to HDFS using FSDataOutputStream,
the owner of the process writing the file often conflicts with the
UID of user running the Spark job (the appUser). As result, the
owner of the newly written files or directories is the YARN
process, not the appUser, creating file access problems for
downstream processes. Depending on cluster setup, this conflict
often results in write permission errors that kill the Job. In contrast, when one runs an equivalent Yarn MapReduce2 job, all application output files are owned by the appUser, the UID running the job using the usual "hadoop jar JARFILE INPUT OUTPUT args..." job submission method. This occurs in the following environments:
There are some workarounds, including changing system UMASK to
000 or changing output destination directories to 0777, but some
environments do not allow any workarounds. None of the
workarounds solve the downstream processing problems that result
from incorrect UIDs. I've scanned any number of configuration instructions, and I've
jumped into the code and the various Spark and Yarn scripts used
by yarn-standalone and yarn-client modes, but to
no avail! Any help would be appreciated!!! Here are a few details... In yarn-standalone, output directories and contents are
owned by YARN, not by the appUser UID. # Spark yarn-standalone job drwxrwxrwx - klmarkey klmarkey 0 2014-01-16 16:09 output drwxrwxrwx - yarn klmarkey 0 2014-01-16 16:09 output/SparkTest1 drwxrwxrwx - yarn klmarkey 0 2014-01-16 16:09 output/SparkTest1/summary -rw-r--r-- 3 yarn klmarkey 4658 2014-01-16 16:09 output/SparkTest1/summary/summary.txt # Yarn MapReduce job drwxr-xr-x - klmarkey klmarkey 0 2014-01-06 15:47 output/TestQueue4 drwxr-xr-x - klmarkey klmarkey 0 2014-01-06 15:40 output/TestQueue4/maxfreq drwxr-xr-x - klmarkey klmarkey 0 2014-01-06 15:38 output/TestQueue4/maxfreq/0000 -rw-r--r-- 3 klmarkey klmarkey 0 2014-01-06 15:38 output/TestQueue4/maxfreq/0000/_SUCCESS -rw-r--r-- 3 klmarkey klmarkey 65 2014-01-06 15:38 output/TestQueue4/maxfreq/0000/part-r-00000 In yarn-client, output directories are owned by the appUser, but contents written by the worker nodes are owned by the YARN UID. In the following example, an attempt to write by RDD.saveAsTextFile failed when Spark worker nodes attempted to write temporary results into output/SparkClient1/values/column-0000/_temporary/0. # Spark yarn-client job (failed; see error message below) drwxrwxrwx - klmarkey klmarkey 0 2014-01-30 22:58 output drwxr-xr-x - klmarkey klmarkey 0 2014-01-30 22:39 output/SparkClient1 drwxr-xr-x - klmarkey klmarkey 0 2014-01-30 22:39 output/SparkClient1/values drwxr-xr-x - klmarkey klmarkey 0 2014-01-30 22:39 output/SparkClient1/values/column-0000 drwxr-xr-x - klmarkey klmarkey 0 2014-01-30 22:39 output/SparkClient1/values/column-0000/_temporary drwxr-xr-x - klmarkey klmarkey 0 2014-01-30 22:39 output/SparkClient1/values/column-0000/_temporary/0 Here is the error message: 14/01/30 22:39:37 WARN ClusterTaskSetManager: Loss was due to org.apache.hadoop.security.AccessControlException org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/user/klmarkey/output/SparkClient1/values/column-0000/_temporary/0":klmarkey:klmarkey:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:214) Finally, in all these processes, "klmarkey" is reported as the appUser in all logs, and Java system properties report the "user.name" to be "yarn" when using FSDataOutputStream (I can't instrument the saveAsTextFile in the same way). Thanks. Kevin Markey |
