[ https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-1030. ------------------------------- Resolution: Fixed Fix Version/s: 1.0.0 Closing this now, since it was addressed as part of Spark 1.0's PySpark on YARN patches (including SPARK-1004). > unneeded file required when running pyspark program using yarn-client > --------------------------------------------------------------------- > > Key: SPARK-1030 > URL: https://issues.apache.org/jira/browse/SPARK-1030 > Project: Spark > Issue Type: Bug > Components: Deploy, PySpark, YARN > Affects Versions: 0.8.1 > Reporter: Diana Carroll > Assignee: Josh Rosen > Fix For: 1.0.0 > > > I can successfully run a pyspark program using the yarn-client master using > the following command: > {code} > SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar > \ > SPARK_YARN_APP_JAR=~/testdata.txt pyspark \ > test1.py > {code} > However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python > program, and therefore there's no JAR. If I don't set the value, or if I set > the value to a non-existent files, Spark gives me an error message. > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46) > {code} > or > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.io.FileNotFoundException: File file:dummy.txt does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) > {code} > My program is very simple: > {code} > from pyspark import SparkContext > def main(): > sc = SparkContext("yarn-client", "Simple App") > logData = > sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log") > numjpgs = logData.filter(lambda s: '.jpg' in s).count() > print "Number of JPG requests: " + str(numjpgs) > {code} > Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at > all; I can point it at anything, as long as it's a valid, accessible file, > and it works the same. > Although there's an obvious workaround for this bug, it's high priority from > my perspective because I'm working on a course to teach people how to do > this, and it's really hard to explain why this variable is needed! -- This message was sent by Atlassian JIRA (v6.2#6252)