[jira] [Resolved] (SPARK-1030) unneeded file required when running pyspark program using yarn-client

Josh Rosen (JIRA) Thu, 24 Jul 2014 18:43:22 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen resolved SPARK-1030.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Closing this now, since it was addressed as part of Spark 1.0's PySpark on YARN 
patches (including SPARK-1004).

> unneeded file required when running pyspark program using yarn-client
> ---------------------------------------------------------------------
>
>                 Key: SPARK-1030
>                 URL: https://issues.apache.org/jira/browse/SPARK-1030
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, PySpark, YARN
>    Affects Versions: 0.8.1
>            Reporter: Diana Carroll
>            Assignee: Josh Rosen
>             Fix For: 1.0.0
>
>
> I can successfully run a pyspark program using the yarn-client master using 
> the following command:
> {code}
> SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar
>  \
> SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
> test1.py
> {code}
> However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python 
> program, and therefore there's no JAR.  If I don't set the value, or if I set 
> the value to a non-existent files, Spark gives me an error message.  
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
>       at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
> {code}
> or
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.io.FileNotFoundException: File file:dummy.txt does not exist
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
> {code}
> My program is very simple:
> {code}
> from pyspark import SparkContext
> def main():
>     sc = SparkContext("yarn-client", "Simple App")
>     logData = 
> sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log")
>     numjpgs = logData.filter(lambda s: '.jpg' in s).count()
>     print "Number of JPG requests: " + str(numjpgs)
> {code}
> Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at 
> all; I can point it at anything, as long as it's a valid, accessible file, 
> and it works the same.
> Although there's an obvious workaround for this bug, it's high priority from 
> my perspective because I'm working on a course to teach people how to do 
> this, and it's really hard to explain why this variable is needed!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1030) unneeded file required when running pyspark program using yarn-client

Reply via email to