Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Andrew Or Mon, 02 Jun 2014 08:51:30 -0700

Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode,
pyspark is packaged into your assembly jar and shipped to your executors
automatically. This seems like a more general problem. There are a few
things to try:


1) Run a simple pyspark shell with yarn-client, and do
"sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and
the Java class files.

3) If the files are there, try running a simple python shell (not pyspark
shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is
probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot
be properly opened by Java 6. I would either verify that the JAVA_HOME set
on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
2.3.0-cdh5.0.0

5) You can check out
http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
which has more detailed information about how to debug running an
application on YARN in general. In my experience, the steps outlined there
are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:

> Hi folks,
>
> I have a weird problem when using pyspark with yarn. I started ipython as
> follows:
>
> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
> --num-executors 4 --executor-memory 4G
>
> When I create a notebook, I can see workers being created and indeed I see
> spark UI running on my client machine on port 4040.
>
> I have the following simple script:
> """
> import pyspark
> data = sc.textFile("hdfs://test/tmp/data/*").cache()
> oneday = data.map(lambda line: line.split(",")).\
>               map(lambda f: (f[0], float(f[1]))).\
>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
> "2013-01-02").\
>               map(lambda t: (parser.parse(t[0]), t[1]))
> oneday.take(1)
> """
>
> By executing this, I see that it is my client machine (where ipython is
> launched) is reading all the data from HDFS, and produce the result of
> take(1), rather than my worker nodes...
>
> When I do "data.count()", things would blow up altogether. But I do see in
> the error message something like this:
> """
>
> Error from python worker:
>   /usr/bin/python: No module named pyspark
>
> """
>
>
> Am I supposed to install pyspark on every worker node?
>
>
> Thanks.
>
> -Simon
>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to