Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen Mon, 02 Jun 2014 09:03:21 -0700

1) yes, that sc.parallelize(range(10)).count() has the same error.

2) the files seem to be correct


3) I have trouble at this step, "ImportError: No module named pyspark"
but I seem to have files in the jar file:
"""
$ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
>>> import pyspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark

$ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
pyspark/
pyspark/rddsampler.py
pyspark/broadcast.py
pyspark/serializers.py
pyspark/java_gateway.py
pyspark/resultiterable.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/__init__.py
pyspark/daemon.py
pyspark/context.py
pyspark/cloudpickle.py
pyspark/join.py
pyspark/tests.py
pyspark/files.py
pyspark/conf.py
pyspark/rdd.py
pyspark/storagelevel.py
pyspark/statcounter.py
pyspark/shell.py
pyspark/worker.py
"""

4) All my nodes should be running java 7, so probably this is not related.
5) I'll do it in a bit.

Any ideas on 3)?

Thanks.
-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote:

> Hi Simon,
>
> You shouldn't have to install pyspark on every worker node. In YARN mode,
> pyspark is packaged into your assembly jar and shipped to your executors
> automatically. This seems like a more general problem. There are a few
> things to try:
>
> 1) Run a simple pyspark shell with yarn-client, and do
> "sc.parallelize(range(10)).count()" to see if you get the same error
> 2) If so, check if your assembly jar is compiled correctly. Run
>
> $ jar -tf <path/to/assembly/jar> pyspark
> $ jar -tf <path/to/assembly/jar> py4j
>
> to see if the files are there. For Py4j, you need both the python files
> and the Java class files.
>
> 3) If the files are there, try running a simple python shell (not pyspark
> shell) with the assembly jar on the PYTHONPATH:
>
> $ PYTHONPATH=/path/to/assembly/jar python
> >>> import pyspark
>
> 4) If that works, try it on every worker node. If it doesn't work, there
> is probably something wrong with your jar.
>
> There is a known issue for PySpark on YARN - jars built with Java 7 cannot
> be properly opened by Java 6. I would either verify that the JAVA_HOME set
> on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
> or simply build your jar with Java 6:
>
> $ cd /path/to/spark/home
> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> 2.3.0-cdh5.0.0
>
> 5) You can check out
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
> which has more detailed information about how to debug running an
> application on YARN in general. In my experience, the steps outlined there
> are quite useful.
>
> Let me know if you get it working (or not).
>
> Cheers,
> Andrew
>
>
>
> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:
>
> Hi folks,
>>
>> I have a weird problem when using pyspark with yarn. I started ipython as
>> follows:
>>
>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>> --num-executors 4 --executor-memory 4G
>>
>> When I create a notebook, I can see workers being created and indeed I
>> see spark UI running on my client machine on port 4040.
>>
>> I have the following simple script:
>> """
>> import pyspark
>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>> oneday = data.map(lambda line: line.split(",")).\
>>               map(lambda f: (f[0], float(f[1]))).\
>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>> "2013-01-02").\
>>               map(lambda t: (parser.parse(t[0]), t[1]))
>> oneday.take(1)
>> """
>>
>> By executing this, I see that it is my client machine (where ipython is
>> launched) is reading all the data from HDFS, and produce the result of
>> take(1), rather than my worker nodes...
>>
>> When I do "data.count()", things would blow up altogether. But I do see
>> in the error message something like this:
>> """
>>
>> Error from python worker:
>>   /usr/bin/python: No module named pyspark
>>
>> """
>>
>>
>> Am I supposed to install pyspark on every worker node?
>>
>>
>> Thanks.
>>
>> -Simon
>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to