Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen Mon, 02 Jun 2014 09:21:40 -0700

So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers,
it seems that the jar file is distributed and included in classpath
correctly.


I think the problem is likely at step 3..

I build my jar file with maven, like this:
"mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean
package"

Anything that I might have missed?

Thanks.
-Simon


On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:

> 1) yes, that sc.parallelize(range(10)).count() has the same error.
>
> 2) the files seem to be correct
>
> 3) I have trouble at this step, "ImportError: No module named pyspark"
> but I seem to have files in the jar file:
> """
> $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
> >>> import pyspark
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ImportError: No module named pyspark
>
> $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
> pyspark/
> pyspark/rddsampler.py
> pyspark/broadcast.py
> pyspark/serializers.py
> pyspark/java_gateway.py
> pyspark/resultiterable.py
> pyspark/accumulators.py
> pyspark/sql.py
> pyspark/__init__.py
> pyspark/daemon.py
> pyspark/context.py
> pyspark/cloudpickle.py
> pyspark/join.py
> pyspark/tests.py
> pyspark/files.py
> pyspark/conf.py
> pyspark/rdd.py
> pyspark/storagelevel.py
> pyspark/statcounter.py
> pyspark/shell.py
> pyspark/worker.py
> """
>
> 4) All my nodes should be running java 7, so probably this is not related.
> 5) I'll do it in a bit.
>
> Any ideas on 3)?
>
> Thanks.
> -Simon
>
>
>
> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote:
>
>> Hi Simon,
>>
>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>> pyspark is packaged into your assembly jar and shipped to your executors
>> automatically. This seems like a more general problem. There are a few
>> things to try:
>>
>> 1) Run a simple pyspark shell with yarn-client, and do
>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> 2) If so, check if your assembly jar is compiled correctly. Run
>>
>> $ jar -tf <path/to/assembly/jar> pyspark
>> $ jar -tf <path/to/assembly/jar> py4j
>>
>> to see if the files are there. For Py4j, you need both the python files
>> and the Java class files.
>>
>> 3) If the files are there, try running a simple python shell (not pyspark
>> shell) with the assembly jar on the PYTHONPATH:
>>
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> 4) If that works, try it on every worker node. If it doesn't work, there
>> is probably something wrong with your jar.
>>
>> There is a known issue for PySpark on YARN - jars built with Java 7
>> cannot be properly opened by Java 6. I would either verify that the
>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>
>> $ cd /path/to/spark/home
>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> 2.3.0-cdh5.0.0
>>
>> 5) You can check out
>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> which has more detailed information about how to debug running an
>> application on YARN in general. In my experience, the steps outlined there
>> are quite useful.
>>
>> Let me know if you get it working (or not).
>>
>> Cheers,
>> Andrew
>>
>>
>>
>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:
>>
>> Hi folks,
>>>
>>> I have a weird problem when using pyspark with yarn. I started ipython
>>> as follows:
>>>
>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>> --num-executors 4 --executor-memory 4G
>>>
>>> When I create a notebook, I can see workers being created and indeed I
>>> see spark UI running on my client machine on port 4040.
>>>
>>> I have the following simple script:
>>> """
>>> import pyspark
>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>> oneday = data.map(lambda line: line.split(",")).\
>>>               map(lambda f: (f[0], float(f[1]))).\
>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>> "2013-01-02").\
>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>> oneday.take(1)
>>> """
>>>
>>> By executing this, I see that it is my client machine (where ipython is
>>> launched) is reading all the data from HDFS, and produce the result of
>>> take(1), rather than my worker nodes...
>>>
>>> When I do "data.count()", things would blow up altogether. But I do see
>>> in the error message something like this:
>>> """
>>>
>>> Error from python worker:
>>>   /usr/bin/python: No module named pyspark
>>>
>>> """
>>>
>>>
>>> Am I supposed to install pyspark on every worker node?
>>>
>>>
>>> Thanks.
>>>
>>> -Simon
>>>
>>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to