Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Andrew Or Mon, 02 Jun 2014 23:48:12 -0700

>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark


That is because people usually don't package python files into their jars.
For pyspark, however, this will work as long as the jar can be opened and
its contents can be read. In my experience, if I am able to import the
pyspark module by explicitly specifying the PYTHONPATH this way, then I can
run pyspark on YARN without fail.

>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..

It seems that this problem is not specific to running Java 6 on a Java 7
jar. We definitely need to document and warn against Java 7 jars more
aggressively. For now, please do try building the jar with Java 6.



2014-06-03 4:42 GMT+02:00 Patrick Wendell <pwend...@gmail.com>:

> Yeah we need to add a build warning to the Maven build. Would you be
> able to try compiling Spark with Java 6? It would be good to narrow
> down if you hare hitting this problem or something else.
>
> On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:
> > Nope... didn't try java 6. The standard installation guide didn't say
> > anything about java 7 and suggested to do "-DskipTests" for the build..
> > http://spark.apache.org/docs/latest/building-with-maven.html
> >
> > So, I didn't see the warning message...
> >
> >
> > On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> >>
> >> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
> >> Zip format and Java 7 uses Zip64. I think we've tried to add some
> >> build warnings if Java 7 is used, for this reason:
> >>
> >> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
> >>
> >> Any luck if you use JDK 6 to compile?
> >>
> >>
> >> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xche...@gmail.com>
> >> wrote:
> >> > OK, my colleague found this:
> >> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
> >> >
> >> > And my jar file has 70011 files. Fantastic..
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I asked several people, no one seems to believe that we can do this:
> >> >> $ PYTHONPATH=/path/to/assembly/jar python
> >> >> >>> import pyspark
> >> >>
> >> >> This following pull request did mention something about generating a
> >> >> zip
> >> >> file for all python related modules:
> >> >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
> >> >>
> >> >> I've tested that zipped modules can as least be imported via
> zipimport.
> >> >>
> >> >> Any ideas?
> >> >>
> >> >> -Simon
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Simon,
> >> >>>
> >> >>> You shouldn't have to install pyspark on every worker node. In YARN
> >> >>> mode,
> >> >>> pyspark is packaged into your assembly jar and shipped to your
> >> >>> executors
> >> >>> automatically. This seems like a more general problem. There are a
> few
> >> >>> things to try:
> >> >>>
> >> >>> 1) Run a simple pyspark shell with yarn-client, and do
> >> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
> >> >>> 2) If so, check if your assembly jar is compiled correctly. Run
> >> >>>
> >> >>> $ jar -tf <path/to/assembly/jar> pyspark
> >> >>> $ jar -tf <path/to/assembly/jar> py4j
> >> >>>
> >> >>> to see if the files are there. For Py4j, you need both the python
> >> >>> files
> >> >>> and the Java class files.
> >> >>>
> >> >>> 3) If the files are there, try running a simple python shell (not
> >> >>> pyspark
> >> >>> shell) with the assembly jar on the PYTHONPATH:
> >> >>>
> >> >>> $ PYTHONPATH=/path/to/assembly/jar python
> >> >>> >>> import pyspark
> >> >>>
> >> >>> 4) If that works, try it on every worker node. If it doesn't work,
> >> >>> there
> >> >>> is probably something wrong with your jar.
> >> >>>
> >> >>> There is a known issue for PySpark on YARN - jars built with Java 7
> >> >>> cannot be properly opened by Java 6. I would either verify that the
> >> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
> >> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
> >> >>>
> >> >>> $ cd /path/to/spark/home
> >> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> >> >>> 2.3.0-cdh5.0.0
> >> >>>
> >> >>> 5) You can check out
> >> >>>
> >> >>>
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
> ,
> >> >>> which has more detailed information about how to debug running an
> >> >>> application on YARN in general. In my experience, the steps outlined
> >> >>> there
> >> >>> are quite useful.
> >> >>>
> >> >>> Let me know if you get it working (or not).
> >> >>>
> >> >>> Cheers,
> >> >>> Andrew
> >> >>>
> >> >>>
> >> >>>
> >> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:
> >> >>>
> >> >>>> Hi folks,
> >> >>>>
> >> >>>> I have a weird problem when using pyspark with yarn. I started
> >> >>>> ipython
> >> >>>> as follows:
> >> >>>>
> >> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
> >> >>>> --num-executors 4 --executor-memory 4G
> >> >>>>
> >> >>>> When I create a notebook, I can see workers being created and
> indeed
> >> >>>> I
> >> >>>> see spark UI running on my client machine on port 4040.
> >> >>>>
> >> >>>> I have the following simple script:
> >> >>>> """
> >> >>>> import pyspark
> >> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
> >> >>>> oneday = data.map(lambda line: line.split(",")).\
> >> >>>>               map(lambda f: (f[0], float(f[1]))).\
> >> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
> >> >>>> "2013-01-02").\
> >> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
> >> >>>> oneday.take(1)
> >> >>>> """
> >> >>>>
> >> >>>> By executing this, I see that it is my client machine (where
> ipython
> >> >>>> is
> >> >>>> launched) is reading all the data from HDFS, and produce the result
> >> >>>> of
> >> >>>> take(1), rather than my worker nodes...
> >> >>>>
> >> >>>> When I do "data.count()", things would blow up altogether. But I do
> >> >>>> see
> >> >>>> in the error message something like this:
> >> >>>> """
> >> >>>>
> >> >>>> Error from python worker:
> >> >>>>   /usr/bin/python: No module named pyspark
> >> >>>>
> >> >>>> """
> >> >>>>
> >> >>>>
> >> >>>> Am I supposed to install pyspark on every worker node?
> >> >>>>
> >> >>>>
> >> >>>> Thanks.
> >> >>>>
> >> >>>> -Simon
> >> >>>
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to