pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen Mon, 02 Jun 2014 08:24:45 -0700

Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as
follows:


IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors
4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see
spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] <
"2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is
launched) is reading all the data from HDFS, and produce the result of
take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in
the error message something like this:
"""

Error from python worker:
  /usr/bin/python: No module named pyspark

"""


Am I supposed to install pyspark on every worker node?


Thanks.

-Simon

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to