Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows:
IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: """ import pyspark data = sc.textFile("hdfs://test/tmp/data/*").cache() oneday = data.map(lambda line: line.split(",")).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) """ By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do "data.count()", things would blow up altogether. But I do see in the error message something like this: """ Error from python worker: /usr/bin/python: No module named pyspark """ Am I supposed to install pyspark on every worker node? Thanks. -Simon