Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try:
1) Run a simple pyspark shell with yarn-client, and do "sc.parallelize(range(10)).count()" to see if you get the same error 2) If so, check if your assembly jar is compiled correctly. Run $ jar -tf <path/to/assembly/jar> pyspark $ jar -tf <path/to/assembly/jar> py4j to see if the files are there. For Py4j, you need both the python files and the Java class files. 3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH: $ PYTHONPATH=/path/to/assembly/jar python >>> import pyspark 4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar. There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6: $ cd /path/to/spark/home $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0 5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>: > Hi folks, > > I have a weird problem when using pyspark with yarn. I started ipython as > follows: > > IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 > --num-executors 4 --executor-memory 4G > > When I create a notebook, I can see workers being created and indeed I see > spark UI running on my client machine on port 4040. > > I have the following simple script: > """ > import pyspark > data = sc.textFile("hdfs://test/tmp/data/*").cache() > oneday = data.map(lambda line: line.split(",")).\ > map(lambda f: (f[0], float(f[1]))).\ > filter(lambda t: t[0] >= "2013-01-01" and t[0] < > "2013-01-02").\ > map(lambda t: (parser.parse(t[0]), t[1])) > oneday.take(1) > """ > > By executing this, I see that it is my client machine (where ipython is > launched) is reading all the data from HDFS, and produce the result of > take(1), rather than my worker nodes... > > When I do "data.count()", things would blow up altogether. But I do see in > the error message something like this: > """ > > Error from python worker: > /usr/bin/python: No module named pyspark > > """ > > > Am I supposed to install pyspark on every worker node? > > > Thanks. > > -Simon > >