So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers, it seems that the jar file is distributed and included in classpath correctly.
I think the problem is likely at step 3.. I build my jar file with maven, like this: "mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean package" Anything that I might have missed? Thanks. -Simon On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: > 1) yes, that sc.parallelize(range(10)).count() has the same error. > > 2) the files seem to be correct > > 3) I have trouble at this step, "ImportError: No module named pyspark" > but I seem to have files in the jar file: > """ > $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python > >>> import pyspark > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > ImportError: No module named pyspark > > $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark > pyspark/ > pyspark/rddsampler.py > pyspark/broadcast.py > pyspark/serializers.py > pyspark/java_gateway.py > pyspark/resultiterable.py > pyspark/accumulators.py > pyspark/sql.py > pyspark/__init__.py > pyspark/daemon.py > pyspark/context.py > pyspark/cloudpickle.py > pyspark/join.py > pyspark/tests.py > pyspark/files.py > pyspark/conf.py > pyspark/rdd.py > pyspark/storagelevel.py > pyspark/statcounter.py > pyspark/shell.py > pyspark/worker.py > """ > > 4) All my nodes should be running java 7, so probably this is not related. > 5) I'll do it in a bit. > > Any ideas on 3)? > > Thanks. > -Simon > > > > On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote: > >> Hi Simon, >> >> You shouldn't have to install pyspark on every worker node. In YARN mode, >> pyspark is packaged into your assembly jar and shipped to your executors >> automatically. This seems like a more general problem. There are a few >> things to try: >> >> 1) Run a simple pyspark shell with yarn-client, and do >> "sc.parallelize(range(10)).count()" to see if you get the same error >> 2) If so, check if your assembly jar is compiled correctly. Run >> >> $ jar -tf <path/to/assembly/jar> pyspark >> $ jar -tf <path/to/assembly/jar> py4j >> >> to see if the files are there. For Py4j, you need both the python files >> and the Java class files. >> >> 3) If the files are there, try running a simple python shell (not pyspark >> shell) with the assembly jar on the PYTHONPATH: >> >> $ PYTHONPATH=/path/to/assembly/jar python >> >>> import pyspark >> >> 4) If that works, try it on every worker node. If it doesn't work, there >> is probably something wrong with your jar. >> >> There is a known issue for PySpark on YARN - jars built with Java 7 >> cannot be properly opened by Java 6. I would either verify that the >> JAVA_HOME set on all of your workers points to Java 7 (by setting >> SPARK_YARN_USER_ENV), or simply build your jar with Java 6: >> >> $ cd /path/to/spark/home >> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop >> 2.3.0-cdh5.0.0 >> >> 5) You can check out >> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, >> which has more detailed information about how to debug running an >> application on YARN in general. In my experience, the steps outlined there >> are quite useful. >> >> Let me know if you get it working (or not). >> >> Cheers, >> Andrew >> >> >> >> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>: >> >> Hi folks, >>> >>> I have a weird problem when using pyspark with yarn. I started ipython >>> as follows: >>> >>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 >>> --num-executors 4 --executor-memory 4G >>> >>> When I create a notebook, I can see workers being created and indeed I >>> see spark UI running on my client machine on port 4040. >>> >>> I have the following simple script: >>> """ >>> import pyspark >>> data = sc.textFile("hdfs://test/tmp/data/*").cache() >>> oneday = data.map(lambda line: line.split(",")).\ >>> map(lambda f: (f[0], float(f[1]))).\ >>> filter(lambda t: t[0] >= "2013-01-01" and t[0] < >>> "2013-01-02").\ >>> map(lambda t: (parser.parse(t[0]), t[1])) >>> oneday.take(1) >>> """ >>> >>> By executing this, I see that it is my client machine (where ipython is >>> launched) is reading all the data from HDFS, and produce the result of >>> take(1), rather than my worker nodes... >>> >>> When I do "data.count()", things would blow up altogether. But I do see >>> in the error message something like this: >>> """ >>> >>> Error from python worker: >>> /usr/bin/python: No module named pyspark >>> >>> """ >>> >>> >>> Am I supposed to install pyspark on every worker node? >>> >>> >>> Thanks. >>> >>> -Simon >>> >>> >> >