1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct
3) I have trouble at this step, "ImportError: No module named pyspark" but I seem to have files in the jar file: """ $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python >>> import pyspark Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named pyspark $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark pyspark/ pyspark/rddsampler.py pyspark/broadcast.py pyspark/serializers.py pyspark/java_gateway.py pyspark/resultiterable.py pyspark/accumulators.py pyspark/sql.py pyspark/__init__.py pyspark/daemon.py pyspark/context.py pyspark/cloudpickle.py pyspark/join.py pyspark/tests.py pyspark/files.py pyspark/conf.py pyspark/rdd.py pyspark/storagelevel.py pyspark/statcounter.py pyspark/shell.py pyspark/worker.py """ 4) All my nodes should be running java 7, so probably this is not related. 5) I'll do it in a bit. Any ideas on 3)? Thanks. -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote: > Hi Simon, > > You shouldn't have to install pyspark on every worker node. In YARN mode, > pyspark is packaged into your assembly jar and shipped to your executors > automatically. This seems like a more general problem. There are a few > things to try: > > 1) Run a simple pyspark shell with yarn-client, and do > "sc.parallelize(range(10)).count()" to see if you get the same error > 2) If so, check if your assembly jar is compiled correctly. Run > > $ jar -tf <path/to/assembly/jar> pyspark > $ jar -tf <path/to/assembly/jar> py4j > > to see if the files are there. For Py4j, you need both the python files > and the Java class files. > > 3) If the files are there, try running a simple python shell (not pyspark > shell) with the assembly jar on the PYTHONPATH: > > $ PYTHONPATH=/path/to/assembly/jar python > >>> import pyspark > > 4) If that works, try it on every worker node. If it doesn't work, there > is probably something wrong with your jar. > > There is a known issue for PySpark on YARN - jars built with Java 7 cannot > be properly opened by Java 6. I would either verify that the JAVA_HOME set > on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), > or simply build your jar with Java 6: > > $ cd /path/to/spark/home > $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop > 2.3.0-cdh5.0.0 > > 5) You can check out > http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, > which has more detailed information about how to debug running an > application on YARN in general. In my experience, the steps outlined there > are quite useful. > > Let me know if you get it working (or not). > > Cheers, > Andrew > > > > 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>: > > Hi folks, >> >> I have a weird problem when using pyspark with yarn. I started ipython as >> follows: >> >> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 >> --num-executors 4 --executor-memory 4G >> >> When I create a notebook, I can see workers being created and indeed I >> see spark UI running on my client machine on port 4040. >> >> I have the following simple script: >> """ >> import pyspark >> data = sc.textFile("hdfs://test/tmp/data/*").cache() >> oneday = data.map(lambda line: line.split(",")).\ >> map(lambda f: (f[0], float(f[1]))).\ >> filter(lambda t: t[0] >= "2013-01-01" and t[0] < >> "2013-01-02").\ >> map(lambda t: (parser.parse(t[0]), t[1])) >> oneday.take(1) >> """ >> >> By executing this, I see that it is my client machine (where ipython is >> launched) is reading all the data from HDFS, and produce the result of >> take(1), rather than my worker nodes... >> >> When I do "data.count()", things would blow up altogether. But I do see >> in the error message something like this: >> """ >> >> Error from python worker: >> /usr/bin/python: No module named pyspark >> >> """ >> >> >> Am I supposed to install pyspark on every worker node? >> >> >> Thanks. >> >> -Simon >> >> >