Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
>> I asked several people, no one seems to believe that we can do this: >> $ PYTHONPATH=/path/to/assembly/jar python >> >>> import pyspark That is because people usually don't package python files into their jars. For pyspark, however, this will work as long as the jar can be opened and its conten

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Yeah we need to add a build warning to the Maven build. Would you be able to try compiling Spark with Java 6? It would be good to narrow down if you hare hitting this problem or something else. On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen wrote: > Nope... didn't try java 6. The standard instal

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Nope... didn't try java 6. The standard installation guide didn't say anything about java 7 and suggested to do "-DskipTests" for the build.. http://spark.apache.org/docs/latest/building-with-maven.html So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell wrot

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason: https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile?

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen wrote: > I asked several people, no one seems to believe that we can do this: > $ PYTHONPATH=/path/to/a

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
I asked several people, no one seems to believe that we can do this: $ PYTHONPATH=/path/to/assembly/jar python >>> import pyspark This following pull request did mention something about generating a zip file for all python related modules: https://www.mail-archive.com/reviews@spark.apache.org/msg0

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers, it seems that the jar file is distributed and included in classpath correctly. I think the problem is likely at step 3.. I build my jar file with maven, like this: "mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct 3) I have trouble at this step, "ImportError: No module named pyspark" but I seem to have files in the jar file: """ $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python >>> import py

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client