Hi, Thanks for the quick reply. I have set up my configuration for zeppelin exactly as you did except for the port number. I had to add to zeppelin/conf/zeppelin-env.sh export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python
Before the interpreter patch my PYTHONPATH env variable looked like :/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip AFTER the patch PYTHONPATH looked like :/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip I am still getting the same errors even after I removed the extra python path from conf/zeppelin-env.sh Currently my zeppelin environment looks like: export MASTER=yarn-client export HADOOP_CONF_DIR=/home/hadoop/conf export ZEPPELIN_SPARK_USEHIVECONTEXT=false export ZEPPELIN_JAVA_OPTS="" From: moon soo Lee <m...@apache.org<mailto:m...@apache.org>> Reply-To: "users@zeppelin.incubator.apache.org<mailto:users@zeppelin.incubator.apache.org>" <users@zeppelin.incubator.apache.org<mailto:users@zeppelin.incubator.apache.org>> Date: Sunday, July 12, 2015 at 8:59 AM To: "users@zeppelin.incubator.apache.org<mailto:users@zeppelin.incubator.apache.org>" <users@zeppelin.incubator.apache.org<mailto:users@zeppelin.incubator.apache.org>> Subject: Re: PySpark RDD method errors Hi, Thanks for sharing the problem. I have tried with AWS EMR and i could make all the code works without error. I've set export HADOOP_CONF_DIR=/home/hadoop/conf export SPARK_HOME=/home/hadoop/spark export ZEPPELIN_PORT=9090 with 'yarn-client' for master property. export SPARK_HOME is not correctly work without this patch. https://github.com/apache/incubator-zeppelin/pull/151 Could you share your configuration of Zeppelin with EMR cluster? Thanks, moon On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com<mailto:ctimm...@trulia.com>> wrote: Hi, When I run the filter() method on an RDD object and then try to print its results using collect(), I get a Py4JJavaError. It is not only filter but other methods that cause similar errors and I cannot figure out what is causing this. PySpark from the command line works fine, but it does not work in the Zeppelin Notebook. My setup is on an AWS EMR instance running spark 1.3.1 on Amazon's Hadoop 2.4.0. I have included a snippet of code (in blue) and the error (in red). Thank you and please let me know if you need any more additional information. %pyspark nums = [1,2,3,4,5,6] rdd_nums = sc.parallelize(nums) rdd_sq = rdd_nums.map(lambda x: pow(x,2)) rdd_cube = rdd_nums.map(lambda x: pow(x,3)) rdd_odd = rdd_nums.filter(lambda x: x%2 == 1) print "nums: %s" % rdd_nums.collect() print "squares: %s" % rdd_sq.collect() print "cubes: %s" % rdd_cube.collect() print "odds: %s" % rdd_odd.collect() Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) ... ... Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ... 10 more If I instead set rdd_odd = rdd_nums.filter(lambda x: x%2) I don't get an error Thanks, Chad Timmins Software Engineer Intern at Trulia B.S. Electrical Engineering, UC Davis 2015