Hi, Thanks for sharing the problem. I have tried with AWS EMR and i could make all the code works without error.
I've set export HADOOP_CONF_DIR=/home/hadoop/conf export SPARK_HOME=/home/hadoop/spark export ZEPPELIN_PORT=9090 with 'yarn-client' for master property. export SPARK_HOME is not correctly work without this patch. https://github.com/apache/incubator-zeppelin/pull/151 Could you share your configuration of Zeppelin with EMR cluster? Thanks, moon On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> wrote: > Hi, > > When I run the filter() method on an RDD object and then try to print > its results using collect(), I get a Py4JJavaError. It is not only filter > but other methods that cause similar errors and I cannot figure out what is > causing this. PySpark from the command line works fine, but it does not > work in the Zeppelin Notebook. My setup is on an AWS EMR instance running > spark 1.3.1 on Amazon’s Hadoop 2.4.0. I have included a snippet of code > (in blue) and the error (in red). Thank you and please let me know if you > need any more additional information. > > > %pyspark > > nums = [1,2,3,4,5,6] > > rdd_nums = sc.parallelize(nums) > rdd_sq = rdd_nums.map(lambda x: pow(x,2)) > rdd_cube = rdd_nums.map(lambda x: pow(x,3)) > rdd_odd = rdd_nums.filter(lambda x: x%2 == 1) > > print "nums: %s" % rdd_nums.collect() > print "squares: %s" % rdd_sq.collect() > print "cubes: %s" % rdd_cube.collect() > print "odds: %s" % rdd_odd.collect() > > > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal): > org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) > … > … > Caused by: java.io.EOFException at > java.io.DataInputStream.readInt(DataInputStream.java:392) at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ... > 10 more > > If I instead set rdd_odd = rdd_nums.filter(lambda x: x%2) I don’t > get an error > > > Thanks, > > Chad Timmins > > Software Engineer Intern at Trulia > B.S. Electrical Engineering, UC Davis 2015 >