Could you try export SPARK_HOME variable? like export SPARK_HOME=/home/hadoop/spark
On Mon, Jul 13, 2015 at 10:55 AM Chad Timmins <ctimm...@trulia.com> wrote: > Hi, > > Thanks for the quick reply. I have set up my configuration for zeppelin > exactly as you did except for the port number. I had to add to > zeppelin/conf/zeppelin-env.sh > *export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python* > > Before the interpreter patch my PYTHONPATH env variable looked like > > *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip* > > AFTER the patch PYTHONPATH looked like > > *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip* > > I am still getting the same errors even after I removed the extra > python path from conf/zeppelin-env.sh > Currently my zeppelin environment looks like: > > export MASTER=yarn-client > export HADOOP_CONF_DIR=/home/hadoop/conf > export ZEPPELIN_SPARK_USEHIVECONTEXT=false > export ZEPPELIN_JAVA_OPTS="" > > > > From: moon soo Lee <m...@apache.org> > Reply-To: "users@zeppelin.incubator.apache.org" < > users@zeppelin.incubator.apache.org> > Date: Sunday, July 12, 2015 at 8:59 AM > To: "users@zeppelin.incubator.apache.org" < > users@zeppelin.incubator.apache.org> > Subject: Re: PySpark RDD method errors > > Hi, > > Thanks for sharing the problem. > I have tried with AWS EMR and i could make all the code works without > error. > > I've set > > export HADOOP_CONF_DIR=/home/hadoop/conf > export SPARK_HOME=/home/hadoop/spark > export ZEPPELIN_PORT=9090 > > with 'yarn-client' for master property. > export SPARK_HOME is not correctly work without this patch. > https://github.com/apache/incubator-zeppelin/pull/151 > > Could you share your configuration of Zeppelin with EMR cluster? > > Thanks, > moon > > On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> wrote: > >> Hi, >> >> When I run the filter() method on an RDD object and then try to print >> its results using collect(), I get a Py4JJavaError. It is not only filter >> but other methods that cause similar errors and I cannot figure out what is >> causing this. PySpark from the command line works fine, but it does not >> work in the Zeppelin Notebook. My setup is on an AWS EMR instance running >> spark 1.3.1 on Amazon’s Hadoop 2.4.0. I have included a snippet of code >> (in blue) and the error (in red). Thank you and please let me know if you >> need any more additional information. >> >> >> %pyspark >> >> nums = [1,2,3,4,5,6] >> >> rdd_nums = sc.parallelize(nums) >> rdd_sq = rdd_nums.map(lambda x: pow(x,2)) >> rdd_cube = rdd_nums.map(lambda x: pow(x,3)) >> rdd_odd = rdd_nums.filter(lambda x: x%2 == 1) >> >> print "nums: %s" % rdd_nums.collect() >> print "squares: %s" % rdd_sq.collect() >> print "cubes: %s" % rdd_cube.collect() >> print "odds: %s" % rdd_odd.collect() >> >> >> Py4JJavaError: An error occurred while calling >> z:org.apache.spark.api.python.PythonRDD.collectAndServe. : >> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 >> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage >> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal): >> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) >> … >> … >> Caused by: java.io.EOFException at >> java.io.DataInputStream.readInt(DataInputStream.java:392) at >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ... >> 10 more >> >> If I instead set rdd_odd = rdd_nums.filter(lambda x: x%2) I don’t >> get an error >> >> >> Thanks, >> >> Chad Timmins >> >> Software Engineer Intern at Trulia >> B.S. Electrical Engineering, UC Davis 2015 >> >