Hi Chad Timmins, I tried again with exactly the same procedure you describe and can not reproduce the error (applied https://github.com/apache/incubator-zeppelin/pull/151).
Any other idea about reproducing the error? Thanks, moon On Fri, Jul 17, 2015 at 9:57 PM moon soo Lee <m...@apache.org> wrote: > I still can not reproduce the problem. > Let me try little more and update here. > > Thanks, > moon > > > On Mon, Jul 13, 2015 at 2:50 PM Chad Timmins <ctimm...@trulia.com> wrote: > >> I already export SPARK_HOME in my .bashrc and I confirmed it is >> /home/hadoop/spark in the zeppelin notebook. >> >> I configure zeppelin using the following script (almost identical to a >> gist another user posted): >> >> # Install Zeppelin >> git clone https://github.com/apache/incubator-zeppelin.git >> /home/hadoop/zeppelin >> cd /home/hadoop/zeppelin >> mvn clean package -Pspark-1.3 -Dhadoop.version=2.4.0 -Phadoop-2.4 -Pyarn >> -DskipTests >> >> # Configure Zeppelin >> SPARK_DEFAULTS=/home/hadoop/spark/conf/spark-defaults.conf >> >> declare -a ZEPPELIN_JAVA_OPTS >> if [ -f $SPARK_DEFAULTS ]; then >> ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ >> $(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print "-D" >> $1 "=" $2}')) >> ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ >> $(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print "-D" $1 >> "=" $2}')) >> ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ >> $(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print "-D" $1 >> "=" $2}')) >> ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ >> $(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print "-D" >> $1 "=" $2}')) >> fi >> echo "${ZEPPELIN_JAVA_OPTS[@]}" >> >> cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh >> cat <<EOF >> conf/zeppelin-env.sh >> export MASTER=yarn-client >> export HADOOP_CONF_DIR=$HADOOP_CONF_DIR >> export ZEPPELIN_SPARK_USEHIVECONTEXT=false >> export ZEPPELIN_JAVA_OPTS="${ZEPPELIN_JAVA_OPTS[@]}" >> EOF >> >> >> Thank you so much for helping >> >> -Chad >> >> From: moon soo Lee <m...@apache.org> >> Reply-To: "users@zeppelin.incubator.apache.org" < >> users@zeppelin.incubator.apache.org> >> Date: Monday, July 13, 2015 at 12:25 PM >> >> To: "users@zeppelin.incubator.apache.org" < >> users@zeppelin.incubator.apache.org> >> Subject: Re: PySpark RDD method errors >> >> Could you try export SPARK_HOME variable? like >> >> export SPARK_HOME=/home/hadoop/spark >> >> >> >> On Mon, Jul 13, 2015 at 10:55 AM Chad Timmins <ctimm...@trulia.com> >> wrote: >> >>> Hi, >>> >>> Thanks for the quick reply. I have set up my configuration for >>> zeppelin exactly as you did except for the port number. I had to add to >>> zeppelin/conf/zeppelin-env.sh >>> *export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python* >>> >>> Before the interpreter patch my PYTHONPATH env variable looked like >>> >>> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip* >>> >>> AFTER the patch PYTHONPATH looked like >>> >>> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip* >>> >>> I am still getting the same errors even after I removed the extra >>> python path from conf/zeppelin-env.sh >>> Currently my zeppelin environment looks like: >>> >>> export MASTER=yarn-client >>> export HADOOP_CONF_DIR=/home/hadoop/conf >>> export ZEPPELIN_SPARK_USEHIVECONTEXT=false >>> export ZEPPELIN_JAVA_OPTS="" >>> >>> >>> >>> From: moon soo Lee <m...@apache.org> >>> Reply-To: "users@zeppelin.incubator.apache.org" < >>> users@zeppelin.incubator.apache.org> >>> Date: Sunday, July 12, 2015 at 8:59 AM >>> To: "users@zeppelin.incubator.apache.org" < >>> users@zeppelin.incubator.apache.org> >>> Subject: Re: PySpark RDD method errors >>> >>> Hi, >>> >>> Thanks for sharing the problem. >>> I have tried with AWS EMR and i could make all the code works without >>> error. >>> >>> I've set >>> >>> export HADOOP_CONF_DIR=/home/hadoop/conf >>> export SPARK_HOME=/home/hadoop/spark >>> export ZEPPELIN_PORT=9090 >>> >>> with 'yarn-client' for master property. >>> export SPARK_HOME is not correctly work without this patch. >>> https://github.com/apache/incubator-zeppelin/pull/151 >>> >>> Could you share your configuration of Zeppelin with EMR cluster? >>> >>> Thanks, >>> moon >>> >>> On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> >>> wrote: >>> >>>> Hi, >>>> >>>> When I run the filter() method on an RDD object and then try to print >>>> its results using collect(), I get a Py4JJavaError. It is not only filter >>>> but other methods that cause similar errors and I cannot figure out what is >>>> causing this. PySpark from the command line works fine, but it does not >>>> work in the Zeppelin Notebook. My setup is on an AWS EMR instance running >>>> spark 1.3.1 on Amazon’s Hadoop 2.4.0. I have included a snippet of code >>>> (in blue) and the error (in red). Thank you and please let me know if you >>>> need any more additional information. >>>> >>>> >>>> %pyspark >>>> >>>> nums = [1,2,3,4,5,6] >>>> >>>> rdd_nums = sc.parallelize(nums) >>>> rdd_sq = rdd_nums.map(lambda x: pow(x,2)) >>>> rdd_cube = rdd_nums.map(lambda x: pow(x,3)) >>>> rdd_odd = rdd_nums.filter(lambda x: x%2 == 1) >>>> >>>> print "nums: %s" % rdd_nums.collect() >>>> print "squares: %s" % rdd_sq.collect() >>>> print "cubes: %s" % rdd_cube.collect() >>>> print "odds: %s" % rdd_odd.collect() >>>> >>>> >>>> Py4JJavaError: An error occurred while calling >>>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. : >>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 >>>> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage >>>> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal): >>>> org.apache.spark.SparkException: Python worker exited unexpectedly >>>> (crashed) >>>> … >>>> … >>>> Caused by: java.io.EOFException at >>>> java.io.DataInputStream.readInt(DataInputStream.java:392) at >>>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ... >>>> 10 more >>>> >>>> If I instead set rdd_odd = rdd_nums.filter(lambda x: x%2) I don’t >>>> get an error >>>> >>>> >>>> Thanks, >>>> >>>> Chad Timmins >>>> >>>> Software Engineer Intern at Trulia >>>> B.S. Electrical Engineering, UC Davis 2015 >>>> >>>