Re: PySpark RDD method errors

moon soo Lee Mon, 20 Jul 2015 19:45:29 -0700

Hi Chad Timmins,

I tried again with exactly the same procedure you describe and can not
reproduce the error (applied
https://github.com/apache/incubator-zeppelin/pull/151).


Any other idea about reproducing the error?

Thanks,
moon

On Fri, Jul 17, 2015 at 9:57 PM moon soo Lee <m...@apache.org> wrote:

> I still can not reproduce the problem.
> Let me try little more and update here.
>
> Thanks,
> moon
>
>
> On Mon, Jul 13, 2015 at 2:50 PM Chad Timmins <ctimm...@trulia.com> wrote:
>
>>  I already export SPARK_HOME in my .bashrc and I confirmed it is
>> /home/hadoop/spark in the zeppelin notebook.
>>
>>  I configure zeppelin using the following script (almost identical to a
>> gist another user posted):
>>
>> # Install Zeppelin
>> git clone https://github.com/apache/incubator-zeppelin.git 
>> /home/hadoop/zeppelin
>> cd /home/hadoop/zeppelin
>> mvn clean package -Pspark-1.3 -Dhadoop.version=2.4.0 -Phadoop-2.4 -Pyarn 
>> -DskipTests
>>
>> # Configure Zeppelin
>> SPARK_DEFAULTS=/home/hadoop/spark/conf/spark-defaults.conf
>>
>> declare -a ZEPPELIN_JAVA_OPTS
>> if [ -f $SPARK_DEFAULTS ]; then
>>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>>         $(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print "-D" 
>> $1 "=" $2}'))
>>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>>         $(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print "-D" $1 
>> "=" $2}'))
>>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>>         $(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print "-D" $1 
>> "=" $2}'))
>>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>>         $(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print "-D" 
>> $1 "=" $2}'))
>> fi
>> echo "${ZEPPELIN_JAVA_OPTS[@]}"
>>
>> cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
>> cat <<EOF >> conf/zeppelin-env.sh
>> export MASTER=yarn-client
>> export HADOOP_CONF_DIR=$HADOOP_CONF_DIR
>> export ZEPPELIN_SPARK_USEHIVECONTEXT=false
>> export ZEPPELIN_JAVA_OPTS="${ZEPPELIN_JAVA_OPTS[@]}"
>> EOF
>>
>>
>> Thank you so much for helping
>>
>>  -Chad
>>
>>   From: moon soo Lee <m...@apache.org>
>> Reply-To: "users@zeppelin.incubator.apache.org" <
>> users@zeppelin.incubator.apache.org>
>> Date: Monday, July 13, 2015 at 12:25 PM
>>
>> To: "users@zeppelin.incubator.apache.org" <
>> users@zeppelin.incubator.apache.org>
>> Subject: Re: PySpark RDD method errors
>>
>>   Could you try export SPARK_HOME variable? like
>>
>>  export SPARK_HOME=/home/hadoop/spark
>>
>>
>>
>>  On Mon, Jul 13, 2015 at 10:55 AM Chad Timmins <ctimm...@trulia.com>
>> wrote:
>>
>>>  Hi,
>>>
>>>  Thanks for the quick reply.  I have set up my configuration for
>>> zeppelin exactly as you did except for the port number.  I had to add to
>>>  zeppelin/conf/zeppelin-env.sh
>>> *export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python*
>>>
>>>  Before the interpreter patch my PYTHONPATH env variable looked like
>>>
>>> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip*
>>>
>>>  AFTER the patch PYTHONPATH looked like
>>>
>>> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip*
>>>
>>>  I am still getting the same errors  even after I removed the extra
>>> python path from conf/zeppelin-env.sh
>>> Currently my zeppelin environment looks like:
>>>
>>>  export MASTER=yarn-client
>>> export HADOOP_CONF_DIR=/home/hadoop/conf
>>> export ZEPPELIN_SPARK_USEHIVECONTEXT=false
>>> export ZEPPELIN_JAVA_OPTS=""
>>>
>>>
>>>
>>>   From: moon soo Lee <m...@apache.org>
>>> Reply-To: "users@zeppelin.incubator.apache.org" <
>>> users@zeppelin.incubator.apache.org>
>>> Date: Sunday, July 12, 2015 at 8:59 AM
>>> To: "users@zeppelin.incubator.apache.org" <
>>> users@zeppelin.incubator.apache.org>
>>> Subject: Re: PySpark RDD method errors
>>>
>>>   Hi,
>>>
>>>  Thanks for sharing the problem.
>>> I have tried with AWS EMR and i could make all the code works without
>>> error.
>>>
>>>  I've set
>>>
>>>  export HADOOP_CONF_DIR=/home/hadoop/conf
>>> export SPARK_HOME=/home/hadoop/spark
>>> export ZEPPELIN_PORT=9090
>>>
>>>  with 'yarn-client' for master property.
>>> export SPARK_HOME is not correctly work without this patch.
>>>  https://github.com/apache/incubator-zeppelin/pull/151
>>>
>>>  Could you share your configuration of Zeppelin with EMR cluster?
>>>
>>>  Thanks,
>>> moon
>>>
>>>  On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com>
>>> wrote:
>>>
>>>>  Hi,
>>>>
>>>>  When I run the filter() method on an RDD object and then try to print
>>>> its results using collect(), I get a Py4JJavaError.  It is not only filter
>>>> but other methods that cause similar errors and I cannot figure out what is
>>>> causing this.  PySpark from the command line works fine, but it does not
>>>> work in the Zeppelin Notebook.  My setup is on an AWS EMR instance running
>>>> spark 1.3.1 on Amazon’s Hadoop 2.4.0.  I have included a snippet of code
>>>> (in blue) and the error (in red).  Thank you and please let me know if you
>>>> need any more additional information.
>>>>
>>>>
>>>>  %pyspark
>>>>
>>>>  nums = [1,2,3,4,5,6]
>>>>
>>>>  rdd_nums = sc.parallelize(nums)
>>>> rdd_sq   = rdd_nums.map(lambda x: pow(x,2))
>>>> rdd_cube = rdd_nums.map(lambda x: pow(x,3))
>>>> rdd_odd  = rdd_nums.filter(lambda x: x%2 == 1)
>>>>
>>>>  print "nums: %s" % rdd_nums.collect()
>>>> print "squares: %s" % rdd_sq.collect()
>>>> print "cubes: %s" % rdd_cube.collect()
>>>> print "odds: %s" % rdd_odd.collect()
>>>>
>>>>
>>>>  Py4JJavaError: An error occurred while calling
>>>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
>>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>>>> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>>> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal):
>>>> org.apache.spark.SparkException: Python worker exited unexpectedly 
>>>> (crashed)
>>>> …
>>>> …
>>>> Caused by: java.io.EOFException at
>>>> java.io.DataInputStream.readInt(DataInputStream.java:392) at
>>>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ...
>>>> 10 more
>>>>
>>>>  If I instead set rdd_odd  = rdd_nums.filter(lambda x: x%2)    I don’t
>>>> get an error
>>>>
>>>>
>>>>  Thanks,
>>>>
>>>>  Chad Timmins
>>>>
>>>>  Software Engineer Intern at Trulia
>>>> B.S. Electrical Engineering, UC Davis 2015
>>>>
>>>

Re: PySpark RDD method errors

Reply via email to