Re: PySpark RDD method errors

moon soo Lee Sun, 12 Jul 2015 09:00:28 -0700

Hi,

Thanks for sharing the problem.
I have tried with AWS EMR and i could make all the code works without error.


I've set

export HADOOP_CONF_DIR=/home/hadoop/conf
export SPARK_HOME=/home/hadoop/spark
export ZEPPELIN_PORT=9090

with 'yarn-client' for master property.
export SPARK_HOME is not correctly work without this patch.
 https://github.com/apache/incubator-zeppelin/pull/151

Could you share your configuration of Zeppelin with EMR cluster?

Thanks,
moon

On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> wrote:

>  Hi,
>
>  When I run the filter() method on an RDD object and then try to print
> its results using collect(), I get a Py4JJavaError.  It is not only filter
> but other methods that cause similar errors and I cannot figure out what is
> causing this.  PySpark from the command line works fine, but it does not
> work in the Zeppelin Notebook.  My setup is on an AWS EMR instance running
> spark 1.3.1 on Amazon’s Hadoop 2.4.0.  I have included a snippet of code
> (in blue) and the error (in red).  Thank you and please let me know if you
> need any more additional information.
>
>
>  %pyspark
>
>  nums = [1,2,3,4,5,6]
>
>  rdd_nums = sc.parallelize(nums)
> rdd_sq   = rdd_nums.map(lambda x: pow(x,2))
> rdd_cube = rdd_nums.map(lambda x: pow(x,3))
> rdd_odd  = rdd_nums.filter(lambda x: x%2 == 1)
>
>  print "nums: %s" % rdd_nums.collect()
> print "squares: %s" % rdd_sq.collect()
> print "cubes: %s" % rdd_cube.collect()
> print "odds: %s" % rdd_odd.collect()
>
>
>  Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal):
> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
> …
> …
> Caused by: java.io.EOFException at
> java.io.DataInputStream.readInt(DataInputStream.java:392) at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ...
> 10 more
>
>  If I instead set rdd_odd  = rdd_nums.filter(lambda x: x%2)    I don’t
> get an error
>
>
>  Thanks,
>
>  Chad Timmins
>
>  Software Engineer Intern at Trulia
> B.S. Electrical Engineering, UC Davis 2015
>

Re: PySpark RDD method errors

Reply via email to