Re: PySpark RDD method errors

moon soo Lee Mon, 13 Jul 2015 12:26:33 -0700

Could you try export SPARK_HOME variable? like

export SPARK_HOME=/home/hadoop/spark




On Mon, Jul 13, 2015 at 10:55 AM Chad Timmins <ctimm...@trulia.com> wrote:

>  Hi,
>
>  Thanks for the quick reply.  I have set up my configuration for zeppelin
> exactly as you did except for the port number.  I had to add to
>  zeppelin/conf/zeppelin-env.sh
> *export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python*
>
>  Before the interpreter patch my PYTHONPATH env variable looked like
>
> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip*
>
>  AFTER the patch PYTHONPATH looked like
>
> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip*
>
>  I am still getting the same errors  even after I removed the extra
> python path from conf/zeppelin-env.sh
> Currently my zeppelin environment looks like:
>
>  export MASTER=yarn-client
> export HADOOP_CONF_DIR=/home/hadoop/conf
> export ZEPPELIN_SPARK_USEHIVECONTEXT=false
> export ZEPPELIN_JAVA_OPTS=""
>
>
>
>   From: moon soo Lee <m...@apache.org>
> Reply-To: "users@zeppelin.incubator.apache.org" <
> users@zeppelin.incubator.apache.org>
> Date: Sunday, July 12, 2015 at 8:59 AM
> To: "users@zeppelin.incubator.apache.org" <
> users@zeppelin.incubator.apache.org>
> Subject: Re: PySpark RDD method errors
>
>   Hi,
>
>  Thanks for sharing the problem.
> I have tried with AWS EMR and i could make all the code works without
> error.
>
>  I've set
>
>  export HADOOP_CONF_DIR=/home/hadoop/conf
> export SPARK_HOME=/home/hadoop/spark
> export ZEPPELIN_PORT=9090
>
>  with 'yarn-client' for master property.
> export SPARK_HOME is not correctly work without this patch.
>  https://github.com/apache/incubator-zeppelin/pull/151
>
>  Could you share your configuration of Zeppelin with EMR cluster?
>
>  Thanks,
> moon
>
>  On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> wrote:
>
>>  Hi,
>>
>>  When I run the filter() method on an RDD object and then try to print
>> its results using collect(), I get a Py4JJavaError.  It is not only filter
>> but other methods that cause similar errors and I cannot figure out what is
>> causing this.  PySpark from the command line works fine, but it does not
>> work in the Zeppelin Notebook.  My setup is on an AWS EMR instance running
>> spark 1.3.1 on Amazon’s Hadoop 2.4.0.  I have included a snippet of code
>> (in blue) and the error (in red).  Thank you and please let me know if you
>> need any more additional information.
>>
>>
>>  %pyspark
>>
>>  nums = [1,2,3,4,5,6]
>>
>>  rdd_nums = sc.parallelize(nums)
>> rdd_sq   = rdd_nums.map(lambda x: pow(x,2))
>> rdd_cube = rdd_nums.map(lambda x: pow(x,3))
>> rdd_odd  = rdd_nums.filter(lambda x: x%2 == 1)
>>
>>  print "nums: %s" % rdd_nums.collect()
>> print "squares: %s" % rdd_sq.collect()
>> print "cubes: %s" % rdd_cube.collect()
>> print "odds: %s" % rdd_odd.collect()
>>
>>
>>  Py4JJavaError: An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal):
>> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
>> …
>> …
>> Caused by: java.io.EOFException at
>> java.io.DataInputStream.readInt(DataInputStream.java:392) at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ...
>> 10 more
>>
>>  If I instead set rdd_odd  = rdd_nums.filter(lambda x: x%2)    I don’t
>> get an error
>>
>>
>>  Thanks,
>>
>>  Chad Timmins
>>
>>  Software Engineer Intern at Trulia
>> B.S. Electrical Engineering, UC Davis 2015
>>
>

Re: PySpark RDD method errors

Reply via email to