Re: pyspark on yarn hdp hortonworks

2014-09-05 Thread Greg Hill
I'm running into a problem getting this working as well.  I have spark-submit 
and spark-shell working fine, but pyspark in interactive mode can't seem to 
find the lzo jar:

java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not 
found

This is in /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar which is in my 
SPARK_CLASSPATH environment variable, but that doesn't seem to be picked up by 
pyspark.

Any ideas?  I can't find much in the way of docs on getting the environment 
right for pyspark.

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Wednesday, September 3, 2014 4:19 PM
To: Oleg Ruchovets oruchov...@gmail.commailto:oruchov...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: pyspark on yarn hdp hortonworks

Hi Oleg,

There isn't much you need to do to setup a Yarn cluster to run PySpark. You 
need to make sure all machines have python installed, and... that's about it. 
Your assembly jar will be shipped to all containers along with all the pyspark 
and py4j files needed. One caveat, however, is that the jar needs to be built 
in maven and not on a Red Hat-based OS,

http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn

In addition, it should be built with Java 6 because of a known issue with 
building jars with Java 7 and including python files in them 
(https://issues.apache.org/jira/browse/SPARK-1718). Lastly, if you have trouble 
getting it to work, you can follow the steps I have listed in a different 
thread to figure out what's wrong:

http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e

Let me know if you can get it working,
-Andrew





2014-09-03 5:03 GMT-07:00 Oleg Ruchovets 
oruchov...@gmail.commailto:oruchov...@gmail.com:
Hi all.
   I am trying to run pyspark on yarn already couple of days:

http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/

I posted exception on previous posts. It looks that I didn't do correct 
configuration.
  I googled quite a lot and I can't find the steps should be done to configure 
PySpark running on Yarn.

Can you please share the steps (critical points) should be configured to use 
PaSpark on Yarn ( hortonworks distribution) :
  Environment variables.
  Classpath
  copy jars to all machine
  other configuration.

Thanks
Oleg.




Re: pyspark on yarn hdp hortonworks

2014-09-03 Thread Andrew Or
Hi Oleg,

There isn't much you need to do to setup a Yarn cluster to run PySpark. You
need to make sure all machines have python installed, and... that's about
it. Your assembly jar will be shipped to all containers along with all the
pyspark and py4j files needed. One caveat, however, is that the jar needs
to be built in maven and not on a Red Hat-based OS,

http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn

In addition, it should be built with Java 6 because of a known issue with
building jars with Java 7 and including python files in them (
https://issues.apache.org/jira/browse/SPARK-1718). Lastly, if you have
trouble getting it to work, you can follow the steps I have listed in a
different thread to figure out what's wrong:

http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e

Let me know if you can get it working,
-Andrew





2014-09-03 5:03 GMT-07:00 Oleg Ruchovets oruchov...@gmail.com:

 Hi all.
I am trying to run pyspark on yarn already couple of days:

 http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/

 I posted exception on previous posts. It looks that I didn't do correct
 configuration.
   I googled quite a lot and I can't find the steps should be done to
 configure PySpark running on Yarn.

 Can you please share the steps (critical points) should be configured to
 use PaSpark on Yarn ( hortonworks distribution) :
   Environment variables.
   Classpath
   copy jars to all machine
   other configuration.

 Thanks
 Oleg.