Hi Gerhard, I just stumbled upon some documentation on EMR - link below. Seems there is a -u option to add jars in S3 to your classpath, have you tried that ?
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html Best Regards, Sonal Founder, Nube Technologies <http://www.nubetech.co> Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM> Reifier at Spark Summit 2015 <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> <http://in.linkedin.com/in/sonalgoyal> On Wed, Mar 9, 2016 at 11:50 AM, Wang, Daoyuan <[email protected]> wrote: > Hi Gerhard, > > > > How does EMR set its conf for spark? I think if you set SPARK_CLASSPATH > and spark.dirver.extraClassPath, spark would ignore SPARK_CLASSPATH. > > I think you can do this by read the configuration from SparkConf, and then > add your custom settings to the corresponding key, and use the updated > SparkConf to instantiate your SparkContext. > > > > Thanks, > > Daoyuan > > > > *From:* Gerhard Fiedler [mailto:[email protected]] > *Sent:* Wednesday, March 09, 2016 5:41 AM > *To:* [email protected] > *Subject:* How to add a custom jar file to the Spark driver? > > > > We’re running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, > but we want to add a custom jar file to the driver. > > > > When running on a local one-node standalone cluster, we just use > spark.driver.extraClassPath and everything works: > > > > spark-submit --conf spark.driver.extraClassPath=/path/to/our/custom/jar/* > our-python-script.py > > > > But on EMR, this value is set to something that is needed to make their > installation of Spark work. Setting it to point to our custom jar > overwrites the original setting rather than adding to it and breaks Spark. > > > > Our current workaround is to capture to whatever EMR sets > spark.driver.extraClassPath once, then use that path and add our jar file > to it. Of course this breaks when EMR changes this path in their cluster > settings. We wouldn’t necessarily notice this easily. This is how it looks: > > > > spark-submit --conf > spark.driver.extraClassPath=/path/to/our/custom/jar/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/* > our-python-script.py > > > > We prefer not to do this… > > > > We tried the spark-submit argument --jars, but it didn’t seem to do > anything. Like this: > > > > spark-submit --jars /path/to/our/custom/jar/file.jar our-python-script.py > > > > We also tried to set CLASSPATH, but it doesn’t seem to have any impact: > > > > export CLASSPATH=/path/to/our/custom/jar/* > > spark-submit our-python-script.py > > > > When using SPARK_CLASSPATH, we got warnings that it is deprecated, and the > messages also seemed to imply that it affects the same configuration that > is set by spark.driver.extraClassPath. > > > > > > So, my question is: Is there a clean way to add a custom jar file to a > Spark configuration? > > > > Thanks, > > Gerhard > > >
