At Cloudera we recommend bundling your application separately from the Spark libraries. The two biggest reasons are: * No need to modify your application jar when upgrading or applying a patch. * When running on YARN, the Spark jar can be cached as a YARN local resource, meaning it doesn't need to be transferred every time.
On Sun, Jul 27, 2014 at 8:52 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Mayur, > > I don't know if I exactly understand the context of what you are asking, > but let me just mention issues I had with deploying. > > * As my application is a streaming application, it doesn't read any files > from disk, so therefore I have no Hadoop/HDFS in place and I there is no > need for it, either. There should be no dependency on Hadoop or HDFS, since > you can perfectly run Spark applications without it. > * I use Mesos and so far I always had the downloaded Spark distribution > accessible for all machines (e.g., via HTTP) and then added my application > code by uploading a jar built with `sbt assembly`. As the Spark code itself > must not be contained in that jar file, I had to add '% "provided"' in the > sbt file, which in turn prevented me from running the application locally > from IntelliJ IDEA (it would not find the libraries marked with > "provided"), I always had to use `sbt run`. > * When using Mesos, on the Spark slaves the Spark jar is loaded before the > application jar, and so the log4j file from the Spark jar is used instead > of my custom one (that is different when running locally), so I had to edit > that file in the Spark distribution jar to customize logging of my Spark > nodes. > > I wonder if the two latter problems would vanish if the Spark libraries > were bundled together with the application. (That would be your approach > #1, I guess.) > > Tobias >