Re: How do you run your spark app?

Andrei Fri, 20 Jun 2014 17:05:28 -0700

Hi Shivani,

Adding JARs to classpath (e.g. via "-cp" option) is needed to run your
_local_ Java application, whatever it is. To deliver them to _other
machines_ for execution you need to add them to SparkContext. And you can
do it in 2 different ways:


1. Add them right from your code (your suggested
"sparkContext.setJars(...)").
2. Use "spark-submit" and pass JARs from command line.

Note, that both options are easier to do if you assemble your code and all
its dependencies into a single "fat" JAR instead of manually listing all
needed libraries.




On Sat, Jun 21, 2014 at 1:47 AM, Shivani Rao <raoshiv...@gmail.com> wrote:

> Hello Shrikar,
>
> Thanks for your email. I have been using the same workflow as you did. But
> my questions was related to creation of the sparkContext. My question was
>
> If I am specifying jars in the "java -cp <jar-paths>", and adding to them
> to my build.sbt, do I need to additionally add them in my code while
> creating the sparkContext (sparkContext.setJars(" "))??
>
>
> Thanks,
> Shivani
>
>
> On Fri, Jun 20, 2014 at 11:03 AM, Shrikar archak <shrika...@gmail.com>
> wrote:
>
>> Hi Shivani,
>>
>> I use sbt assembly to create a fat jar .
>> https://github.com/sbt/sbt-assembly
>>
>> Example of the sbt file is below.
>>
>> import AssemblyKeys._ // put this at the top of the file
>>
>> assemblySettings
>>
>> mainClass in assembly := Some("FifaSparkStreaming")
>>
>>  name := "FifaSparkStreaming"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.0.0"
>> % "provided",
>>                             "org.apache.spark" %% "spark-streaming" %
>> "1.0.0" % "provided",
>>                             ("org.apache.spark" %%
>> "spark-streaming-twitter" %
>> "1.0.0").exclude("org.eclipse.jetty.orbit","javax.transaction")
>>
>>              .exclude("org.eclipse.jetty.orbit","javax.servlet")
>>
>>              .exclude("org.eclipse.jetty.orbit","javax.mail.glassfish")
>>
>>              .exclude("org.eclipse.jetty.orbit","javax.activation")
>>
>>              .exclude("com.esotericsoftware.minlog", "minlog"),
>>                             ("net.debasishg" % "redisclient_2.10" %
>> "2.12").exclude("com.typesafe.akka","akka-actor_2.10"))
>>
>> mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
>>   {
>>     case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
>>     case PathList("org", "apache", xs @ _*) => MergeStrategy.first
>>     case PathList("org", "apache", xs @ _*) => MergeStrategy.first
>>     case "application.conf" => MergeStrategy.concat
>>     case "unwanted.txt"     => MergeStrategy.discard
>>     case x => old(x)
>>   }
>> }
>>
>>
>> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>>
>>
>> And I run as mentioned below.
>>
>> LOCALLY :
>> 1)  sbt 'run AP1z4IYraYm5fqWhITWArY53x
>> Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6
>> 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN
>> Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014'
>>
>> If you want to submit on the cluster
>>
>> CLUSTER:
>> 2) spark-submit --class FifaSparkStreaming --master
>> "spark://server-8-144:7077" --driver-memory 2048 --deploy-mode cluster
>> FifaSparkStreaming-assembly-1.0.jar AP1z4IYraYm5fqWhITWArY53x
>> Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6
>> 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN
>> Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014
>>
>>
>> Hope this helps.
>>
>> Thanks,
>> Shrikar
>>
>>
>> On Fri, Jun 20, 2014 at 9:16 AM, Shivani Rao <raoshiv...@gmail.com>
>> wrote:
>>
>>> Hello Michael,
>>>
>>> I have a quick question for you. Can you clarify the statement " build
>>> fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's
>>> and everything needed to run a Job".  Can you give an example.
>>>
>>> I am using sbt assembly as well to create a fat jar, and supplying the
>>> spark and hadoop locations in the class path. Inside the main() function
>>> where spark context is created, I use SparkContext.jarOfClass(this).toList
>>> add the fat jar to my spark context. However, I seem to be running into
>>> issues with this approach. I was wondering if you had any inputs Michael.
>>>
>>> Thanks,
>>> Shivani
>>>
>>>
>>> On Thu, Jun 19, 2014 at 10:57 PM, Sonal Goyal <sonalgoy...@gmail.com>
>>> wrote:
>>>
>>>> We use maven for building our code and then invoke spark-submit through
>>>> the exec plugin, passing in our parameters. Works well for us.
>>>>
>>>> Best Regards,
>>>> Sonal
>>>> Nube Technologies <http://www.nubetech.co>
>>>>
>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler <mich...@tumra.com>
>>>> wrote:
>>>>
>>>>> P.S. Last but not least we use sbt-assembly to build fat JAR's and
>>>>> build dist-style TAR.GZ packages with launch scripts, JAR's and everything
>>>>> needed to run a Job.  These are automatically built from source by our
>>>>> Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest
>>>>> release TAR.GZ direct from HDFS, unpack it and launch the appropriate
>>>>> script.
>>>>>
>>>>> Makes for a much cleaner development / testing / deployment to package
>>>>> everything required in one go instead of relying on cluster specific
>>>>> classpath additions or any add-jars functionality.
>>>>>
>>>>>
>>>>> On 19 June 2014 22:53, Michael Cutler <mich...@tumra.com> wrote:
>>>>>
>>>>>> When you start seriously using Spark in production there are
>>>>>> basically two things everyone eventually needs:
>>>>>>
>>>>>>    1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
>>>>>>    2. Always-On Jobs - that require monitoring, restarting etc.
>>>>>>
>>>>>> There are lots of ways to implement these requirements, everything
>>>>>> from crontab through to workflow managers like Oozie.
>>>>>>
>>>>>> We opted for the following stack:
>>>>>>
>>>>>>    - Apache Mesos <http://mesosphere.io/> (mesosphere.io
>>>>>>    distribution)
>>>>>>
>>>>>>
>>>>>>    - Marathon <https://github.com/mesosphere/marathon> -
>>>>>>    init/control system for starting, stopping, and maintaining always-on
>>>>>>    applications.
>>>>>>
>>>>>>
>>>>>>    - Chronos <http://airbnb.github.io/chronos/> - general-purpose
>>>>>>    scheduler for Mesos, supports job dependency graphs.
>>>>>>
>>>>>>
>>>>>>    - ** Spark Job Server <https://github.com/ooyala/spark-jobserver>
>>>>>>    - primarily for it's ability to reuse shared contexts with multiple 
>>>>>> jobs
>>>>>>
>>>>>> The majority of our jobs are periodic (batch) jobs run through
>>>>>> spark-sumit, and we have several always-on Spark Streaming jobs (also run
>>>>>> through spark-submit).
>>>>>>
>>>>>> We always use "client mode" with spark-submit because the Mesos
>>>>>> cluster has direct connectivity to the Spark cluster and it means all the
>>>>>> Spark stdout/stderr is externalised into Mesos logs which helps 
>>>>>> diagnosing
>>>>>> problems.
>>>>>>
>>>>>> I thoroughly recommend you explore using Mesos/Marathon/Chronos to
>>>>>> run Spark and manage your Jobs, the Mesosphere tutorials are awesome and
>>>>>> you can be up and running in literally minutes.  The Web UI's to both 
>>>>>> make
>>>>>> it easy to get started without talking to REST API's etc.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19 June 2014 19:44, Evan R. Sparks <evan.spa...@gmail.com> wrote:
>>>>>>
>>>>>>> I use SBT, create an assembly, and then add the assembly jars when I
>>>>>>> create my spark context. The main executor I run with something like 
>>>>>>> "java
>>>>>>> -cp ... MyDriver".
>>>>>>>
>>>>>>> That said - as of spark 1.0 the preferred way to run spark
>>>>>>> applications is via spark-submit -
>>>>>>> http://spark.apache.org/docs/latest/submitting-applications.html
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <ldm...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I want to ask this, not because I can't read endless documentation
>>>>>>>> and
>>>>>>>> several tutorials, but because there seems to be many ways of doing
>>>>>>>> things
>>>>>>>> and I keep having issues. How do you run /your /spark app?
>>>>>>>>
>>>>>>>> I had it working when I was only using yarn+hadoop1 (Cloudera),
>>>>>>>> then I had
>>>>>>>> to get Spark and Shark working and ended upgrading everything and
>>>>>>>> dropped
>>>>>>>> CDH support. Anyways, this is what I used with master=yarn-client
>>>>>>>> and
>>>>>>>> app_jar being Scala code compiled with Maven.
>>>>>>>>
>>>>>>>> java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER
>>>>>>>> $CLASSNAME
>>>>>>>> $ARGS
>>>>>>>>
>>>>>>>> Do you use this? or something else? I could never figure out this
>>>>>>>> method.
>>>>>>>> SPARK_HOME/bin/spark jar APP_JAR ARGS
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> bin/spark-class jar
>>>>>>>>
>>>>>>>> /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>>>>>>>> pi 10 10
>>>>>>>>
>>>>>>>> Do you use SBT or Maven to compile? or something else?
>>>>>>>>
>>>>>>>>
>>>>>>>> ** It seams that I can't get subscribed to the mailing list and I
>>>>>>>> tried both
>>>>>>>> my work email and personal.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>

Re: How do you run your spark app?

Reply via email to