Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Anshul Singhle
Hi firemonk9,

What you're doing looks interesting. Can you share some more details?
Are you running the same spark context for each job, or are you running a
seperate spark context for each job?
Does your system need sharing of rdd's across multiple jobs? If yes, how do
you implement that?
Also what prompted you to run Yarn instead of standalone? Does this give
some performance benefit? Have you evaluated yarn vs mesos?
Also have you looked at spark jobserver by ooyala? It makes doing some if
the stuff I mentioned easier. IIRC it also works with yarn. Definitely
works with Mesos. Heres the link
https://github.com/spark-jobserver/spark-jobserver

Thanks
Anshul
On 23 Apr 2015 20:32, Dean Wampler deanwamp...@gmail.com wrote:

 I strongly recommend spawning a new process for the Spark jobs. Much
 cleaner separation. Your driver program won't be clobbered if the Spark job
 dies, etc. It can even watch for failures and restart.

 In the Scala standard library, the sys.process package has classes for
 constructing and interoperating with external processes. Perhaps Java has
 something similar these days?

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, Apr 21, 2015 at 2:15 PM, Steve Loughran ste...@hortonworks.com
 wrote:


  On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.com
 wrote:

 - There are System.exit calls built into Spark as of now that could kill
 your running JVM. We have shadowed some of the most offensive bits within
 our own application to work around this. You'd likely want to do that or to
 do your own Spark fork. For example, if the SparkContext can't connect to
 your cluster master node when it is created, it will System.exit.


 people can block errant System.exit calls by running under a
 SecurityManager. Less than ideal (and there's a small performance hit) -but
 possible





Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Dean Wampler
I strongly recommend spawning a new process for the Spark jobs. Much
cleaner separation. Your driver program won't be clobbered if the Spark job
dies, etc. It can even watch for failures and restart.

In the Scala standard library, the sys.process package has classes for
constructing and interoperating with external processes. Perhaps Java has
something similar these days?

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Apr 21, 2015 at 2:15 PM, Steve Loughran ste...@hortonworks.com
wrote:


  On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.com
 wrote:

 - There are System.exit calls built into Spark as of now that could kill
 your running JVM. We have shadowed some of the most offensive bits within
 our own application to work around this. You'd likely want to do that or to
 do your own Spark fork. For example, if the SparkContext can't connect to
 your cluster master node when it is created, it will System.exit.


 people can block errant System.exit calls by running under a
 SecurityManager. Less than ideal (and there's a small performance hit) -but
 possible



Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Steve Loughran

On 21 Apr 2015, at 17:34, Richard Marscher 
rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote:

- There are System.exit calls built into Spark as of now that could kill your 
running JVM. We have shadowed some of the most offensive bits within our own 
application to work around this. You'd likely want to do that or to do your own 
Spark fork. For example, if the SparkContext can't connect to your cluster 
master node when it is created, it will System.exit.

people can block errant System.exit calls by running under a SecurityManager. 
Less than ideal (and there's a small performance hit) -but possible


Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Richard Marscher
Could you possibly describe what you are trying to learn how to do in more
detail? Some basics of submitting programmatically:

- Create a SparkContext instance and use that to build your RDDs
- You can only have 1 SparkContext per JVM you are running, so if you need
to satisfy concurrent job requests you would need to manage a SparkContext
as a shared resource on that server. Keep in mind if something goes wrong
with that SparkContext, all running jobs would probably be in a failed
state and you'd need to try to get a new SparkContext.
- There are System.exit calls built into Spark as of now that could kill
your running JVM. We have shadowed some of the most offensive bits within
our own application to work around this. You'd likely want to do that or to
do your own Spark fork. For example, if the SparkContext can't connect to
your cluster master node when it is created, it will System.exit.
- You'll need to provide all of the relevant classes that your platform
uses in the jobs on the classpath of the spark cluster. We do this with a
JAR file loaded from S3 dynamically by a SparkContext, but there are other
options.

On Mon, Apr 20, 2015 at 10:12 PM, firemonk9 dhiraj.peech...@gmail.com
wrote:

 I have built a data analytics SaaS platform by creating Rest end points and
 based on the type of job request I would invoke the necessary spark
 job/jobs
 and return the results as json(async). I used yarn-client mode to submit
 the
 jobs to yarn cluster.

 hope this helps.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577p22584.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Instantiating/starting Spark jobs programmatically

2015-04-20 Thread Ajay Singal
Greetings,

We have an analytics workflow system in production.  This system is built in
Java and utilizes other services (including Apache Solr).  It works fine
with moderate level of data/processing load.  However, when the load goes
beyond certain limit (e.g., more than 10 million messages/documents) delays
start to show up.  No doubt this is a scalability issue, and Hadoop
ecosystem, especially Spark, can be handy in this situation.  The simplest
approach would be to rebuild the entire workflow using Spark, Kafka and
other components.  However, we decided to handle the problem in a couple of
phases.  In first phase we identified a few pain points (areas where
performance suffers most) and have started building corresponding mini Spark
applications (so as to take advantage of parallelism).

For now my question is: how can we instantiate/start our mini Spark jobs
programmatically (e.g., from Java applications)?  Only option I see in the
documentation is to run the jobs through command line (using spark-submit). 
Any insight in this area would be highly appreciated.

In longer term, I want to construct a collection of mini Spark applications
(each performing one specific task, similar to web services), and
architect/design bigger Spark based applications which in term will call
these mini Spark applications programmatically.  There is a possibility that
the Spark community has already started building such collection of
services.  Can you please provide some information/tips/best-practices in
this regard?

Cheers!
Ajay




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Instantiating/starting Spark jobs programmatically

2015-04-20 Thread firemonk9
I have built a data analytics SaaS platform by creating Rest end points and
based on the type of job request I would invoke the necessary spark job/jobs
and return the results as json(async). I used yarn-client mode to submit the
jobs to yarn cluster. 

hope this helps. 

   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577p22584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org