Re: Instantiating/starting Spark jobs programmatically
Hi firemonk9, What you're doing looks interesting. Can you share some more details? Are you running the same spark context for each job, or are you running a seperate spark context for each job? Does your system need sharing of rdd's across multiple jobs? If yes, how do you implement that? Also what prompted you to run Yarn instead of standalone? Does this give some performance benefit? Have you evaluated yarn vs mesos? Also have you looked at spark jobserver by ooyala? It makes doing some if the stuff I mentioned easier. IIRC it also works with yarn. Definitely works with Mesos. Heres the link https://github.com/spark-jobserver/spark-jobserver Thanks Anshul On 23 Apr 2015 20:32, Dean Wampler deanwamp...@gmail.com wrote: I strongly recommend spawning a new process for the Spark jobs. Much cleaner separation. Your driver program won't be clobbered if the Spark job dies, etc. It can even watch for failures and restart. In the Scala standard library, the sys.process package has classes for constructing and interoperating with external processes. Perhaps Java has something similar these days? dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Apr 21, 2015 at 2:15 PM, Steve Loughran ste...@hortonworks.com wrote: On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.com wrote: - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this. You'd likely want to do that or to do your own Spark fork. For example, if the SparkContext can't connect to your cluster master node when it is created, it will System.exit. people can block errant System.exit calls by running under a SecurityManager. Less than ideal (and there's a small performance hit) -but possible
Re: Instantiating/starting Spark jobs programmatically
I strongly recommend spawning a new process for the Spark jobs. Much cleaner separation. Your driver program won't be clobbered if the Spark job dies, etc. It can even watch for failures and restart. In the Scala standard library, the sys.process package has classes for constructing and interoperating with external processes. Perhaps Java has something similar these days? dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Apr 21, 2015 at 2:15 PM, Steve Loughran ste...@hortonworks.com wrote: On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.com wrote: - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this. You'd likely want to do that or to do your own Spark fork. For example, if the SparkContext can't connect to your cluster master node when it is created, it will System.exit. people can block errant System.exit calls by running under a SecurityManager. Less than ideal (and there's a small performance hit) -but possible
Re: Instantiating/starting Spark jobs programmatically
On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote: - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this. You'd likely want to do that or to do your own Spark fork. For example, if the SparkContext can't connect to your cluster master node when it is created, it will System.exit. people can block errant System.exit calls by running under a SecurityManager. Less than ideal (and there's a small performance hit) -but possible
Re: Instantiating/starting Spark jobs programmatically
Could you possibly describe what you are trying to learn how to do in more detail? Some basics of submitting programmatically: - Create a SparkContext instance and use that to build your RDDs - You can only have 1 SparkContext per JVM you are running, so if you need to satisfy concurrent job requests you would need to manage a SparkContext as a shared resource on that server. Keep in mind if something goes wrong with that SparkContext, all running jobs would probably be in a failed state and you'd need to try to get a new SparkContext. - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this. You'd likely want to do that or to do your own Spark fork. For example, if the SparkContext can't connect to your cluster master node when it is created, it will System.exit. - You'll need to provide all of the relevant classes that your platform uses in the jobs on the classpath of the spark cluster. We do this with a JAR file loaded from S3 dynamically by a SparkContext, but there are other options. On Mon, Apr 20, 2015 at 10:12 PM, firemonk9 dhiraj.peech...@gmail.com wrote: I have built a data analytics SaaS platform by creating Rest end points and based on the type of job request I would invoke the necessary spark job/jobs and return the results as json(async). I used yarn-client mode to submit the jobs to yarn cluster. hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577p22584.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Instantiating/starting Spark jobs programmatically
Greetings, We have an analytics workflow system in production. This system is built in Java and utilizes other services (including Apache Solr). It works fine with moderate level of data/processing load. However, when the load goes beyond certain limit (e.g., more than 10 million messages/documents) delays start to show up. No doubt this is a scalability issue, and Hadoop ecosystem, especially Spark, can be handy in this situation. The simplest approach would be to rebuild the entire workflow using Spark, Kafka and other components. However, we decided to handle the problem in a couple of phases. In first phase we identified a few pain points (areas where performance suffers most) and have started building corresponding mini Spark applications (so as to take advantage of parallelism). For now my question is: how can we instantiate/start our mini Spark jobs programmatically (e.g., from Java applications)? Only option I see in the documentation is to run the jobs through command line (using spark-submit). Any insight in this area would be highly appreciated. In longer term, I want to construct a collection of mini Spark applications (each performing one specific task, similar to web services), and architect/design bigger Spark based applications which in term will call these mini Spark applications programmatically. There is a possibility that the Spark community has already started building such collection of services. Can you please provide some information/tips/best-practices in this regard? Cheers! Ajay -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Instantiating/starting Spark jobs programmatically
I have built a data analytics SaaS platform by creating Rest end points and based on the type of job request I would invoke the necessary spark job/jobs and return the results as json(async). I used yarn-client mode to submit the jobs to yarn cluster. hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577p22584.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org