On 7 Apr 2017, at 18:40, Sam Elamin <hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:
Definitely agree with gourav there. I wouldn't want jenkins to run my work flow. Seems to me that you would only be using jenkins for its scheduling capabilities Maybe I was just looking at this differenlty Yes you can run tests but you wouldn't want it to run your orchestration of jobs What happens if jenkijs goes down for any particular reason. How do you have the conversation with your stakeholders that your pipeline is not working and they don't have data because the build server is going through an upgrade or going through an upgrade Well, I wouldn't use it as a replacement for Oozie, but I'd certainly consider as the pipeline for getting your code out to the cluster, so you don't have to explain why you just pushed out something broken As example, here's Renault's pipeline as discussed last week in Munich https://flic.kr/p/Tw3Emu However to be fair I understand what you are saying Steve if someone is in a place where you only have access to jenkins and have to go through hoops to setup:get access to new instances then engineers will do what they always do, find ways to game the system to get their work done This isn't about trying to "Game the system", this is about what makes a replicable workflow for getting code into production, either at the press of a button or as part of a scheduled "we push out an update every night, rerun the deployment tests and then switch over to the new installation" mech. Put differently: how do you get your code from SCM into production? Not just for CI, but what's your strategy for test data: that's always the troublespot. Random selection of rows may work, although it will skip the odd outlier (high-unicode char in what should be a LATIN-1 field, time set to 0, etc), and for work joining > 1 table, you need rows which join well. I've never seen any good strategy there short of "throw it at a copy of the production dataset". -Steve On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote: Hi Steve, Why would you ever do that? You are suggesting the use of a CI tool as a workflow and orchestration engine. Regards, Gourav Sengupta On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: If you have Jenkins set up for some CI workflow, that can do scheduled builds and tests. Works well if you can do some build test before even submitting it to a remote cluster On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote: Hi Shyla You have multiple options really some of which have been already listed but let me try and clarify Assuming you have a spark application in a jar you have a variety of options You have to have an existing spark cluster that is either running on EMR or somewhere else. Super simple / hacky Cron job on EC2 that calls a simple shell script that does a spart submit to a Spark Cluster OR create or add step to an EMR cluster More Elegant Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do the above step but have scheduling and potential backfilling and error handling(retries,alerts etc) AWS are coming out with glue<https://aws.amazon.com/glue/> soon that does some Spark jobs but I do not think its available worldwide just yet Hope I cleared things up Regards Sam On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote: Hi Shyla, why would you want to schedule a spark job in EC2 instead of EMR? Regards, Gourav On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote: I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks