Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Steve Loughran Tue, 11 Apr 2017 03:32:50 -0700

On 7 Apr 2017, at 18:40, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:


Definitely agree with gourav there. I wouldn't want jenkins to run my work 
flow. Seems to me that you would only be using jenkins for its scheduling 
capabilities


Maybe I was just looking at this differenlty

Yes you can run tests but you wouldn't want it to run your orchestration of jobs

What happens if jenkijs goes down for any particular reason. How do you have 
the conversation with your stakeholders that your pipeline is not working and 
they don't have data because the build server is going through an upgrade or 
going through an upgrade



Well, I wouldn't use it as a replacement for Oozie, but I'd certainly consider 
as the pipeline for getting your code out to the cluster, so you don't have to 
explain why you just pushed out something broken

As example, here's Renault's pipeline as discussed last week in Munich 
https://flic.kr/p/Tw3Emu

However to be fair I understand what you are saying Steve if someone is in a 
place where you only have access to jenkins and have to go through hoops to 
setup:get access to new instances then engineers will do what they always do, 
find ways to game the system to get their work done



This isn't about trying to "Game the system", this is about what makes a 
replicable workflow for getting code into production, either at the press of a 
button or as part of a scheduled "we push out an update every night, rerun the 
deployment tests and then switch over to the new installation" mech.

Put differently: how do you get your code from SCM into production? Not just 
for CI, but what's your strategy for test data: that's always the troublespot. 
Random selection of rows may work, although it will skip the odd outlier 
(high-unicode char in what should be a LATIN-1 field, time set to 0, etc), and 
for work joining > 1 table, you need rows which join well. I've never seen any 
good strategy there short of "throw it at a copy of the production dataset".


-Steve






On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi Steve,

Why would you ever do that? You are suggesting the use of a CI tool as a 
workflow and orchestration engine.

Regards,
Gourav Sengupta

On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:
If you have Jenkins set up for some CI workflow, that can do scheduled builds 
and tests. Works well if you can do some build test before even submitting it 
to a remote cluster

On 7 Apr 2017, at 10:15, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:

Hi Shyla

You have multiple options really some of which have been already listed but let 
me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or 
somewhere else.

Super simple / hacky
Cron job on EC2 that calls a simple shell script that does a spart submit to a 
Spark Cluster OR create or add step to an EMR cluster

More Elegant
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do 
the above step but have scheduling and potential backfilling and error 
handling(retries,alerts etc)

AWS are coming out with glue<https://aws.amazon.com/glue/> soon that does some 
Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande 
<deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote:
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the easiest 
way to do this. Thanks

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Reply via email to