Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Gourav Sengupta Tue, 11 Apr 2017 12:46:31 -0700

And once again JAVA programmers are trying to solve a data analytics and
data warehousing problem using programming paradigms. It genuinely a pain
to see this happen.



Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Hi Steve
>
>
> Thanks for the detailed response, I think this problem doesn't have an
> industry standard solution as of yet and I am sure a lot of people would
> benefit from the discussion
>
> I realise now what you are saying so thanks for clarifying, that said let
> me try and explain how we approached the problem
>
> There are 2 problems you highlighted, the first if moving the code from
> SCM to prod, and the other is enusiring the data your code uses is correct.
> (using the latest data from prod)
>
>
> *"how do you get your code from SCM into production?"*
>
> We currently have our pipeline being run via airflow, we have our dags in
> S3, with regards to how we get our code from SCM to production
>
> 1) Jenkins build that builds our spark applications and runs tests
> 2) Once the first build is successful we trigger another build to copy the
> dags to an s3 folder
>
> We then routinely sync this folder to the local airflow dags folder every
> X amount of mins
>
> Re test data
> *" but what's your strategy for test data: that's always the troublespot."*
>
> Our application is using versioning against the data, so we expect the
> source data to be in a certain version and the output data to also be in a
> certain version
>
> We have a test resources folder that we have following the same convention
> of versioning - this is the data that our application tests use - to ensure
> that the data is in the correct format
>
> so for example if we have Table X with version 1 that depends on data from
> Table A and B also version 1, we run our spark application then ensure the
> transformed table X has the correct columns and row values
>
> Then when we have a new version 2 of the source data or adding a new
> column in Table X (version 2), we generate a new version of the data and
> ensure the tests are updated
>
> That way we ensure any new version of the data has tests against it
>
> *"I've never seen any good strategy there short of "throw it at a copy of
> the production dataset"."*
>
> I agree which is why we have a sample of the production data and version
> the schemas we expect the source and target data to look like.
>
> If people are interested I am happy writing a blog about it in the hopes
> this helps people build more reliable pipelines
>
> Kind Regards
> Sam
>
>
>
>
>
>
>
>
>
>
> On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>>
>> On 7 Apr 2017, at 18:40, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>
>> Definitely agree with gourav there. I wouldn't want jenkins to run my
>> work flow. Seems to me that you would only be using jenkins for its
>> scheduling capabilities
>>
>>
>> Maybe I was just looking at this differenlty
>>
>> Yes you can run tests but you wouldn't want it to run your orchestration
>> of jobs
>>
>> What happens if jenkijs goes down for any particular reason. How do you
>> have the conversation with your stakeholders that your pipeline is not
>> working and they don't have data because the build server is going through
>> an upgrade or going through an upgrade
>>
>>
>>
>> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly
>> consider as the pipeline for getting your code out to the cluster, so you
>> don't have to explain why you just pushed out something broken
>>
>> As example, here's Renault's pipeline as discussed last week in Munich
>> https://flic.kr/p/Tw3Emu
>>
>> However to be fair I understand what you are saying Steve if someone is
>> in a place where you only have access to jenkins and have to go through
>> hoops to setup:get access to new instances then engineers will do what they
>> always do, find ways to game the system to get their work done
>>
>>
>>
>>
>> This isn't about trying to "Game the system", this is about what makes a
>> replicable workflow for getting code into production, either at the press
>> of a button or as part of a scheduled "we push out an update every night,
>> rerun the deployment tests and then switch over to the new installation"
>> mech.
>>
>> Put differently: how do you get your code from SCM into production? Not
>> just for CI, but what's your strategy for test data: that's always the
>> troublespot. Random selection of rows may work, although it will skip the
>> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
>> to 0, etc), and for work joining > 1 table, you need rows which join well.
>> I've never seen any good strategy there short of "throw it at a copy of the
>> production dataset".
>>
>>
>> -Steve
>>
>>
>>
>>
>>
>>
>> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> Why would you ever do that? You are suggesting the use of a CI tool as a
>>> workflow and orchestration engine.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
>>> wrote:
>>>
>>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>>> builds and tests. Works well if you can do some build test before even
>>>> submitting it to a remote cluster
>>>>
>>>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>>>
>>>> Hi Shyla
>>>>
>>>> You have multiple options really some of which have been already listed
>>>> but let me try and clarify
>>>>
>>>> Assuming you have a spark application in a jar you have a variety of
>>>> options
>>>>
>>>> You have to have an existing spark cluster that is either running on
>>>> EMR or somewhere else.
>>>>
>>>> *Super simple / hacky*
>>>> Cron job on EC2 that calls a simple shell script that does a spart
>>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>>
>>>> *More Elegant*
>>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>>> will do the above step but have scheduling and potential backfilling and
>>>> error handling(retries,alerts etc)
>>>>
>>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>>> does some Spark jobs but I do not think its available worldwide just yet
>>>>
>>>> Hope I cleared things up
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>>
>>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi Shyla,
>>>>>
>>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is
>>>>>> the easiest way to do this. Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Reply via email to