And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen.
Regards, Gourav On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hi Steve > > > Thanks for the detailed response, I think this problem doesn't have an > industry standard solution as of yet and I am sure a lot of people would > benefit from the discussion > > I realise now what you are saying so thanks for clarifying, that said let > me try and explain how we approached the problem > > There are 2 problems you highlighted, the first if moving the code from > SCM to prod, and the other is enusiring the data your code uses is correct. > (using the latest data from prod) > > > *"how do you get your code from SCM into production?"* > > We currently have our pipeline being run via airflow, we have our dags in > S3, with regards to how we get our code from SCM to production > > 1) Jenkins build that builds our spark applications and runs tests > 2) Once the first build is successful we trigger another build to copy the > dags to an s3 folder > > We then routinely sync this folder to the local airflow dags folder every > X amount of mins > > Re test data > *" but what's your strategy for test data: that's always the troublespot."* > > Our application is using versioning against the data, so we expect the > source data to be in a certain version and the output data to also be in a > certain version > > We have a test resources folder that we have following the same convention > of versioning - this is the data that our application tests use - to ensure > that the data is in the correct format > > so for example if we have Table X with version 1 that depends on data from > Table A and B also version 1, we run our spark application then ensure the > transformed table X has the correct columns and row values > > Then when we have a new version 2 of the source data or adding a new > column in Table X (version 2), we generate a new version of the data and > ensure the tests are updated > > That way we ensure any new version of the data has tests against it > > *"I've never seen any good strategy there short of "throw it at a copy of > the production dataset"."* > > I agree which is why we have a sample of the production data and version > the schemas we expect the source and target data to look like. > > If people are interested I am happy writing a blog about it in the hopes > this helps people build more reliable pipelines > > Kind Regards > Sam > > > > > > > > > > > On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> On 7 Apr 2017, at 18:40, Sam Elamin <hussam.ela...@gmail.com> wrote: >> >> Definitely agree with gourav there. I wouldn't want jenkins to run my >> work flow. Seems to me that you would only be using jenkins for its >> scheduling capabilities >> >> >> Maybe I was just looking at this differenlty >> >> Yes you can run tests but you wouldn't want it to run your orchestration >> of jobs >> >> What happens if jenkijs goes down for any particular reason. How do you >> have the conversation with your stakeholders that your pipeline is not >> working and they don't have data because the build server is going through >> an upgrade or going through an upgrade >> >> >> >> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly >> consider as the pipeline for getting your code out to the cluster, so you >> don't have to explain why you just pushed out something broken >> >> As example, here's Renault's pipeline as discussed last week in Munich >> https://flic.kr/p/Tw3Emu >> >> However to be fair I understand what you are saying Steve if someone is >> in a place where you only have access to jenkins and have to go through >> hoops to setup:get access to new instances then engineers will do what they >> always do, find ways to game the system to get their work done >> >> >> >> >> This isn't about trying to "Game the system", this is about what makes a >> replicable workflow for getting code into production, either at the press >> of a button or as part of a scheduled "we push out an update every night, >> rerun the deployment tests and then switch over to the new installation" >> mech. >> >> Put differently: how do you get your code from SCM into production? Not >> just for CI, but what's your strategy for test data: that's always the >> troublespot. Random selection of rows may work, although it will skip the >> odd outlier (high-unicode char in what should be a LATIN-1 field, time set >> to 0, etc), and for work joining > 1 table, you need rows which join well. >> I've never seen any good strategy there short of "throw it at a copy of the >> production dataset". >> >> >> -Steve >> >> >> >> >> >> >> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >>> Hi Steve, >>> >>> Why would you ever do that? You are suggesting the use of a CI tool as a >>> workflow and orchestration engine. >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com> >>> wrote: >>> >>>> If you have Jenkins set up for some CI workflow, that can do scheduled >>>> builds and tests. Works well if you can do some build test before even >>>> submitting it to a remote cluster >>>> >>>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote: >>>> >>>> Hi Shyla >>>> >>>> You have multiple options really some of which have been already listed >>>> but let me try and clarify >>>> >>>> Assuming you have a spark application in a jar you have a variety of >>>> options >>>> >>>> You have to have an existing spark cluster that is either running on >>>> EMR or somewhere else. >>>> >>>> *Super simple / hacky* >>>> Cron job on EC2 that calls a simple shell script that does a spart >>>> submit to a Spark Cluster OR create or add step to an EMR cluster >>>> >>>> *More Elegant* >>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that >>>> will do the above step but have scheduling and potential backfilling and >>>> error handling(retries,alerts etc) >>>> >>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that >>>> does some Spark jobs but I do not think its available worldwide just yet >>>> >>>> Hope I cleared things up >>>> >>>> Regards >>>> Sam >>>> >>>> >>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> Hi Shyla, >>>>> >>>>> why would you want to schedule a spark job in EC2 instead of EMR? >>>>> >>>>> Regards, >>>>> Gourav >>>>> >>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande < >>>>> deshpandesh...@gmail.com> wrote: >>>>> >>>>>> I want to run a spark batch job maybe hourly on AWS EC2 . What is >>>>>> the easiest way to do this. Thanks >>>>>> >>>>> >>>>> >>>> >>>> >>> >> >