On 12 Apr 2017, at 17:25, Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine functional requirement to see a color of a button in green on the screen. well, I reserve the right to have incomplete knowledge, and look forward to improving it. Perhaps it may be pertinent to read the first preface of a CI/ CD book and realize to what kind of software development disciplines is it applicable to. the original introduction on CI was probably Fowler's Cruise Control article, https://martinfowler.com/articles/originalContinuousIntegration.html "The key is to automate absolutely everything and run the process so often that integration errors are found quickly" Java Development with Ant, 2003, looks at Cruise Control, Anthill and Gump, again, with that focus on team coding and automated regression testing, both of unit tests, and, with things like HttpUnit, web UIs. There's no discussion of "Data" per-se, though databases are implicit. Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get the entire ASF project portfolio to build and test against the latest build of everything else". Lots of finger pointing there, especially when something foundational like Ant or Xerces did bad. AFAIK, The earliest known in-print reference to Continuous Deployme3nt is the HP Labs 2002 paper, Making Web Services that Work. That introduced the concept with a focus on automating deployment, staging testing and treating ops problems as use cases for which engineers could often write tests for, and, perhaps, even design their applications to support. "We are exploring extending this model to one we term Continuous Deployment —after passing the local test suite, a service can be automatically deployed to a public staging server for stress and acceptance testing by physically remote calling parties" At this time, the applications weren't modern "big data" apps as they didn't have affordable storage or the tools to schedule work over it. It wasn't that the people writing the books and papers looked at big data and said "not for us", it just wasn't on their horizons. 1TB was a lot of storage in those days, not a high-end SSD. Otherwise your approach is just another line of defense in saving your job by applying an impertinent, incorrect, and outdated skill and tool to a problem. please be a bit more constructive here, the ASF code of conduct encourages empathy and coillaboration. https://www.apache.org/foundation/policies/conduct . Thanks., Building data products is a very different discipline from that of building software. Which is why we ned to consider how to take what are core methodologies for software and apply them, and, where appropriate, supercede them with new workflows, ideas, technologies. But doing so with an understanding of the reasoning behind today's tools and workflows. I'm really interested in how do we get from experimental notebook code to something usable in production, pushing it out, finding the dirty-data-problems before it goes live, etc, etc. I do think today's tools have been outgrown by the applications we now build, and am thinking not so much "which tools to use', but one step further, "what are the new tools and techniques to use?". I look forward to whatever insight people have here. My genuine advice to everyone in all spheres of activities will be to first understand the problem to solve before solving it and definitely before selecting the tools to solve it, otherwise you will land up with a bowl of soup and fork in hand and argue that CI/ CD is still applicable to building data products and data warehousing. I concur Regards, Gourav -Steve On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 11 Apr 2017, at 20:46, Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote: And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. While I'm happy to be faulted for treating things as software processes, having a full automated mechanism for testing the latest code before production is something I'd consider foundational today. This is what "Contiunous Deployment" was about when it was first conceived. Does it mean you should blindly deploy that way? well, not if you worry about security, but having that review process and then a final manual "deploy" button can address that. Cloud infras let you integrate cluster instantiation to the process; which helps you automate things like "stage the deployment in some new VMs, run acceptance tests (*), then switch the load balancer over to the new cluster, being ready to switch back if you need. I've not tried that with streaming apps though; I don't know how to do it there. Boot the new cluster off checkpointed state requires deserialization to work, which can't be guaranteed if you are changing the objects which get serialized. I'd argue then, it's not a problem which has already been solved by data analystics/warehousing —though if you've got pointers there, I'd be grateful. Always good to see work by others. Indeed, the telecoms industry have led the way in testing and HA deployment: if you look at Erlang you can see a system designed with hot upgrades in mind, the way java code "add a JAR to a web server" never was. -Steve (*) do always make sure this is the test cluster with a snapshot of test data, not production machines/data. There are always horror stories there. Regards, Gourav On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote: Hi Steve Thanks for the detailed response, I think this problem doesn't have an industry standard solution as of yet and I am sure a lot of people would benefit from the discussion I realise now what you are saying so thanks for clarifying, that said let me try and explain how we approached the problem There are 2 problems you highlighted, the first if moving the code from SCM to prod, and the other is enusiring the data your code uses is correct. (using the latest data from prod) "how do you get your code from SCM into production?" We currently have our pipeline being run via airflow, we have our dags in S3, with regards to how we get our code from SCM to production 1) Jenkins build that builds our spark applications and runs tests 2) Once the first build is successful we trigger another build to copy the dags to an s3 folder We then routinely sync this folder to the local airflow dags folder every X amount of mins Re test data " but what's your strategy for test data: that's always the troublespot." Our application is using versioning against the data, so we expect the source data to be in a certain version and the output data to also be in a certain version We have a test resources folder that we have following the same convention of versioning - this is the data that our application tests use - to ensure that the data is in the correct format so for example if we have Table X with version 1 that depends on data from Table A and B also version 1, we run our spark application then ensure the transformed table X has the correct columns and row values Then when we have a new version 2 of the source data or adding a new column in Table X (version 2), we generate a new version of the data and ensure the tests are updated That way we ensure any new version of the data has tests against it "I've never seen any good strategy there short of "throw it at a copy of the production dataset"." I agree which is why we have a sample of the production data and version the schemas we expect the source and target data to look like. If people are interested I am happy writing a blog about it in the hopes this helps people build more reliable pipelines Love to see that. Kind Regards Sam