Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Steve Loughran Wed, 12 Apr 2017 10:46:25 -0700

On 12 Apr 2017, at 17:25, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:


Hi,

Your answer is like saying, I know how to code in assembly level language and I 
am going to build the next GUI in assembly level code and I think that there is 
a genuine functional requirement to see a color of a button in green on the 
screen.


well, I reserve the right to have incomplete knowledge, and look forward to 
improving it.

Perhaps it may be pertinent to read the first preface of a CI/ CD book and 
realize to what kind of software development disciplines is it applicable to.

the original introduction on CI was probably Fowler's Cruise Control article,
https://martinfowler.com/articles/originalContinuousIntegration.html

"The key is to automate absolutely everything and run the process so often that 
integration errors are found quickly"

Java Development with Ant, 2003, looks at Cruise Control, Anthill and Gump, 
again, with that focus on team coding and automated regression testing, both of 
unit tests, and, with things like HttpUnit, web UIs. There's no discussion of 
"Data" per-se, though databases are implicit.

Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get the 
entire ASF project portfolio to build and test against the latest build of 
everything else". Lots of finger pointing there, especially when something 
foundational like Ant or Xerces did bad.

AFAIK, The earliest known in-print reference to Continuous Deployme3nt is the 
HP Labs 2002 paper, Making Web Services that Work. That introduced the concept 
with a focus on automating deployment, staging testing and treating ops 
problems as use cases for which engineers could often write tests for, and, 
perhaps, even design their applications to support. "We are exploring extending 
this model to one we term Continuous Deployment —after passing the local test 
suite, a service can be automatically deployed to a public staging server for 
stress and acceptance testing by physically remote calling parties"

At this time, the applications weren't modern "big data" apps as they didn't 
have affordable storage or the tools to schedule work over it. It wasn't that 
the people writing the books and papers looked at big data and said "not for 
us", it just wasn't on their horizons. 1TB was a lot of storage in those days, 
not a high-end SSD.

Otherwise your approach is just another line of defense in saving your job by 
applying an impertinent, incorrect, and outdated skill and tool to a problem.


please be a bit more constructive here, the ASF code of conduct encourages 
empathy and coillaboration. https://www.apache.org/foundation/policies/conduct 
. Thanks.,


Building data products is a very different discipline from that of building 
software.


Which is why we ned to consider how to take what are core methodologies for 
software and apply them, and, where appropriate, supercede them with new 
workflows, ideas, technologies. But doing so with an understanding of the 
reasoning behind today's tools and workflows. I'm really interested in how do 
we get from experimental notebook code to something usable in production, 
pushing it out, finding the dirty-data-problems before it goes live, etc, etc. 
I do think today's tools have been outgrown by the applications we now build, 
and am thinking not so much "which tools to use', but one step further, "what 
are the new tools and techniques to use?".

I look forward to whatever insight people have here.


My genuine advice to everyone in all spheres of activities will be to first 
understand the problem to solve before solving it and definitely before 
selecting the tools to solve it, otherwise you will land up with a bowl of soup 
and fork in hand and argue that CI/ CD is still applicable to building data 
products and data warehousing.


I concur

Regards,
Gourav


-Steve

On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:

On 11 Apr 2017, at 20:46, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:

And once again JAVA programmers are trying to solve a data analytics and data 
warehousing problem using programming paradigms. It genuinely a pain to see 
this happen.



While I'm happy to be faulted for treating things as software processes, having 
a full automated mechanism for testing the latest code before production is 
something I'd consider foundational today. This is what "Contiunous Deployment" 
was about when it was first conceived. Does it mean you should blindly deploy 
that way? well, not if you worry about security, but having that review process 
and then a final manual "deploy" button can address that.

Cloud infras let you integrate cluster instantiation to the process; which 
helps you automate things like "stage the deployment in some new VMs, run 
acceptance tests (*), then switch the load balancer over to the new cluster, 
being ready to switch back if you need. I've not tried that with streaming apps 
though; I don't know how to do it there. Boot the new cluster off checkpointed 
state requires deserialization to work, which can't be guaranteed if you are 
changing the objects which get serialized.

I'd argue then, it's not a problem which has already been solved by data 
analystics/warehousing —though if you've got pointers there, I'd be grateful. 
Always good to see work by others. Indeed, the telecoms industry have led the 
way in testing and HA deployment: if you look at Erlang you can see a system 
designed with hot upgrades in mind, the way java code "add a JAR to a web 
server" never was.

-Steve


(*) do always make sure this is the test cluster with a snapshot of test data, 
not production machines/data. There are always horror stories there.


Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:
Hi Steve


Thanks for the detailed response, I think this problem doesn't have an industry 
standard solution as of yet and I am sure a lot of people would benefit from 
the discussion

I realise now what you are saying so thanks for clarifying, that said let me 
try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM to 
prod, and the other is enusiring the data your code uses is correct. (using the 
latest data from prod)


"how do you get your code from SCM into production?"

We currently have our pipeline being run via airflow, we have our dags in S3, 
with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the dags 
to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X 
amount of mins

Re test data
" but what's your strategy for test data: that's always the troublespot."

Our application is using versioning against the data, so we expect the source 
data to be in a certain version and the output data to also be in a certain 
version

We have a test resources folder that we have following the same convention of 
versioning - this is the data that our application tests use - to ensure that 
the data is in the correct format

so for example if we have Table X with version 1 that depends on data from 
Table A and B also version 1, we run our spark application then ensure the 
transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column in 
Table X (version 2), we generate a new version of the data and ensure the tests 
are updated

That way we ensure any new version of the data has tests against it

"I've never seen any good strategy there short of "throw it at a copy of the 
production dataset"."

I agree which is why we have a sample of the production data and version the 
schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes this 
helps people build more reliable pipelines


Love to see that.

Kind Regards
Sam

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Reply via email to