Hi Lewis, hi Markus,

> snappy compression, which is a massive improvement for large data shuffling 
jobs

Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for
longer. We use it for a 25-billion CrawlDB: it's almost as fast (both 
compression
and decompression) as snappy and you get a compression ratio which is not far 
away
from bzip2 (which is very slow).

> worker/task nodes were run on spot instances

we do the same. However, the EMR is priced at 25% of the on-demand EC2
instance price. As spot prices are usually 50-70% of the on-demand price,
the EMR costs would add a non-trivial part.

> backup logic

Yes. We checkpoint the output of every step to S3 unless it runs less than
one hour.

> using Terraform or AWS CloudFormation

We use a shell script to bootstrap the instances.
Some parts have been added/rewritten using CloudFormation (the VPC setup).
The long-term plan is to use more templates and also bake a base machine
image to speed up the bootstraping.

> ARM support

ARM instances often (including spot instances) offer a better price/CPU ratio.
We've already switched to ARM for a couple of services/tasks - the efforts
are minimal: choose the right base image, installing and running Java or Python
workflows does not change. But I haven't tried Hadoop yet.

Thanks for sharing your experiences, I'll keep you updated about our decisions
and progress!

Best,
Sebastian

Reply via email to