Hi Lewis, hi Markus, > snappy compression, which is a massive improvement for large data shuffling jobs
Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for longer. We use it for a 25-billion CrawlDB: it's almost as fast (both compression and decompression) as snappy and you get a compression ratio which is not far away from bzip2 (which is very slow). > worker/task nodes were run on spot instances we do the same. However, the EMR is priced at 25% of the on-demand EC2 instance price. As spot prices are usually 50-70% of the on-demand price, the EMR costs would add a non-trivial part. > backup logic Yes. We checkpoint the output of every step to S3 unless it runs less than one hour. > using Terraform or AWS CloudFormation We use a shell script to bootstrap the instances. Some parts have been added/rewritten using CloudFormation (the VPC setup). The long-term plan is to use more templates and also bake a base machine image to speed up the bootstraping. > ARM support ARM instances often (including spot instances) offer a better price/CPU ratio. We've already switched to ARM for a couple of services/tasks - the efforts are minimal: choose the right base image, installing and running Java or Python workflows does not change. But I haven't tried Hadoop yet. Thanks for sharing your experiences, I'll keep you updated about our decisions and progress! Best, Sebastian