While evaluating many tools for a project, I found Beam suits my needs quite well from the abstraction point of view. Both the dead-simple way to scale up (and even down to single-machine for testing) and the ease of moving between different runners are key. Plus, I'm familiar with the framework from having used Flume while at Google.
One thing I'd find useful in the implementation are resource hints[1], particularly to use GPUs for several parts of the processing. Forgoing hints and the ability to run easily on GPUs, I'd be happy to break up my pipeline, and just spin up all my machines with GPUs for the sub-pipelines that need it. Some paths I'm considering: - Find the easiest way to go from start-cluster-with-cpus (i.e. gcloud container clusters ... --accelerator=...) to run-dataflow-on-said-cluster. What would that be? - Implement --accelerator in PipelineOptions and implement for Dataflow Thanks for any advice, Chris [1] https://issues.apache.org/jira/browse/BEAM-2085
