Hi Chris, Dataflow does not support GPUs at the moment, but this feature is on our radar and we are considering it for future prioritization. Dataflow-on-GKE is also not supported.
Currently Dataflow worker pool is homogenous. However, in the future, resource annotations in pipeline should be a way to go. As you noted, resource annotation support needs to happen in Beam SDK. This feature is not tied to a particular functionality (GPUs) or a particular runner (Dataflow), and can be implemented in Beam codebase. At the moment, you can try experimenting with Direct runner on a single machine with a GPU, or try portable runners that use a stand-alone infrastructure for example, Beam Flink runner + Flink on Dataproc cluster with GPUs. Thanks, Valentyn On Tue, Oct 1, 2019 at 11:24 AM Chris Roat <[email protected]> wrote: > While evaluating many tools for a project, I found Beam suits my needs > quite well from the abstraction point of view. Both the dead-simple way to > scale up (and even down to single-machine for testing) and the ease of > moving between different runners are key. Plus, I'm familiar with the > framework from having used Flume while at Google. > > One thing I'd find useful in the implementation are resource hints[1], > particularly to use GPUs for several parts of the processing. Forgoing > hints and the ability to run easily on GPUs, I'd be happy to break up my > pipeline, and just spin up all my machines with GPUs for the sub-pipelines > that need it. > > Some paths I'm considering: > - Find the easiest way to go from start-cluster-with-cpus (i.e. gcloud > container clusters ... --accelerator=...) to run-dataflow-on-said-cluster. > What would that be? > - Implement --accelerator in PipelineOptions and implement for Dataflow > > Thanks for any advice, > Chris > > [1] https://issues.apache.org/jira/browse/BEAM-2085 >
