Hi, some responses inline: On Thu, Jan 30, 2020 at 2:52 PM Xander Song <[email protected]> wrote:
> Hello, > > I am new to the Apache ecosystem and am attempting to use Beam to build a > horizontally scalable pipeline for feature extraction from video data. The > extraction process for certain features can be accelerated using GPUs, > while other features require only a CPU to compute. I have several > questions, listed in order of decreasing priority: > > 1. Can I run a Beam pipeline with GPUs? (as far as I can tell, Google > Cloud Dataflow does not currently support this option) > > There was a thread on user[1] that discusses this. I think the status quo hasn't changed much since then. > > 1. Is it possible to achieve this functionality using Spark or Flink > as a runner? > > It should be possible, although I have not tried it. It is possible to run Beam Flink/Spark clusters, and it is possible to create a Flink cluster with GPUs. Beam custom containers[4] can provide a way to manage required GPU dependencies (CUDA toolkit, cuDNN, etc). Google Cloud Dataproc offers a way to create managed Flink/Spark clusters and attaching GPUs to Dataproc clusters [2]. > > 1. Is it possible to mix hardware types in a Beam pipeline (e.g., to > have certain features extracted by CPUs and others extracted by GPUs), or > does this go against the Beam paradigm of abstracting away such details? > > It does not go against the paradigm, but support for annotating parts of Beam pipelines with hardware requirements, has not been implemented yet [2]. > > 1. Do the Spark and Flink runners have support for auto-scaling like > Google Cloud Dataflow? > > Support for autoscaling should be implemented in Flink/Spark itself, not so much in Beam Flink/Spark runner. To my knowledge, the answer is no. > > 1. What are relevant considerations when selecting between Spark vs. > Flink as a runner? > > Language support, pipeline type (batch/streaming), runner capabilities are all relevant considerations. There are two Spark runners: portable(Python, Java, Go, supports custom containers) and non-portable (Java only). I think you'd want to go with a portable runner for your use case. Among portable runners I think Flink had the most capabilities implemented as of last year, see: [5] [6], but the information may be out of date. Any guidance, resources, or tips are appreciated. Thank you in advance! > -Xander > [1] https://lists.apache.org/thread.html/00c1b5b44204b5c7f33bdae53da20d84739e1f80c3c286db8a9151b6%40%3Cuser.beam.apache.org%3E . [2] https://cloud.google.com/dataproc/docs/release-notes#September_24_2019 [3] https://issues.apache.org/jira/browse/BEAM-2085 [4] https://beam.apache.org/documentation/runtime/environments/ [5] https://s.apache.org/apache-beam-portability-support-table [6] https://beam.apache.org/documentation/runners/capability-matrix/
