Hi, some responses inline:

On Thu, Jan 30, 2020 at 2:52 PM Xander Song <[email protected]> wrote:

> Hello,
>
> I am new to the Apache ecosystem and am attempting to use Beam to build a
> horizontally scalable pipeline for feature extraction from video data. The
> extraction process for certain features can be accelerated using GPUs,
> while other features require only a CPU to compute. I have several
> questions, listed in order of decreasing priority:
>
>    1. Can I run a Beam pipeline with GPUs? (as far as I can tell, Google
>    Cloud Dataflow does not currently support this option)
>
> There was a thread on user[1] that discusses this. I think the status quo
hasn't changed much since then.

>
>    1. Is it possible to achieve this functionality using Spark or Flink
>    as a runner?
>
> It should be possible, although I have not tried it. It is possible to run
Beam Flink/Spark clusters, and it is possible to create a Flink cluster
with GPUs. Beam custom containers[4] can provide a way to manage required
GPU dependencies (CUDA toolkit, cuDNN, etc). Google Cloud Dataproc offers a
way to create managed Flink/Spark clusters and attaching GPUs to Dataproc
clusters [2].

>
>    1. Is it possible to mix hardware types in a Beam pipeline (e.g., to
>    have certain features extracted by CPUs and others extracted by GPUs), or
>    does this go against the Beam paradigm of abstracting away such details?
>
> It does not go against the paradigm, but support for annotating parts of
Beam pipelines with hardware requirements, has not been implemented yet [2].

>
>    1. Do the Spark and Flink runners have support for auto-scaling like
>    Google Cloud Dataflow?
>
> Support for autoscaling should be implemented in Flink/Spark itself, not
so much in Beam Flink/Spark runner. To my knowledge, the answer is no.

>
>    1. What are relevant considerations when selecting between Spark vs.
>    Flink as a runner?
>
> Language support, pipeline type (batch/streaming), runner capabilities are
all relevant considerations. There are two Spark runners: portable(Python,
Java, Go, supports custom containers) and non-portable (Java only). I think
you'd want to go with a portable runner for your use case. Among portable
runners I think Flink had the most capabilities implemented as of last
year, see: [5] [6], but the information may be out of date.

Any guidance, resources, or tips are appreciated. Thank you in advance!
> -Xander
>

[1]
https://lists.apache.org/thread.html/00c1b5b44204b5c7f33bdae53da20d84739e1f80c3c286db8a9151b6%40%3Cuser.beam.apache.org%3E
.
[2] https://cloud.google.com/dataproc/docs/release-notes#September_24_2019
[3] https://issues.apache.org/jira/browse/BEAM-2085
[4] https://beam.apache.org/documentation/runtime/environments/
[5] https://s.apache.org/apache-beam-portability-support-table
[6] https://beam.apache.org/documentation/runners/capability-matrix/

Reply via email to