Is this a valid usecase for Apache Beam and Google dataflow

Randal Moore Tue, 20 Jun 2017 09:11:41 -0700

Just starting looking at Beam this week as a candidate for executing some
fairly CPU intensive work.  I am curious if the stream-oriented features of
Beam are a match for my usecase. My user will submit a large number of
computations to the system (as a "job").  Those computations can be
expressed in a series of DoFn whose results can be stored (in for example
google datastore).


My initial idea was to post the individual messages (anywhere from 1000 to
1,000,000 per job) to a google pub/sub topic that is watched by a google
dataflow job. I can write a prototype that performs the appropriate
transformation and posts the results.

But...

I cannot find any way to capture the notion of the completion of the
original "user job".  I want to post to pub/sub that all individual
calculations are complete. I was hoping that I could write a CombineFn that
would be able to post a message as each job finishes but Combine needs a
window and I don't see how to define it.

Is there a way to define a window that is defined by the user's job - I
know exactly which individual computations are part of the user's job (and
exactly how many).  But all the grouping that I've discovered so far seems
to be well defined at compile time (e.g., number of messages in the window,
or number of partitions).

Is this the wrong usecase for dataflow/beam?  Is there a better way to
express this problem?

Thanks,
rdm

Is this a valid usecase for Apache Beam and Google dataflow

Reply via email to