Re: Is this a valid usecase for Apache Beam and Google dataflow

2017-06-20 Thread Lukasz Cwik
Take a look at session windows[1]. As long as the messages you post to
Pubsub aren't spaced out farther then the session gap duration they will
all get grouped together.
It seems as though it would be much simpler to just run a separate Apache
Beam job for each internal job you want to process since you won't have to
deal with potentially late data exceeding the session gap duration.

1: https://cloud.google.com/dataflow/model/windowing#session-windows

On Tue, Jun 20, 2017 at 9:10 AM, Randal Moore  wrote:

> Just starting looking at Beam this week as a candidate for executing some
> fairly CPU intensive work.  I am curious if the stream-oriented features of
> Beam are a match for my usecase. My user will submit a large number of
> computations to the system (as a "job").  Those computations can be
> expressed in a series of DoFn whose results can be stored (in for example
> google datastore).
>
> My initial idea was to post the individual messages (anywhere from 1000 to
> 1,000,000 per job) to a google pub/sub topic that is watched by a google
> dataflow job. I can write a prototype that performs the appropriate
> transformation and posts the results.
>
> But...
>
> I cannot find any way to capture the notion of the completion of the
> original "user job".  I want to post to pub/sub that all individual
> calculations are complete. I was hoping that I could write a CombineFn that
> would be able to post a message as each job finishes but Combine needs a
> window and I don't see how to define it.
>
> Is there a way to define a window that is defined by the user's job - I
> know exactly which individual computations are part of the user's job (and
> exactly how many).  But all the grouping that I've discovered so far seems
> to be well defined at compile time (e.g., number of messages in the window,
> or number of partitions).
>
> Is this the wrong usecase for dataflow/beam?  Is there a better way to
> express this problem?
>
> Thanks,
> rdm
>


Is this a valid usecase for Apache Beam and Google dataflow

2017-06-20 Thread Randal Moore
Just starting looking at Beam this week as a candidate for executing some
fairly CPU intensive work.  I am curious if the stream-oriented features of
Beam are a match for my usecase. My user will submit a large number of
computations to the system (as a "job").  Those computations can be
expressed in a series of DoFn whose results can be stored (in for example
google datastore).

My initial idea was to post the individual messages (anywhere from 1000 to
1,000,000 per job) to a google pub/sub topic that is watched by a google
dataflow job. I can write a prototype that performs the appropriate
transformation and posts the results.

But...

I cannot find any way to capture the notion of the completion of the
original "user job".  I want to post to pub/sub that all individual
calculations are complete. I was hoping that I could write a CombineFn that
would be able to post a message as each job finishes but Combine needs a
window and I don't see how to define it.

Is there a way to define a window that is defined by the user's job - I
know exactly which individual computations are part of the user's job (and
exactly how many).  But all the grouping that I've discovered so far seems
to be well defined at compile time (e.g., number of messages in the window,
or number of partitions).

Is this the wrong usecase for dataflow/beam?  Is there a better way to
express this problem?

Thanks,
rdm