Re: Understanding GenerateSequence and SideInputs

Lukasz Cwik Thu, 24 May 2018 14:22:02 -0700

The runner is responsible for scheduling the work anywhere it chooses. It
can be the same node all the time or different nodes.

There is no precision guarantee on the upper bound (only the lower bound), the
withRate method states that it will "generate at most a given number of
elements per a given period". This is because a DoFn can't control whether
and when the runner decides to schedule the work. A runner will attempt to
honor any processing commitments that it knows about such as timers but if
the runner has too much work and too few resources it may fall behind or
decide to group small work units into larger work units for performance
reasons.

On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <car...@mrcalonso.com> wrote:

> Hi everyone!!
>
> I'm building a pipeline to store streaming data into BQ and I'm using the
> pattern: Slowly changing lookup cache described here:
> https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>  to
> hold and refresh the table schemas (as they may change from time to time).
>
> Now I'd like to understand how that is scheduled on a distributed system.
> Who is running that code? One random node? One node but always the same?
> All nodes?
>
> Also, what are the GenerateSequence guarantees in terms of precision? I
> have it configured to generate 1 element every 5 minutes and most of the
> time it works exact, but sometimes it doesn't... Is that expected?
>
> Regards
>

Re: Understanding GenerateSequence and SideInputs

Reply via email to