On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <car...@mrcalonso.com> wrote:

> Hi everyone!!
>
> I'm building a pipeline to store streaming data into BQ and I'm using the
> pattern: Slowly changing lookup cache described here:
> https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>  to
> hold and refresh the table schemas (as they may change from time to time).
>
> Now I'd like to understand how that is scheduled on a distributed system.
> Who is running that code? One random node? One node but always the same?
> All nodes?
>

GenerateSequence() is uses an unbounded source. Like any unbounded source,
it can has a set of 'splits' ('desiredNumSplits' [1] is set by runtime).
Each of the splits run in parallel.. a typical runtime distributes these
across the workers. Typically they stay on a worker unless there is a
reason to redistribute (autoscaling, workers unresponsive etc). W.r.t. user
application there are no guarantees about affinity.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337


>
> Also, what are the GenerateSequence guarantees in terms of precision? I
> have it configured to generate 1 element every 5 minutes and most of the
> time it works exact, but sometimes it doesn't... Is that expected?
>

Each of the splits mentioned above essentially runs 'advance() [2]' in a
loop. It check current walltime to decide if it needs to emit next element.
If the value you see off by a few seconds, it would imply 'advance()' was
not called during that time by the framework. Runtime frameworks usually
don't provide any hard or soft deadlines for scheduling work.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337

[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L426


> Regards
>

Reply via email to