On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <car...@mrcalonso.com> wrote:
> Hi everyone!! > > I'm building a pipeline to store streaming data into BQ and I'm using the > pattern: Slowly changing lookup cache described here: > https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1 > to > hold and refresh the table schemas (as they may change from time to time). > > Now I'd like to understand how that is scheduled on a distributed system. > Who is running that code? One random node? One node but always the same? > All nodes? > GenerateSequence() is uses an unbounded source. Like any unbounded source, it can has a set of 'splits' ('desiredNumSplits' [1] is set by runtime). Each of the splits run in parallel.. a typical runtime distributes these across the workers. Typically they stay on a worker unless there is a reason to redistribute (autoscaling, workers unresponsive etc). W.r.t. user application there are no guarantees about affinity. [1] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337 > > Also, what are the GenerateSequence guarantees in terms of precision? I > have it configured to generate 1 element every 5 minutes and most of the > time it works exact, but sometimes it doesn't... Is that expected? > Each of the splits mentioned above essentially runs 'advance() [2]' in a loop. It check current walltime to decide if it needs to emit next element. If the value you see off by a few seconds, it would imply 'advance()' was not called during that time by the framework. Runtime frameworks usually don't provide any hard or soft deadlines for scheduling work. [1] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337 [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L426 > Regards >