Hi, Not sure if this is useful for your use case but as you are using BQ with a changing schema the following may also be a interesting read ...
https://cloud.google.com/blog/big-data/2018/02/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix Cheers Reza On Fri, May 25, 2018, 5:50 AM Raghu Angadi <rang...@google.com> wrote: > > On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Hi everyone!! >> >> I'm building a pipeline to store streaming data into BQ and I'm using the >> pattern: Slowly changing lookup cache described here: >> https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1 >> to >> hold and refresh the table schemas (as they may change from time to time). >> >> Now I'd like to understand how that is scheduled on a distributed system. >> Who is running that code? One random node? One node but always the same? >> All nodes? >> > > GenerateSequence() is uses an unbounded source. Like any unbounded source, > it can has a set of 'splits' ('desiredNumSplits' [1] is set by runtime). > Each of the splits run in parallel.. a typical runtime distributes these > across the workers. Typically they stay on a worker unless there is a > reason to redistribute (autoscaling, workers unresponsive etc). W.r.t. user > application there are no guarantees about affinity. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337 > > >> >> Also, what are the GenerateSequence guarantees in terms of precision? I >> have it configured to generate 1 element every 5 minutes and most of the >> time it works exact, but sometimes it doesn't... Is that expected? >> > > Each of the splits mentioned above essentially runs 'advance() [2]' in a > loop. It check current walltime to decide if it needs to emit next element. > If the value you see off by a few seconds, it would imply 'advance()' was > not called during that time by the framework. Runtime frameworks usually > don't provide any hard or soft deadlines for scheduling work. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337 > > [2] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L426 > > >> Regards >> >