Hi,

Not sure if this is useful for your use case but as you are using BQ with a
changing schema the following may also be a interesting read ...

https://cloud.google.com/blog/big-data/2018/02/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix

Cheers

Reza



On Fri, May 25, 2018, 5:50 AM Raghu Angadi <rang...@google.com> wrote:

>
> On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <car...@mrcalonso.com>
> wrote:
>
>> Hi everyone!!
>>
>> I'm building a pipeline to store streaming data into BQ and I'm using the
>> pattern: Slowly changing lookup cache described here:
>> https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>>  to
>> hold and refresh the table schemas (as they may change from time to time).
>>
>> Now I'd like to understand how that is scheduled on a distributed system.
>> Who is running that code? One random node? One node but always the same?
>> All nodes?
>>
>
> GenerateSequence() is uses an unbounded source. Like any unbounded source,
> it can has a set of 'splits' ('desiredNumSplits' [1] is set by runtime).
> Each of the splits run in parallel.. a typical runtime distributes these
> across the workers. Typically they stay on a worker unless there is a
> reason to redistribute (autoscaling, workers unresponsive etc). W.r.t. user
> application there are no guarantees about affinity.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337
>
>
>>
>> Also, what are the GenerateSequence guarantees in terms of precision? I
>> have it configured to generate 1 element every 5 minutes and most of the
>> time it works exact, but sometimes it doesn't... Is that expected?
>>
>
> Each of the splits mentioned above essentially runs 'advance() [2]' in a
> loop. It check current walltime to decide if it needs to emit next element.
> If the value you see off by a few seconds, it would imply 'advance()' was
> not called during that time by the framework. Runtime frameworks usually
> don't provide any hard or soft deadlines for scheduling work.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337
>
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L426
>
>
>> Regards
>>
>

Reply via email to