Beam streaming BigQueryIO pipeline on a Flink cluster misses elements

Kaymak, Tobias Wed, 04 Nov 2020 04:37:57 -0800

Hello,

while investigating potential benefits of switching BigQueryIO from
FILE_LOADS to streaming inserts, I found a potential edge case that might
be related to the way the BigQueryIO is being handled on a Flink cluster:


Flink's task manager are run as pre-emptible instances in a GKE
cluster's node pool. This means they can be terminated any time by Google,
but will be respawned within 5 minutes or so.

As the job manager is being run on a fixed node pool, in theory this means
that a pipeline will be shortly interrupted, but resume as soon as the task
manager is respawned.

Now, with checkpointing and EXACTLY_ONCE processing enabled, comparing the
BigQuery streaming vs. the non streaming inserts showed that the streaming
one was missing a couple of elements, all from the same close timestamp
range.

Checking the GKE logs I saw that one task manager got respawned a couple of
minutes earlier. There were no ERROR messages regarding streaming insert
problems towards BigQuery so my suspicion is that the BigQuery sink somehow
might have lost some records here.

I ran the streaming-inserts pipeline again this morning, and the records
were correctly inserted into BigQuery - none was missed like during the
night.

Any advice for me on how to dig deeper here?

[Beam 2.24.0 / Flink 1.10.2]

Best,
Tobi

Beam streaming BigQueryIO pipeline on a Flink cluster misses elements

Reply via email to