Hello, while investigating potential benefits of switching BigQueryIO from FILE_LOADS to streaming inserts, I found a potential edge case that might be related to the way the BigQueryIO is being handled on a Flink cluster:
Flink's task manager are run as pre-emptible instances in a GKE cluster's node pool. This means they can be terminated any time by Google, but will be respawned within 5 minutes or so. As the job manager is being run on a fixed node pool, in theory this means that a pipeline will be shortly interrupted, but resume as soon as the task manager is respawned. Now, with checkpointing and EXACTLY_ONCE processing enabled, comparing the BigQuery streaming vs. the non streaming inserts showed that the streaming one was missing a couple of elements, all from the same close timestamp range. Checking the GKE logs I saw that one task manager got respawned a couple of minutes earlier. There were no ERROR messages regarding streaming insert problems towards BigQuery so my suspicion is that the BigQuery sink somehow might have lost some records here. I ran the streaming-inserts pipeline again this morning, and the records were correctly inserted into BigQuery - none was missed like during the night. Any advice for me on how to dig deeper here? [Beam 2.24.0 / Flink 1.10.2] Best, Tobi
