There is no error in Flink or Beam, the two pipelines in parallel just
revealed that we had late data we were not aware of before.
As we had a strict drop policy in the second pipeline (withAllowedLateness())
we dropped these late records.
The second pipeline showed us that our mental model was wrong, after
digging into the source and looking at percentiles we found that on some
days we receive *really late *data which we need to account for upstream.

Just another awesome finding while using Beam. Kudos to the people
implementing windowing and triggers, this is amazing!

On Wed, Nov 4, 2020 at 1:36 PM Kaymak, Tobias <[email protected]>
wrote:

> Hello,
>
> while investigating potential benefits of switching BigQueryIO from
> FILE_LOADS to streaming inserts, I found a potential edge case that might
> be related to the way the BigQueryIO is being handled on a Flink cluster:
>
> Flink's task manager are run as pre-emptible instances in a GKE
> cluster's node pool. This means they can be terminated any time by Google,
> but will be respawned within 5 minutes or so.
>
> As the job manager is being run on a fixed node pool, in theory this means
> that a pipeline will be shortly interrupted, but resume as soon as the task
> manager is respawned.
>
> Now, with checkpointing and EXACTLY_ONCE processing enabled, comparing the
> BigQuery streaming vs. the non streaming inserts showed that the streaming
> one was missing a couple of elements, all from the same close timestamp
> range.
>
> Checking the GKE logs I saw that one task manager got respawned a couple
> of minutes earlier. There were no ERROR messages regarding streaming insert
> problems towards BigQuery so my suspicion is that the BigQuery sink somehow
> might have lost some records here.
>
> I ran the streaming-inserts pipeline again this morning, and the records
> were correctly inserted into BigQuery - none was missed like during the
> night.
>
> Any advice for me on how to dig deeper here?
>
> [Beam 2.24.0 / Flink 1.10.2]
>
> Best,
> Tobi
>
>
>
>
>
>

Reply via email to