Beam users,

We're attempting to write a Java pipeline that uses Count.perKey() to
collect event counts, and then flush those to an HTTP API every ten minutes
based on processing time.

We've tried expressing this using GlobalWindows with an AfterProcessingTime
trigger, but we find that when we drain the pipeline we end up with entries
in the droppedDueToLateness metric. This was initially surprising, but may
be line line with documented behavior [0]:

> When you issue the Drain command, Dataflow immediately closes any
in-process windows and fires all triggers. The system does not wait for any
outstanding time-based windows to finish. Dataflow causes open windows to
close by advancing the system watermark to infinity

Perhaps advancing watermark to infinity has no effect on GlobalWindows, so
we attempted to get around this by using a fixed but arbitrarily-long
window:

    FixedWindows.of(Duration.standardDays(36500))

The first few tests with this configuration came back clean, but the third
test again showed droppedDueToLateness after calling Drain. You can see
this current configuration in [1].

Is there a pattern for reliably flushing on Drain when doing processing
time-based aggregates like this?

[0]
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#effects_of_draining_a_job
[1]
https://github.com/mozilla/gcp-ingestion/pull/1689/files#diff-1d75ce2cbda625465d5971a83d842dd35e2eaded2c2dd2b6c7d0d7cdfd459115R58-R71

Reply via email to