Ah great point about the Kafka partition skew. I can see how the problem
domain is kind of stacked against sub-second latency expectations. Thanks
again.

On Fri, Dec 13, 2019 at 3:48 PM Robert Bradshaw <rober...@google.com> wrote:

> On Fri, Dec 13, 2019 at 1:20 PM Aaron Dixon <atdi...@gmail.com> wrote:
> >
> > Thank you Robert, this is really helpful context and confirmation.
> >
> > > You mention having an "agressive" watermark--could you clarify what
> you mean by this?
> >
> > I'm using KafkaIO and I customize the watermark to follow my message
> event timestamps as they come (instead of having it lag behind the stream
> to allow for any even slightly out-of-order events.)  So I call this
> "aggressive".
>
> What about skew between partitions? This would show up as a difference
> between the oldest and newest element being concurrently processed,
> and would hold back the watermark.
>
> > I also have a custom window that closes precisely on certain event
> markers in the stream -- and I need these windows to trigger as fast as
> possible (was hoping for sub-second here, but I suspected that was not the
> focus of runners as you're saying so I think I can live w/o this).
>
> Correct.
>
> Essentially, state and window-using operations are aggregating
> operations, and the latency is bound by how simultaneously every set
> of elements that need to be aggregated arrive together. This kind of
> synchronization is difficult to do in a distributed system, and
> watermarks are an attempt to accurately track the minimum time needed
> to wait to ensure all events have arrived and a result can be emitted.
>
> > > Certainly all-in-process runners would likely take the cake here
> >
> > Besides DirectRunner which as I understand it is primarily for testing,
> what are other in-process runners?
>
> Flink and Spark have (test) all-in-process modes too.
>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Dec 13, 2019 at 1:45 PM Robert Bradshaw <rober...@google.com>
> wrote:
> >>
> >> In general, sub-second latencies are difficult because one must wait
> >> for the watermark to catch up before actually firing. This would
> >> require the oldest item in flight across all machines to be almost
> >> exactly the same timestamp as the newest. Furthermore most sources
> >> cannot provide sub-second watermarks. You mention having an
> >> "agressive" watermark--could you clarify what you mean by this? There
> >> are also (generally small) latencies involved with persisting state
> >> and then firing timers, and windowing is built on the same mechanisms.
> >>
> >> I am not aware of latency benchmarks for various runners--in my
> >> experience most people are interested in high throughput at
> >> O(second-minute) latency. There's nothing in the Beam model that
> >> prevents sub-second latencies, but I don't think this has been pushed
> >> very far on most runners. Gathering such data would certainly be
> >> interesting. (Certainly all-in-process runners would likely take the
> >> cake here, but it'd be interesting to see how it degrades as one adds
> >> more machines.)
> >>
> >> On Fri, Dec 13, 2019 at 10:58 AM Aaron Dixon <atdi...@gmail.com> wrote:
> >> >
> >> > I've been building pipelines and benchmarking Beam jobs in Dataflow.
> >> >
> >> > Without windowing, latencies look pretty good (reliably sub-second)
> from ingest to sink.
> >> >
> >> > Once I introduce windowing even with aggressive watermark it seems it
> takes at least a second (often multiple seconds) to see a window fire.
> >> >
> >> > Same appears true for setting events timers in the State API; there
> seems a delay between setting a timer at current event time and the timer
> callback fires.
> >> >
> >> > Are there any good/canonical latency benchmarks/reports for different
> runners?
> >> >
> >> > My next move may be to evaluate Flink in terms of these latencies,
> but curious if I should be trying to get sub-second latencies out of Beam
> (windowing) at all?
> >> >
>

Reply via email to