On Fri, Dec 13, 2019 at 1:20 PM Aaron Dixon <[email protected]> wrote: > > Thank you Robert, this is really helpful context and confirmation. > > > You mention having an "agressive" watermark--could you clarify what you > > mean by this? > > I'm using KafkaIO and I customize the watermark to follow my message event > timestamps as they come (instead of having it lag behind the stream to allow > for any even slightly out-of-order events.) So I call this "aggressive".
What about skew between partitions? This would show up as a difference between the oldest and newest element being concurrently processed, and would hold back the watermark. > I also have a custom window that closes precisely on certain event markers in > the stream -- and I need these windows to trigger as fast as possible (was > hoping for sub-second here, but I suspected that was not the focus of runners > as you're saying so I think I can live w/o this). Correct. Essentially, state and window-using operations are aggregating operations, and the latency is bound by how simultaneously every set of elements that need to be aggregated arrive together. This kind of synchronization is difficult to do in a distributed system, and watermarks are an attempt to accurately track the minimum time needed to wait to ensure all events have arrived and a result can be emitted. > > Certainly all-in-process runners would likely take the cake here > > Besides DirectRunner which as I understand it is primarily for testing, what > are other in-process runners? Flink and Spark have (test) all-in-process modes too. > > > > > > > > > > > On Fri, Dec 13, 2019 at 1:45 PM Robert Bradshaw <[email protected]> wrote: >> >> In general, sub-second latencies are difficult because one must wait >> for the watermark to catch up before actually firing. This would >> require the oldest item in flight across all machines to be almost >> exactly the same timestamp as the newest. Furthermore most sources >> cannot provide sub-second watermarks. You mention having an >> "agressive" watermark--could you clarify what you mean by this? There >> are also (generally small) latencies involved with persisting state >> and then firing timers, and windowing is built on the same mechanisms. >> >> I am not aware of latency benchmarks for various runners--in my >> experience most people are interested in high throughput at >> O(second-minute) latency. There's nothing in the Beam model that >> prevents sub-second latencies, but I don't think this has been pushed >> very far on most runners. Gathering such data would certainly be >> interesting. (Certainly all-in-process runners would likely take the >> cake here, but it'd be interesting to see how it degrades as one adds >> more machines.) >> >> On Fri, Dec 13, 2019 at 10:58 AM Aaron Dixon <[email protected]> wrote: >> > >> > I've been building pipelines and benchmarking Beam jobs in Dataflow. >> > >> > Without windowing, latencies look pretty good (reliably sub-second) from >> > ingest to sink. >> > >> > Once I introduce windowing even with aggressive watermark it seems it >> > takes at least a second (often multiple seconds) to see a window fire. >> > >> > Same appears true for setting events timers in the State API; there seems >> > a delay between setting a timer at current event time and the timer >> > callback fires. >> > >> > Are there any good/canonical latency benchmarks/reports for different >> > runners? >> > >> > My next move may be to evaluate Flink in terms of these latencies, but >> > curious if I should be trying to get sub-second latencies out of Beam >> > (windowing) at all? >> >
