Ah great point about the Kafka partition skew. I can see how the problem domain is kind of stacked against sub-second latency expectations. Thanks again.
On Fri, Dec 13, 2019 at 3:48 PM Robert Bradshaw <rober...@google.com> wrote: > On Fri, Dec 13, 2019 at 1:20 PM Aaron Dixon <atdi...@gmail.com> wrote: > > > > Thank you Robert, this is really helpful context and confirmation. > > > > > You mention having an "agressive" watermark--could you clarify what > you mean by this? > > > > I'm using KafkaIO and I customize the watermark to follow my message > event timestamps as they come (instead of having it lag behind the stream > to allow for any even slightly out-of-order events.) So I call this > "aggressive". > > What about skew between partitions? This would show up as a difference > between the oldest and newest element being concurrently processed, > and would hold back the watermark. > > > I also have a custom window that closes precisely on certain event > markers in the stream -- and I need these windows to trigger as fast as > possible (was hoping for sub-second here, but I suspected that was not the > focus of runners as you're saying so I think I can live w/o this). > > Correct. > > Essentially, state and window-using operations are aggregating > operations, and the latency is bound by how simultaneously every set > of elements that need to be aggregated arrive together. This kind of > synchronization is difficult to do in a distributed system, and > watermarks are an attempt to accurately track the minimum time needed > to wait to ensure all events have arrived and a result can be emitted. > > > > Certainly all-in-process runners would likely take the cake here > > > > Besides DirectRunner which as I understand it is primarily for testing, > what are other in-process runners? > > Flink and Spark have (test) all-in-process modes too. > > > > > > > > > > > > > > > > > > > > > > > On Fri, Dec 13, 2019 at 1:45 PM Robert Bradshaw <rober...@google.com> > wrote: > >> > >> In general, sub-second latencies are difficult because one must wait > >> for the watermark to catch up before actually firing. This would > >> require the oldest item in flight across all machines to be almost > >> exactly the same timestamp as the newest. Furthermore most sources > >> cannot provide sub-second watermarks. You mention having an > >> "agressive" watermark--could you clarify what you mean by this? There > >> are also (generally small) latencies involved with persisting state > >> and then firing timers, and windowing is built on the same mechanisms. > >> > >> I am not aware of latency benchmarks for various runners--in my > >> experience most people are interested in high throughput at > >> O(second-minute) latency. There's nothing in the Beam model that > >> prevents sub-second latencies, but I don't think this has been pushed > >> very far on most runners. Gathering such data would certainly be > >> interesting. (Certainly all-in-process runners would likely take the > >> cake here, but it'd be interesting to see how it degrades as one adds > >> more machines.) > >> > >> On Fri, Dec 13, 2019 at 10:58 AM Aaron Dixon <atdi...@gmail.com> wrote: > >> > > >> > I've been building pipelines and benchmarking Beam jobs in Dataflow. > >> > > >> > Without windowing, latencies look pretty good (reliably sub-second) > from ingest to sink. > >> > > >> > Once I introduce windowing even with aggressive watermark it seems it > takes at least a second (often multiple seconds) to see a window fire. > >> > > >> > Same appears true for setting events timers in the State API; there > seems a delay between setting a timer at current event time and the timer > callback fires. > >> > > >> > Are there any good/canonical latency benchmarks/reports for different > runners? > >> > > >> > My next move may be to evaluate Flink in terms of these latencies, > but curious if I should be trying to get sub-second latencies out of Beam > (windowing) at all? > >> > >