Interesting, thanks for the info! Combiner lifting definitely makes sense here, but as you mentioned I'm curious how much it helps performance in a streaming pipeline. The blog post you linked is great, I wonder if it possible to make this information more visible? It's pretty buried in the blog list now, and I'll admit I never even got that far, because there's another post on stateful processing almost directly above it.
I still plan on trying to do some benchmarks here because it'd be interesting to see the differences. I'll make sure to post results when I do. On Thu, Mar 14, 2019 at 3:43 PM Kenneth Knowles <[email protected]> wrote: > Combine admits many more execution plans than stateful ParDo: > > - "Combiner lifting" or "mapper-side combine", in which the CombineFn is > used to reduce data before shuffling. This is tremendous in batch, but can > still matter in streaming. > - Hot key fanout & recombine. This is important in both batch & streaming. > > I tried to cover the issues a little in this section of my blog post on > state, because it also answers the converse question: why/when would you > use state (without timers) when Combine is so similar? > https://beam.apache.org/blog/2017/02/13/stateful-processing.html#how-does-stateful-processing-fit-into-the-beam-model > > And here's a slide with the same idea but side-by-side illustrations: > https://s.apache.org/ffsf-2017-beam-state#slide=id.g1dbf0d46d2_0_258 > > Kenn > > On Tue, Mar 12, 2019 at 6:55 AM Steve Niemitz <[email protected]> wrote: > >> Hi all. >> >> I'm curious if anyone has done any comparison of the performance of a >> pipeline that uses CombineByKey, vs one that uses a stateful DoFn with >> combining state. [1] >> >> More specifically, if I had a pipeline that had a CombineByKey configured >> with early firings every N minutes, and I replaced the CBK with a stateful >> DoFn with combining state and a timer that fired every N minutes instead, >> would there be a (significant?) performance difference? Specifically I'm >> using dataflow (with streaming engine) but I'd be curious for other runners >> as well >> >> If no one has tried this I might do a benchmark to test, I'd be very >> interested to see the results. >> >> [1] >> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/state/CombiningState.html >> >
