It feels more natural because it's only using GroupByKey once instead of once per combiner. Which I think is still more efficient even accounting for combiner lifting (unless there's some kind of pipeline optimization that merges multiple groupbykey's on the same pcollection into a single GBK).
You can imagine a different use case where this pattern might arise that isn't just trying to reduce GBKs though. For example: ``` max_sample_size = 100_000 ( keyed_nums | GroupByKey() | Map(lambda k_nums: (k, nums[:max_sample_size])) | #?? Mean.PerGrouped()? ``` To take the mean of every grouped_values using current combiners, I think you'd have to use an inverted groupbykey and then call `Mean.PerKey()` unless I'm missing something. (I recognize that writing a Map that takes a mean is simple enough, but in a real use case we might have a more complicated combiner) On Fri, Sep 27, 2024 at 1:31 PM Valentyn Tymofieiev via user < user@beam.apache.org> wrote: > > > On Fri, Sep 27, 2024 at 8:35 AM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Hey all, >> >> Just curious if this pattern comes up for others and if people have >> worked out a good convention. >> >> There are many combiners and a lot of them have two forms: a global form >> (e.g. Count.Globally) and a per key form (e.g. Count.PerKey). These are >> convenient but it feels like often we're running into the case where we >> GroupBy a set of data once and then wish to perform a series of combines on >> them, in which case neither of these forms work, and it begs another form >> which operates on pre-grouped KVs. >> >> Contrived example: maybe you have a pcollection of keyed numbers and you >> want to calculate some summary statistics on them. You could do: >> ``` >> keyed_means = (keyed_nums >> | Mean.PerKey()) >> keyed_counts = (keyed_num >> | Count.PerKey()) >> ... # other combines >> ``` >> But it'd feel more natural to pre-group the pcollection. >> > > Does it feel more natural because it feels as though it would be more > performant? Because it seems like it adds an extra grouping step to the > pipeline code, which otherwise might be not necessary. Note that Dataflow > has the "combiner lifting" optimization, and combiner-specified-reduction > happens before the data is written into shuffle as much as possible: > https://cloud.google.com/dataflow/docs/pipeline-lifecycle#combine_optimization > . > > >> ``` >> grouped_nums = keyed_nums | GBK() >> keyed_means = (grouped_nums | Mean.PerGrouped()) >> keyed_counts (grouped_nums | Count.PerGrouped()) >> ``` >> But these "PerGrouped" variants don't actually currently exist. Does >> anyone else run into this pattern often? I might be missing an obvious >> pattern here. >> >> -- >> >> Joey Tran | Staff Developer | AutoDesigner TL >> >> *he/him* >> >> [image: Schrödinger, Inc.] <https://schrodinger.com/> >> >