This is a great question. Both a list side input and a GroupByKey on one key collect all of the contents of a window into a single iterable.
There are a few high-level differences: Side inputs can be controlled by triggers, but may have additional latency compared to a GBK controlled by a trigger - technically there is no real spec here, just "eventually". On the other hand, side inputs are expected to be read many times, so runners will likely add caching that would not make sense for the output of a GBK, so a list side input is probably cheaper to read many times. Dropping to a lower level of abstraction, a side input is generally a view on a materialization of an entire PCollection. A runner knows this and can choose a suitable materialization to support it. On the other hand, a GroupByKey will generally use the runner's underlying networked shuffle implementation; there's probably needless overhead since all data is being shuffled to a single key. I've tried to answer somewhat vaguely, in terms of the Beam model, since this is an are where a Beam runner has a lot of discretion. The answers could get more specific if you are interested in a particular runner. Kenn On Sun, Sep 24, 2017 at 5:27 PM, Wesley Tanaka <[email protected]> wrote: > What are the differences between the side input approach and > > > .apply(WithKeys.of(31337)) > .apply(GroupByKey.create()) > .apply(Values.create()) > > ? > > > > --- > Wesley Tanaka > https://wtanaka.com/ > > > > > On Monday, July 10, 2017, 5:55:44 PM HST, Kenneth Knowles <[email protected]> > wrote: > > > > > > Hi bluejoe, > > Assuming you know that you have a very small PCollection, the way you can > do this is by reading it as a side input. See https://beam.apache.org/ > documentation/programming-guide/#transforms-sideio > > Here's a snippet as a teaser to read the docs I link to: > > PCollection<Whatever> mySmallCollection = ... > PCollectionView<List<Whatever>> mySideInput = > mySmallCollection.apply(View.asList()); > someOtherCollection.apply(ParDo.of(...).withSideInputs(mySideInput); > > This won't work if your collection is actually large, where "large" means > too big for memory. And it could be slow depending on your runner and > access pattern, even if you have a medium-sized PCollection, aka fits in > memory but still a lot to read without parallelism. > > Hope that helps, > > Kenn > > On Mon, Jul 10, 2017 at 8:35 PM, bluejoe <[email protected]> wrote: > > Hi, > > > > Can anybody tell me how to convert a PCollection to an Array in local > memory? > > I noticed there is a Create.of() which converts a local list into a > PCollection > > But how to do the conversion in an inverse direction? > > > > Best regards, > > bluejoe > > > > > > >
