Hi Matthias, when the shuffle happens is not defined by the Beam model so it depends on the runner. You are right, though, that a runner can optimise execution when you specify a CombineFn. In that case a runner can choose to combine elements before shuffling to reduce the amount of data that we have to shuffle across the network. With a SerializableFunction that's not possible because we don't have an intermediate accumulation type as we have with a CombineFn. Therefore, the runner has to ship all elements for a given key to one machine to apply the SerializableFunction.
Regarding your second question, could you maybe send a code snipped? That would allow us to have a look and give a good answer. Cheers, Aljoscha On Tue, 22 Nov 2016 at 12:32 Matthias Baetens < [email protected]> wrote: > Hi there, > > I had some questions about the internal working of these two concepts and > where I could find more info on this (so I might be able similar problems > in the future myself. Here we go: > > + When doing a GroupByKey, when does the shuffling actually take place? > Could it be the behaviour is not the same when using a CombineFn to > aggregate compared to when using a Serializablefunction? (I have a feeling > in the first case not all the keys get shuffled to one machine, while it > does for the second). > > + When using Accumulators in a CombineFn, what are the actual internals? > Is there any docs on this? The problem I run into is that, when I try > adding elements to an ArrayList and then merge ArrayList, the output is an > empty list. The problem could probably be solved by using a > Serializablefunction to Combine everything at once, but you might loose the > advantages of parallellisation in that case (~ above). > > Thanks a lot :) > > Best, > > Matthias > -- > > *Matthias Baetens* > > > *datatonic | data power unleashed* > office +44 203 668 3680 <+44%2020%203668%203680> | mobile +44 74 918 > 20646 > > Level24 | 1 Canada Square | Canary Wharf | E14 5AB London >
