Hi there, I had some questions about the internal working of these two concepts and where I could find more info on this (so I might be able similar problems in the future myself. Here we go:
+ When doing a GroupByKey, when does the shuffling actually take place? Could it be the behaviour is not the same when using a CombineFn to aggregate compared to when using a Serializablefunction? (I have a feeling in the first case not all the keys get shuffled to one machine, while it does for the second). + When using Accumulators in a CombineFn, what are the actual internals? Is there any docs on this? The problem I run into is that, when I try adding elements to an ArrayList and then merge ArrayList, the output is an empty list. The problem could probably be solved by using a Serializablefunction to Combine everything at once, but you might loose the advantages of parallellisation in that case (~ above). Thanks a lot :) Best, Matthias -- *Matthias Baetens* *datatonic | data power unleashed* office +44 203 668 3680 | mobile +44 74 918 20646 Level24 | 1 Canada Square | Canary Wharf | E14 5AB London
