Re: working with hot keys

2018-02-13 Thread Lukasz Cwik
Both are doing the same thing effectively by loading the entire iterable into memory in the first case and the partitioned iterable into memory in the second case. The side input performance varies a lot depending on whether your running a pipeline with bounded or unbounded PCollections,

Re: working with hot keys

2018-02-13 Thread Jacob Marble
On Mon, Feb 12, 2018 at 3:59 PM, Lukasz Cwik wrote: > The optimization that you have done is that you have forced the V1 > iterable to reside in memory completely since it is now counted as a single > element. This will fall apart as soon your V1 iterable exceeds memory. >

Re: working with hot keys

2018-02-12 Thread Lukasz Cwik
The optimization that you have done is that you have forced the V1 iterable to reside in memory completely since it is now counted as a single element. This will fall apart as soon your V1 iterable exceeds memory. Runners like Dataflow allow re-iteration of a GBK/CoGBK result allowing for the