Both are doing the same thing effectively by loading the entire iterable
into memory in the first case and the partitioned iterable into memory in
the second case.
The side input performance varies a lot depending on whether your running a
pipeline with bounded or unbounded PCollections,
On Mon, Feb 12, 2018 at 3:59 PM, Lukasz Cwik wrote:
> The optimization that you have done is that you have forced the V1
> iterable to reside in memory completely since it is now counted as a single
> element. This will fall apart as soon your V1 iterable exceeds memory.
>
The optimization that you have done is that you have forced the V1 iterable
to reside in memory completely since it is now counted as a single element.
This will fall apart as soon your V1 iterable exceeds memory.
Runners like Dataflow allow re-iteration of a GBK/CoGBK result allowing for
the