There shouldn't be any swapping or memory concerns if your using Dataflow
(unless each element is large (GiB++)). Dataflow will process small
segments of the files all in parallel and write these results out before
processing more so the entire PCollection is never required to be in memory
at a giv
Hi Lukasz, could you please elaborate a bit more around the 2nd part?
What's important to know, from the developers perspective, about Dataflow's
memory management? How big can partitions grow? And what are the
performance considerations? As this sounds like if the workers will "swap"
into disk if
I see, thanks Lukasz - I will try setting that up. Good shout on using
hashcode / ensuring the pipeline is deterministic!
On 23 Feb 2018 01:27, "Lukasz Cwik" wrote:
> 1) Creating a PartitionFn is the right way to go. I would suggest using
> something which would give you stable output so you cou
1) Creating a PartitionFn is the right way to go. I would suggest using
something which would give you stable output so you could replay your
pipeline and this would be useful for tests as well. Use something like the
object's hashcode and divide the hash space into 80%/10%/10% segments could
work