Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Lukasz Cwik
There shouldn't be any swapping or memory concerns if your using Dataflow (unless each element is large (GiB++)). Dataflow will process small segments of the files all in parallel and write these results out before processing more so the entire PCollection is never required to be in memory at a giv

Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Carlos Alonso
Hi Lukasz, could you please elaborate a bit more around the 2nd part? What's important to know, from the developers perspective, about Dataflow's memory management? How big can partitions grow? And what are the performance considerations? As this sounds like if the workers will "swap" into disk if

Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Josh
I see, thanks Lukasz - I will try setting that up. Good shout on using hashcode / ensuring the pipeline is deterministic! On 23 Feb 2018 01:27, "Lukasz Cwik" wrote: > 1) Creating a PartitionFn is the right way to go. I would suggest using > something which would give you stable output so you cou

Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-22 Thread Lukasz Cwik
1) Creating a PartitionFn is the right way to go. I would suggest using something which would give you stable output so you could replay your pipeline and this would be useful for tests as well. Use something like the object's hashcode and divide the hash space into 80%/10%/10% segments could work