I see, thanks Lukasz - I will try setting that up. Good shout on using hashcode / ensuring the pipeline is deterministic!
On 23 Feb 2018 01:27, "Lukasz Cwik" <[email protected]> wrote: > 1) Creating a PartitionFn is the right way to go. I would suggest using > something which would give you stable output so you could replay your > pipeline and this would be useful for tests as well. Use something like the > object's hashcode and divide the hash space into 80%/10%/10% segments could > work just make sure that if you go with hashcode the hashcode function > distribute elements well. > > 2) This is runner dependent but most runners don't require storing > everything in memory. For example if you were using Dataflow, you would > only need to store a couple of elements in memory not the entire > PCollection. > > On Thu, Feb 22, 2018 at 11:38 AM, Josh <[email protected]> wrote: > >> Hi all, >> >> I want to read a large dataset using BigQueryIO, and then randomly >> partition the rows into three chunks, where one partition has 80% of the >> data and there are two other partitions with 10% and 10%. I then want to >> write the three partitions to three files in GCS. >> >> I have a couple of quick questions: >> (1) What would be the best way to do this random partitioning with Beam? >> I think I can just use a PartitionFn which uses Math.random to determine >> which of the three partitions an element should go to, but not sure if >> there is a better approach. >> >> (2) I would then take the resulting PCollectionList and use TextIO to >> write each partition to a GCS file. For this, would I need all data for the >> largest partition to fit into the memory of a single worker? >> >> Thanks for any advice, >> >> Josh >> > >
