Hi all, I want to read a large dataset using BigQueryIO, and then randomly partition the rows into three chunks, where one partition has 80% of the data and there are two other partitions with 10% and 10%. I then want to write the three partitions to three files in GCS.
I have a couple of quick questions: (1) What would be the best way to do this random partitioning with Beam? I think I can just use a PartitionFn which uses Math.random to determine which of the three partitions an element should go to, but not sure if there is a better approach. (2) I would then take the resulting PCollectionList and use TextIO to write each partition to a GCS file. For this, would I need all data for the largest partition to fit into the memory of a single worker? Thanks for any advice, Josh
