I want to read a large dataset using BigQueryIO, and then randomly
partition the rows into three chunks, where one partition has 80% of the
data and there are two other partitions with 10% and 10%. I then want to
write the three partitions to three files in GCS.
I have a couple of quick questions:
(1) What would be the best way to do this random partitioning with Beam? I
think I can just use a PartitionFn which uses Math.random to determine
which of the three partitions an element should go to, but not sure if
there is a better approach.
(2) I would then take the resulting PCollectionList and use TextIO to write
each partition to a GCS file. For this, would I need all data for the
largest partition to fit into the memory of a single worker?
Thanks for any advice,