Partitioning a stream randomly and writing to files with TextIO

Josh Thu, 22 Feb 2018 11:39:37 -0800

Hi all,

I want to read a large dataset using BigQueryIO, and then randomly
partition the rows into three chunks, where one partition has 80% of the
data and there are two other partitions with 10% and 10%. I then want to
write the three partitions to three files in GCS.


I have a couple of quick questions:
(1) What would be the best way to do this random partitioning with Beam? I
think I can just use a PartitionFn which uses Math.random to determine
which of the three partitions an element should go to, but not sure if
there is a better approach.

(2) I would then take the resulting PCollectionList and use TextIO to write
each partition to a GCS file. For this, would I need all data for the
largest partition to fit into the memory of a single worker?

Thanks for any advice,

Josh

Partitioning a stream randomly and writing to files with TextIO

Reply via email to