1) Creating a PartitionFn is the right way to go. I would suggest using
something which would give you stable output so you could replay your
pipeline and this would be useful for tests as well. Use something like the
object's hashcode and divide the hash space into 80%/10%/10% segments could
work just make sure that if you go with hashcode the hashcode function
distribute elements well.

2) This is runner dependent but most runners don't require storing
everything in memory. For example if you were using Dataflow, you would
only need to store a couple of elements in memory not the entire
PCollection.

On Thu, Feb 22, 2018 at 11:38 AM, Josh <jof...@gmail.com> wrote:

> Hi all,
>
> I want to read a large dataset using BigQueryIO, and then randomly
> partition the rows into three chunks, where one partition has 80% of the
> data and there are two other partitions with 10% and 10%. I then want to
> write the three partitions to three files in GCS.
>
> I have a couple of quick questions:
> (1) What would be the best way to do this random partitioning with Beam? I
> think I can just use a PartitionFn which uses Math.random to determine
> which of the three partitions an element should go to, but not sure if
> there is a better approach.
>
> (2) I would then take the resulting PCollectionList and use TextIO to
> write each partition to a GCS file. For this, would I need all data for the
> largest partition to fit into the memory of a single worker?
>
> Thanks for any advice,
>
> Josh
>

Reply via email to