Re: Partitioning a stream randomly and writing to files with TextIO

Josh Fri, 23 Feb 2018 00:22:32 -0800

I see, thanks Lukasz - I will try setting that up. Good shout on using
hashcode / ensuring the pipeline is deterministic!


On 23 Feb 2018 01:27, "Lukasz Cwik" <[email protected]> wrote:

> 1) Creating a PartitionFn is the right way to go. I would suggest using
> something which would give you stable output so you could replay your
> pipeline and this would be useful for tests as well. Use something like the
> object's hashcode and divide the hash space into 80%/10%/10% segments could
> work just make sure that if you go with hashcode the hashcode function
> distribute elements well.
>
> 2) This is runner dependent but most runners don't require storing
> everything in memory. For example if you were using Dataflow, you would
> only need to store a couple of elements in memory not the entire
> PCollection.
>
> On Thu, Feb 22, 2018 at 11:38 AM, Josh <[email protected]> wrote:
>
>> Hi all,
>>
>> I want to read a large dataset using BigQueryIO, and then randomly
>> partition the rows into three chunks, where one partition has 80% of the
>> data and there are two other partitions with 10% and 10%. I then want to
>> write the three partitions to three files in GCS.
>>
>> I have a couple of quick questions:
>> (1) What would be the best way to do this random partitioning with Beam?
>> I think I can just use a PartitionFn which uses Math.random to determine
>> which of the three partitions an element should go to, but not sure if
>> there is a better approach.
>>
>> (2) I would then take the resulting PCollectionList and use TextIO to
>> write each partition to a GCS file. For this, would I need all data for the
>> largest partition to fit into the memory of a single worker?
>>
>> Thanks for any advice,
>>
>> Josh
>>
>
>

Re: Partitioning a stream randomly and writing to files with TextIO

Reply via email to