Re: Rolling File Query

Vinay Patil Mon, 26 Nov 2018 12:36:02 -0800

Hi Lukasz,

Thank you for your reply. I understood the part of buffering the elements
however I did not clearly understand this: "a fixed number of keys where
the fixed number of keys controls the write parallelization"  Let's say if
have only 4 keys then 4 files will be created but I need to create new file
when the buffer size is reached (10k), so it could happen that there are
more than 4 files created, correct ?


Using Existing FileSystem API - you mean without using existing FileIO,
right ?

I was looking for a API similar to BucketingSink API of Flink where it
rolls to a new file after batch size is reached.


Regards,
Vinay Patil


On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik <[email protected]> wrote:

> You could use a StatefulDoFn to buffer up 10k worth of data and write it.
> There is an example for batched RPC[1] that could be re-used using the
> FileSystems API to just create files. You'll want to use a reshuffle + a
> fixed number of keys where the fixed number of keys controls the write
> parallelization. Pipeline would look like:
>
> --> ParDo(assign deterministic random key in fixed key space) -->
> Reshuffle --> ParDo(10k buffering StatefulDoFn)
>
> You'll need to provide more details around any other constraints you have
> around writing 10k.
>
> 1: https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
>
> On Mon, Nov 26, 2018 at 9:39 AM Vinay Patil <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have a use case of writing only 10k in a single file after which a new
>> file should be created.
>>
>> Is there a way in Beam to roll a file after the specified number of
>> records ?
>>
>> Regards,
>> Vinay Patil
>>
>

Re: Rolling File Query

Reply via email to