On Mon, Nov 26, 2018 at 12:35 PM Vinay Patil <[email protected]>
wrote:

> Hi Lukasz,
>
> Thank you for your reply. I understood the part of buffering the elements
> however I did not clearly understand this: "a fixed number of keys where
> the fixed number of keys controls the write parallelization"  Let's say if
> have only 4 keys then 4 files will be created but I need to create new file
> when the buffer size is reached (10k), so it could happen that there are
> more than 4 files created, correct ?
>

Using 4 fixed keys means that you have 4 10kb buffers being filled
concurrently (one buffer per key). A file will be written when each buffer
is full so you will end up with many files each at least 10k.


> Using Existing FileSystem API - you mean without using existing FileIO,
> right ?
>

Yes.


> I was looking for a API similar to BucketingSink API of Flink where it
> rolls to a new file after batch size is reached.
>
>
> Regards,
> Vinay Patil
>
>
> On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik <[email protected]> wrote:
>
>> You could use a StatefulDoFn to buffer up 10k worth of data and write it.
>> There is an example for batched RPC[1] that could be re-used using the
>> FileSystems API to just create files. You'll want to use a reshuffle + a
>> fixed number of keys where the fixed number of keys controls the write
>> parallelization. Pipeline would look like:
>>
>> --> ParDo(assign deterministic random key in fixed key space) -->
>> Reshuffle --> ParDo(10k buffering StatefulDoFn)
>>
>> You'll need to provide more details around any other constraints you have
>> around writing 10k.
>>
>> 1: https://beam.apache.org/blog/2017/08/28/timely-processing.html
>>
>>
>> On Mon, Nov 26, 2018 at 9:39 AM Vinay Patil <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a use case of writing only 10k in a single file after which a new
>>> file should be created.
>>>
>>> Is there a way in Beam to roll a file after the specified number of
>>> records ?
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>

Reply via email to