Hi Lukasz, Thank you for your reply. I understood the part of buffering the elements however I did not clearly understand this: "a fixed number of keys where the fixed number of keys controls the write parallelization" Let's say if have only 4 keys then 4 files will be created but I need to create new file when the buffer size is reached (10k), so it could happen that there are more than 4 files created, correct ?
Using Existing FileSystem API - you mean without using existing FileIO, right ? I was looking for a API similar to BucketingSink API of Flink where it rolls to a new file after batch size is reached. Regards, Vinay Patil On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik <[email protected]> wrote: > You could use a StatefulDoFn to buffer up 10k worth of data and write it. > There is an example for batched RPC[1] that could be re-used using the > FileSystems API to just create files. You'll want to use a reshuffle + a > fixed number of keys where the fixed number of keys controls the write > parallelization. Pipeline would look like: > > --> ParDo(assign deterministic random key in fixed key space) --> > Reshuffle --> ParDo(10k buffering StatefulDoFn) > > You'll need to provide more details around any other constraints you have > around writing 10k. > > 1: https://beam.apache.org/blog/2017/08/28/timely-processing.html > > > On Mon, Nov 26, 2018 at 9:39 AM Vinay Patil <[email protected]> > wrote: > >> Hi, >> >> I have a use case of writing only 10k in a single file after which a new >> file should be created. >> >> Is there a way in Beam to roll a file after the specified number of >> records ? >> >> Regards, >> Vinay Patil >> >
