On Mon, Nov 26, 2018 at 12:35 PM Vinay Patil <[email protected]> wrote:
> Hi Lukasz, > > Thank you for your reply. I understood the part of buffering the elements > however I did not clearly understand this: "a fixed number of keys where > the fixed number of keys controls the write parallelization" Let's say if > have only 4 keys then 4 files will be created but I need to create new file > when the buffer size is reached (10k), so it could happen that there are > more than 4 files created, correct ? > Using 4 fixed keys means that you have 4 10kb buffers being filled concurrently (one buffer per key). A file will be written when each buffer is full so you will end up with many files each at least 10k. > Using Existing FileSystem API - you mean without using existing FileIO, > right ? > Yes. > I was looking for a API similar to BucketingSink API of Flink where it > rolls to a new file after batch size is reached. > > > Regards, > Vinay Patil > > > On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik <[email protected]> wrote: > >> You could use a StatefulDoFn to buffer up 10k worth of data and write it. >> There is an example for batched RPC[1] that could be re-used using the >> FileSystems API to just create files. You'll want to use a reshuffle + a >> fixed number of keys where the fixed number of keys controls the write >> parallelization. Pipeline would look like: >> >> --> ParDo(assign deterministic random key in fixed key space) --> >> Reshuffle --> ParDo(10k buffering StatefulDoFn) >> >> You'll need to provide more details around any other constraints you have >> around writing 10k. >> >> 1: https://beam.apache.org/blog/2017/08/28/timely-processing.html >> >> >> On Mon, Nov 26, 2018 at 9:39 AM Vinay Patil <[email protected]> >> wrote: >> >>> Hi, >>> >>> I have a use case of writing only 10k in a single file after which a new >>> file should be created. >>> >>> Is there a way in Beam to roll a file after the specified number of >>> records ? >>> >>> Regards, >>> Vinay Patil >>> >>
