Have you considered using a stateful DoFn, buffering/batching based upon a certain number of elements is shown in this blog[1] and could be extended for your usecase.
1: https://beam.apache.org/blog/2017/08/28/timely-processing.html On Tue, Oct 17, 2017 at 2:46 PM, Jacob Marble <[email protected]> wrote: > My first streaming pipeline is pretty simple, it just pipes a queue into > files: > > - read JSON objects from PubsubIO > - event time = processing time > - 5 minute windows ( > - write n files to GCS, (TextIO.withNumShards() not dynamic) > > When the pipeline gets behind (for example, when the pipeline is stopped > for an hour and restarted) this creates problems because the amount of data > per file becomes too much, and the pipeline stays behind. > > I believe that what is needed is a new step, just before "write to GCS": > > - split/partition/window into ceil(totalElements / maxElements) groups > > My next idea is to implement my own Partition and PartitionDoFn that > accept a PCollectionView<Long> from Count.perElemen(). > > Is there a more built-in way to accomplish dynamic partitions by element > quantity? > > Jacob >
