My first streaming pipeline is pretty simple, it just pipes a queue into files:
- read JSON objects from PubsubIO - event time = processing time - 5 minute windows ( - write n files to GCS, (TextIO.withNumShards() not dynamic) When the pipeline gets behind (for example, when the pipeline is stopped for an hour and restarted) this creates problems because the amount of data per file becomes too much, and the pipeline stays behind. I believe that what is needed is a new step, just before "write to GCS": - split/partition/window into ceil(totalElements / maxElements) groups My next idea is to implement my own Partition and PartitionDoFn that accept a PCollectionView<Long> from Count.perElemen(). Is there a more built-in way to accomplish dynamic partitions by element quantity? Jacob