My first streaming pipeline is pretty simple, it just pipes a queue into
files:

- read JSON objects from PubsubIO
- event time = processing time
- 5 minute windows (
- write n files to GCS, (TextIO.withNumShards() not dynamic)

When the pipeline gets behind (for example, when the pipeline is stopped
for an hour and restarted) this creates problems because the amount of data
per file becomes too much, and the pipeline stays behind.

I believe that what is needed is a new step, just before "write to GCS":

- split/partition/window into ceil(totalElements / maxElements) groups

My next idea is to implement my own Partition and PartitionDoFn that accept
a PCollectionView<Long> from Count.perElemen().

Is there a more built-in way to accomplish dynamic partitions by element
quantity?

Jacob

Reply via email to