As far as I can tell, Beam expects runners to have full control over separation of individual elements into bundles and this is something users have no control over. Is that true? Or are there any ways that I might exert some influence over bundle sizes?
My main interest at the moment is investigating lighter-weight alternatives to FileIO for a simple but high-throughput Dataflow job that batches up messages from Pub/Sub and sinks them to GCS. I'm imagining a ParDo that buffers incoming messages and then writes them all as an object to GCS in a @FinalizeBundle method, avoiding the multiple GroupByKey operations needed for writing sharded output from FileIO. The problem is that bundles in practice look to be far too small to make this feasible. I deployed a 4 node test job that simply reads ~30k messages per second from Pub/Sub and passes them to a transform that publishes some metrics about the bundles passing through. I found a mean bundle size of ~500 elements corresponding to ~10 MB of data, which is too small for the proposed approach to be feasible. Are there any tricks I could use to coerce Dataflow to increase the size of bundles? I realize this is basically an abuse of the Beam programming model, but the alternative I'm looking at is having to write a custom application using the google-cloud APIs and deploying it on Kubernetes.
