It seems you can 'hack' it with the State API. See the discussion on this ticket: https://issues.apache.org/jira/browse/BEAM-6886
On Thu, Apr 4, 2019 at 9:42 PM Jeff Klukas <[email protected]> wrote: > > As far as I can tell, Beam expects runners to have full control over > separation of individual elements into bundles and this is something users > have no control over. Is that true? Or are there any ways that I might exert > some influence over bundle sizes? > > My main interest at the moment is investigating lighter-weight alternatives > to FileIO for a simple but high-throughput Dataflow job that batches up > messages from Pub/Sub and sinks them to GCS. I'm imagining a ParDo that > buffers incoming messages and then writes them all as an object to GCS in a > @FinalizeBundle method, avoiding the multiple GroupByKey operations needed > for writing sharded output from FileIO. > > The problem is that bundles in practice look to be far too small to make this > feasible. I deployed a 4 node test job that simply reads ~30k messages per > second from Pub/Sub and passes them to a transform that publishes some > metrics about the bundles passing through. I found a mean bundle size of ~500 > elements corresponding to ~10 MB of data, which is too small for the proposed > approach to be feasible. Are there any tricks I could use to coerce Dataflow > to increase the size of bundles? > > I realize this is basically an abuse of the Beam programming model, but the > alternative I'm looking at is having to write a custom application using the > google-cloud APIs and deploying it on Kubernetes.
