On Tue, May 19, 2020 at 4:14 AM Marcin Kuthan <[email protected]> wrote:
> I'm looking for the Pubsub publication details on unbounded collections > when Dataflow runner is used and streaming engine is on. > > As I understood correctly the PubsubUnboundedSink transform is overridden > by internal implementation. > > > https://lists.apache.org/thread.html/26e2bfdb6eaa7319ea3cc65f9d8a0bfeb7be6a6d88f0167ebad0591d%40%3Cuser.beam.apache.org%3E > Only the streaming runner. > > > Questions: > > 1. Should I expect that parameters maxBatchByteSize, batchSize are > respected, or Dataflow internal implementation just ignores them? > I don't think that Dataflow pays attention to this. > 2. What about pubsubClientFactory? The default one is > PubsubJsonClientFactory, and this is somehow important if I want to > configure maxBatchByteSize under Pubsub 10MB limit. Json factory encodes > messages using base64, so the limit should be lowered to 10MB * 0.75 (minus > some safety margin). > Similarly, this has no meaning when using Dataflow streaming. > 3. Should I expect any differences for bounded and unbounded collections? > There are different defaults in the Beam code: e.g: maxBatchByteSize is > ~7.5MB for bounded and ~400kB for unbounded collections, batchSize is 100 > for bounded, and 1000 for unbounded. I also don't understand the reasons > behind default settings. > 4. How to estimate streaming engine costs for internal shuffling in > PubsubUnboundedSink, if any? The default PubsubUnboundedSink implementation > shuffles data before publication but I don't know how how it is done by > internal implementation. And I don't need to know, as long as it does not > generate extra costs :) > The internal implementation does not add any extra cost. Dataflow charges for every MB read from Streaming Engine as a "shuffle" charge, and this includes the records read from PubSub. The external Beam implementation includes a full shuffle, which would be more expensive as it includes both a write and a read. > Many questions about Dataflow internals but it would be nice to know some > details, the details important from the performance and costs perspective. > > Thanks, > Marcin >
