On Tue, May 19, 2020 at 4:14 AM Marcin Kuthan <[email protected]>
wrote:

> I'm looking for the Pubsub publication details on unbounded collections
> when Dataflow runner is used and streaming engine is on.
>
> As I understood correctly the PubsubUnboundedSink transform is overridden
> by internal implementation.
>
>
> https://lists.apache.org/thread.html/26e2bfdb6eaa7319ea3cc65f9d8a0bfeb7be6a6d88f0167ebad0591d%40%3Cuser.beam.apache.org%3E
>

Only the streaming runner.


>
>
> Questions:
>
> 1. Should I expect that parameters maxBatchByteSize, batchSize are
> respected, or Dataflow internal implementation just ignores them?
>

I don't think that Dataflow pays attention to this.


> 2. What about pubsubClientFactory? The default one is
> PubsubJsonClientFactory, and this is somehow important if I want to
> configure maxBatchByteSize under Pubsub 10MB limit. Json factory encodes
> messages using base64, so the limit should be lowered to 10MB * 0.75 (minus
> some safety margin).
>

Similarly, this has no meaning when using Dataflow streaming.


> 3. Should I expect any differences for bounded and unbounded collections?
> There are different defaults in the Beam code: e.g: maxBatchByteSize is
> ~7.5MB for bounded and ~400kB for unbounded collections, batchSize is 100
> for bounded, and 1000 for unbounded. I also don't understand the reasons
> behind default settings.
> 4. How to estimate streaming engine costs for internal shuffling in
> PubsubUnboundedSink, if any? The default PubsubUnboundedSink implementation
> shuffles data before publication but I don't know how how it is done by
> internal implementation. And I don't need to know, as long as it does not
> generate extra costs :)
>

The internal implementation does not add any extra cost. Dataflow charges
for every MB read from Streaming Engine as a "shuffle" charge, and this
includes the records read from PubSub. The external Beam implementation
includes a full shuffle, which would be more expensive as it includes both
a write and a read.



> Many questions about Dataflow internals but it would be nice to know some
> details, the details important from the performance and costs perspective.
>
> Thanks,
> Marcin
>

Reply via email to