Re: Using gRPC with PubsubIO?

2019-01-02 Thread Jeff Klukas
I believe this explains why I have observing Pubsub write errors (about
messages being too large) in logs for the Dataflow "shuffler" rather than
the workers.

The specific error I saw was about a 7 MB message being too large with
base64 encoding to meet Pubsub requirements (10 MB max message size), which
makes me think that the Dataflow Pubsub writer was still using JSON rather
than gRPC. But sounds like this is not configurable from the client and
Google has full control over the details of how Pubsub writing and reading
work in Dataflow jobs.


On Wed, Jan 2, 2019 at 1:04 PM Steve Niemitz  wrote:

> Something to consider: if you're running in Dataflow, the entire Pubsub
> read step becomes a noop [1], and the underlying streaming implementation
> itself handles reading from pubsub (either windmill or the streaming
> engine).
>
> [1]
> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L373
>
> On Wed, Jan 2, 2019 at 12:11 PM Jeff Klukas  wrote:
>
>> I see that the Beam codebase includes a PubsubGrpcClient, but there
>> doesn't appear to be any way to configure PubsubIO to use that client over
>> the PubsubJsonClient.
>>
>> There's even a PubsubIO.Read#withClientFactory, but it's marked as for
>> testing only.
>>
>> Is gRPC support something that's still in development? Or am I missing
>> something about how to configure this?
>>
>> I'm particularly interested in using gRPC due to the message size
>> inflation of base64 encoding required for JSON transport. My payloads are
>> all below the 10 MB Pubsub limit, but I need to support some near the top
>> end of that range that are currently causing errors due to base64 inflation.
>>
>


Re: Using gRPC with PubsubIO?

2019-01-02 Thread Steve Niemitz
Something to consider: if you're running in Dataflow, the entire Pubsub
read step becomes a noop [1], and the underlying streaming implementation
itself handles reading from pubsub (either windmill or the streaming
engine).

[1]
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L373

On Wed, Jan 2, 2019 at 12:11 PM Jeff Klukas  wrote:

> I see that the Beam codebase includes a PubsubGrpcClient, but there
> doesn't appear to be any way to configure PubsubIO to use that client over
> the PubsubJsonClient.
>
> There's even a PubsubIO.Read#withClientFactory, but it's marked as for
> testing only.
>
> Is gRPC support something that's still in development? Or am I missing
> something about how to configure this?
>
> I'm particularly interested in using gRPC due to the message size
> inflation of base64 encoding required for JSON transport. My payloads are
> all below the 10 MB Pubsub limit, but I need to support some near the top
> end of that range that are currently causing errors due to base64 inflation.
>