Oh, one other important thing I forgot to mention is that I can't reproduce
(the empty message issue at least) locally on the DirectRunner.

On Wed, Jul 10, 2019 at 6:04 PM Steve Niemitz <[email protected]> wrote:

> Thanks for making JIRAs for these, I was going to, I just wanted to do a
> sanity check first. :)
>
> I reproduced them all with the stock PubsubIO first, and then again with
> the gRPC client.  I can try to throw together a much more minimal repro
> case too.
>
> On Wed, Jul 10, 2019 at 4:21 PM Kenneth Knowles <[email protected]> wrote:
>
>> This is pretty surprising. Seems valuable to file separate Jiras so we
>> can track investigation and resolution.
>>
>>  - use gRPC: https://issues.apache.org/jira/browse/BEAM-7718
>>  - empty message bodies: https://issues.apache.org/jira/browse/BEAM-7716
>>  - watermark tracking: https://issues.apache.org/jira/browse/BEAM-7717
>>
>> You reproduced these with the original PubsubIO?
>>
>> Kenn
>>
>> On Mon, Jul 8, 2019 at 10:38 AM Steve Niemitz <[email protected]>
>> wrote:
>>
>>> I was trying to use the bundled PubsubIO.Read implementation in beam on
>>> dataflow (using --experiments=enable_custom_pubsub_source to prevent
>>> dataflow from overriding it with its own implementation) and ran into some
>>> interesting issues.  I was curious if people have any experience with
>>> these.  I'd assume anyone using PubsubIO on a runner other than dataflow
>>> would have run into the same things.
>>>
>>> - The default implementation uses the HTTP REST API, which seems to be
>>> much less performant than the gRPC implementation.  Is there a reason that
>>> the gRPC implementation is essentially unavailable from the public API?
>>> PubsubIO.Read.withClientFactory is package private.  I worked around this
>>> by making it public and rebuilding, which led me to...
>>>
>>> - Both the JSON and gRPC implementation return empty message bodies for
>>> all messages read (using readMessages).  When running with the
>>> dataflow-specific reader, this doesn't happen and the message bodies have
>>> the content as expected.  I took a pipeline that works as expected on
>>> dataflow using PubsubIO.Read, added the experiment flag, and then my
>>> pipeline broke from empty message bodies.  This obviously blocked me from
>>> really experimenting much more.
>>>
>>> - The watermark tracking seems off.  The dataflow UI was reporting my
>>> watermark as around (but not exactly) the epoch (it was ~1970-01-19), which
>>> makes me wonder if seconds/milliseconds got confused somewhere (ie, if you
>>> take the time since epoch in milliseconds now and interpret it as seconds,
>>> you'll get somewhere around 1970-01-18).
>>>
>>

Reply via email to