Oh, one other important thing I forgot to mention is that I can't reproduce (the empty message issue at least) locally on the DirectRunner.
On Wed, Jul 10, 2019 at 6:04 PM Steve Niemitz <[email protected]> wrote: > Thanks for making JIRAs for these, I was going to, I just wanted to do a > sanity check first. :) > > I reproduced them all with the stock PubsubIO first, and then again with > the gRPC client. I can try to throw together a much more minimal repro > case too. > > On Wed, Jul 10, 2019 at 4:21 PM Kenneth Knowles <[email protected]> wrote: > >> This is pretty surprising. Seems valuable to file separate Jiras so we >> can track investigation and resolution. >> >> - use gRPC: https://issues.apache.org/jira/browse/BEAM-7718 >> - empty message bodies: https://issues.apache.org/jira/browse/BEAM-7716 >> - watermark tracking: https://issues.apache.org/jira/browse/BEAM-7717 >> >> You reproduced these with the original PubsubIO? >> >> Kenn >> >> On Mon, Jul 8, 2019 at 10:38 AM Steve Niemitz <[email protected]> >> wrote: >> >>> I was trying to use the bundled PubsubIO.Read implementation in beam on >>> dataflow (using --experiments=enable_custom_pubsub_source to prevent >>> dataflow from overriding it with its own implementation) and ran into some >>> interesting issues. I was curious if people have any experience with >>> these. I'd assume anyone using PubsubIO on a runner other than dataflow >>> would have run into the same things. >>> >>> - The default implementation uses the HTTP REST API, which seems to be >>> much less performant than the gRPC implementation. Is there a reason that >>> the gRPC implementation is essentially unavailable from the public API? >>> PubsubIO.Read.withClientFactory is package private. I worked around this >>> by making it public and rebuilding, which led me to... >>> >>> - Both the JSON and gRPC implementation return empty message bodies for >>> all messages read (using readMessages). When running with the >>> dataflow-specific reader, this doesn't happen and the message bodies have >>> the content as expected. I took a pipeline that works as expected on >>> dataflow using PubsubIO.Read, added the experiment flag, and then my >>> pipeline broke from empty message bodies. This obviously blocked me from >>> really experimenting much more. >>> >>> - The watermark tracking seems off. The dataflow UI was reporting my >>> watermark as around (but not exactly) the epoch (it was ~1970-01-19), which >>> makes me wonder if seconds/milliseconds got confused somewhere (ie, if you >>> take the time since epoch in milliseconds now and interpret it as seconds, >>> you'll get somewhere around 1970-01-18). >>> >>
