This is pretty surprising. Seems valuable to file separate Jiras so we can track investigation and resolution.
- use gRPC: https://issues.apache.org/jira/browse/BEAM-7718 - empty message bodies: https://issues.apache.org/jira/browse/BEAM-7716 - watermark tracking: https://issues.apache.org/jira/browse/BEAM-7717 You reproduced these with the original PubsubIO? Kenn On Mon, Jul 8, 2019 at 10:38 AM Steve Niemitz <[email protected]> wrote: > I was trying to use the bundled PubsubIO.Read implementation in beam on > dataflow (using --experiments=enable_custom_pubsub_source to prevent > dataflow from overriding it with its own implementation) and ran into some > interesting issues. I was curious if people have any experience with > these. I'd assume anyone using PubsubIO on a runner other than dataflow > would have run into the same things. > > - The default implementation uses the HTTP REST API, which seems to be > much less performant than the gRPC implementation. Is there a reason that > the gRPC implementation is essentially unavailable from the public API? > PubsubIO.Read.withClientFactory is package private. I worked around this > by making it public and rebuilding, which led me to... > > - Both the JSON and gRPC implementation return empty message bodies for > all messages read (using readMessages). When running with the > dataflow-specific reader, this doesn't happen and the message bodies have > the content as expected. I took a pipeline that works as expected on > dataflow using PubsubIO.Read, added the experiment flag, and then my > pipeline broke from empty message bodies. This obviously blocked me from > really experimenting much more. > > - The watermark tracking seems off. The dataflow UI was reporting my > watermark as around (but not exactly) the epoch (it was ~1970-01-19), which > makes me wonder if seconds/milliseconds got confused somewhere (ie, if you > take the time since epoch in milliseconds now and interpret it as seconds, > you'll get somewhere around 1970-01-18). >
