I was trying to use the bundled PubsubIO.Read implementation in beam on dataflow (using --experiments=enable_custom_pubsub_source to prevent dataflow from overriding it with its own implementation) and ran into some interesting issues. I was curious if people have any experience with these. I'd assume anyone using PubsubIO on a runner other than dataflow would have run into the same things.
- The default implementation uses the HTTP REST API, which seems to be much less performant than the gRPC implementation. Is there a reason that the gRPC implementation is essentially unavailable from the public API? PubsubIO.Read.withClientFactory is package private. I worked around this by making it public and rebuilding, which led me to... - Both the JSON and gRPC implementation return empty message bodies for all messages read (using readMessages). When running with the dataflow-specific reader, this doesn't happen and the message bodies have the content as expected. I took a pipeline that works as expected on dataflow using PubsubIO.Read, added the experiment flag, and then my pipeline broke from empty message bodies. This obviously blocked me from really experimenting much more. - The watermark tracking seems off. The dataflow UI was reporting my watermark as around (but not exactly) the epoch (it was ~1970-01-19), which makes me wonder if seconds/milliseconds got confused somewhere (ie, if you take the time since epoch in milliseconds now and interpret it as seconds, you'll get somewhere around 1970-01-18).
