I was trying to use the bundled PubsubIO.Read implementation in beam on
dataflow (using --experiments=enable_custom_pubsub_source to prevent
dataflow from overriding it with its own implementation) and ran into some
interesting issues.  I was curious if people have any experience with
these.  I'd assume anyone using PubsubIO on a runner other than dataflow
would have run into the same things.

- The default implementation uses the HTTP REST API, which seems to be much
less performant than the gRPC implementation.  Is there a reason that the
gRPC implementation is essentially unavailable from the public API?
PubsubIO.Read.withClientFactory is package private.  I worked around this
by making it public and rebuilding, which led me to...

- Both the JSON and gRPC implementation return empty message bodies for all
messages read (using readMessages).  When running with the
dataflow-specific reader, this doesn't happen and the message bodies have
the content as expected.  I took a pipeline that works as expected on
dataflow using PubsubIO.Read, added the experiment flag, and then my
pipeline broke from empty message bodies.  This obviously blocked me from
really experimenting much more.

- The watermark tracking seems off.  The dataflow UI was reporting my
watermark as around (but not exactly) the epoch (it was ~1970-01-19), which
makes me wonder if seconds/milliseconds got confused somewhere (ie, if you
take the time since epoch in milliseconds now and interpret it as seconds,
you'll get somewhere around 1970-01-18).

Reply via email to