Hi there, I'm building streaming pipelines in Beam (using Google Dataflow runner) and using Google Pubsub as a message broker. I've made a couple of experiments with a very simple pipeline: consume events from Pubsub subscription, add a timestamp to the message body, emit the new event to another Pubsub topic. I'm using all the default parameters when producing and consuming messages.
I've noticed a pretty high latency while consuming messages in Dataflow from Pubsub. My observations show that average duration between the event create timestamp (simple producer that publishes events to Pubsub) and event consume timestamp (Google Dataflow using PubsubIO) is more than 2 seconds. I've been publishing messages at different rates, e.g. 10 msg/sec, 1000 msg/sec, 10,000 msg/sec. And the latency never went lower than 2 seconds. Such latency looks really high. I've tried with direct runner and it has high latency too. I've made a few other experiments with Kafka (very small Kafka cluster) and the same kind of pipeline: consume from Kafka, add timestamp, publish to another Kafka topic. I saw the latency is much lower, on average it's about 150 milliseconds. I suspect there is some batching in PubsubIO that makes the latency so high. My questions are: what should be expected latency in this kind of scenarios? Is there any recommendations to achieve lower latency? I appreciate any help on this! Thank you, Dmitry.
