Hi there,

I'm building streaming pipelines in Beam (using Google Dataflow runner) and 
using Google Pubsub as a message broker. I've made a couple of experiments with 
a very simple pipeline: consume events from Pubsub subscription, add a 
timestamp to the message body, emit the new event to another Pubsub topic. I'm 
using all the default parameters when producing and consuming messages.

I've noticed a pretty high latency while consuming messages in Dataflow from 
Pubsub. My observations show that average duration between the event create 
timestamp (simple producer that publishes events to Pubsub) and event consume 
timestamp (Google Dataflow using PubsubIO) is more than 2 seconds. I've been 
publishing messages at different rates, e.g. 10 msg/sec, 1000 msg/sec, 10,000 
msg/sec. And the latency never went lower than 2 seconds. Such latency looks 
really high. I've tried with direct runner and it has high latency too.

I've made a few other experiments with Kafka (very small Kafka cluster) and the 
same kind of pipeline: consume from Kafka, add timestamp, publish to another 
Kafka topic. I saw the latency is much lower, on average it's about 150 
milliseconds.

I suspect there is some batching in PubsubIO that makes the latency so high.

My questions are: what should be expected latency in this kind of scenarios? Is 
there any recommendations to achieve lower latency? 

I appreciate any help on this!

Thank you, 
Dmitry.

Reply via email to