Hi,

I've noticed that when we restart our Kafka consumers our consumer lag
metric sometimes looks "weird".

Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH

You can see lag go up around 15:00, when some consumers were restarted.
The "weird" thing is that the lag remains flat!
How could it remain flat if consumers are running? (they have enough juice
to catch up!)

What I think is happening is this:
1) consumers are initially not really lagging
2) consumers get stopped
3) lag grows
4) consumers get started again
5) something shifts around...not sure what...
6) consumers start consuming, and there is actually no lag, but the offsets
written to ZK sometime during 3) don't get updated because after restart
consumers are reading from somewhere else, not from partition(s) whose lag
and offset delta jumped during 3)

Oh, and:
7) Kafka JMX still exposes all offsets, event those for partitions that are
no longer being read, so the consumer lag metric remains constant/flat,
even though consumers are actually not lagging on partitions from which
they are now consuming.

What bugs me is 7), because reading lag info from JMX looks like it's
"lying".

Does this sound crazy or reasonable?

If anyone has any comments/advice/suggestions for what one can do about
this, I'm all ears!

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Reply via email to