Hi, I've noticed that when we restart our Kafka consumers our consumer lag metric sometimes looks "weird".
Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH You can see lag go up around 15:00, when some consumers were restarted. The "weird" thing is that the lag remains flat! How could it remain flat if consumers are running? (they have enough juice to catch up!) What I think is happening is this: 1) consumers are initially not really lagging 2) consumers get stopped 3) lag grows 4) consumers get started again 5) something shifts around...not sure what... 6) consumers start consuming, and there is actually no lag, but the offsets written to ZK sometime during 3) don't get updated because after restart consumers are reading from somewhere else, not from partition(s) whose lag and offset delta jumped during 3) Oh, and: 7) Kafka JMX still exposes all offsets, event those for partitions that are no longer being read, so the consumer lag metric remains constant/flat, even though consumers are actually not lagging on partitions from which they are now consuming. What bugs me is 7), because reading lag info from JMX looks like it's "lying". Does this sound crazy or reasonable? If anyone has any comments/advice/suggestions for what one can do about this, I'm all ears! Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/