I recently had a problem on my production which I believe was a manifestation
of the issue kafka-2978 (Topic partition is not sometimes consumed after
rebalancing of consumer group), this is fixed in 0.9.0.1 and we will upgrade
our client soon. However, it made me realise that I didn’t have any monitoring
set up on this. The only thing I can find as a metric is the
kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
which, if I understand correctly, is the max lag of any partition that that
particular consumer is consuming.
1. If I had been monitoring this, and if my consumer was suffering from the
issue in kafka-2978, would I actually have been alerted, i.e. since the
consumer would think it is consuming correctly would it not have updated the
metric.
2. There is another way to see offset lag using the command
/usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
10.10.1.61:9092 --describe —group consumer_group_name and parsing the response.
Is it safe or advisable to do this? I like the fact that it tells me each
partition lag, although it is also not available if no consumer from the group
is currently consuming.
3. Is there a better way of doing this?