I'm trying to track down an issue with one of our consumers. There are 4
threads (in 4 separate processes) in the same consumer group, which will
run happily for a few hours before inevitably one of them crashes with the
following exception:

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be
completed due to group rebalance

This consumer is not using autocommits, instead managing its own committing
on a predetermined message frequency. The consumer, as well as the broker,
are 0.9.0.1.

>From what I've read in other mailing list posts as well as the
documentation, this error seems to indicate that this consumer thread did
not send a heartbeat within session.timeout.ms and was kicked out of the
group by the coordinator.

I added some logging to check on this, and the logging indicates that
poll() is called on the consumer much more often than the session.timeout.ms
 time (configured to 30,000ms, heartbeat.interval.ms = 1000). poll() is
called every few seconds in general, and loooking at the logs for specific
consumer instances sees no gap in poll() calls larger than 4-5 seconds in
the preceding minutes before the error occurs.

In addition to the exception, the following two messages are also logged
right before the crash:

Marking the coordinator 2147483644 dead.
Error UNKNOWN_MEMBER_ID occurred while committing offsets for group
<group_name>

This also seems to indicate that the consumer exceeded the
session.timeout.ms value, but again poll() seems to be being called enough.

Any idea what could be happening? Happy to provide more details or config
to help diagnose the issue.

Reply via email to