I'm trying to track down an issue with one of our consumers. There are 4 threads (in 4 separate processes) in the same consumer group, which will run happily for a few hours before inevitably one of them crashes with the following exception:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance This consumer is not using autocommits, instead managing its own committing on a predetermined message frequency. The consumer, as well as the broker, are 0.9.0.1. >From what I've read in other mailing list posts as well as the documentation, this error seems to indicate that this consumer thread did not send a heartbeat within session.timeout.ms and was kicked out of the group by the coordinator. I added some logging to check on this, and the logging indicates that poll() is called on the consumer much more often than the session.timeout.ms time (configured to 30,000ms, heartbeat.interval.ms = 1000). poll() is called every few seconds in general, and loooking at the logs for specific consumer instances sees no gap in poll() calls larger than 4-5 seconds in the preceding minutes before the error occurs. In addition to the exception, the following two messages are also logged right before the crash: Marking the coordinator 2147483644 dead. Error UNKNOWN_MEMBER_ID occurred while committing offsets for group <group_name> This also seems to indicate that the consumer exceeded the session.timeout.ms value, but again poll() seems to be being called enough. Any idea what could be happening? Happy to provide more details or config to help diagnose the issue.