I'm trying to track down an issue with one of our consumers. There are 4 threads in the same consumer group, which will run happily for a few hours before one of them crashes with the following exception:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance This consumer is not using autocommits, instead managing its own committing. The consumer, as well as the broker, are 0.9.0.1. >From what I've read in other mailing list posts as well as the documentation, this seems to indicate that this consumer thread did not send a heartbeat within session.timeout.ms and was kicked out of the group by the coordinator. I added some logging to check on this, and the logging indicates that poll() is called on the consumer much more often than the session.timeout.ms time (configured to 30,000ms, heartbeat.interval.ms = 1000). poll() is called within a second or less, and in general with this consumer poll() is called 2-3x a second on average. In addition to the exception, the following two messages are also logged right before the crash: Marking the coordinator 2147483644 dead. Error UNKNOWN_MEMBER_ID occurred while committing offsets for group <group_name> This also seems to indicate that the consumer exceeded the session.timeout.ms value, but again poll() seems to be being called enough. Any idea what could be happening? Happy to provide more details or config to help diagnose the issue.