It seems, that we discovered a bug: In case if unclean leader election happened, KafkaConsumer may hang up indefinitely
Full version According to documentation, in case if `auto.offset.reset` is set to none or not set, the exception is thrown to a client code, allowing to handle it in a way that client want. In case if one will take a closer look on this mechanism, it will turn out that it is not working. Starting from kafka 2.3 new offset reset negotiation algorithm added (org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync) During this validation, Fetcher `org.apache.kafka.clients.consumer.internals.SubscriptionState` is held in `AWAIT_VALIDATION` fetch state. This effectively means that fetch requests are not issued and consumption stopped. In case if unclean leader election is happening during this time, `LogTruncationException` is thrown from future listener in method `validateOffsetsAsync`. The main problem is that this exception (thrown from listener of future) is effectively swallowed by `org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest` by this part of code ``` } catch (RuntimeException e) { if (!future.isDone()) { future.raise(e); } } ``` In the end the result is: The only way to get out of AWAIT_VALIDATION and continue consumption is to successfully finish validation, but it can not be finished. However - consumer is alive, but is consuming nothing. The only way to resume consumption is to terminate consumer and start another one. We discovered this situation by means of kstreams application, where valid value of `auto.offset.reset` provided by our code is replaced by `None` value for a purpose of position reset (org.apache.kafka.streams.processor.internals.StreamThread#create). And with kstreams it is even worse, as application may be working, logging warn messages of format `Truncation detected for partition ...,` but data is not generated for a long time and in the end is lost, making kstreams application unreliable. *Did someone saw it already, maybe there are some ways to reconfigure this behavior?* -- Dmitry Sorokin mailto://dmitry.soro...@gmail.com