It seems, that we discovered a bug:
In case if unclean leader election happened, KafkaConsumer may hang up
indefinitely

Full version

According to documentation, in case if `auto.offset.reset` is set
to none or not set, the exception is thrown to a client code, allowing to
handle it in a way that client want.
In case if one will take a closer look on this mechanism, it will turn out
that it is not working.

Starting from kafka 2.3 new offset reset negotiation algorithm added
(org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync)
During this validation,
Fetcher `org.apache.kafka.clients.consumer.internals.SubscriptionState` is
held in `AWAIT_VALIDATION` fetch state.
This effectively means that fetch requests are not issued and consumption
stopped.
In case if unclean leader election is happening during this time,
`LogTruncationException` is thrown from future listener in method
`validateOffsetsAsync`.
The main problem is that this exception (thrown from listener of future) is
effectively swallowed
by `org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest`
by this part of code
```
} catch (RuntimeException e) {
  if (!future.isDone()) {
    future.raise(e);
  }
}
```

In the end the result is: The only way to get out of AWAIT_VALIDATION and
continue consumption is to successfully finish validation, but it can not
be finished.
However - consumer is alive, but is consuming nothing. The only way to
resume consumption is to terminate consumer and start another one.

We discovered this situation by means of kstreams application, where valid
value of `auto.offset.reset` provided by our code is replaced
by `None` value for a purpose of position reset
(org.apache.kafka.streams.processor.internals.StreamThread#create).
And with kstreams it is even worse, as application may be working, logging
warn messages of format `Truncation detected for partition ...,` but data
is not generated for a long time and in the end is lost, making kstreams
application unreliable.

*Did someone saw it already, maybe there are some ways to reconfigure this
behavior?*
-- 
Dmitry Sorokin
mailto://dmitry.soro...@gmail.com

Reply via email to