Hi Adrien, Thanks for the report!
I had some questions about your observations: > Upon > waking, the thread simply continues its sleep operation without any > awareness that a significant amount of time may have passed. Were you able to take a stack trace, attach a debugger to see the state of the thread/variables, or check logging to confirm this? time.sleep() ultimately calls Thread.sleep(), which delegates to the JVM. If there is a single Thread.sleep() invocation which never returns after the system hibernating, I don't think there is much that Kafka can do to mitigate that, and it would affect many more systems than just the credential refresh. It's worth ruling this out with some more investigation. What seems more likely to me is that either: * The thread is exiting, possibly due to a thread interruption * The thread is trying to refresh but unable to complete it successfully for some reason * The thread erroneously computes the refresh time to still be in the future and continues to sleep with additional Thread.sleep calls that behave as expected With regards to the last point, maybe we're running into this: https://issues.apache.org/jira/browse/KAFKA-7945 which applies a fallback 10 minute delay. There are some pretty detailed logs on that code path that could explain more about the state of the refresh thread. > Do you consider this behavior a bug that should be addressed? And would you > recommend creating a KIP for this issue? If this is a bug in Kafka and not in the JVM, and the fix is reasonable and backwards-compatible, we can proceed without a KIP, just as a normal bug fix. Please create a JIRA ticket with the results of your investigation, and if you're interested, assign it to yourself and try to work out a solution. Thanks, Greg On Thu, Mar 13, 2025 at 10:16 AM Adrien Wattez <adrien.wat...@gmail.com> wrote: > Hello Kafka community, > > I've encountered an issue with OAuth authentication in Kafka when running > on a system that goes to sleep/hibernates. I believe I've identified a flaw > in the token refresh mechanism that affects reliability in certain > environments. > > When using OAuth authentication between brokers and controllers, the token > refresh mechanism fails after system sleep/hibernation, causing all > authentication to fail until the service is restarted. > > I observed this on my Confluent Platform setup running on a MacBook: > > OAuth token was set to refresh at 18:31:29 > System went to sleep at 18:19 > System woke up at 18:53, after the tokens had expired at 18:42 > No refresh login attempt occurred after wakeup > All authentication failed with expired tokens > After reviewing the ExpiringCredentialRefreshingLogin class code, I can > see the issue stems from how the refresh thread sleeps until the next > scheduled refresh time: > > log.info("[Principal={}]: Expiring credential re-login sleeping until: > {}", principalLogText(), > new Date(nextRefreshMs)); > time.sleep(nextRefreshMs - nowMs); > When the system goes to sleep, this thread's execution is suspended. Upon > waking, the thread simply continues its sleep operation without any > awareness that a significant amount of time may have passed. There's no > mechanism to detect that the planned refresh window has been missed due to > system suspension. > > I understand that hibernating a Kafka cluster isn't a common production > scenario, and this issue might not affect many users in production > environments. However, I believe this vulnerability in the token refresh > mechanism could be problematic in certain scenarios like development > environments, containerized setups, or any situation where process > suspension might occur. > > Do you consider this behavior a bug that should be addressed? And would > you recommend creating a KIP for this issue? > > I'm asking because while this might be a niche case for production, the > authentication failure is particularly frustrating in development > environments as it requires a full cluster restart to resolve. > > Thanks for your help, > > Adrien