Hello Kafka community,

I've encountered an issue with OAuth authentication in Kafka when running on a 
system that goes to sleep/hibernates. I believe I've identified a flaw in the 
token refresh mechanism that affects reliability in certain environments.

When using OAuth authentication between brokers and controllers, the token 
refresh mechanism fails after system sleep/hibernation, causing all 
authentication to fail until the service is restarted.

I observed this on my Confluent Platform setup running on a MacBook:

OAuth token was set to refresh at 18:31:29
System went to sleep at 18:19
System woke up at 18:53, after the tokens had expired at 18:42
No refresh login attempt occurred after wakeup
All authentication failed with expired tokens
After reviewing the ExpiringCredentialRefreshingLogin class code, I can see the 
issue stems from how the refresh thread sleeps until the next scheduled refresh 
time:

log.info("[Principal={}]: Expiring credential re-login sleeping until: {}", 
principalLogText(),
        new Date(nextRefreshMs));
time.sleep(nextRefreshMs - nowMs);
When the system goes to sleep, this thread's execution is suspended. Upon 
waking, the thread simply continues its sleep operation without any awareness 
that a significant amount of time may have passed. There's no mechanism to 
detect that the planned refresh window has been missed due to system suspension.

I understand that hibernating a Kafka cluster isn't a common production 
scenario, and this issue might not affect many users in production 
environments. However, I believe this vulnerability in the token refresh 
mechanism could be problematic in certain scenarios like development 
environments, containerized setups, or any situation where process suspension 
might occur.

Do you consider this behavior a bug that should be addressed? And would you 
recommend creating a KIP for this issue?

I'm asking because while this might be a niche case for production, the 
authentication failure is particularly frustrating in development environments 
as it requires a full cluster restart to resolve.

Thanks for your help,

Adrien

Reply via email to