Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients
Hello, I've updated the KIP. If there are no further comments, when can we proceed to the vote? KIP: https://cwiki.apache.org/confluence/x/to08G Regards, Skander On Mon, Mar 9, 2026 at 1:14 AM Jakub Scholz wrote: > *> That said, I’m not sure I fully understand why you mentioned that this> > would not work with mTLS and would only be useful for reloading server> > certificates.* > > So imagine the following scenario: > * You use mTLS on the internal or control plane listeners used by the Kafka > nodes to talk with each other. > * That means that the truststore needs to contain the CA that is used to > sign all the server certificates of the other Kafka nodes. And the keystore > needs to have a server/client key that is signed by a CA in the truststore > of the other Kafka nodes. > * Without that, the communication within your cluster would fall apart. > > So, when I need to move to use a new CA, what do I need to do? > * First, roll the new CA alongside the old CA into the truststore of all > the Kafka nodes > * Once I know on 100% that all of the nodes trust both the old and new CA, > I can roll out the new server certificates. > * At this point, the cluster still works, because the nodes using the old > server certificate are trusted by the old CA, and the new server > certificates are trusted by the new CA. > * Only once I'm 100% sure that all Kafka nodes use the new server > certificate, I can remove the old CA from the truststores. > > Doing this in the step-based approach is important because at any point of > time, things still work fine, and any random restart and so on will > not break anything. > > I do not think this is necessarily a rare scenario. Using private CAs is > common - especially with mTLS. And I do think there is a demand for > short-lived CAs. For example, because certificate revocation is hard etc. > Sure, they won't be 100 minutes short-lived. But for example, 15 days > short-lived. > > Obviously, as I said, not everyone might need this. So while that might be > limitations for some users, some would not care. > > *> How about a metric for the sha256 hash of the contents of the > truststore?* > *> Since the hash is 256 bit wide, we can split it into 4x64bit (long) > chunks and have 4 "tags" on the metric, one for each chunk. That way we > limit the > cardinality of the metric to 4. What do you think?* > > I think that would work, yes. We could query the metrics through JMX or > something to get the value and compare it. That would allow us to integrate > it. > > Thanks & Regards > Jakub > > > On Thu, Mar 5, 2026 at 11:23 AM Skander Soltane > > wrote: > > > Hello Jakub, Gaurav, > > > > Thank you both for your feedback. > > > > On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula wrote: > > > > > Hi Skander and Jakub, > > > > > > Please find my comments inline > > > > > > > On 4 Mar 2026, at 17:58, Jakub Scholz wrote: > > > > > > > > Hi Skander, > > > > > > > > Thanks for the KIP. Here are some of my thoughts on it ... > > > > > > > > I think using a poller instead of the WatchService is a good choice. > In > > > the > > > > previous KIP (KIP-1119), this was my main concern about why it would > > not > > > > work. > > > > > > > > However, are you sure that Files.getLastModifiedTime() will work on > > > > Kubernetes with something like a mounted ConfigMap or Secret? The > file > > > > itself is a symlink, and its dates do not change when a Secret is > > > updated. > > > > At least when observed with something like bash's stat command. Only > > the > > > > dates of the file that the symlink points to change. So, out of my > > head, > > > > I'm not sure which timestamp Java would give you (I haven't tried it, > > to > > > be > > > > honest - I'm just wondering if you did and if it really works). If > the > > > > timestamp doesn't work, maybe one can just read the content of the > file > > > and > > > > store some checksum to compare it with in the next check? > > > > > > GN1: I think `Files.getLastModifiedTime()` has an overload for > accepting > > > LinkOption and if none it passed it follows symlinks. > > > We should be fine as long as the timestamp for the file that the > symlink > > > points to is updated. > > > > > > > I think Gaurav is correct, according to the JavaDoc of > getLastModifiedTime, > > “By default, symbolic links are followed and the file attribute of the > > final target of the link is read.” > > In the Kubernetes setup I used to validate my work on the Kafka client, > the > > PKCS#12 keystore and truststore are mounted via a volume, but they are > > actually generated from Vault Secret Agent (VSO) secrets exposed in > another > > volume. A sidecar container is responsible for creating the stores from > the > > PEM files mounted by VSO and regenerating them whenever VSO rotates the > > certificates. > > That said, you raise a valid point: if the stores were mounted directly > > from Kubernetes Secrets or ConfigMaps, would relying on > getLastModifiedTime > > (whic
Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients
*> That said, I’m not sure I fully understand why you mentioned that this> would not work with mTLS and would only be useful for reloading server> certificates.* So imagine the following scenario: * You use mTLS on the internal or control plane listeners used by the Kafka nodes to talk with each other. * That means that the truststore needs to contain the CA that is used to sign all the server certificates of the other Kafka nodes. And the keystore needs to have a server/client key that is signed by a CA in the truststore of the other Kafka nodes. * Without that, the communication within your cluster would fall apart. So, when I need to move to use a new CA, what do I need to do? * First, roll the new CA alongside the old CA into the truststore of all the Kafka nodes * Once I know on 100% that all of the nodes trust both the old and new CA, I can roll out the new server certificates. * At this point, the cluster still works, because the nodes using the old server certificate are trusted by the old CA, and the new server certificates are trusted by the new CA. * Only once I'm 100% sure that all Kafka nodes use the new server certificate, I can remove the old CA from the truststores. Doing this in the step-based approach is important because at any point of time, things still work fine, and any random restart and so on will not break anything. I do not think this is necessarily a rare scenario. Using private CAs is common - especially with mTLS. And I do think there is a demand for short-lived CAs. For example, because certificate revocation is hard etc. Sure, they won't be 100 minutes short-lived. But for example, 15 days short-lived. Obviously, as I said, not everyone might need this. So while that might be limitations for some users, some would not care. *> How about a metric for the sha256 hash of the contents of the truststore?* *> Since the hash is 256 bit wide, we can split it into 4x64bit (long) chunks and have 4 "tags" on the metric, one for each chunk. That way we limit the > cardinality of the metric to 4. What do you think?* I think that would work, yes. We could query the metrics through JMX or something to get the value and compare it. That would allow us to integrate it. Thanks & Regards Jakub On Thu, Mar 5, 2026 at 11:23 AM Skander Soltane wrote: > Hello Jakub, Gaurav, > > Thank you both for your feedback. > > On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula wrote: > > > Hi Skander and Jakub, > > > > Please find my comments inline > > > > > On 4 Mar 2026, at 17:58, Jakub Scholz wrote: > > > > > > Hi Skander, > > > > > > Thanks for the KIP. Here are some of my thoughts on it ... > > > > > > I think using a poller instead of the WatchService is a good choice. In > > the > > > previous KIP (KIP-1119), this was my main concern about why it would > not > > > work. > > > > > > However, are you sure that Files.getLastModifiedTime() will work on > > > Kubernetes with something like a mounted ConfigMap or Secret? The file > > > itself is a symlink, and its dates do not change when a Secret is > > updated. > > > At least when observed with something like bash's stat command. Only > the > > > dates of the file that the symlink points to change. So, out of my > head, > > > I'm not sure which timestamp Java would give you (I haven't tried it, > to > > be > > > honest - I'm just wondering if you did and if it really works). If the > > > timestamp doesn't work, maybe one can just read the content of the file > > and > > > store some checksum to compare it with in the next check? > > > > GN1: I think `Files.getLastModifiedTime()` has an overload for accepting > > LinkOption and if none it passed it follows symlinks. > > We should be fine as long as the timestamp for the file that the symlink > > points to is updated. > > > > I think Gaurav is correct, according to the JavaDoc of getLastModifiedTime, > “By default, symbolic links are followed and the file attribute of the > final target of the link is read.” > In the Kubernetes setup I used to validate my work on the Kafka client, the > PKCS#12 keystore and truststore are mounted via a volume, but they are > actually generated from Vault Secret Agent (VSO) secrets exposed in another > volume. A sidecar container is responsible for creating the stores from the > PEM files mounted by VSO and regenerating them whenever VSO rotates the > certificates. > That said, you raise a valid point: if the stores were mounted directly > from Kubernetes Secrets or ConfigMaps, would relying on getLastModifiedTime > (which follows the final symbolic link) still be reliable? This needs to be > validated. > If it proves reliable in that scenario, all the better. Otherwise, I can > switch to computing and comparing a checksum of the files instead and > update the KIP accordingly. > > > > > > The other part of my comments in KIP-1119 was more about the usability > > for > > > something like Strimzi. I do not think the debounce interval really > > solves > > > the issue
RE: Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients
Hello Skander, I think KIP-687 while accepted never got implemented so this is very much an issue on the broker side as well. Perhaps this can supersede it? I also feel we should also adopt some sort of last-run-at and failure metrics as it would be useful for operators of both the brokers and clients to alert on them. Regards, Gaurav On 2026/03/05 10:23:23 Skander Soltane wrote: > Hello Jakub, Gaurav, > > Thank you both for your feedback. > > On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula wrote: > > > Hi Skander and Jakub, > > > > Please find my comments inline > > > > > On 4 Mar 2026, at 17:58, Jakub Scholz wrote: > > > > > > Hi Skander, > > > > > > Thanks for the KIP. Here are some of my thoughts on it ... > > > > > > I think using a poller instead of the WatchService is a good choice. In > > the > > > previous KIP (KIP-1119), this was my main concern about why it would not > > > work. > > > > > > However, are you sure that Files.getLastModifiedTime() will work on > > > Kubernetes with something like a mounted ConfigMap or Secret? The file > > > itself is a symlink, and its dates do not change when a Secret is > > updated. > > > At least when observed with something like bash's stat command. Only the > > > dates of the file that the symlink points to change. So, out of my head, > > > I'm not sure which timestamp Java would give you (I haven't tried it, to > > be > > > honest - I'm just wondering if you did and if it really works). If the > > > timestamp doesn't work, maybe one can just read the content of the file > > and > > > store some checksum to compare it with in the next check? > > > > GN1: I think `Files.getLastModifiedTime()` has an overload for accepting > > LinkOption and if none it passed it follows symlinks. > > We should be fine as long as the timestamp for the file that the symlink > > points to is updated. > > > > I think Gaurav is correct, according to the JavaDoc of getLastModifiedTime, > “By default, symbolic links are followed and the file attribute of the > final target of the link is read.” > In the Kubernetes setup I used to validate my work on the Kafka client, the > PKCS#12 keystore and truststore are mounted via a volume, but they are > actually generated from Vault Secret Agent (VSO) secrets exposed in another > volume. A sidecar container is responsible for creating the stores from the > PEM files mounted by VSO and regenerating them whenever VSO rotates the > certificates. > That said, you raise a valid point: if the stores were mounted directly > from Kubernetes Secrets or ConfigMaps, would relying on getLastModifiedTime > (which follows the final symbolic link) still be reliable? This needs to be > validated. > If it proves reliable in that scenario, all the better. Otherwise, I can > switch to computing and comparing a checksum of the files instead and > update the KIP accordingly. > > > > > > The other part of my comments in KIP-1119 was more about the usability > > for > > > something like Strimzi. I do not think the debounce interval really > > solves > > > the issue for us. With Kafka, you have a distributed system with: > > > * Multiple controllers > > > * Multiple brokers > > > * Additional components (e.g., an Operator, Cruise Control, etc.) > > > > > > So when I need to, for example, roll out a new Certificate Authority, > > and I > > > use mTLS authentication, I have to: > > > * First, roll out the trust to the new CA to all the components > > > * Only once all components trust the new CA, I can start rolling out the > > > new server/user certificates > > > * Once the new user and server certificates are used by all components, I > > > can remove the old CA > > > > > > But the debounce interval works only locally within a single Kafka node. > > So > > > while it allows me to safely reload the certificates within the node, > > which > > > is good, it does not help me with the understanding of the state on the > > > other nodes. To be able to orchestrate the whole system, I need a way to > > > find out if it has been reloaded in order to proceed with the next steps. > > > For example, open a TCP connection and sniff the actual TLS > > configuration. > > > But that is pretty ugly, and leaves a mess in the logs and so on. > > > > > > Don't get me wrong. I think this is a useful KIP, and I guess that in > > many > > > cases - especially when running things manually - it would work fine. It > > > would also work fine for reloading server certificates only, without an > > > mTLS. Which is a useful feature as well, with CAs such as Let's Encrypt > > > shortening the validity period of their server certificates. > > > > > > But for an automated solution like Strimzi, the main missing feature for > > > the hot-reloading of certificates is not about the auto-reload being done > > > by Kafka. It is an API that would tell us what is the current state of > > the > > > system in order to orchestrate more complicated things. > > > > GN2: I think that's a good point a
Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients
Hello Jakub, Gaurav, Thank you both for your feedback. On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula wrote: > Hi Skander and Jakub, > > Please find my comments inline > > > On 4 Mar 2026, at 17:58, Jakub Scholz wrote: > > > > Hi Skander, > > > > Thanks for the KIP. Here are some of my thoughts on it ... > > > > I think using a poller instead of the WatchService is a good choice. In > the > > previous KIP (KIP-1119), this was my main concern about why it would not > > work. > > > > However, are you sure that Files.getLastModifiedTime() will work on > > Kubernetes with something like a mounted ConfigMap or Secret? The file > > itself is a symlink, and its dates do not change when a Secret is > updated. > > At least when observed with something like bash's stat command. Only the > > dates of the file that the symlink points to change. So, out of my head, > > I'm not sure which timestamp Java would give you (I haven't tried it, to > be > > honest - I'm just wondering if you did and if it really works). If the > > timestamp doesn't work, maybe one can just read the content of the file > and > > store some checksum to compare it with in the next check? > > GN1: I think `Files.getLastModifiedTime()` has an overload for accepting > LinkOption and if none it passed it follows symlinks. > We should be fine as long as the timestamp for the file that the symlink > points to is updated. > I think Gaurav is correct, according to the JavaDoc of getLastModifiedTime, “By default, symbolic links are followed and the file attribute of the final target of the link is read.” In the Kubernetes setup I used to validate my work on the Kafka client, the PKCS#12 keystore and truststore are mounted via a volume, but they are actually generated from Vault Secret Agent (VSO) secrets exposed in another volume. A sidecar container is responsible for creating the stores from the PEM files mounted by VSO and regenerating them whenever VSO rotates the certificates. That said, you raise a valid point: if the stores were mounted directly from Kubernetes Secrets or ConfigMaps, would relying on getLastModifiedTime (which follows the final symbolic link) still be reliable? This needs to be validated. If it proves reliable in that scenario, all the better. Otherwise, I can switch to computing and comparing a checksum of the files instead and update the KIP accordingly. > > > The other part of my comments in KIP-1119 was more about the usability > for > > something like Strimzi. I do not think the debounce interval really > solves > > the issue for us. With Kafka, you have a distributed system with: > > * Multiple controllers > > * Multiple brokers > > * Additional components (e.g., an Operator, Cruise Control, etc.) > > > > So when I need to, for example, roll out a new Certificate Authority, > and I > > use mTLS authentication, I have to: > > * First, roll out the trust to the new CA to all the components > > * Only once all components trust the new CA, I can start rolling out the > > new server/user certificates > > * Once the new user and server certificates are used by all components, I > > can remove the old CA > > > > But the debounce interval works only locally within a single Kafka node. > So > > while it allows me to safely reload the certificates within the node, > which > > is good, it does not help me with the understanding of the state on the > > other nodes. To be able to orchestrate the whole system, I need a way to > > find out if it has been reloaded in order to proceed with the next steps. > > For example, open a TCP connection and sniff the actual TLS > configuration. > > But that is pretty ugly, and leaves a mess in the logs and so on. > > > > Don't get me wrong. I think this is a useful KIP, and I guess that in > many > > cases - especially when running things manually - it would work fine. It > > would also work fine for reloading server certificates only, without an > > mTLS. Which is a useful feature as well, with CAs such as Let's Encrypt > > shortening the validity period of their server certificates. > > > > But for an automated solution like Strimzi, the main missing feature for > > the hot-reloading of certificates is not about the auto-reload being done > > by Kafka. It is an API that would tell us what is the current state of > the > > system in order to orchestrate more complicated things. > > GN2: I think that's a good point and perhaps a pain shared by a few as > usually CAs are very long lived (of the order of years). > I do agree it would be useful to have an "API" to see the state of the > system. How about a metric for the sha256 hash of the contents of the > truststore? > Since the hash is 256 bit wide, we can split it into 4x64bit (long) chunks > and have 4 "tags" on the metric, one for each chunk. That way we limit the > cardinality of the metric to 4. What do you think? > > Jakub, I see your point about the limitations in setups like Strimzi. However, as Gaurav mentioned, in most cases the CA tend
Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients
Hi Skander and Jakub, Please find my comments inline > On 4 Mar 2026, at 17:58, Jakub Scholz wrote: > > Hi Skander, > > Thanks for the KIP. Here are some of my thoughts on it ... > > I think using a poller instead of the WatchService is a good choice. In the > previous KIP (KIP-1119), this was my main concern about why it would not > work. > > However, are you sure that Files.getLastModifiedTime() will work on > Kubernetes with something like a mounted ConfigMap or Secret? The file > itself is a symlink, and its dates do not change when a Secret is updated. > At least when observed with something like bash's stat command. Only the > dates of the file that the symlink points to change. So, out of my head, > I'm not sure which timestamp Java would give you (I haven't tried it, to be > honest - I'm just wondering if you did and if it really works). If the > timestamp doesn't work, maybe one can just read the content of the file and > store some checksum to compare it with in the next check? GN1: I think `Files.getLastModifiedTime()` has an overload for accepting LinkOption and if none it passed it follows symlinks. We should be fine as long as the timestamp for the file that the symlink points to is updated. > > The other part of my comments in KIP-1119 was more about the usability for > something like Strimzi. I do not think the debounce interval really solves > the issue for us. With Kafka, you have a distributed system with: > * Multiple controllers > * Multiple brokers > * Additional components (e.g., an Operator, Cruise Control, etc.) > > So when I need to, for example, roll out a new Certificate Authority, and I > use mTLS authentication, I have to: > * First, roll out the trust to the new CA to all the components > * Only once all components trust the new CA, I can start rolling out the > new server/user certificates > * Once the new user and server certificates are used by all components, I > can remove the old CA > > But the debounce interval works only locally within a single Kafka node. So > while it allows me to safely reload the certificates within the node, which > is good, it does not help me with the understanding of the state on the > other nodes. To be able to orchestrate the whole system, I need a way to > find out if it has been reloaded in order to proceed with the next steps. > For example, open a TCP connection and sniff the actual TLS configuration. > But that is pretty ugly, and leaves a mess in the logs and so on. > > Don't get me wrong. I think this is a useful KIP, and I guess that in many > cases - especially when running things manually - it would work fine. It > would also work fine for reloading server certificates only, without an > mTLS. Which is a useful feature as well, with CAs such as Let's Encrypt > shortening the validity period of their server certificates. > > But for an automated solution like Strimzi, the main missing feature for > the hot-reloading of certificates is not about the auto-reload being done > by Kafka. It is an API that would tell us what is the current state of the > system in order to orchestrate more complicated things. GN2: I think that's a good point and perhaps a pain shared by a few as usually CAs are very long lived (of the order of years). I do agree it would be useful to have an "API" to see the state of the system. How about a metric for the sha256 hash of the contents of the truststore? Since the hash is 256 bit wide, we can split it into 4x64bit (long) chunks and have 4 "tags" on the metric, one for each chunk. That way we limit the cardinality of the metric to 4. What do you think? Regards, Gaurav > > Thanks & Regards > Jakub > > On Sat, Feb 21, 2026 at 3:58 PM Skander Soltane > wrote: > >> Hi all, >> >> I'd like to start a discussion on a new KIP for SSL hot reload on the >> client side. >> >> You can find the KIP here : >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1288%3A+SSL+Hot+Reload+for+Kafka+Clients >> >> I also drafted a PR implementing the KIP as I imagined it: >> https://github.com/apache/kafka/pull/21488 >> >> I'd love to hear your thoughts, especially on the polling approach vs >> WatchService, the debounce mechanism, and whether the registry design makes >> sense to you. >> >> Than you! >> Skander >>
Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients
Hi Skander, Thanks for the KIP. Here are some of my thoughts on it ... I think using a poller instead of the WatchService is a good choice. In the previous KIP (KIP-1119), this was my main concern about why it would not work. However, are you sure that Files.getLastModifiedTime() will work on Kubernetes with something like a mounted ConfigMap or Secret? The file itself is a symlink, and its dates do not change when a Secret is updated. At least when observed with something like bash's stat command. Only the dates of the file that the symlink points to change. So, out of my head, I'm not sure which timestamp Java would give you (I haven't tried it, to be honest - I'm just wondering if you did and if it really works). If the timestamp doesn't work, maybe one can just read the content of the file and store some checksum to compare it with in the next check? The other part of my comments in KIP-1119 was more about the usability for something like Strimzi. I do not think the debounce interval really solves the issue for us. With Kafka, you have a distributed system with: * Multiple controllers * Multiple brokers * Additional components (e.g., an Operator, Cruise Control, etc.) So when I need to, for example, roll out a new Certificate Authority, and I use mTLS authentication, I have to: * First, roll out the trust to the new CA to all the components * Only once all components trust the new CA, I can start rolling out the new server/user certificates * Once the new user and server certificates are used by all components, I can remove the old CA But the debounce interval works only locally within a single Kafka node. So while it allows me to safely reload the certificates within the node, which is good, it does not help me with the understanding of the state on the other nodes. To be able to orchestrate the whole system, I need a way to find out if it has been reloaded in order to proceed with the next steps. For example, open a TCP connection and sniff the actual TLS configuration. But that is pretty ugly, and leaves a mess in the logs and so on. Don't get me wrong. I think this is a useful KIP, and I guess that in many cases - especially when running things manually - it would work fine. It would also work fine for reloading server certificates only, without an mTLS. Which is a useful feature as well, with CAs such as Let's Encrypt shortening the validity period of their server certificates. But for an automated solution like Strimzi, the main missing feature for the hot-reloading of certificates is not about the auto-reload being done by Kafka. It is an API that would tell us what is the current state of the system in order to orchestrate more complicated things. Thanks & Regards Jakub On Sat, Feb 21, 2026 at 3:58 PM Skander Soltane wrote: > Hi all, > > I'd like to start a discussion on a new KIP for SSL hot reload on the > client side. > > You can find the KIP here : > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1288%3A+SSL+Hot+Reload+for+Kafka+Clients > > I also drafted a PR implementing the KIP as I imagined it: > https://github.com/apache/kafka/pull/21488 > > I'd love to hear your thoughts, especially on the polling approach vs > WatchService, the debounce mechanism, and whether the registry design makes > sense to you. > > Than you! > Skander >
