Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi all, Two quick updates on KIP-507: 1) According to https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#Mac, "With some MAC algorithms, the (secret-)key algorithm associated with the (secret-)key object used to initialize the Mac object does not matter (this is the case with the HMAC-MD5 and HMAC-SHA1 implementations of the SunJCE provider). With others, however, the (secret-)key algorithm does matter, and an InvalidKeyException is thrown if a (secret-)key object with an inappropriate (secret-)key algorithm is used.". Based on this, I'm inclined to keep the two separate configurations for key generation and signature algorithms; if there happens to be a use case out there that requires, for example, the DES key generation algorithm in conjunction with the HmacSHA256 MAC algorithm, I'd hate for someone to have to file a follow-up KIP just to add that single configuration to their Connect cluster. These will all be low-importance configurations and although unnecessary configs should be avoided to prevent documentation bloat, I believe the potential benefits of keeping these two separate outweigh the costs. 2) I've renamed the proposed configurations to try to take some of Randall's suggestions into account; the changes can be summarized as replacing "internal.request" with "inter.worker", and changing " inter.worker.key.rotation.interval.ms" to "inter.worker.key.ttl.ms". Cheers, Chris On Tue, Sep 17, 2019 at 10:36 AM Chris Egerton wrote: > Hi Randall, > > I've added your suggested paragraph to the KIP; it definitely clarifies > the intended behavior more than the content that it replaced. Thanks! > > As far as I can tell, the two items remaining open for discussion are the > names for the new configs (which should be made less REST > request-specific), and whether separate configs should be enabled for the > key generation and MAC signing algorithms. As soon as I've gotten some > research done on the latter, I'll report back and then we can hopefully > also begin discussing the former. > > Cheers, > > Chris > > On Mon, Sep 16, 2019 at 4:28 PM Randall Hauch wrote: > >> On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton >> wrote: >> >> > Hi Randall, >> > >> > The new default value for the key size configuration will be "null". >> I've >> > clarified this in the KIP. This will still preserve backwards >> compatibility >> > and should not be an issue. >> > >> >> Thanks for clarifying this in the KIP. >> >> >> > >> > I understand your point about key generation vs MAC signing algorithms; >> > like I said, I'll need to do more research. >> > >> > I respectfully disagree that a single algorithm list would be easier to >> > understand as opposed to a list of accepted algorithms and a signature >> > algorithm. Placing special priority on the first element in that list is >> > fairly implicit and leaves room for confusion where users might have the >> > same algorithms in their lists for that config but in different orders >> for >> > different workers. The group coordination protocol selection behavior >> isn't >> > really a great example since users don't configure that directly >> > themselves. >> > >> > RE the proposed set of new configs: like I stated in my previous >> response, >> > I object to the use of "cluster." as a configuration prefix for any >> worker >> > configs: "most configurations deal with the worker and, transitively, >> the >> > cluster; there doesn't seem to be good enough cause for including that >> > prefix for these configurations." I also think this discussion should >> > probably continue a little more before we start proposing new >> configuration >> > names, given that it's still undecided which configurations we're going >> to >> > expose. >> > >> > > I don't think we have any data about how how often a follower will be >> > fully caught up, and it's possible that a worker's consumer fails to >> keep >> > the worker caught up quickly enough with the configs topic and the new >> > session key. So can we really say that a follower making a request with >> an >> > expired key will be rare? >> > It would depend on the key rotation interval, but I can't imagine the >> > likelihood of this occurring with an interval as low as even 10 minutes >> > would be that high. The config topic is low-volume; the consumer for >> that >> > topic isn't going to be flooded with writes and it seems fine to expect >> > fairly low latency for the consumer of this topic. >> > >> >> My concern is that we're really *assuming* it's not a problem. All I'm >> asking for is more clarity on what happens when a worker doesn't know >> about >> the new session key when it makes a request to this REST resource? The KIP >> makes it clear that It will be retried and that the existing error message >> will be replaced with a debug message, at least for a time being. Perhaps >> the KIP paragraph that talks about this can be reworded to make this more >> clear, something akin to: >>
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi Randall, I've added your suggested paragraph to the KIP; it definitely clarifies the intended behavior more than the content that it replaced. Thanks! As far as I can tell, the two items remaining open for discussion are the names for the new configs (which should be made less REST request-specific), and whether separate configs should be enabled for the key generation and MAC signing algorithms. As soon as I've gotten some research done on the latter, I'll report back and then we can hopefully also begin discussing the former. Cheers, Chris On Mon, Sep 16, 2019 at 4:28 PM Randall Hauch wrote: > On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton wrote: > > > Hi Randall, > > > > The new default value for the key size configuration will be "null". I've > > clarified this in the KIP. This will still preserve backwards > compatibility > > and should not be an issue. > > > > Thanks for clarifying this in the KIP. > > > > > > I understand your point about key generation vs MAC signing algorithms; > > like I said, I'll need to do more research. > > > > I respectfully disagree that a single algorithm list would be easier to > > understand as opposed to a list of accepted algorithms and a signature > > algorithm. Placing special priority on the first element in that list is > > fairly implicit and leaves room for confusion where users might have the > > same algorithms in their lists for that config but in different orders > for > > different workers. The group coordination protocol selection behavior > isn't > > really a great example since users don't configure that directly > > themselves. > > > > RE the proposed set of new configs: like I stated in my previous > response, > > I object to the use of "cluster." as a configuration prefix for any > worker > > configs: "most configurations deal with the worker and, transitively, the > > cluster; there doesn't seem to be good enough cause for including that > > prefix for these configurations." I also think this discussion should > > probably continue a little more before we start proposing new > configuration > > names, given that it's still undecided which configurations we're going > to > > expose. > > > > > I don't think we have any data about how how often a follower will be > > fully caught up, and it's possible that a worker's consumer fails to keep > > the worker caught up quickly enough with the configs topic and the new > > session key. So can we really say that a follower making a request with > an > > expired key will be rare? > > It would depend on the key rotation interval, but I can't imagine the > > likelihood of this occurring with an interval as low as even 10 minutes > > would be that high. The config topic is low-volume; the consumer for that > > topic isn't going to be flooded with writes and it seems fine to expect > > fairly low latency for the consumer of this topic. > > > > My concern is that we're really *assuming* it's not a problem. All I'm > asking for is more clarity on what happens when a worker doesn't know about > the new session key when it makes a request to this REST resource? The KIP > makes it clear that It will be retried and that the existing error message > will be replaced with a debug message, at least for a time being. Perhaps > the KIP paragraph that talks about this can be reworded to make this more > clear, something akin to: > > "The leader will only accept requests signed with the most current key. > However, Connect follower workers may routinely experience small delays > when reading the new key. Rather than always logging such task > configuration failure and retry attempts as errors (the current behavior), > Connect's distributed herder will be modified slightly to handle such HTTP > 403 responses for this task configuration request by quietly retrying them > with the latest key for up to 1 minute. If failures persist for more than 1 > minute, they will be logged as errors." > > > > > Can we not retry for longer than 1 second if the request fails with > HTTP > > 403? I'm concerned what the UX will be if/when this happens, and that the > > user sees a very obtuse and seemingly unrelated error message they won't > > know how to fix. > > To be clear, the KIP doesn't propose any changes to the infinite retry > > logic that's present in the worker today. All that the KIP proposes is > that > > an existing error-level message be replaced with a debug-level message if > > it's suspected that a request has failed due to an out-of-date key. With > > that out of the way, sure, we can bump the grace period before beginning > to > > emit error-level log messages. I think going as high as minute might be > > acceptable; we should try to stay fairly low, however, in case the > request > > failures are due to some other reason that should be surfaced as soon as > > possible and with some urgency. > > > > Ack. See above . > > > > > > The text in the "Failure to relay task configurations to leader due to > > incorrect
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton wrote: > Hi Randall, > > The new default value for the key size configuration will be "null". I've > clarified this in the KIP. This will still preserve backwards compatibility > and should not be an issue. > Thanks for clarifying this in the KIP. > > I understand your point about key generation vs MAC signing algorithms; > like I said, I'll need to do more research. > > I respectfully disagree that a single algorithm list would be easier to > understand as opposed to a list of accepted algorithms and a signature > algorithm. Placing special priority on the first element in that list is > fairly implicit and leaves room for confusion where users might have the > same algorithms in their lists for that config but in different orders for > different workers. The group coordination protocol selection behavior isn't > really a great example since users don't configure that directly > themselves. > > RE the proposed set of new configs: like I stated in my previous response, > I object to the use of "cluster." as a configuration prefix for any worker > configs: "most configurations deal with the worker and, transitively, the > cluster; there doesn't seem to be good enough cause for including that > prefix for these configurations." I also think this discussion should > probably continue a little more before we start proposing new configuration > names, given that it's still undecided which configurations we're going to > expose. > > > I don't think we have any data about how how often a follower will be > fully caught up, and it's possible that a worker's consumer fails to keep > the worker caught up quickly enough with the configs topic and the new > session key. So can we really say that a follower making a request with an > expired key will be rare? > It would depend on the key rotation interval, but I can't imagine the > likelihood of this occurring with an interval as low as even 10 minutes > would be that high. The config topic is low-volume; the consumer for that > topic isn't going to be flooded with writes and it seems fine to expect > fairly low latency for the consumer of this topic. > My concern is that we're really *assuming* it's not a problem. All I'm asking for is more clarity on what happens when a worker doesn't know about the new session key when it makes a request to this REST resource? The KIP makes it clear that It will be retried and that the existing error message will be replaced with a debug message, at least for a time being. Perhaps the KIP paragraph that talks about this can be reworded to make this more clear, something akin to: "The leader will only accept requests signed with the most current key. However, Connect follower workers may routinely experience small delays when reading the new key. Rather than always logging such task configuration failure and retry attempts as errors (the current behavior), Connect's distributed herder will be modified slightly to handle such HTTP 403 responses for this task configuration request by quietly retrying them with the latest key for up to 1 minute. If failures persist for more than 1 minute, they will be logged as errors." > > Can we not retry for longer than 1 second if the request fails with HTTP > 403? I'm concerned what the UX will be if/when this happens, and that the > user sees a very obtuse and seemingly unrelated error message they won't > know how to fix. > To be clear, the KIP doesn't propose any changes to the infinite retry > logic that's present in the worker today. All that the KIP proposes is that > an existing error-level message be replaced with a debug-level message if > it's suspected that a request has failed due to an out-of-date key. With > that out of the way, sure, we can bump the grace period before beginning to > emit error-level log messages. I think going as high as minute might be > acceptable; we should try to stay fairly low, however, in case the request > failures are due to some other reason that should be surfaced as soon as > possible and with some urgency. > Ack. See above . > > > The text in the "Failure to relay task configurations to leader due to > incorrect configuration" section is similarly ambiguous. How will a user > know that this is occurring, and is this really an unlikely situation > (e.g., "This scenario is unlikely to occur with any normal usage of > Connect.")? Seems like users unintentionally misconfiguring some of their > Connect workers is quite common. > A user will know that this is occurring based on error-level log messages > emitted by the worker about a failure to relay task ("Failed to reconfigure > connector's tasks, retrying after backoff:"), plus a stack trace. Yes, this > situation is unlikely; most worker files will never contain the > newly-proposed configurations for this KIP since the default values should > suffice for most cases. If anyone tries to adjust these values, we will > have documentation available on
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi Randall, The new default value for the key size configuration will be "null". I've clarified this in the KIP. This will still preserve backwards compatibility and should not be an issue. I understand your point about key generation vs MAC signing algorithms; like I said, I'll need to do more research. I respectfully disagree that a single algorithm list would be easier to understand as opposed to a list of accepted algorithms and a signature algorithm. Placing special priority on the first element in that list is fairly implicit and leaves room for confusion where users might have the same algorithms in their lists for that config but in different orders for different workers. The group coordination protocol selection behavior isn't really a great example since users don't configure that directly themselves. RE the proposed set of new configs: like I stated in my previous response, I object to the use of "cluster." as a configuration prefix for any worker configs: "most configurations deal with the worker and, transitively, the cluster; there doesn't seem to be good enough cause for including that prefix for these configurations." I also think this discussion should probably continue a little more before we start proposing new configuration names, given that it's still undecided which configurations we're going to expose. > I don't think we have any data about how how often a follower will be fully caught up, and it's possible that a worker's consumer fails to keep the worker caught up quickly enough with the configs topic and the new session key. So can we really say that a follower making a request with an expired key will be rare? It would depend on the key rotation interval, but I can't imagine the likelihood of this occurring with an interval as low as even 10 minutes would be that high. The config topic is low-volume; the consumer for that topic isn't going to be flooded with writes and it seems fine to expect fairly low latency for the consumer of this topic. > Can we not retry for longer than 1 second if the request fails with HTTP 403? I'm concerned what the UX will be if/when this happens, and that the user sees a very obtuse and seemingly unrelated error message they won't know how to fix. To be clear, the KIP doesn't propose any changes to the infinite retry logic that's present in the worker today. All that the KIP proposes is that an existing error-level message be replaced with a debug-level message if it's suspected that a request has failed due to an out-of-date key. With that out of the way, sure, we can bump the grace period before beginning to emit error-level log messages. I think going as high as minute might be acceptable; we should try to stay fairly low, however, in case the request failures are due to some other reason that should be surfaced as soon as possible and with some urgency. > The text in the "Failure to relay task configurations to leader due to incorrect configuration" section is similarly ambiguous. How will a user know that this is occurring, and is this really an unlikely situation (e.g., "This scenario is unlikely to occur with any normal usage of Connect.")? Seems like users unintentionally misconfiguring some of their Connect workers is quite common. A user will know that this is occurring based on error-level log messages emitted by the worker about a failure to relay task ("Failed to reconfigure connector's tasks, retrying after backoff:"), plus a stack trace. Yes, this situation is unlikely; most worker files will never contain the newly-proposed configurations for this KIP since the default values should suffice for most cases. If anyone tries to adjust these values, we will have documentation available on what their behavior is and what to do to ensure your cluster doesn't break while changing them (including the rolling upgrade procedure outlined in the KIP). These configurations make it no likelier that a user will accidentally break their Connect cluster than the group ID, key/value converter, REST extension, and broker authentication configs (to name just a few), and since they will be left out of any sample worker configurations included in Kafka, the odds of someone accidentally messing with them are low enough that it doesn't seem worth investing a lot of effort into making it harder for someone to shoot themselves in the foot. I'll update the KIP to include possible indicators that the cluster has been misconfigured, but I don't think this scenario deserves a ton of priority. > Maybe provide a bit more description of what these error messages will be. I believe this is more of an implementation detail and should be left to PR review. KIPs should only be expected to be so fine-grained; proposing actual log messages doesn't seem necessary for the overall adoption/rejection of the mechanisms proposed here and the iteration cycle on GitHub is significantly faster and more efficient than on a mailing list. > Do we need a JMX metric that shows
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Thanks, Chris. I still have a number of questions and requests for clarification. If we don't provide a default value for the key size, then the following statement in the "Backwards compatibility" subsection is no longer true: "All of the proposed configurations here have default values, making them backwards compatible." IIUC, a user will not be able to upgrade an existing Connect cluster without editing their configurations, and this pretty much means it is not backwards compatible, which is a non-starter. Regarding the new configurations and our earlier discussion about whether all are required, I'm not convinced that we need separate configs for signing and key generation algorithms. If this is a common need, then the KIP should justify that with an explanation. But as it stands now, exposing multiple algorithm properties now means the UX is more complicated and we can't make things simpler in the future. I also think that a single algorithm list where the first is used for signing would be easier to understand (fewer is better) and would be more in line with the way rebalance protocols are chosen the broker (see https://www.youtube.com/watch?v=MmLezWRI3Ys for a decent explanation). If at some point in the future we do want that extra flexibility, then we can expose additional properties. I also asked earlier about renaming the config properties so they can be used in other places within the cluster other than the task configs request, which is something I think we'll want to do. If we minimize the number of configs, then how about `cluster.session.algorithms`, `cluster.session.key.size` and `cluster.session.key.ttl.ms`? The "Proposed Changes" section now mentions: The leader will only accept requests signed with the most current key. This > should not cause any major problems; if a follower attempts to make a > request with an expired key (which should be quite rare and only occur if > the request is made by a follower that is not fully caught up to the end of > the config topic), the initial request will fail, but will be subsequently > retried after a backoff period. This backoff period should leave sufficient > room for the rebalance to complete. > I don't think we have any data about how how often a follower will be fully caught up, and it's possible that a worker's consumer fails to keep the worker caught up quickly enough with the configs topic and the new session key. So can we really say that a follower making a request with an expired key will be rare? If the first four requests fail with HTTP 403, it will be assumed that this > is due to an out-of-date session key; a debug-level message about the > subsequent retry will be logged in place of the current error-level log > message of "Failed to reconfigure connector's tasks, retrying after > backoff: " followed by a stack trace. Since the backoff period is 250 > milliseconds, this should give at least one second of leeway for an > outdated key to be updated. If longer than that is required, the usual > error-level log messages will begin to be generated by the worker. > Can we not retry for longer than 1 second if the request fails with HTTP 403? I'm concerned what the UX will be if/when this happens, and that the user sees a very obtuse and seemingly unrelated error message they won't know how to fix. The text in the "Failure to relay task configurations to leader due to incorrect configuration" section is similarly ambiguous. How will a user know that this is occurring, and is this really an unlikely situation (e.g., "This scenario is unlikely to occur with any normal usage of Connect.")? Seems like users unintentionally misconfiguring some of their Connect workers is quite common. Additionally, it will be vital to design appropriate error messages for > this scenario so that users can (hopefully) dig themselves out of that hole > on their own. Maybe provide a bit more description of what these error messages will be. Do we need a JMX metric that shows the protocol that each worker is configured to use, and whether the workers are using session keys? This would be a great way to monitor the cluster's use of this feature. Best regards, Randall On Wed, Sep 11, 2019 at 7:00 PM Chris Egerton wrote: > Hi all, > > I've updated KIP-507 to reflect the changes inspired by Randall's recent > feedback. In addition, after some further research, I've decided to remove > the proposed default value for the internal.request.key.size and instead, > should no value be provided, rely on the default key size for the given key > algorithm. More detail can be found in the KIP if anyone's interested. > > Cheers, > > Chris > > On Tue, Sep 10, 2019 at 3:18 PM Chris Egerton wrote: > > > Hi Randall, > > > > Thanks so much for your comprehensive feedback! To avoid creating an > > extremely long response I'll address your comments/questions by > referencing > > the identifiers you've provided for them as opposed to copying them and > > responding
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi all, I've updated KIP-507 to reflect the changes inspired by Randall's recent feedback. In addition, after some further research, I've decided to remove the proposed default value for the internal.request.key.size and instead, should no value be provided, rely on the default key size for the given key algorithm. More detail can be found in the KIP if anyone's interested. Cheers, Chris On Tue, Sep 10, 2019 at 3:18 PM Chris Egerton wrote: > Hi Randall, > > Thanks so much for your comprehensive feedback! To avoid creating an > extremely long response I'll address your comments/questions by referencing > the identifiers you've provided for them as opposed to copying them and > responding inline; hopefully things are still readable :) > > > RE new configs: > > a) I believe exposing these configurations right now is the correct > approach. There are two scenarios that we should aim to support: running > Connect on a bare-bones JVM that doesn't implement anything beyond the > requirements of the JVM specs in terms of key generation and key signing > algorithms, and running Connect in an environment with hard security > requirements for, e.g., FIPS compliance. The default values for these > configurations are intended to satisfy the first use case; however, in > order to enable users to conform with higher security standards (the second > use case), some key generation and key signing algorithms that are not > guaranteed to be present on every implementation of the JVM may be > required. I don't see a way to satisfy both use cases without adding > configurability. > With regards to key size: yes, this needs to be configurable as there are > different ranges of acceptable key size for different key generation > algorithms; for example, the "AES" algorithm for key generation requires a > key size of either 128, 192 or 256 (at least on my local JVM) whereas the > "DES" algorithm requires a key size of exactly 56. > As far as enabling different algorithms for key generation vs. request > signing goes... I'm not sure. Local tests involving keys generated with > different algorithms from the mac algorithms used to sign inputs (e.g., > combining an HmacSHA256 key with an HmacMD5 mac or using a DES key with an > HmacSHA256 mac) haven't thrown any exceptions yet but this is a little > beyond the boundaries of my current knowledge, so I'll have to get back to > you on that point. At the very least, your question (and the research I've > done so far to try to address it) has highlighted a bug in my current PR > that'll definitely need to be fixed :) > > b) The riskiest scenario is if a follower worker is configured to use a > request signing algorithm that isn't allowed by the leader. In this case, > any failure will only occur if/when that follower starts up a connector and > then has to forward tasks for that connector to the leader, which may not > happen immediately. Once that failure occurs, an endless failure loop will > occur wherein the follower endlessly retries to send those task > configurations to the leader and pauses by the backoff interval in between > each failed request. > There are two ways to rectify this situation; either shut down the > follower and restart it after editing its configuration to use a request > signing algorithm permitted by the leader, or shut down all other workers > in the cluster that do not permit the request signing algorithm used by the > follower, reconfigure them to permit it, and then restart them. > This scenario is unlikely to occur with any normal usage of Connect, but > the circumstances leading to it will need to be called out very carefully > in order to avoid putting the cluster into a bad state and flooding logs > with scary-looking error messages. Speaking of, it'll be vital to design > appropriate error messages for this scenario so that users can (hopefully) > dig themselves out of that hole on their own. > Differing values for any of the other configs shouldn't actually be too > problematic, given that the three remaining configs all deal with key > generation/rotation, which is only handled by one worker at a time (the > leader). > > c) Good call. Yes, these configs are all "low" importance and I'll update > the KIP accordingly to reflect that. > > d) No, there is no window when multiple keys are valid. This is explicitly > called out in the KIP at the end of the "Proposed Changes" section: > > The leader will only accept requests signed with the most current key. > This should not cause any major problems; if a follower attempts to make a > request with an expired key (which should be quite rare and only occur if > the request is made by a follower that is not fully caught up to the end of > the config topic), the initial request will fail, but will be subsequently > retried after a backoff period. > > e) I'll update the KIP to reflect the fact that, barring any need for > heightened levels of security compliance, none of these configurations > should
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi Randall, Thanks so much for your comprehensive feedback! To avoid creating an extremely long response I'll address your comments/questions by referencing the identifiers you've provided for them as opposed to copying them and responding inline; hopefully things are still readable :) RE new configs: a) I believe exposing these configurations right now is the correct approach. There are two scenarios that we should aim to support: running Connect on a bare-bones JVM that doesn't implement anything beyond the requirements of the JVM specs in terms of key generation and key signing algorithms, and running Connect in an environment with hard security requirements for, e.g., FIPS compliance. The default values for these configurations are intended to satisfy the first use case; however, in order to enable users to conform with higher security standards (the second use case), some key generation and key signing algorithms that are not guaranteed to be present on every implementation of the JVM may be required. I don't see a way to satisfy both use cases without adding configurability. With regards to key size: yes, this needs to be configurable as there are different ranges of acceptable key size for different key generation algorithms; for example, the "AES" algorithm for key generation requires a key size of either 128, 192 or 256 (at least on my local JVM) whereas the "DES" algorithm requires a key size of exactly 56. As far as enabling different algorithms for key generation vs. request signing goes... I'm not sure. Local tests involving keys generated with different algorithms from the mac algorithms used to sign inputs (e.g., combining an HmacSHA256 key with an HmacMD5 mac or using a DES key with an HmacSHA256 mac) haven't thrown any exceptions yet but this is a little beyond the boundaries of my current knowledge, so I'll have to get back to you on that point. At the very least, your question (and the research I've done so far to try to address it) has highlighted a bug in my current PR that'll definitely need to be fixed :) b) The riskiest scenario is if a follower worker is configured to use a request signing algorithm that isn't allowed by the leader. In this case, any failure will only occur if/when that follower starts up a connector and then has to forward tasks for that connector to the leader, which may not happen immediately. Once that failure occurs, an endless failure loop will occur wherein the follower endlessly retries to send those task configurations to the leader and pauses by the backoff interval in between each failed request. There are two ways to rectify this situation; either shut down the follower and restart it after editing its configuration to use a request signing algorithm permitted by the leader, or shut down all other workers in the cluster that do not permit the request signing algorithm used by the follower, reconfigure them to permit it, and then restart them. This scenario is unlikely to occur with any normal usage of Connect, but the circumstances leading to it will need to be called out very carefully in order to avoid putting the cluster into a bad state and flooding logs with scary-looking error messages. Speaking of, it'll be vital to design appropriate error messages for this scenario so that users can (hopefully) dig themselves out of that hole on their own. Differing values for any of the other configs shouldn't actually be too problematic, given that the three remaining configs all deal with key generation/rotation, which is only handled by one worker at a time (the leader). c) Good call. Yes, these configs are all "low" importance and I'll update the KIP accordingly to reflect that. d) No, there is no window when multiple keys are valid. This is explicitly called out in the KIP at the end of the "Proposed Changes" section: > The leader will only accept requests signed with the most current key. This should not cause any major problems; if a follower attempts to make a request with an expired key (which should be quite rare and only occur if the request is made by a follower that is not fully caught up to the end of the config topic), the initial request will fail, but will be subsequently retried after a backoff period. e) I'll update the KIP to reflect the fact that, barring any need for heightened levels of security compliance, none of these configurations should need to be altered and the defaults should suffice. f) As mentioned above, a larger key size isn't necessarily possible for all key generation algorithms. I believe these defaults are the best we can work with until/unless other algorithms are added to the JVM spec. g) I don't believe there's a lot of precedent for configuration properties like this in AK. The closest I could find was the "password.encoder.*" broker properties, but those deal with storing and encrypting passwords, as opposed to generating keys and signing requests with them. I'm leaning towards the stance that it's not
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Thanks for putting this KIP together, Chris. It's thorough and well thought out, and you've done a great job responding to comments. It is indeed going to be nice to harden the REST API a bit more. I do have a few questions/concerns/comments, all of which I think can be incorporated relatively easily. First, regarding the new Connect worker configs mentioned in the "Public Interfaces" section: a. Do we need to expose all of these, or could Connect just internally use constants and only expo them in the future if necessary? Do users really need to explicitly set any of these, or will most users simply use the defaults? Do we need to specify a different algorithm for generating a session key versus signing a request? Does the key size need to be configured, or can we use a large enough value and grow this in the future with patch releases? b. What happens if one worker has different values for any of these new worker configs than the other workers? If there is an error, when will that error occur (before startup happens, only when that worker attempts to use the internal REST endpoint, etc.) and will the error be clear enough the user knows how to fix the problem? c. The section should mention the importance of each new configuration property. I suspect these will be "low" importance, since the defaults are likely sufficient for most users. d. Is there a window when both the new session key and the prior session key are still valid? If so, what is that window? e. Can the KIP emphasize more that the default values are likely to be sufficient for most users? f. Are all of the defaults sufficient for the foreseeable future? For example, what is the cost of using a larger session key size? g. Please mention whether these config properties follow or do not follow existing precedent within AK. h. Might these settings be used to validate other kinds of intra-cluster communication and messages? If so, the "internal.request" prefix might make that difficult. Was there any thought about using a bit more generic prefix, such as "cluster."? i. The `internal.request.verification.algorithms` and `internal.request.signature.algorithm` are dependent upon each other, and it seems like it's possible for the "verification" algorithms to not include the "signature" algorithm. We could catch cases like that (though not with simple Validator implementations since they are specific to a single property value), but would it be better to have a single list and to use the first item in the list as the signature algorithm? Second, the KIP proposes to write the session key to the existing config topic using record(s) with a new key. Konstantine was correct when he said earlier in the thread: > Indeed, broadcasting the session key to the followers is necessary. But > adding it to the configs topic with a new key is compatible with previous > versions (the key will be ignored by workers that don't support this > protocol) and works since all workers will need to read this topic to the > end to remain in the group. > This is true that the KafkaConfigBackingStore will not act upon config record keys with keys that do not match the known pattern, but such messages do result in an ERROR log message. I'm concerned that operators will be alarmed (both emotionally and technically) if they begin to upgrade a Connect cluster and start seeing ERROR log messages. We could change these error messages to be warnings, but that will take effect only after a worker is upgraded and restarted, and will not affect the yet-to-be-upgraded workers in the Connect cluster. What can we do to avoid these ERROR log messages? Should the default be to not change `connect.protocol`, allow the user to upgrade all clusters, and then to change `connect.protocol=sessioned` with a (second) rolling upgrade? Or, if we decide to rely upon an upgrade recipe to avoid these, then we should document that recipe here. Third, I think the KIP should address the potential for seeing one or more "Failed to reconfigure connector's tasks, retrying after backoff" error message during a key rotation. It would be good to eliminate the condition, maybe by returning a response indicating authorization failure and signifying a key rotation is required, and to prevent an ERROR log message. I don't think separate INFO level log messages saying to disregard earlier ERROR log messages is sufficient. Fourth, how will an operator of a Connect cluster know whether this internal endpoint is protected by authorization via this feature? And how can an operator know which Connect workers are preventing a cluster from enabling this feature? Should a warning or error be logged if this feature is disabled after being enabled? Finally, I have a few nits: 1. The "Backwards compatibility" section should be more focused on the upgrade UX. "All of the new worker configurations have sensible defaults, and most users can simply upgrade without needing to override them." (BTW, if this is
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi all, I've published a draft PR implementing the changes currently proposed in KIP-507, which can be found at https://github.com/apache/kafka/pull/7310. Thanks for all of the review and feedback so far! Given the lack of any major voiced reservations so far, if I don't hear anything else over the course of the next few days or so I'll be opening this for a vote. Cheers, Chris On Wed, Aug 28, 2019 at 12:21 PM Chris Egerton wrote: > HI all, > > Wow, thanks for all the feedback! Happy to see this proposal getting some > love :) > > > RE Konstantine's comments: > > > I've read the discussions regarding why the rebalancing protocol is used > here and your intention to follow the approach which was recently used in > order to elegantly support upgrades without requiring rolling restarts > makes sense. > > However, I'd like to revisit the proposed extension of the connect > protocol > going a little bit more in detail. > Indeed, broadcasting the session key to the followers is necessary. But > adding it to the configs topic with a new key is compatible with previous > versions (the key will be ignored by workers that don't support this > protocol) and works since all workers will need to read this topic to the > end to remain in the group. > > Additionally, to your point about consensus, meaning how the workers all > know during the same generation that the "sessioned" version of the > protocol was chosen by the leader, it seems to me that an explicit upgrade > of the protocol on the wire, on its name and on the metadata that are sent > with the join request by each worker might not be required. What I believe > would suffice is merely an increment of the version of the protocol, > because all the other information remains the same (for example, the list > of assigned and revoked tasks, etc). This would allow us not to extend the > metadata even more with yet another JoinGroupRequest and could achieve what > you want in terms of an orchestrated upgrade. > > I really like this idea, thanks for suggesting it. A simple bump of the > protocol version does seem sufficient to achieve consensus among workers > and would significantly reduce the implementation complexity of an entirely > separate protocol just for the sake of distributing keys, and using the > config topic to relay keys instead of the group coordination protocol would > prevent rebalances solely for the sake of key rotation (instead, the leader > can just periodically write a new key to the config topic). I'll update the > KIP by EOD today with these changes. > > > All the leader worker would have to check is whether all the assignments > that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if > you'd choose to go this route). > > I like this idea for dictating whether the leader requires request > signing, SGTM. I think we can have all workers use the latest session key > present in the config topic (if there is one) to sign their requests; since > the signatures are passed via header, this shouldn't cause any problems if > a follower signs a request that gets sent to a leader that isn't performing > request validation. > > > RE Magesh's comments: > > > Thanks a lot for the KIP. This will certainly be a useful feature. I > would > have preferred to use the topic approach as well but I also understand your > point of view about the operational complexity for upgrades. If not with > this KIP, I would certainly want to go that route at some point in the > future. > > I'm actually a little conflicted on this front. If this KIP passes and we > have a secure way to perform quick, synchronous, follower-to-leader > communication, it may be better to hold onto and even perhaps expand on in > case this functionality comes in handy later. Additionally, there would > have to be a lot of thought put into the semantics of using topic-based > communication for relaying task configs to the leader due to its > asynchronous nature; I'm not convinced it's not worth it, but I'm also not > entirely convinced it is. > > > As far as using the rebalance protocol goes, it would be great if you > could > elaborate on what exactly would be the rebalance impact when a key expires. > I see that you have called out saying that there should be no significant > impact but it will be great to explicitly state as what is to be expected. > I would prefer to not have any sorts of rebalancing when this happens since > the connector and task assignments should not change with this event. It > will be useful to explain this a bit more. > > Given Konstantine's comments, this question isn't strictly relevant > anymore (unless we decide to return to the approach of distributing session > keys during rebalance), but just for completeness it does seem worth > addressing. The "rebalances" proposed for key rotation would be achieved by > a request from the leader the rejoin the group. This would immediately > result in the leader being asked to assign connectors and
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
HI all, Wow, thanks for all the feedback! Happy to see this proposal getting some love :) RE Konstantine's comments: > I've read the discussions regarding why the rebalancing protocol is used here and your intention to follow the approach which was recently used in order to elegantly support upgrades without requiring rolling restarts makes sense. > However, I'd like to revisit the proposed extension of the connect protocol going a little bit more in detail. Indeed, broadcasting the session key to the followers is necessary. But adding it to the configs topic with a new key is compatible with previous versions (the key will be ignored by workers that don't support this protocol) and works since all workers will need to read this topic to the end to remain in the group. > Additionally, to your point about consensus, meaning how the workers all know during the same generation that the "sessioned" version of the protocol was chosen by the leader, it seems to me that an explicit upgrade of the protocol on the wire, on its name and on the metadata that are sent with the join request by each worker might not be required. What I believe would suffice is merely an increment of the version of the protocol, because all the other information remains the same (for example, the list of assigned and revoked tasks, etc). This would allow us not to extend the metadata even more with yet another JoinGroupRequest and could achieve what you want in terms of an orchestrated upgrade. I really like this idea, thanks for suggesting it. A simple bump of the protocol version does seem sufficient to achieve consensus among workers and would significantly reduce the implementation complexity of an entirely separate protocol just for the sake of distributing keys, and using the config topic to relay keys instead of the group coordination protocol would prevent rebalances solely for the sake of key rotation (instead, the leader can just periodically write a new key to the config topic). I'll update the KIP by EOD today with these changes. > All the leader worker would have to check is whether all the assignments that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if you'd choose to go this route). I like this idea for dictating whether the leader requires request signing, SGTM. I think we can have all workers use the latest session key present in the config topic (if there is one) to sign their requests; since the signatures are passed via header, this shouldn't cause any problems if a follower signs a request that gets sent to a leader that isn't performing request validation. RE Magesh's comments: > Thanks a lot for the KIP. This will certainly be a useful feature. I would have preferred to use the topic approach as well but I also understand your point of view about the operational complexity for upgrades. If not with this KIP, I would certainly want to go that route at some point in the future. I'm actually a little conflicted on this front. If this KIP passes and we have a secure way to perform quick, synchronous, follower-to-leader communication, it may be better to hold onto and even perhaps expand on in case this functionality comes in handy later. Additionally, there would have to be a lot of thought put into the semantics of using topic-based communication for relaying task configs to the leader due to its asynchronous nature; I'm not convinced it's not worth it, but I'm also not entirely convinced it is. > As far as using the rebalance protocol goes, it would be great if you could elaborate on what exactly would be the rebalance impact when a key expires. I see that you have called out saying that there should be no significant impact but it will be great to explicitly state as what is to be expected. I would prefer to not have any sorts of rebalancing when this happens since the connector and task assignments should not change with this event. It will be useful to explain this a bit more. Given Konstantine's comments, this question isn't strictly relevant anymore (unless we decide to return to the approach of distributing session keys during rebalance), but just for completeness it does seem worth addressing. The "rebalances" proposed for key rotation would be achieved by a request from the leader the rejoin the group. This would immediately result in the leader being asked to assign connectors and tasks to the group, which would exactly match the existing assignments, so with that and the help of the new incremental cooperative rebalance semantics there would be zero downtime for connectors or tasks. Truly, "rebalance" is a bit of a misnomer here; we're really talking about performing group assignment (which doesn't necessarily imply adding or removing connectors/tasks to/from any workers), but as far as I've seen the two are basically used synonymously w/r/t Connect so I elected to use the more common, albeit slightly less accurate, term. RE Greg's comments: > Does this roll-out
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hey Chris, The KIP makes sense to me, and I think it's a very natural way to solve the issue at hand. I had two questions about the automatic rollout logic. Does this roll-out ratchet the protocol version irreversibly? What is the expected behavior when the feature has been enabled, and a worker joins the group while advertising an old protocol? I think it's reasonable to prevent single-worker downgrades, since this is an attack vector. But what happens if every worker in a cluster goes down, and at least the first worker comes back with an old protocol? This is a potential attack that requires access to the workers, but could also be used by customers to intentionally downgrade their cluster. You mentioned that this security improvement only adds security when there are a number of existing security changes in place, one of these being ACLs on the Kafka broker. Do you plan to gate this feature roll-out on detecting that those prerequisite features are enabled? If not, I think this feature adds no additional security to unsecured clusters, while still incurring the overhead on signing requests. Do you know how significant the overhead will be, as a rough estimate? Looks great! Thanks, Greg On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis < konstant...@confluent.io> wrote: > Thanks for the KIP Chris! > > This proposal will fill a gap in Kafka Connect deployments and will > strengthen their production readiness in even more diverse environments, in > which current workarounds are less practical. > > I've read the discussions regarding why the rebalancing protocol is used > here and your intention to follow the approach which was recently used in > order to elegantly support upgrades without requiring rolling restarts > makes sense. > > However, I'd like to revisit the proposed extension of the connect protocol > going a little bit more in detail. > Indeed, broadcasting the session key to the followers is necessary. But > adding it to the configs topic with a new key is compatible with previous > versions (the key will be ignored by workers that don't support this > protocol) and works since all workers will need to read this topic to the > end to remain in the group. > > Additionally, to your point about consensus, meaning how the workers all > know during the same generation that the "sessioned" version of the > protocol was chosen by the leader, it seems to me that an explicit upgrade > of the protocol on the wire, on its name and on the metadata that are sent > with the join request by each worker might not be required. What I believe > would suffice is merely an increment of the version of the protocol, > because all the other information remains the same (for example, the list > of assigned and revoked tasks, etc). This would allow us not to extend the > metadata even more with yet another JoinGroupRequest and could achieve what > you want in terms of an orchestrated upgrade. > > Without having exhaustively checked the various code paths, I feel that > such an approach would be less heavy-handed in terms of extending the > format of the connect protocol and would achieve a certain separation of > concerns between the information that is required for rebalancing and is > included in the protocol metadata and the information that is required for > securing the internal Connect REST endpoint. In theory, this method could > be used to even secure the eager version of the protocol, but I'd agree > that this out of the scope of the current proposal. > > All the leader worker would have to check is whether all the assignments > that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if > you'd choose to go this route). > > Overall, this does not differ a lot from your current proposal, which I > think is already in the right direction. > Let me know what you think. > > Cheers, > Konstantine > > > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar > wrote: > > > Hi Chris, > > > > Thanks a lot for the KIP. This will certainly be a useful feature. I > would > > have preferred to use the topic approach as well but I also understand > your > > point of view about the operational complexity for upgrades. If not with > > this KIP, I would certainly want to go that route at some point in the > > future. > > > > As far as using the rebalance protocol goes, it would be great if you > could > > elaborate on what exactly would be the rebalance impact when a key > expires. > > I see that you have called out saying that there should be no significant > > impact but it will be great to explicitly state as what is to be > expected. > > I would prefer to not have any sorts of rebalancing when this happens > since > > the connector and task assignments should not change with this event. It > > will be useful to explain this a bit more. > > > > Thanks, > > Magesh > > > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton > wrote: > > > > > Hi all, > > > > > > I've made some tweaks to the KIP that I
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Thanks for the KIP Chris! This proposal will fill a gap in Kafka Connect deployments and will strengthen their production readiness in even more diverse environments, in which current workarounds are less practical. I've read the discussions regarding why the rebalancing protocol is used here and your intention to follow the approach which was recently used in order to elegantly support upgrades without requiring rolling restarts makes sense. However, I'd like to revisit the proposed extension of the connect protocol going a little bit more in detail. Indeed, broadcasting the session key to the followers is necessary. But adding it to the configs topic with a new key is compatible with previous versions (the key will be ignored by workers that don't support this protocol) and works since all workers will need to read this topic to the end to remain in the group. Additionally, to your point about consensus, meaning how the workers all know during the same generation that the "sessioned" version of the protocol was chosen by the leader, it seems to me that an explicit upgrade of the protocol on the wire, on its name and on the metadata that are sent with the join request by each worker might not be required. What I believe would suffice is merely an increment of the version of the protocol, because all the other information remains the same (for example, the list of assigned and revoked tasks, etc). This would allow us not to extend the metadata even more with yet another JoinGroupRequest and could achieve what you want in terms of an orchestrated upgrade. Without having exhaustively checked the various code paths, I feel that such an approach would be less heavy-handed in terms of extending the format of the connect protocol and would achieve a certain separation of concerns between the information that is required for rebalancing and is included in the protocol metadata and the information that is required for securing the internal Connect REST endpoint. In theory, this method could be used to even secure the eager version of the protocol, but I'd agree that this out of the scope of the current proposal. All the leader worker would have to check is whether all the assignments that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if you'd choose to go this route). Overall, this does not differ a lot from your current proposal, which I think is already in the right direction. Let me know what you think. Cheers, Konstantine On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar wrote: > Hi Chris, > > Thanks a lot for the KIP. This will certainly be a useful feature. I would > have preferred to use the topic approach as well but I also understand your > point of view about the operational complexity for upgrades. If not with > this KIP, I would certainly want to go that route at some point in the > future. > > As far as using the rebalance protocol goes, it would be great if you could > elaborate on what exactly would be the rebalance impact when a key expires. > I see that you have called out saying that there should be no significant > impact but it will be great to explicitly state as what is to be expected. > I would prefer to not have any sorts of rebalancing when this happens since > the connector and task assignments should not change with this event. It > will be useful to explain this a bit more. > > Thanks, > Magesh > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton wrote: > > > Hi all, > > > > I've made some tweaks to the KIP that I believe are improvements. More > > detail can be found on the KIP page itself, but as a brief summary, the > > three changes are: > > > > - The removal of the internal.request.verification property in favor of > > modifying the default value for the connect.protocol property from > > "compatible" to "sessioned" > > - The renaming of some configurations to use better terminology (mostly > > just "request" instead of "key" where appropriate) > > - The addition of two new configurations that dictate how session keys > are > > to be generated > > > > Thanks Ryanne for the feedback so far, hope to hear from some more of you > > soon :) > > > > Cheers, > > > > Chris > > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton > > wrote: > > > > > Hi Ryanne, > > > > > > The reasoning for this is provided in the KIP: "There would be no clear > > > way to achieve consensus amongst the workers in a cluster on whether to > > > switch to this new behavior." To elaborate on this--it would be bad if > a > > > follower worker began writing task configurations to the topic while > the > > > leader continued to only listen on the internal REST endpoint. It's > > > necessary to ensure that every worker in a cluster supports this new > > > behavior before switching it on. > > > > > > To be fair, it is technically possible to achieve consensus without > using > > > the group coordination protocol by adding new configurations and using > a > > > multi-phase rolling
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi Chris, Thanks a lot for the KIP. This will certainly be a useful feature. I would have preferred to use the topic approach as well but I also understand your point of view about the operational complexity for upgrades. If not with this KIP, I would certainly want to go that route at some point in the future. As far as using the rebalance protocol goes, it would be great if you could elaborate on what exactly would be the rebalance impact when a key expires. I see that you have called out saying that there should be no significant impact but it will be great to explicitly state as what is to be expected. I would prefer to not have any sorts of rebalancing when this happens since the connector and task assignments should not change with this event. It will be useful to explain this a bit more. Thanks, Magesh On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton wrote: > Hi all, > > I've made some tweaks to the KIP that I believe are improvements. More > detail can be found on the KIP page itself, but as a brief summary, the > three changes are: > > - The removal of the internal.request.verification property in favor of > modifying the default value for the connect.protocol property from > "compatible" to "sessioned" > - The renaming of some configurations to use better terminology (mostly > just "request" instead of "key" where appropriate) > - The addition of two new configurations that dictate how session keys are > to be generated > > Thanks Ryanne for the feedback so far, hope to hear from some more of you > soon :) > > Cheers, > > Chris > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton > wrote: > > > Hi Ryanne, > > > > The reasoning for this is provided in the KIP: "There would be no clear > > way to achieve consensus amongst the workers in a cluster on whether to > > switch to this new behavior." To elaborate on this--it would be bad if a > > follower worker began writing task configurations to the topic while the > > leader continued to only listen on the internal REST endpoint. It's > > necessary to ensure that every worker in a cluster supports this new > > behavior before switching it on. > > > > To be fair, it is technically possible to achieve consensus without using > > the group coordination protocol by adding new configurations and using a > > multi-phase rolling upgrade, but the operational complexity of this > > approach would significantly complicate things for the default case of > just > > wanting to bump your Connect cluster to the next version without having > to > > alter your existing worker config files and with only a single restart > per > > worker. The proposed approach provides a much simpler user experience for > > the most common use case and doesn't impose much additional complexity > for > > other use cases. > > > > I've updated the KIP to expand on points from your last emails; let me > > know what you think. > > > > Cheers, > > > > Chris > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan > > wrote: > > > >> Thanks Chris, that makes sense. > >> > >> I know you have already considered this, but I'm not convinced we should > >> rule out using Kafka topics for this purpose. That would enable the same > >> level of security without introducing any new authentication or > >> authorization mechanisms (your keys). And as you say, you'd need to lock > >> down Connect's topics and groups anyway. > >> > >> Can you explain what you mean when you say using Kafka topics would > >> require > >> "reworking the group coordination protocol"? I don't see how these are > >> related. Why would it matter if the workers sent Kafka messages vs POST > >> requests among themselves? > >> > >> Ryanne > >> > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton > >> wrote: > >> > >> > Hi Ryanne, > >> > > >> > Yes, if the Connect group is left unsecured then that is a potential > >> > vulnerability. However, in a truly secure Connect cluster, the group > >> would > >> > need to be secured anyways to prevent attackers from joining the group > >> with > >> > the intent to either snoop on connector/task configurations or bring > the > >> > cluster to a halt by spamming the group with membership requests and > >> then > >> > not running the assigned connectors/tasks. Additionally, for a Connect > >> > cluster to be secure, access to internal topics (for configs, offsets, > >> and > >> > statuses) would also need to be restricted so that attackers could > not, > >> > e.g., write arbitrary connector/task configurations to the configs > >> topic. > >> > This is all currently possible in Kafka with the use of ACLs. > >> > > >> > I think the bottom line here is that there's a number of steps that > >> need to > >> > be taken to effectively lock down a Connect cluster; the point of this > >> KIP > >> > is to close a loophole that exists even after all of those steps are > >> taken, > >> > not to completely secure this one vulnerability even when no other > >> security > >> > measures are taken. > >> > > >> >
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi all, I've made some tweaks to the KIP that I believe are improvements. More detail can be found on the KIP page itself, but as a brief summary, the three changes are: - The removal of the internal.request.verification property in favor of modifying the default value for the connect.protocol property from "compatible" to "sessioned" - The renaming of some configurations to use better terminology (mostly just "request" instead of "key" where appropriate) - The addition of two new configurations that dictate how session keys are to be generated Thanks Ryanne for the feedback so far, hope to hear from some more of you soon :) Cheers, Chris On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton wrote: > Hi Ryanne, > > The reasoning for this is provided in the KIP: "There would be no clear > way to achieve consensus amongst the workers in a cluster on whether to > switch to this new behavior." To elaborate on this--it would be bad if a > follower worker began writing task configurations to the topic while the > leader continued to only listen on the internal REST endpoint. It's > necessary to ensure that every worker in a cluster supports this new > behavior before switching it on. > > To be fair, it is technically possible to achieve consensus without using > the group coordination protocol by adding new configurations and using a > multi-phase rolling upgrade, but the operational complexity of this > approach would significantly complicate things for the default case of just > wanting to bump your Connect cluster to the next version without having to > alter your existing worker config files and with only a single restart per > worker. The proposed approach provides a much simpler user experience for > the most common use case and doesn't impose much additional complexity for > other use cases. > > I've updated the KIP to expand on points from your last emails; let me > know what you think. > > Cheers, > > Chris > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan > wrote: > >> Thanks Chris, that makes sense. >> >> I know you have already considered this, but I'm not convinced we should >> rule out using Kafka topics for this purpose. That would enable the same >> level of security without introducing any new authentication or >> authorization mechanisms (your keys). And as you say, you'd need to lock >> down Connect's topics and groups anyway. >> >> Can you explain what you mean when you say using Kafka topics would >> require >> "reworking the group coordination protocol"? I don't see how these are >> related. Why would it matter if the workers sent Kafka messages vs POST >> requests among themselves? >> >> Ryanne >> >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton >> wrote: >> >> > Hi Ryanne, >> > >> > Yes, if the Connect group is left unsecured then that is a potential >> > vulnerability. However, in a truly secure Connect cluster, the group >> would >> > need to be secured anyways to prevent attackers from joining the group >> with >> > the intent to either snoop on connector/task configurations or bring the >> > cluster to a halt by spamming the group with membership requests and >> then >> > not running the assigned connectors/tasks. Additionally, for a Connect >> > cluster to be secure, access to internal topics (for configs, offsets, >> and >> > statuses) would also need to be restricted so that attackers could not, >> > e.g., write arbitrary connector/task configurations to the configs >> topic. >> > This is all currently possible in Kafka with the use of ACLs. >> > >> > I think the bottom line here is that there's a number of steps that >> need to >> > be taken to effectively lock down a Connect cluster; the point of this >> KIP >> > is to close a loophole that exists even after all of those steps are >> taken, >> > not to completely secure this one vulnerability even when no other >> security >> > measures are taken. >> > >> > Cheers, >> > >> > Chris >> > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan >> > wrote: >> > >> > > Chris, I don't understand how the rebalance protocol can be used to >> give >> > > out session tokens in a secure way. It seems that any attacker could >> just >> > > join the group and sign requests with the provided token. Am I missing >> > > something? >> > > >> > > Ryanne >> > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton >> wrote: >> > > >> > > > The KIP page can be found at >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints >> > > > , >> > > > by the way. Apologies for neglecting to include it in my initial >> email! >> > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton > > >> > > > wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > I'd like to start discussion on a KIP to secure the internal "POST >> > > > > /connectors//tasks" endpoint for the Connect framework. The >> > > > proposed >> > > > > changes address a vulnerability in the framework in its
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi Ryanne, The reasoning for this is provided in the KIP: "There would be no clear way to achieve consensus amongst the workers in a cluster on whether to switch to this new behavior." To elaborate on this--it would be bad if a follower worker began writing task configurations to the topic while the leader continued to only listen on the internal REST endpoint. It's necessary to ensure that every worker in a cluster supports this new behavior before switching it on. To be fair, it is technically possible to achieve consensus without using the group coordination protocol by adding new configurations and using a multi-phase rolling upgrade, but the operational complexity of this approach would significantly complicate things for the default case of just wanting to bump your Connect cluster to the next version without having to alter your existing worker config files and with only a single restart per worker. The proposed approach provides a much simpler user experience for the most common use case and doesn't impose much additional complexity for other use cases. I've updated the KIP to expand on points from your last emails; let me know what you think. Cheers, Chris On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan wrote: > Thanks Chris, that makes sense. > > I know you have already considered this, but I'm not convinced we should > rule out using Kafka topics for this purpose. That would enable the same > level of security without introducing any new authentication or > authorization mechanisms (your keys). And as you say, you'd need to lock > down Connect's topics and groups anyway. > > Can you explain what you mean when you say using Kafka topics would require > "reworking the group coordination protocol"? I don't see how these are > related. Why would it matter if the workers sent Kafka messages vs POST > requests among themselves? > > Ryanne > > On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton wrote: > > > Hi Ryanne, > > > > Yes, if the Connect group is left unsecured then that is a potential > > vulnerability. However, in a truly secure Connect cluster, the group > would > > need to be secured anyways to prevent attackers from joining the group > with > > the intent to either snoop on connector/task configurations or bring the > > cluster to a halt by spamming the group with membership requests and then > > not running the assigned connectors/tasks. Additionally, for a Connect > > cluster to be secure, access to internal topics (for configs, offsets, > and > > statuses) would also need to be restricted so that attackers could not, > > e.g., write arbitrary connector/task configurations to the configs topic. > > This is all currently possible in Kafka with the use of ACLs. > > > > I think the bottom line here is that there's a number of steps that need > to > > be taken to effectively lock down a Connect cluster; the point of this > KIP > > is to close a loophole that exists even after all of those steps are > taken, > > not to completely secure this one vulnerability even when no other > security > > measures are taken. > > > > Cheers, > > > > Chris > > > > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan > > wrote: > > > > > Chris, I don't understand how the rebalance protocol can be used to > give > > > out session tokens in a secure way. It seems that any attacker could > just > > > join the group and sign requests with the provided token. Am I missing > > > something? > > > > > > Ryanne > > > > > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton > wrote: > > > > > > > The KIP page can be found at > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints > > > > , > > > > by the way. Apologies for neglecting to include it in my initial > email! > > > > > > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I'd like to start discussion on a KIP to secure the internal "POST > > > > > /connectors//tasks" endpoint for the Connect framework. The > > > > proposed > > > > > changes address a vulnerability in the framework in its current > state > > > > that > > > > > allows malicious users to write arbitrary task configurations for > > > > > connectors; it is vital that this issue be addressed in order for > any > > > > > Connect cluster to be secure. > > > > > > > > > > Looking forward to your thoughts, > > > > > > > > > > Chris > > > > > > > > > > > > > > >
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Thanks Chris, that makes sense. I know you have already considered this, but I'm not convinced we should rule out using Kafka topics for this purpose. That would enable the same level of security without introducing any new authentication or authorization mechanisms (your keys). And as you say, you'd need to lock down Connect's topics and groups anyway. Can you explain what you mean when you say using Kafka topics would require "reworking the group coordination protocol"? I don't see how these are related. Why would it matter if the workers sent Kafka messages vs POST requests among themselves? Ryanne On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton wrote: > Hi Ryanne, > > Yes, if the Connect group is left unsecured then that is a potential > vulnerability. However, in a truly secure Connect cluster, the group would > need to be secured anyways to prevent attackers from joining the group with > the intent to either snoop on connector/task configurations or bring the > cluster to a halt by spamming the group with membership requests and then > not running the assigned connectors/tasks. Additionally, for a Connect > cluster to be secure, access to internal topics (for configs, offsets, and > statuses) would also need to be restricted so that attackers could not, > e.g., write arbitrary connector/task configurations to the configs topic. > This is all currently possible in Kafka with the use of ACLs. > > I think the bottom line here is that there's a number of steps that need to > be taken to effectively lock down a Connect cluster; the point of this KIP > is to close a loophole that exists even after all of those steps are taken, > not to completely secure this one vulnerability even when no other security > measures are taken. > > Cheers, > > Chris > > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan > wrote: > > > Chris, I don't understand how the rebalance protocol can be used to give > > out session tokens in a secure way. It seems that any attacker could just > > join the group and sign requests with the provided token. Am I missing > > something? > > > > Ryanne > > > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton wrote: > > > > > The KIP page can be found at > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints > > > , > > > by the way. Apologies for neglecting to include it in my initial email! > > > > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton > > > wrote: > > > > > > > Hi all, > > > > > > > > I'd like to start discussion on a KIP to secure the internal "POST > > > > /connectors//tasks" endpoint for the Connect framework. The > > > proposed > > > > changes address a vulnerability in the framework in its current state > > > that > > > > allows malicious users to write arbitrary task configurations for > > > > connectors; it is vital that this issue be addressed in order for any > > > > Connect cluster to be secure. > > > > > > > > Looking forward to your thoughts, > > > > > > > > Chris > > > > > > > > > >
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Hi Ryanne, Yes, if the Connect group is left unsecured then that is a potential vulnerability. However, in a truly secure Connect cluster, the group would need to be secured anyways to prevent attackers from joining the group with the intent to either snoop on connector/task configurations or bring the cluster to a halt by spamming the group with membership requests and then not running the assigned connectors/tasks. Additionally, for a Connect cluster to be secure, access to internal topics (for configs, offsets, and statuses) would also need to be restricted so that attackers could not, e.g., write arbitrary connector/task configurations to the configs topic. This is all currently possible in Kafka with the use of ACLs. I think the bottom line here is that there's a number of steps that need to be taken to effectively lock down a Connect cluster; the point of this KIP is to close a loophole that exists even after all of those steps are taken, not to completely secure this one vulnerability even when no other security measures are taken. Cheers, Chris On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan wrote: > Chris, I don't understand how the rebalance protocol can be used to give > out session tokens in a secure way. It seems that any attacker could just > join the group and sign requests with the provided token. Am I missing > something? > > Ryanne > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton wrote: > > > The KIP page can be found at > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints > > , > > by the way. Apologies for neglecting to include it in my initial email! > > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton > > wrote: > > > > > Hi all, > > > > > > I'd like to start discussion on a KIP to secure the internal "POST > > > /connectors//tasks" endpoint for the Connect framework. The > > proposed > > > changes address a vulnerability in the framework in its current state > > that > > > allows malicious users to write arbitrary task configurations for > > > connectors; it is vital that this issue be addressed in order for any > > > Connect cluster to be secure. > > > > > > Looking forward to your thoughts, > > > > > > Chris > > > > > >
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Chris, I don't understand how the rebalance protocol can be used to give out session tokens in a secure way. It seems that any attacker could just join the group and sign requests with the provided token. Am I missing something? Ryanne On Wed, Aug 14, 2019, 2:31 PM Chris Egerton wrote: > The KIP page can be found at > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints > , > by the way. Apologies for neglecting to include it in my initial email! > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton > wrote: > > > Hi all, > > > > I'd like to start discussion on a KIP to secure the internal "POST > > /connectors//tasks" endpoint for the Connect framework. The > proposed > > changes address a vulnerability in the framework in its current state > that > > allows malicious users to write arbitrary task configurations for > > connectors; it is vital that this issue be addressed in order for any > > Connect cluster to be secure. > > > > Looking forward to your thoughts, > > > > Chris > > >
Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
The KIP page can be found at https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints, by the way. Apologies for neglecting to include it in my initial email! On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton wrote: > Hi all, > > I'd like to start discussion on a KIP to secure the internal "POST > /connectors//tasks" endpoint for the Connect framework. The proposed > changes address a vulnerability in the framework in its current state that > allows malicious users to write arbitrary task configurations for > connectors; it is vital that this issue be addressed in order for any > Connect cluster to be secure. > > Looking forward to your thoughts, > > Chris >