Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-19 Thread Chris Egerton
Hi all,

Two quick updates on KIP-507:

1) According to
https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#Mac,
"With some MAC algorithms, the (secret-)key algorithm associated with the
(secret-)key object used to initialize the Mac object does not matter (this
is the case with the HMAC-MD5 and HMAC-SHA1 implementations of the SunJCE
provider). With others, however, the (secret-)key algorithm does matter,
and an InvalidKeyException is thrown if a (secret-)key object with an
inappropriate (secret-)key algorithm is used.". Based on this, I'm inclined
to keep the two separate configurations for key generation and signature
algorithms; if there happens to be a use case out there that requires, for
example, the DES key generation algorithm in conjunction with the
HmacSHA256 MAC algorithm, I'd hate for someone to have to file a follow-up
KIP just to add that single configuration to their Connect cluster. These
will all be low-importance configurations and although unnecessary configs
should be avoided to prevent documentation bloat, I believe the potential
benefits of keeping these two separate outweigh the costs.

2) I've renamed the proposed configurations to try to take some of
Randall's suggestions into account; the changes can be summarized as
replacing "internal.request" with "inter.worker", and changing "
inter.worker.key.rotation.interval.ms" to "inter.worker.key.ttl.ms".

Cheers,

Chris

On Tue, Sep 17, 2019 at 10:36 AM Chris Egerton  wrote:

> Hi Randall,
>
> I've added your suggested paragraph to the KIP; it definitely clarifies
> the intended behavior more than the content that it replaced. Thanks!
>
> As far as I can tell, the two items remaining open for discussion are the
> names for the new configs (which should be made less REST
> request-specific), and whether separate configs should be enabled for the
> key generation and MAC signing algorithms. As soon as I've gotten some
> research done on the latter, I'll report back and then we can hopefully
> also begin discussing the former.
>
> Cheers,
>
> Chris
>
> On Mon, Sep 16, 2019 at 4:28 PM Randall Hauch  wrote:
>
>> On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton 
>> wrote:
>>
>> > Hi Randall,
>> >
>> > The new default value for the key size configuration will be "null".
>> I've
>> > clarified this in the KIP. This will still preserve backwards
>> compatibility
>> > and should not be an issue.
>> >
>>
>> Thanks for clarifying this in the KIP.
>>
>>
>> >
>> > I understand your point about key generation vs MAC signing algorithms;
>> > like I said, I'll need to do more research.
>> >
>> > I respectfully disagree that a single algorithm list would be easier to
>> > understand as opposed to a list of accepted algorithms and a signature
>> > algorithm. Placing special priority on the first element in that list is
>> > fairly implicit and leaves room for confusion where users might have the
>> > same algorithms in their lists for that config but in different orders
>> for
>> > different workers. The group coordination protocol selection behavior
>> isn't
>> > really a great example since users don't configure that directly
>> > themselves.
>> >
>> > RE the proposed set of new configs: like I stated in my previous
>> response,
>> > I object to the use of "cluster." as a configuration prefix for any
>> worker
>> > configs: "most configurations deal with the worker and, transitively,
>> the
>> > cluster; there doesn't seem to be good enough cause for including that
>> > prefix for these configurations." I also think this discussion should
>> > probably continue a little more before we start proposing new
>> configuration
>> > names, given that it's still undecided which configurations we're going
>> to
>> > expose.
>> >
>> > > I don't think we have any data about how how often a follower will be
>> > fully caught up, and it's possible that a worker's consumer fails to
>> keep
>> > the worker caught up quickly enough with the configs topic and the new
>> > session key. So can we really say that a follower making a request with
>> an
>> > expired key will be rare?
>> > It would depend on the key rotation interval, but I can't imagine the
>> > likelihood of this occurring with an interval as low as even 10 minutes
>> > would be that high. The config topic is low-volume; the consumer for
>> that
>> > topic isn't going to be flooded with writes and it seems fine to expect
>> > fairly low latency for the consumer of this topic.
>> >
>>
>> My concern is that we're really *assuming* it's not a problem. All I'm
>> asking for is more clarity on what happens when a worker doesn't know
>> about
>> the new session key when it makes a request to this REST resource? The KIP
>> makes it clear that It will be retried and that the existing error message
>> will be replaced with a debug message, at least for a time being. Perhaps
>> the KIP paragraph that talks about this can be reworded to make this more
>> clear, something akin to:
>>

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-17 Thread Chris Egerton
Hi Randall,

I've added your suggested paragraph to the KIP; it definitely clarifies the
intended behavior more than the content that it replaced. Thanks!

As far as I can tell, the two items remaining open for discussion are the
names for the new configs (which should be made less REST
request-specific), and whether separate configs should be enabled for the
key generation and MAC signing algorithms. As soon as I've gotten some
research done on the latter, I'll report back and then we can hopefully
also begin discussing the former.

Cheers,

Chris

On Mon, Sep 16, 2019 at 4:28 PM Randall Hauch  wrote:

> On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton  wrote:
>
> > Hi Randall,
> >
> > The new default value for the key size configuration will be "null". I've
> > clarified this in the KIP. This will still preserve backwards
> compatibility
> > and should not be an issue.
> >
>
> Thanks for clarifying this in the KIP.
>
>
> >
> > I understand your point about key generation vs MAC signing algorithms;
> > like I said, I'll need to do more research.
> >
> > I respectfully disagree that a single algorithm list would be easier to
> > understand as opposed to a list of accepted algorithms and a signature
> > algorithm. Placing special priority on the first element in that list is
> > fairly implicit and leaves room for confusion where users might have the
> > same algorithms in their lists for that config but in different orders
> for
> > different workers. The group coordination protocol selection behavior
> isn't
> > really a great example since users don't configure that directly
> > themselves.
> >
> > RE the proposed set of new configs: like I stated in my previous
> response,
> > I object to the use of "cluster." as a configuration prefix for any
> worker
> > configs: "most configurations deal with the worker and, transitively, the
> > cluster; there doesn't seem to be good enough cause for including that
> > prefix for these configurations." I also think this discussion should
> > probably continue a little more before we start proposing new
> configuration
> > names, given that it's still undecided which configurations we're going
> to
> > expose.
> >
> > > I don't think we have any data about how how often a follower will be
> > fully caught up, and it's possible that a worker's consumer fails to keep
> > the worker caught up quickly enough with the configs topic and the new
> > session key. So can we really say that a follower making a request with
> an
> > expired key will be rare?
> > It would depend on the key rotation interval, but I can't imagine the
> > likelihood of this occurring with an interval as low as even 10 minutes
> > would be that high. The config topic is low-volume; the consumer for that
> > topic isn't going to be flooded with writes and it seems fine to expect
> > fairly low latency for the consumer of this topic.
> >
>
> My concern is that we're really *assuming* it's not a problem. All I'm
> asking for is more clarity on what happens when a worker doesn't know about
> the new session key when it makes a request to this REST resource? The KIP
> makes it clear that It will be retried and that the existing error message
> will be replaced with a debug message, at least for a time being. Perhaps
> the KIP paragraph that talks about this can be reworded to make this more
> clear, something akin to:
>
> "The leader will only accept requests signed with the most current key.
> However, Connect follower workers may routinely experience small delays
> when reading the new key. Rather than always logging such task
> configuration failure and retry attempts as errors (the current behavior),
> Connect's distributed herder will be modified slightly to handle such HTTP
> 403 responses for this task configuration request by quietly retrying them
> with the latest key for up to 1 minute. If failures persist for more than 1
> minute, they will be logged as errors."
>
>
> > > Can we not retry for longer than 1 second if the request fails with
> HTTP
> > 403? I'm concerned what the UX will be if/when this happens, and that the
> > user sees a very obtuse and seemingly unrelated error message they won't
> > know how to fix.
> > To be clear, the KIP doesn't propose any changes to the infinite retry
> > logic that's present in the worker today. All that the KIP proposes is
> that
> > an existing error-level message be replaced with a debug-level message if
> > it's suspected that a request has failed due to an out-of-date key. With
> > that out of the way, sure, we can bump the grace period before beginning
> to
> > emit error-level log messages. I think going as high as minute might be
> > acceptable; we should try to stay fairly low, however, in case the
> request
> > failures are due to some other reason that should be surfaced as soon as
> > possible and with some urgency.
> >
>
> Ack. See above .
>
> >
> > > The text in the "Failure to relay task configurations to leader due to
> > incorrect 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-16 Thread Randall Hauch
On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton  wrote:

> Hi Randall,
>
> The new default value for the key size configuration will be "null". I've
> clarified this in the KIP. This will still preserve backwards compatibility
> and should not be an issue.
>

Thanks for clarifying this in the KIP.


>
> I understand your point about key generation vs MAC signing algorithms;
> like I said, I'll need to do more research.
>
> I respectfully disagree that a single algorithm list would be easier to
> understand as opposed to a list of accepted algorithms and a signature
> algorithm. Placing special priority on the first element in that list is
> fairly implicit and leaves room for confusion where users might have the
> same algorithms in their lists for that config but in different orders for
> different workers. The group coordination protocol selection behavior isn't
> really a great example since users don't configure that directly
> themselves.
>
> RE the proposed set of new configs: like I stated in my previous response,
> I object to the use of "cluster." as a configuration prefix for any worker
> configs: "most configurations deal with the worker and, transitively, the
> cluster; there doesn't seem to be good enough cause for including that
> prefix for these configurations." I also think this discussion should
> probably continue a little more before we start proposing new configuration
> names, given that it's still undecided which configurations we're going to
> expose.
>
> > I don't think we have any data about how how often a follower will be
> fully caught up, and it's possible that a worker's consumer fails to keep
> the worker caught up quickly enough with the configs topic and the new
> session key. So can we really say that a follower making a request with an
> expired key will be rare?
> It would depend on the key rotation interval, but I can't imagine the
> likelihood of this occurring with an interval as low as even 10 minutes
> would be that high. The config topic is low-volume; the consumer for that
> topic isn't going to be flooded with writes and it seems fine to expect
> fairly low latency for the consumer of this topic.
>

My concern is that we're really *assuming* it's not a problem. All I'm
asking for is more clarity on what happens when a worker doesn't know about
the new session key when it makes a request to this REST resource? The KIP
makes it clear that It will be retried and that the existing error message
will be replaced with a debug message, at least for a time being. Perhaps
the KIP paragraph that talks about this can be reworded to make this more
clear, something akin to:

"The leader will only accept requests signed with the most current key.
However, Connect follower workers may routinely experience small delays
when reading the new key. Rather than always logging such task
configuration failure and retry attempts as errors (the current behavior),
Connect's distributed herder will be modified slightly to handle such HTTP
403 responses for this task configuration request by quietly retrying them
with the latest key for up to 1 minute. If failures persist for more than 1
minute, they will be logged as errors."


> > Can we not retry for longer than 1 second if the request fails with HTTP
> 403? I'm concerned what the UX will be if/when this happens, and that the
> user sees a very obtuse and seemingly unrelated error message they won't
> know how to fix.
> To be clear, the KIP doesn't propose any changes to the infinite retry
> logic that's present in the worker today. All that the KIP proposes is that
> an existing error-level message be replaced with a debug-level message if
> it's suspected that a request has failed due to an out-of-date key. With
> that out of the way, sure, we can bump the grace period before beginning to
> emit error-level log messages. I think going as high as minute might be
> acceptable; we should try to stay fairly low, however, in case the request
> failures are due to some other reason that should be surfaced as soon as
> possible and with some urgency.
>

Ack. See above .

>
> > The text in the "Failure to relay task configurations to leader due to
> incorrect configuration" section is similarly ambiguous. How will a user
> know that this is occurring, and is this really an unlikely situation
> (e.g., "This scenario is unlikely to occur with any normal usage of
> Connect.")? Seems like users unintentionally misconfiguring some of their
> Connect workers is quite common.
> A user will know that this is occurring based on error-level log messages
> emitted by the worker about a failure to relay task ("Failed to reconfigure
> connector's tasks, retrying after backoff:"), plus a stack trace. Yes, this
> situation is unlikely; most worker files will never contain the
> newly-proposed configurations for this KIP since the default values should
> suffice for most cases. If anyone tries to adjust these values, we will
> have documentation available on 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-16 Thread Chris Egerton
Hi Randall,

The new default value for the key size configuration will be "null". I've
clarified this in the KIP. This will still preserve backwards compatibility
and should not be an issue.

I understand your point about key generation vs MAC signing algorithms;
like I said, I'll need to do more research.

I respectfully disagree that a single algorithm list would be easier to
understand as opposed to a list of accepted algorithms and a signature
algorithm. Placing special priority on the first element in that list is
fairly implicit and leaves room for confusion where users might have the
same algorithms in their lists for that config but in different orders for
different workers. The group coordination protocol selection behavior isn't
really a great example since users don't configure that directly
themselves.

RE the proposed set of new configs: like I stated in my previous response,
I object to the use of "cluster." as a configuration prefix for any worker
configs: "most configurations deal with the worker and, transitively, the
cluster; there doesn't seem to be good enough cause for including that
prefix for these configurations." I also think this discussion should
probably continue a little more before we start proposing new configuration
names, given that it's still undecided which configurations we're going to
expose.

> I don't think we have any data about how how often a follower will be
fully caught up, and it's possible that a worker's consumer fails to keep
the worker caught up quickly enough with the configs topic and the new
session key. So can we really say that a follower making a request with an
expired key will be rare?
It would depend on the key rotation interval, but I can't imagine the
likelihood of this occurring with an interval as low as even 10 minutes
would be that high. The config topic is low-volume; the consumer for that
topic isn't going to be flooded with writes and it seems fine to expect
fairly low latency for the consumer of this topic.

> Can we not retry for longer than 1 second if the request fails with HTTP
403? I'm concerned what the UX will be if/when this happens, and that the
user sees a very obtuse and seemingly unrelated error message they won't
know how to fix.
To be clear, the KIP doesn't propose any changes to the infinite retry
logic that's present in the worker today. All that the KIP proposes is that
an existing error-level message be replaced with a debug-level message if
it's suspected that a request has failed due to an out-of-date key. With
that out of the way, sure, we can bump the grace period before beginning to
emit error-level log messages. I think going as high as minute might be
acceptable; we should try to stay fairly low, however, in case the request
failures are due to some other reason that should be surfaced as soon as
possible and with some urgency.

> The text in the "Failure to relay task configurations to leader due to
incorrect configuration" section is similarly ambiguous. How will a user
know that this is occurring, and is this really an unlikely situation
(e.g., "This scenario is unlikely to occur with any normal usage of
Connect.")? Seems like users unintentionally misconfiguring some of their
Connect workers is quite common.
A user will know that this is occurring based on error-level log messages
emitted by the worker about a failure to relay task ("Failed to reconfigure
connector's tasks, retrying after backoff:"), plus a stack trace. Yes, this
situation is unlikely; most worker files will never contain the
newly-proposed configurations for this KIP since the default values should
suffice for most cases. If anyone tries to adjust these values, we will
have documentation available on what their behavior is and what to do to
ensure your cluster doesn't break while changing them (including the
rolling upgrade procedure outlined in the KIP). These configurations make
it no likelier that a user will accidentally break their Connect cluster
than the group ID, key/value converter, REST extension, and broker
authentication configs (to name just a few), and since they will be left
out of any sample worker configurations included in Kafka, the odds of
someone accidentally messing with them are low enough that it doesn't seem
worth investing a lot of effort into making it harder for someone to shoot
themselves in the foot.
I'll update the KIP to include possible indicators that the cluster has
been misconfigured, but I don't think this scenario deserves a ton of
priority.

> Maybe provide a bit more description of what these error messages will be.
I believe this is more of an implementation detail and should be left to PR
review. KIPs should only be expected to be so fine-grained; proposing
actual log messages doesn't seem necessary for the overall
adoption/rejection of the mechanisms proposed here and the iteration cycle
on GitHub is significantly faster and more efficient than on a mailing list.

> Do we need a JMX metric that shows 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-16 Thread Randall Hauch
Thanks, Chris. I still have a number of questions and requests for
clarification.

If we don't provide a default value for the key size, then the following
statement in the "Backwards compatibility" subsection is no longer true:
"All of the proposed configurations here have default values, making them
backwards compatible." IIUC, a user will not be able to upgrade an existing
Connect cluster without editing their configurations, and this pretty much
means it is not backwards compatible, which is a non-starter.

Regarding the new configurations and our earlier discussion about whether
all are required, I'm not convinced that we need separate configs for
signing and key generation algorithms. If this is a common need, then the
KIP should justify that with an explanation. But as it stands now, exposing
multiple algorithm properties now means the UX is more complicated and we
can't make things simpler in the future. I also think that a single
algorithm list where the first is used for signing would be easier to
understand (fewer is better) and would be more in line with the way
rebalance protocols are chosen the broker (see
https://www.youtube.com/watch?v=MmLezWRI3Ys for a decent explanation). If
at some point in the future we do want that extra flexibility, then we can
expose additional properties.

I also asked earlier about renaming the config properties so they can be
used in other places within the cluster other than the task configs
request, which is something I think we'll want to do. If we minimize the
number of configs, then how about `cluster.session.algorithms`,
`cluster.session.key.size` and `cluster.session.key.ttl.ms`?

The "Proposed Changes" section now mentions:

The leader will only accept requests signed with the most current key. This
> should not cause any major problems; if a follower attempts to make a
> request with an expired key (which should be quite rare and only occur if
> the request is made by a follower that is not fully caught up to the end of
> the config topic), the initial request will fail, but will be subsequently
> retried after a backoff period. This backoff period should leave sufficient
> room for the rebalance to complete.
>

I don't think we have any data about how how often a follower will be fully
caught up, and it's possible that a worker's consumer fails to keep the
worker caught up quickly enough with the configs topic and the new session
key. So can we really say that a follower making a request with an expired
key will be rare?

If the first four requests fail with HTTP 403, it will be assumed that this
> is due to an out-of-date session key; a debug-level message about the
> subsequent retry will be logged in place of the current error-level log
> message of "Failed to reconfigure connector's tasks, retrying after
> backoff: " followed by a stack trace. Since the backoff period is 250
> milliseconds, this should give at least one second of leeway for an
> outdated key to be updated. If longer than that is required, the usual
> error-level log messages will begin to be generated by the worker.
>

Can we not retry for longer than 1 second if the request fails with HTTP
403? I'm concerned what the UX will be if/when this happens, and that the
user sees a very obtuse and seemingly unrelated error message they won't
know how to fix.

The text in the "Failure to relay task configurations to leader due to
incorrect configuration" section is similarly ambiguous. How will a user
know that this is occurring, and is this really an unlikely situation
(e.g., "This scenario is unlikely to occur with any normal usage of
Connect.")? Seems like users unintentionally misconfiguring some of their
Connect workers is quite common.

Additionally, it will be vital to design appropriate error messages for
> this scenario so that users can (hopefully) dig themselves out of that hole
> on their own.


Maybe provide a bit more description of what these error messages will be.

Do we need a JMX metric that shows the protocol that each worker is
configured to use, and whether the workers are using session keys? This
would be a great way to monitor the cluster's use of this feature.

Best regards,

Randall


On Wed, Sep 11, 2019 at 7:00 PM Chris Egerton  wrote:

> Hi all,
>
> I've updated KIP-507 to reflect the changes inspired by Randall's recent
> feedback. In addition, after some further research, I've decided to remove
> the proposed default value for the internal.request.key.size and instead,
> should no value be provided, rely on the default key size for the given key
> algorithm. More detail can be found in the KIP if anyone's interested.
>
> Cheers,
>
> Chris
>
> On Tue, Sep 10, 2019 at 3:18 PM Chris Egerton  wrote:
>
> > Hi Randall,
> >
> > Thanks so much for your comprehensive feedback! To avoid creating an
> > extremely long response I'll address your comments/questions by
> referencing
> > the identifiers you've provided for them as opposed to copying them and
> > responding 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-11 Thread Chris Egerton
Hi all,

I've updated KIP-507 to reflect the changes inspired by Randall's recent
feedback. In addition, after some further research, I've decided to remove
the proposed default value for the internal.request.key.size and instead,
should no value be provided, rely on the default key size for the given key
algorithm. More detail can be found in the KIP if anyone's interested.

Cheers,

Chris

On Tue, Sep 10, 2019 at 3:18 PM Chris Egerton  wrote:

> Hi Randall,
>
> Thanks so much for your comprehensive feedback! To avoid creating an
> extremely long response I'll address your comments/questions by referencing
> the identifiers you've provided for them as opposed to copying them and
> responding inline; hopefully things are still readable :)
>
>
> RE new configs:
>
> a) I believe exposing these configurations right now is the correct
> approach. There are two scenarios that we should aim to support: running
> Connect on a bare-bones JVM that doesn't implement anything beyond the
> requirements of the JVM specs in terms of key generation and key signing
> algorithms, and running Connect in an environment with hard security
> requirements for, e.g., FIPS compliance. The default values for these
> configurations are intended to satisfy the first use case; however, in
> order to enable users to conform with higher security standards (the second
> use case), some key generation and key signing algorithms that are not
> guaranteed to be present on every implementation of the JVM may be
> required. I don't see a way to satisfy both use cases without adding
> configurability.
> With regards to key size: yes, this needs to be configurable as there are
> different ranges of acceptable key size for different key generation
> algorithms; for example, the "AES" algorithm for key generation requires a
> key size of either 128, 192 or 256 (at least on my local JVM) whereas the
> "DES" algorithm requires a key size of exactly 56.
> As far as enabling different algorithms for key generation vs. request
> signing goes... I'm not sure. Local tests involving keys generated with
> different algorithms from the mac algorithms used to sign inputs (e.g.,
> combining an HmacSHA256 key with an HmacMD5 mac or using a DES key with an
> HmacSHA256 mac) haven't thrown any exceptions yet but this is a little
> beyond the boundaries of my current knowledge, so I'll have to get back to
> you on that point. At the very least, your question (and the research I've
> done so far to try to address it) has highlighted a bug in my current PR
> that'll definitely need to be fixed :)
>
> b) The riskiest scenario is if a follower worker is configured to use a
> request signing algorithm that isn't allowed by the leader. In this case,
> any failure will only occur if/when that follower starts up a connector and
> then has to forward tasks for that connector to the leader, which may not
> happen immediately. Once that failure occurs, an endless failure loop will
> occur wherein the follower endlessly retries to send those task
> configurations to the leader and pauses by the backoff interval in between
> each failed request.
> There are two ways to rectify this situation; either shut down the
> follower and restart it after editing its configuration to use a request
> signing algorithm permitted by the leader, or shut down all other workers
> in the cluster that do not permit the request signing algorithm used by the
> follower, reconfigure them to permit it, and then restart them.
> This scenario is unlikely to occur with any normal usage of Connect, but
> the circumstances leading to it will need to be called out very carefully
> in order to avoid putting the cluster into a bad state and flooding logs
> with scary-looking error messages. Speaking of, it'll be vital to design
> appropriate error messages for this scenario so that users can (hopefully)
> dig themselves out of that hole on their own.
> Differing values for any of the other configs shouldn't actually be too
> problematic, given that the three remaining configs all deal with key
> generation/rotation, which is only handled by one worker at a time (the
> leader).
>
> c) Good call. Yes, these configs are all "low" importance and I'll update
> the KIP accordingly to reflect that.
>
> d) No, there is no window when multiple keys are valid. This is explicitly
> called out in the KIP at the end of the "Proposed Changes" section:
> > The leader will only accept requests signed with the most current key.
> This should not cause any major problems; if a follower attempts to make a
> request with an expired key (which should be quite rare and only occur if
> the request is made by a follower that is not fully caught up to the end of
> the config topic), the initial request will fail, but will be subsequently
> retried after a backoff period.
>
> e) I'll update the KIP to reflect the fact that, barring any need for
> heightened levels of security compliance, none of these configurations
> should 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-10 Thread Chris Egerton
Hi Randall,

Thanks so much for your comprehensive feedback! To avoid creating an
extremely long response I'll address your comments/questions by referencing
the identifiers you've provided for them as opposed to copying them and
responding inline; hopefully things are still readable :)


RE new configs:

a) I believe exposing these configurations right now is the correct
approach. There are two scenarios that we should aim to support: running
Connect on a bare-bones JVM that doesn't implement anything beyond the
requirements of the JVM specs in terms of key generation and key signing
algorithms, and running Connect in an environment with hard security
requirements for, e.g., FIPS compliance. The default values for these
configurations are intended to satisfy the first use case; however, in
order to enable users to conform with higher security standards (the second
use case), some key generation and key signing algorithms that are not
guaranteed to be present on every implementation of the JVM may be
required. I don't see a way to satisfy both use cases without adding
configurability.
With regards to key size: yes, this needs to be configurable as there are
different ranges of acceptable key size for different key generation
algorithms; for example, the "AES" algorithm for key generation requires a
key size of either 128, 192 or 256 (at least on my local JVM) whereas the
"DES" algorithm requires a key size of exactly 56.
As far as enabling different algorithms for key generation vs. request
signing goes... I'm not sure. Local tests involving keys generated with
different algorithms from the mac algorithms used to sign inputs (e.g.,
combining an HmacSHA256 key with an HmacMD5 mac or using a DES key with an
HmacSHA256 mac) haven't thrown any exceptions yet but this is a little
beyond the boundaries of my current knowledge, so I'll have to get back to
you on that point. At the very least, your question (and the research I've
done so far to try to address it) has highlighted a bug in my current PR
that'll definitely need to be fixed :)

b) The riskiest scenario is if a follower worker is configured to use a
request signing algorithm that isn't allowed by the leader. In this case,
any failure will only occur if/when that follower starts up a connector and
then has to forward tasks for that connector to the leader, which may not
happen immediately. Once that failure occurs, an endless failure loop will
occur wherein the follower endlessly retries to send those task
configurations to the leader and pauses by the backoff interval in between
each failed request.
There are two ways to rectify this situation; either shut down the follower
and restart it after editing its configuration to use a request signing
algorithm permitted by the leader, or shut down all other workers in the
cluster that do not permit the request signing algorithm used by the
follower, reconfigure them to permit it, and then restart them.
This scenario is unlikely to occur with any normal usage of Connect, but
the circumstances leading to it will need to be called out very carefully
in order to avoid putting the cluster into a bad state and flooding logs
with scary-looking error messages. Speaking of, it'll be vital to design
appropriate error messages for this scenario so that users can (hopefully)
dig themselves out of that hole on their own.
Differing values for any of the other configs shouldn't actually be too
problematic, given that the three remaining configs all deal with key
generation/rotation, which is only handled by one worker at a time (the
leader).

c) Good call. Yes, these configs are all "low" importance and I'll update
the KIP accordingly to reflect that.

d) No, there is no window when multiple keys are valid. This is explicitly
called out in the KIP at the end of the "Proposed Changes" section:
> The leader will only accept requests signed with the most current key.
This should not cause any major problems; if a follower attempts to make a
request with an expired key (which should be quite rare and only occur if
the request is made by a follower that is not fully caught up to the end of
the config topic), the initial request will fail, but will be subsequently
retried after a backoff period.

e) I'll update the KIP to reflect the fact that, barring any need for
heightened levels of security compliance, none of these configurations
should need to be altered and the defaults should suffice.

f) As mentioned above, a larger key size isn't necessarily possible for all
key generation algorithms. I believe these defaults are the best we can
work with until/unless other algorithms are added to the JVM spec.

g) I don't believe there's a lot of precedent for configuration properties
like this in AK. The closest I could find was the "password.encoder.*"
broker properties, but those deal with storing and encrypting passwords, as
opposed to generating keys and signing requests with them. I'm leaning
towards the stance that it's not 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-09 Thread Randall Hauch
Thanks for putting this KIP together, Chris. It's thorough and well thought
out, and you've done a great job responding to comments. It is indeed going
to be nice to harden the REST API a bit more.

I do have a few questions/concerns/comments, all of which I think can be
incorporated relatively easily.

First, regarding the new Connect worker configs mentioned in the "Public
Interfaces" section:
a. Do we need to expose all of these, or could Connect just internally use
constants and only expo them in the future if necessary? Do users really
need to explicitly set any of these, or will most users simply use the
defaults? Do we need to specify a different algorithm for generating a
session key versus signing a request? Does the key size need to be
configured, or can we use a large enough value and grow this in the future
with patch releases?
b. What happens if one worker has different values for any of these new
worker configs than the other workers? If there is an error, when will that
error occur (before startup happens, only when that worker attempts to use
the internal REST endpoint, etc.) and will the error be clear enough the
user knows how to fix the problem?
c. The section should mention the importance of each new configuration
property. I suspect these will be "low" importance, since the defaults are
likely sufficient for most users.
d. Is there a window when both the new session key and the prior session
key are still valid? If so, what is that window?
e. Can the KIP emphasize more that the default values are likely to be
sufficient for most users?
f. Are all of the defaults sufficient for the foreseeable future? For
example, what is the cost of using a larger session key size?
g. Please mention whether these config properties follow or do not follow
existing precedent within AK.
h. Might these settings be used to validate other kinds of intra-cluster
communication and messages? If so, the "internal.request" prefix might make
that difficult. Was there any thought about using a bit more generic
prefix, such as "cluster."?
i. The `internal.request.verification.algorithms` and
`internal.request.signature.algorithm` are dependent upon each other, and
it seems like it's possible for the "verification" algorithms to not
include the "signature" algorithm. We could catch cases like that (though
not with simple Validator implementations since they are specific to a
single property value), but would it be better to have a single list and to
use the first item in the list as the signature algorithm?

Second, the KIP proposes to write the session key to the existing config
topic using record(s) with a new key. Konstantine was correct when he said
earlier in the thread:


> Indeed, broadcasting the session key to the followers is necessary. But
> adding it to the configs topic with a new key is compatible with previous
> versions (the key will be ignored by workers that don't support this
> protocol) and works since all workers will need to read this topic to the
> end to remain in the group.
>

This is true that the KafkaConfigBackingStore will not act upon config
record keys with keys that do not match the known pattern, but such
messages do result in an ERROR log message. I'm concerned that operators
will be alarmed (both emotionally and technically) if they begin to upgrade
a Connect cluster and start seeing ERROR log messages. We could change
these error messages to be warnings, but that will take effect only after a
worker is upgraded and restarted, and will not affect the
yet-to-be-upgraded workers in the Connect cluster. What can we do to avoid
these ERROR log messages? Should the default be to not change
`connect.protocol`, allow the user to upgrade all clusters, and then to
change `connect.protocol=sessioned` with a (second) rolling upgrade? Or, if
we decide to rely upon an upgrade recipe to avoid these, then we should
document that recipe here.

Third, I think the KIP should address the potential for seeing one or more
"Failed to reconfigure connector's tasks, retrying after backoff" error
message during a key rotation. It would be good to eliminate the condition,
maybe by returning a response indicating authorization failure and
signifying a key rotation is required, and to prevent an ERROR log message.
I don't think separate INFO level log messages saying to disregard earlier
ERROR log messages is sufficient.

Fourth, how will an operator of a Connect cluster know whether this
internal endpoint is protected by authorization via this feature? And how
can an operator know which Connect workers are preventing a cluster from
enabling this feature? Should a warning or error be logged if this feature
is disabled after being enabled?

Finally, I have a few nits:

   1. The "Backwards compatibility" section should be more focused on the
   upgrade UX. "All of the new worker configurations have sensible defaults,
   and most users can simply upgrade without needing to override them." (BTW,
   if this is 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-09-06 Thread Chris Egerton
Hi all,

I've published a draft PR implementing the changes currently proposed in
KIP-507, which can be found at https://github.com/apache/kafka/pull/7310.
Thanks for all of the review and feedback so far! Given the lack of any
major voiced reservations so far, if I don't hear anything else over the
course of the next few days or so I'll be opening this for a vote.

Cheers,

Chris


On Wed, Aug 28, 2019 at 12:21 PM Chris Egerton  wrote:

> HI all,
>
> Wow, thanks for all the feedback! Happy to see this proposal getting some
> love :)
>
>
> RE Konstantine's comments:
>
> > I've read the discussions regarding why the rebalancing protocol is used
> here and your intention to follow the approach which was recently used in
> order to elegantly support upgrades without requiring rolling restarts
> makes sense.
> > However, I'd like to revisit the proposed extension of the connect
> protocol
> going a little bit more in detail.
> Indeed, broadcasting the session key to the followers is necessary. But
> adding it to the configs topic with a new key is compatible with previous
> versions (the key will be ignored by workers that don't support this
> protocol) and works since all workers will need to read this topic to the
> end to remain in the group.
> > Additionally, to your point about consensus, meaning how the workers all
> know during the same generation that the "sessioned" version of the
> protocol was chosen by the leader, it seems to me that an explicit upgrade
> of the protocol on the wire, on its name and on the metadata that are sent
> with the join request by each worker might not be required. What I believe
> would suffice is merely an increment of the version of the protocol,
> because all the other information remains the same (for example, the list
> of assigned and revoked tasks, etc). This would allow us not to extend the
> metadata even more with yet another JoinGroupRequest and could achieve what
> you want in terms of an orchestrated upgrade.
>
> I really like this idea, thanks for suggesting it. A simple bump of the
> protocol version does seem sufficient to achieve consensus among workers
> and would significantly reduce the implementation complexity of an entirely
> separate protocol just for the sake of distributing keys, and using the
> config topic to relay keys instead of the group coordination protocol would
> prevent rebalances solely for the sake of key rotation (instead, the leader
> can just periodically write a new key to the config topic). I'll update the
> KIP by EOD today with these changes.
>
> > All the leader worker would have to check is whether all the assignments
> that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
> you'd choose to go this route).
>
> I like this idea for dictating whether the leader requires request
> signing, SGTM. I think we can have all workers use the latest session key
> present in the config topic (if there is one) to sign their requests; since
> the signatures are passed via header, this shouldn't cause any problems if
> a follower signs a request that gets sent to a leader that isn't performing
> request validation.
>
>
> RE Magesh's comments:
>
> > Thanks a lot for the KIP. This will certainly be a useful feature. I
> would
> have preferred to use the topic approach as well but I also understand your
> point of view about the operational complexity for upgrades. If not with
> this KIP, I would certainly want to go that route at some point in the
> future.
>
> I'm actually a little conflicted on this front. If this KIP passes and we
> have a secure way to perform quick, synchronous, follower-to-leader
> communication, it may be better to hold onto and even perhaps expand on in
> case this functionality comes in handy later. Additionally, there would
> have to be a lot of thought put into the semantics of using topic-based
> communication for relaying task configs to the leader due to its
> asynchronous nature; I'm not convinced it's not worth it, but I'm also not
> entirely convinced it is.
>
> > As far as using the rebalance protocol goes, it would be great if you
> could
> elaborate on what exactly would be the rebalance impact when a key expires.
> I see that you have called out saying that there should be no significant
> impact but it will be great to explicitly state as what is to be expected.
> I would prefer to not have any sorts of rebalancing when this happens since
> the connector and task assignments should not change with this event. It
> will be useful to explain this a bit more.
>
> Given Konstantine's comments, this question isn't strictly relevant
> anymore (unless we decide to return to the approach of distributing session
> keys during rebalance), but just for completeness it does seem worth
> addressing. The "rebalances" proposed for key rotation would be achieved by
> a request from the leader the rejoin the group. This would immediately
> result in the leader being asked to assign connectors and 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-28 Thread Chris Egerton
HI all,

Wow, thanks for all the feedback! Happy to see this proposal getting some
love :)


RE Konstantine's comments:

> I've read the discussions regarding why the rebalancing protocol is used
here and your intention to follow the approach which was recently used in
order to elegantly support upgrades without requiring rolling restarts
makes sense.
> However, I'd like to revisit the proposed extension of the connect
protocol
going a little bit more in detail.
Indeed, broadcasting the session key to the followers is necessary. But
adding it to the configs topic with a new key is compatible with previous
versions (the key will be ignored by workers that don't support this
protocol) and works since all workers will need to read this topic to the
end to remain in the group.
> Additionally, to your point about consensus, meaning how the workers all
know during the same generation that the "sessioned" version of the
protocol was chosen by the leader, it seems to me that an explicit upgrade
of the protocol on the wire, on its name and on the metadata that are sent
with the join request by each worker might not be required. What I believe
would suffice is merely an increment of the version of the protocol,
because all the other information remains the same (for example, the list
of assigned and revoked tasks, etc). This would allow us not to extend the
metadata even more with yet another JoinGroupRequest and could achieve what
you want in terms of an orchestrated upgrade.

I really like this idea, thanks for suggesting it. A simple bump of the
protocol version does seem sufficient to achieve consensus among workers
and would significantly reduce the implementation complexity of an entirely
separate protocol just for the sake of distributing keys, and using the
config topic to relay keys instead of the group coordination protocol would
prevent rebalances solely for the sake of key rotation (instead, the leader
can just periodically write a new key to the config topic). I'll update the
KIP by EOD today with these changes.

> All the leader worker would have to check is whether all the assignments
that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
you'd choose to go this route).

I like this idea for dictating whether the leader requires request signing,
SGTM. I think we can have all workers use the latest session key present in
the config topic (if there is one) to sign their requests; since the
signatures are passed via header, this shouldn't cause any problems if a
follower signs a request that gets sent to a leader that isn't performing
request validation.


RE Magesh's comments:

> Thanks a lot for the KIP. This will certainly be a useful feature. I would
have preferred to use the topic approach as well but I also understand your
point of view about the operational complexity for upgrades. If not with
this KIP, I would certainly want to go that route at some point in the
future.

I'm actually a little conflicted on this front. If this KIP passes and we
have a secure way to perform quick, synchronous, follower-to-leader
communication, it may be better to hold onto and even perhaps expand on in
case this functionality comes in handy later. Additionally, there would
have to be a lot of thought put into the semantics of using topic-based
communication for relaying task configs to the leader due to its
asynchronous nature; I'm not convinced it's not worth it, but I'm also not
entirely convinced it is.

> As far as using the rebalance protocol goes, it would be great if you
could
elaborate on what exactly would be the rebalance impact when a key expires.
I see that you have called out saying that there should be no significant
impact but it will be great to explicitly state as what is to be expected.
I would prefer to not have any sorts of rebalancing when this happens since
the connector and task assignments should not change with this event. It
will be useful to explain this a bit more.

Given Konstantine's comments, this question isn't strictly relevant anymore
(unless we decide to return to the approach of distributing session keys
during rebalance), but just for completeness it does seem worth addressing.
The "rebalances" proposed for key rotation would be achieved by a request
from the leader the rejoin the group. This would immediately result in the
leader being asked to assign connectors and tasks to the group, which would
exactly match the existing assignments, so with that and the help of the
new incremental cooperative rebalance semantics there would be zero
downtime for connectors or tasks. Truly, "rebalance" is a bit of a misnomer
here; we're really talking about performing group assignment (which doesn't
necessarily imply adding or removing connectors/tasks to/from any workers),
but as far as I've seen the two are basically used synonymously w/r/t
Connect so I elected to use the more common, albeit slightly less accurate,
term.


RE Greg's comments:

> Does this roll-out 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-28 Thread Greg Harris
Hey Chris,

The KIP makes sense to me, and I think it's a very natural way to solve the
issue at hand. I had two questions about the automatic rollout logic.

Does this roll-out ratchet the protocol version irreversibly? What is the
expected behavior when the feature has been enabled, and a worker joins the
group while advertising an old protocol? I think it's reasonable to prevent
single-worker downgrades, since this is an attack vector. But what happens
if every worker in a cluster goes down, and at least the first worker comes
back with an old protocol? This is a potential attack that requires access
to the workers, but could also be used by customers to intentionally
downgrade their cluster.

You mentioned that this security improvement only adds security when there
are a number of existing security changes in place, one of these being ACLs
on the Kafka broker. Do you plan to gate this feature roll-out on detecting
that those prerequisite features are enabled? If not, I think this feature
adds no additional security to unsecured clusters, while still incurring
the overhead on signing requests. Do you know how significant the overhead
will be, as a rough estimate?

Looks great!

Thanks,
Greg


On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis <
konstant...@confluent.io> wrote:

> Thanks for the KIP Chris!
>
> This proposal will fill a gap in Kafka Connect deployments and will
> strengthen their production readiness in even more diverse environments, in
> which current workarounds are less practical.
>
> I've read the discussions regarding why the rebalancing protocol is used
> here and your intention to follow the approach which was recently used in
> order to elegantly support upgrades without requiring rolling restarts
> makes sense.
>
> However, I'd like to revisit the proposed extension of the connect protocol
> going a little bit more in detail.
> Indeed, broadcasting the session key to the followers is necessary. But
> adding it to the configs topic with a new key is compatible with previous
> versions (the key will be ignored by workers that don't support this
> protocol) and works since all workers will need to read this topic to the
> end to remain in the group.
>
> Additionally, to your point about consensus, meaning how the workers all
> know during the same generation that the "sessioned" version of the
> protocol was chosen by the leader, it seems to me that an explicit upgrade
> of the protocol on the wire, on its name and on the metadata that are sent
> with the join request by each worker might not be required. What I believe
> would suffice is merely an increment of the version of the protocol,
> because all the other information remains the same (for example, the list
> of assigned and revoked tasks, etc). This would allow us not to extend the
> metadata even more with yet another JoinGroupRequest and could achieve what
> you want in terms of an orchestrated upgrade.
>
> Without having exhaustively checked the various code paths, I feel that
> such an approach would be less heavy-handed in terms of extending the
> format of the connect protocol and would achieve a certain separation of
> concerns between the information that is required for rebalancing and is
> included in the protocol metadata and the information that is required for
> securing the internal Connect REST endpoint. In theory, this method could
> be used to even secure the eager version of the protocol, but I'd agree
> that this out of the scope of the current proposal.
>
> All the leader worker would have to check is whether all the assignments
> that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
> you'd choose to go this route).
>
> Overall, this does not differ a lot from your current proposal, which I
> think is already in the right direction.
> Let me know what you think.
>
> Cheers,
> Konstantine
>
>
> On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar 
> wrote:
>
> > Hi Chris,
> >
> > Thanks a lot for the KIP. This will certainly be a useful feature. I
> would
> > have preferred to use the topic approach as well but I also understand
> your
> > point of view about the operational complexity for upgrades. If not with
> > this KIP, I would certainly want to go that route at some point in the
> > future.
> >
> > As far as using the rebalance protocol goes, it would be great if you
> could
> > elaborate on what exactly would be the rebalance impact when a key
> expires.
> > I see that you have called out saying that there should be no significant
> > impact but it will be great to explicitly state as what is to be
> expected.
> > I would prefer to not have any sorts of rebalancing when this happens
> since
> > the connector and task assignments should not change with this event. It
> > will be useful to explain this a bit more.
> >
> > Thanks,
> > Magesh
> >
> > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton 
> wrote:
> >
> > > Hi all,
> > >
> > > I've made some tweaks to the KIP that I 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-27 Thread Konstantine Karantasis
Thanks for the KIP Chris!

This proposal will fill a gap in Kafka Connect deployments and will
strengthen their production readiness in even more diverse environments, in
which current workarounds are less practical.

I've read the discussions regarding why the rebalancing protocol is used
here and your intention to follow the approach which was recently used in
order to elegantly support upgrades without requiring rolling restarts
makes sense.

However, I'd like to revisit the proposed extension of the connect protocol
going a little bit more in detail.
Indeed, broadcasting the session key to the followers is necessary. But
adding it to the configs topic with a new key is compatible with previous
versions (the key will be ignored by workers that don't support this
protocol) and works since all workers will need to read this topic to the
end to remain in the group.

Additionally, to your point about consensus, meaning how the workers all
know during the same generation that the "sessioned" version of the
protocol was chosen by the leader, it seems to me that an explicit upgrade
of the protocol on the wire, on its name and on the metadata that are sent
with the join request by each worker might not be required. What I believe
would suffice is merely an increment of the version of the protocol,
because all the other information remains the same (for example, the list
of assigned and revoked tasks, etc). This would allow us not to extend the
metadata even more with yet another JoinGroupRequest and could achieve what
you want in terms of an orchestrated upgrade.

Without having exhaustively checked the various code paths, I feel that
such an approach would be less heavy-handed in terms of extending the
format of the connect protocol and would achieve a certain separation of
concerns between the information that is required for rebalancing and is
included in the protocol metadata and the information that is required for
securing the internal Connect REST endpoint. In theory, this method could
be used to even secure the eager version of the protocol, but I'd agree
that this out of the scope of the current proposal.

All the leader worker would have to check is whether all the assignments
that it took were in "sessioned" version (possibly CONNECT_PROTOCOL_V2=2 if
you'd choose to go this route).

Overall, this does not differ a lot from your current proposal, which I
think is already in the right direction.
Let me know what you think.

Cheers,
Konstantine


On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar 
wrote:

> Hi Chris,
>
> Thanks a lot for the KIP. This will certainly be a useful feature. I would
> have preferred to use the topic approach as well but I also understand your
> point of view about the operational complexity for upgrades. If not with
> this KIP, I would certainly want to go that route at some point in the
> future.
>
> As far as using the rebalance protocol goes, it would be great if you could
> elaborate on what exactly would be the rebalance impact when a key expires.
> I see that you have called out saying that there should be no significant
> impact but it will be great to explicitly state as what is to be expected.
> I would prefer to not have any sorts of rebalancing when this happens since
> the connector and task assignments should not change with this event. It
> will be useful to explain this a bit more.
>
> Thanks,
> Magesh
>
> On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton  wrote:
>
> > Hi all,
> >
> > I've made some tweaks to the KIP that I believe are improvements. More
> > detail can be found on the KIP page itself, but as a brief summary, the
> > three changes are:
> >
> > - The removal of the internal.request.verification property in favor of
> > modifying the default value for the connect.protocol property from
> > "compatible" to "sessioned"
> > - The renaming of some configurations to use better terminology (mostly
> > just "request" instead of "key" where appropriate)
> > - The addition of two new configurations that dictate how session keys
> are
> > to be generated
> >
> > Thanks Ryanne for the feedback so far, hope to hear from some more of you
> > soon :)
> >
> > Cheers,
> >
> > Chris
> >
> > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton 
> > wrote:
> >
> > > Hi Ryanne,
> > >
> > > The reasoning for this is provided in the KIP: "There would be no clear
> > > way to achieve consensus amongst the workers in a cluster on whether to
> > > switch to this new behavior." To elaborate on this--it would be bad if
> a
> > > follower worker began writing task configurations to the topic while
> the
> > > leader continued to only listen on the internal REST endpoint. It's
> > > necessary to ensure that every worker in a cluster supports this new
> > > behavior before switching it on.
> > >
> > > To be fair, it is technically possible to achieve consensus without
> using
> > > the group coordination protocol by adding new configurations and using
> a
> > > multi-phase rolling 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-27 Thread Magesh Nandakumar
Hi Chris,

Thanks a lot for the KIP. This will certainly be a useful feature. I would
have preferred to use the topic approach as well but I also understand your
point of view about the operational complexity for upgrades. If not with
this KIP, I would certainly want to go that route at some point in the
future.

As far as using the rebalance protocol goes, it would be great if you could
elaborate on what exactly would be the rebalance impact when a key expires.
I see that you have called out saying that there should be no significant
impact but it will be great to explicitly state as what is to be expected.
I would prefer to not have any sorts of rebalancing when this happens since
the connector and task assignments should not change with this event. It
will be useful to explain this a bit more.

Thanks,
Magesh

On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton  wrote:

> Hi all,
>
> I've made some tweaks to the KIP that I believe are improvements. More
> detail can be found on the KIP page itself, but as a brief summary, the
> three changes are:
>
> - The removal of the internal.request.verification property in favor of
> modifying the default value for the connect.protocol property from
> "compatible" to "sessioned"
> - The renaming of some configurations to use better terminology (mostly
> just "request" instead of "key" where appropriate)
> - The addition of two new configurations that dictate how session keys are
> to be generated
>
> Thanks Ryanne for the feedback so far, hope to hear from some more of you
> soon :)
>
> Cheers,
>
> Chris
>
> On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton 
> wrote:
>
> > Hi Ryanne,
> >
> > The reasoning for this is provided in the KIP: "There would be no clear
> > way to achieve consensus amongst the workers in a cluster on whether to
> > switch to this new behavior." To elaborate on this--it would be bad if a
> > follower worker began writing task configurations to the topic while the
> > leader continued to only listen on the internal REST endpoint. It's
> > necessary to ensure that every worker in a cluster supports this new
> > behavior before switching it on.
> >
> > To be fair, it is technically possible to achieve consensus without using
> > the group coordination protocol by adding new configurations and using a
> > multi-phase rolling upgrade, but the operational complexity of this
> > approach would significantly complicate things for the default case of
> just
> > wanting to bump your Connect cluster to the next version without having
> to
> > alter your existing worker config files and with only a single restart
> per
> > worker. The proposed approach provides a much simpler user experience for
> > the most common use case and doesn't impose much additional complexity
> for
> > other use cases.
> >
> > I've updated the KIP to expand on points from your last emails; let me
> > know what you think.
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan 
> > wrote:
> >
> >> Thanks Chris, that makes sense.
> >>
> >> I know you have already considered this, but I'm not convinced we should
> >> rule out using Kafka topics for this purpose. That would enable the same
> >> level of security without introducing any new authentication or
> >> authorization mechanisms (your keys). And as you say, you'd need to lock
> >> down Connect's topics and groups anyway.
> >>
> >> Can you explain what you mean when you say using Kafka topics would
> >> require
> >> "reworking the group coordination protocol"? I don't see how these are
> >> related. Why would it matter if the workers sent Kafka messages vs POST
> >> requests among themselves?
> >>
> >> Ryanne
> >>
> >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton 
> >> wrote:
> >>
> >> > Hi Ryanne,
> >> >
> >> > Yes, if the Connect group is left unsecured then that is a potential
> >> > vulnerability. However, in a truly secure Connect cluster, the group
> >> would
> >> > need to be secured anyways to prevent attackers from joining the group
> >> with
> >> > the intent to either snoop on connector/task configurations or bring
> the
> >> > cluster to a halt by spamming the group with membership requests and
> >> then
> >> > not running the assigned connectors/tasks. Additionally, for a Connect
> >> > cluster to be secure, access to internal topics (for configs, offsets,
> >> and
> >> > statuses) would also need to be restricted so that attackers could
> not,
> >> > e.g., write arbitrary connector/task configurations to the configs
> >> topic.
> >> > This is all currently possible in Kafka with the use of ACLs.
> >> >
> >> > I think the bottom line here is that there's a number of steps that
> >> need to
> >> > be taken to effectively lock down a Connect cluster; the point of this
> >> KIP
> >> > is to close a loophole that exists even after all of those steps are
> >> taken,
> >> > not to completely secure this one vulnerability even when no other
> >> security
> >> > measures are taken.
> >> >
> >> > 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-21 Thread Chris Egerton
Hi all,

I've made some tweaks to the KIP that I believe are improvements. More
detail can be found on the KIP page itself, but as a brief summary, the
three changes are:

- The removal of the internal.request.verification property in favor of
modifying the default value for the connect.protocol property from
"compatible" to "sessioned"
- The renaming of some configurations to use better terminology (mostly
just "request" instead of "key" where appropriate)
- The addition of two new configurations that dictate how session keys are
to be generated

Thanks Ryanne for the feedback so far, hope to hear from some more of you
soon :)

Cheers,

Chris

On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton  wrote:

> Hi Ryanne,
>
> The reasoning for this is provided in the KIP: "There would be no clear
> way to achieve consensus amongst the workers in a cluster on whether to
> switch to this new behavior." To elaborate on this--it would be bad if a
> follower worker began writing task configurations to the topic while the
> leader continued to only listen on the internal REST endpoint. It's
> necessary to ensure that every worker in a cluster supports this new
> behavior before switching it on.
>
> To be fair, it is technically possible to achieve consensus without using
> the group coordination protocol by adding new configurations and using a
> multi-phase rolling upgrade, but the operational complexity of this
> approach would significantly complicate things for the default case of just
> wanting to bump your Connect cluster to the next version without having to
> alter your existing worker config files and with only a single restart per
> worker. The proposed approach provides a much simpler user experience for
> the most common use case and doesn't impose much additional complexity for
> other use cases.
>
> I've updated the KIP to expand on points from your last emails; let me
> know what you think.
>
> Cheers,
>
> Chris
>
> On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan 
> wrote:
>
>> Thanks Chris, that makes sense.
>>
>> I know you have already considered this, but I'm not convinced we should
>> rule out using Kafka topics for this purpose. That would enable the same
>> level of security without introducing any new authentication or
>> authorization mechanisms (your keys). And as you say, you'd need to lock
>> down Connect's topics and groups anyway.
>>
>> Can you explain what you mean when you say using Kafka topics would
>> require
>> "reworking the group coordination protocol"? I don't see how these are
>> related. Why would it matter if the workers sent Kafka messages vs POST
>> requests among themselves?
>>
>> Ryanne
>>
>> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton 
>> wrote:
>>
>> > Hi Ryanne,
>> >
>> > Yes, if the Connect group is left unsecured then that is a potential
>> > vulnerability. However, in a truly secure Connect cluster, the group
>> would
>> > need to be secured anyways to prevent attackers from joining the group
>> with
>> > the intent to either snoop on connector/task configurations or bring the
>> > cluster to a halt by spamming the group with membership requests and
>> then
>> > not running the assigned connectors/tasks. Additionally, for a Connect
>> > cluster to be secure, access to internal topics (for configs, offsets,
>> and
>> > statuses) would also need to be restricted so that attackers could not,
>> > e.g., write arbitrary connector/task configurations to the configs
>> topic.
>> > This is all currently possible in Kafka with the use of ACLs.
>> >
>> > I think the bottom line here is that there's a number of steps that
>> need to
>> > be taken to effectively lock down a Connect cluster; the point of this
>> KIP
>> > is to close a loophole that exists even after all of those steps are
>> taken,
>> > not to completely secure this one vulnerability even when no other
>> security
>> > measures are taken.
>> >
>> > Cheers,
>> >
>> > Chris
>> >
>> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan 
>> > wrote:
>> >
>> > > Chris, I don't understand how the rebalance protocol can be used to
>> give
>> > > out session tokens in a secure way. It seems that any attacker could
>> just
>> > > join the group and sign requests with the provided token. Am I missing
>> > > something?
>> > >
>> > > Ryanne
>> > >
>> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton 
>> wrote:
>> > >
>> > > > The KIP page can be found at
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
>> > > > ,
>> > > > by the way. Apologies for neglecting to include it in my initial
>> email!
>> > > >
>> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton > >
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I'd like to start discussion on a KIP to secure the internal "POST
>> > > > > /connectors//tasks" endpoint for the Connect framework. The
>> > > > proposed
>> > > > > changes address a vulnerability in the framework in its 

Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-19 Thread Chris Egerton
Hi Ryanne,

The reasoning for this is provided in the KIP: "There would be no clear way
to achieve consensus amongst the workers in a cluster on whether to switch
to this new behavior." To elaborate on this--it would be bad if a follower
worker began writing task configurations to the topic while the leader
continued to only listen on the internal REST endpoint. It's necessary to
ensure that every worker in a cluster supports this new behavior before
switching it on.

To be fair, it is technically possible to achieve consensus without using
the group coordination protocol by adding new configurations and using a
multi-phase rolling upgrade, but the operational complexity of this
approach would significantly complicate things for the default case of just
wanting to bump your Connect cluster to the next version without having to
alter your existing worker config files and with only a single restart per
worker. The proposed approach provides a much simpler user experience for
the most common use case and doesn't impose much additional complexity for
other use cases.

I've updated the KIP to expand on points from your last emails; let me know
what you think.

Cheers,

Chris

On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan  wrote:

> Thanks Chris, that makes sense.
>
> I know you have already considered this, but I'm not convinced we should
> rule out using Kafka topics for this purpose. That would enable the same
> level of security without introducing any new authentication or
> authorization mechanisms (your keys). And as you say, you'd need to lock
> down Connect's topics and groups anyway.
>
> Can you explain what you mean when you say using Kafka topics would require
> "reworking the group coordination protocol"? I don't see how these are
> related. Why would it matter if the workers sent Kafka messages vs POST
> requests among themselves?
>
> Ryanne
>
> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton  wrote:
>
> > Hi Ryanne,
> >
> > Yes, if the Connect group is left unsecured then that is a potential
> > vulnerability. However, in a truly secure Connect cluster, the group
> would
> > need to be secured anyways to prevent attackers from joining the group
> with
> > the intent to either snoop on connector/task configurations or bring the
> > cluster to a halt by spamming the group with membership requests and then
> > not running the assigned connectors/tasks. Additionally, for a Connect
> > cluster to be secure, access to internal topics (for configs, offsets,
> and
> > statuses) would also need to be restricted so that attackers could not,
> > e.g., write arbitrary connector/task configurations to the configs topic.
> > This is all currently possible in Kafka with the use of ACLs.
> >
> > I think the bottom line here is that there's a number of steps that need
> to
> > be taken to effectively lock down a Connect cluster; the point of this
> KIP
> > is to close a loophole that exists even after all of those steps are
> taken,
> > not to completely secure this one vulnerability even when no other
> security
> > measures are taken.
> >
> > Cheers,
> >
> > Chris
> >
> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan 
> > wrote:
> >
> > > Chris, I don't understand how the rebalance protocol can be used to
> give
> > > out session tokens in a secure way. It seems that any attacker could
> just
> > > join the group and sign requests with the provided token. Am I missing
> > > something?
> > >
> > > Ryanne
> > >
> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton 
> wrote:
> > >
> > > > The KIP page can be found at
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > > > ,
> > > > by the way. Apologies for neglecting to include it in my initial
> email!
> > > >
> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton 
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start discussion on a KIP to secure the internal "POST
> > > > > /connectors//tasks" endpoint for the Connect framework. The
> > > > proposed
> > > > > changes address a vulnerability in the framework in its current
> state
> > > > that
> > > > > allows malicious users to write arbitrary task configurations for
> > > > > connectors; it is vital that this issue be addressed in order for
> any
> > > > > Connect cluster to be secure.
> > > > >
> > > > > Looking forward to your thoughts,
> > > > >
> > > > > Chris
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-15 Thread Ryanne Dolan
Thanks Chris, that makes sense.

I know you have already considered this, but I'm not convinced we should
rule out using Kafka topics for this purpose. That would enable the same
level of security without introducing any new authentication or
authorization mechanisms (your keys). And as you say, you'd need to lock
down Connect's topics and groups anyway.

Can you explain what you mean when you say using Kafka topics would require
"reworking the group coordination protocol"? I don't see how these are
related. Why would it matter if the workers sent Kafka messages vs POST
requests among themselves?

Ryanne

On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton  wrote:

> Hi Ryanne,
>
> Yes, if the Connect group is left unsecured then that is a potential
> vulnerability. However, in a truly secure Connect cluster, the group would
> need to be secured anyways to prevent attackers from joining the group with
> the intent to either snoop on connector/task configurations or bring the
> cluster to a halt by spamming the group with membership requests and then
> not running the assigned connectors/tasks. Additionally, for a Connect
> cluster to be secure, access to internal topics (for configs, offsets, and
> statuses) would also need to be restricted so that attackers could not,
> e.g., write arbitrary connector/task configurations to the configs topic.
> This is all currently possible in Kafka with the use of ACLs.
>
> I think the bottom line here is that there's a number of steps that need to
> be taken to effectively lock down a Connect cluster; the point of this KIP
> is to close a loophole that exists even after all of those steps are taken,
> not to completely secure this one vulnerability even when no other security
> measures are taken.
>
> Cheers,
>
> Chris
>
> On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan 
> wrote:
>
> > Chris, I don't understand how the rebalance protocol can be used to give
> > out session tokens in a secure way. It seems that any attacker could just
> > join the group and sign requests with the provided token. Am I missing
> > something?
> >
> > Ryanne
> >
> > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton  wrote:
> >
> > > The KIP page can be found at
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > > ,
> > > by the way. Apologies for neglecting to include it in my initial email!
> > >
> > > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to start discussion on a KIP to secure the internal "POST
> > > > /connectors//tasks" endpoint for the Connect framework. The
> > > proposed
> > > > changes address a vulnerability in the framework in its current state
> > > that
> > > > allows malicious users to write arbitrary task configurations for
> > > > connectors; it is vital that this issue be addressed in order for any
> > > > Connect cluster to be secure.
> > > >
> > > > Looking forward to your thoughts,
> > > >
> > > > Chris
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-15 Thread Chris Egerton
Hi Ryanne,

Yes, if the Connect group is left unsecured then that is a potential
vulnerability. However, in a truly secure Connect cluster, the group would
need to be secured anyways to prevent attackers from joining the group with
the intent to either snoop on connector/task configurations or bring the
cluster to a halt by spamming the group with membership requests and then
not running the assigned connectors/tasks. Additionally, for a Connect
cluster to be secure, access to internal topics (for configs, offsets, and
statuses) would also need to be restricted so that attackers could not,
e.g., write arbitrary connector/task configurations to the configs topic.
This is all currently possible in Kafka with the use of ACLs.

I think the bottom line here is that there's a number of steps that need to
be taken to effectively lock down a Connect cluster; the point of this KIP
is to close a loophole that exists even after all of those steps are taken,
not to completely secure this one vulnerability even when no other security
measures are taken.

Cheers,

Chris

On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan  wrote:

> Chris, I don't understand how the rebalance protocol can be used to give
> out session tokens in a secure way. It seems that any attacker could just
> join the group and sign requests with the provided token. Am I missing
> something?
>
> Ryanne
>
> On Wed, Aug 14, 2019, 2:31 PM Chris Egerton  wrote:
>
> > The KIP page can be found at
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > ,
> > by the way. Apologies for neglecting to include it in my initial email!
> >
> > On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton 
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to start discussion on a KIP to secure the internal "POST
> > > /connectors//tasks" endpoint for the Connect framework. The
> > proposed
> > > changes address a vulnerability in the framework in its current state
> > that
> > > allows malicious users to write arbitrary task configurations for
> > > connectors; it is vital that this issue be addressed in order for any
> > > Connect cluster to be secure.
> > >
> > > Looking forward to your thoughts,
> > >
> > > Chris
> > >
> >
>


Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-14 Thread Ryanne Dolan
Chris, I don't understand how the rebalance protocol can be used to give
out session tokens in a secure way. It seems that any attacker could just
join the group and sign requests with the provided token. Am I missing
something?

Ryanne

On Wed, Aug 14, 2019, 2:31 PM Chris Egerton  wrote:

> The KIP page can be found at
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> ,
> by the way. Apologies for neglecting to include it in my initial email!
>
> On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton 
> wrote:
>
> > Hi all,
> >
> > I'd like to start discussion on a KIP to secure the internal "POST
> > /connectors//tasks" endpoint for the Connect framework. The
> proposed
> > changes address a vulnerability in the framework in its current state
> that
> > allows malicious users to write arbitrary task configurations for
> > connectors; it is vital that this issue be addressed in order for any
> > Connect cluster to be secure.
> >
> > Looking forward to your thoughts,
> >
> > Chris
> >
>


Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints

2019-08-14 Thread Chris Egerton
The KIP page can be found at
https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints,
by the way. Apologies for neglecting to include it in my initial email!

On Wed, Aug 14, 2019 at 12:29 PM Chris Egerton  wrote:

> Hi all,
>
> I'd like to start discussion on a KIP to secure the internal "POST
> /connectors//tasks" endpoint for the Connect framework. The proposed
> changes address a vulnerability in the framework in its current state that
> allows malicious users to write arbitrary task configurations for
> connectors; it is vital that this issue be addressed in order for any
> Connect cluster to be secure.
>
> Looking forward to your thoughts,
>
> Chris
>