Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Tolbert, Andy
I think in the context of what I think initially motivated this hot
reloading capability, a big win it provides is avoiding having to
bounce your cluster as your certificates near expiry.  If not watched
closely you can get yourself into a state where every node in the
cluster's cert expired, which is effectively an outage.

I see the appeal of draining connections on a change of trust,
although the necessity of being able to "do it live" (as opposed to
doing a bounce) seems less important then avoiding the outage
condition of your certificates expiring, especially since you can sort
of already do this without bouncing by toggling nodetool
disablebinary/enablebinary.  I agree with Dinesh that most operators
would prefer that it does not do that as interrupting connections can
be disruptive to applications if they don't have retries configured,
but I also agree it'd be a nice improvement to support draining
existing connections in some way.

+1 on the idea of having a "timed connection" capability brought up
here, and implementing it in a way such that connection lifetimes can
be dynamically adjusted.  This way it can be made such that on a trust
store change Cassandra could simply adjust the connection lifetimes
and they will be disconnected immediately or drained over a time
period like Josh proposed.

Thanks,
Andy


Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Josh McKenzie
I think it's all part of the same issue and you're not derailing IMO Abe. For 
the user Pabbireddy here, the unexpected behavior was not closing internode 
connections on that keystore refresh. So ISTM, from a "featureset that would be 
nice to have here" perspective, we could theoretically provide:
 1. An option to disconnect all connections on cert update, disabled by default
 2. An option to drain and recycle connections on a time period, also disabled 
by default
Leave the current behavior in place but allow for these kind of strong 
cert-guarantees if a user needs it in their env.

On Mon, Apr 15, 2024, at 9:51 PM, Abe Ratnofsky wrote:
> Not to derail from the original conversation too far, but wanted to agree 
> that maximum connection establishment time on native transport would be 
> useful. That would provide a maximum duration before an updated client 
> keystore is used for connections, which can be used to safely roll out client 
> keystore updates.
> 
> For example, if the maximum connection establishment time is 12 hours, then 
> you can update the keystore on a canary client, wait 24 hours, confirm that 
> connectivity is maintained, then upgrade keystores across the rest of the 
> fleet.
> 
> With unbounded connection establishment, reconnection isn't tested as often 
> and issues can hide behind long-lived connections.
> 
>> On Apr 15, 2024, at 5:14 PM, Jeff Jirsa  wrote:
>> 
>> It seems like if folks really want the life of a connection to be finite 
>> (either client/server or server/server), adding in an option to quietly 
>> drain and recycle a connection on some period isn’t that difficult.
>> 
>> That type of requirement shows up in a number of environments, usually on 
>> interactive logins (cqlsh, login, walk away, the connection needs to become 
>> invalid in a short and finite period of time), but adding it to internode 
>> could also be done, and may help in some weird situations (if you changed 
>> certs because you believe a key/cert is compromised, having the connection 
>> remain active is decidedly inconvenient, so maybe it does make sense to add 
>> an expiration timer/condition on each connection).
>> 
>> 
>> 
>>> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
>>> 
>>> In addition to what Andy mentioned, I want to point out that for the vast 
>>> majority of use-cases, we would like to _avoid_ interruptions when a 
>>> certificate is updated so it is by design. If you're dealing with a 
>>> situation where you want to ensure that the connections are cycled, you can 
>>> follow Andy's advice. It will require automation outside the database that 
>>> you might already have. If there is demand, we can consider adding a 
>>> feature to slowly cycle the connections so the old SSL context is not used 
>>> anymore.
>>> 
>>> One more thing you should bear in mind is that Cassandra will not load the 
>>> new SSL context if it cannot successfully initialize it. This is again by 
>>> design to prevent an outage when the updated truststore is corrupted or 
>>> could not be read in some way.
>>> 
>>> thanks,
>>> Dinesh
>>> 
>>> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  
>>> wrote:
 I should mention, when toggling disablebinary/enablebinary between
 instances, you will probably want to give some time between doing this
 so connections can reestablish, and you will want to verify that the
 connections can actually reestablish.  You also need to be mindful of
 this being disruptive to inflight queries (if your client is
 configured for retries it will probably be fine).  Semantically to
 your applications it should look a lot like a rolling cluster bounce.
 
 Thanks,
 Andy
 
 On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
  wrote:
 >
 > Thanks Andy for your reply . We will test the scenario you mentioned.
 >
 > Regards
 > Avinash
 >
 > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
 > wrote:
 >>
 >> Hi Avinash,
 >>
 >> As far as I understand it, if the underlying keystore/trustore(s)
 >> Cassandra is configured for is updated, this *will not* provoke
 >> Cassandra to interrupt existing connections, it's just that the new
 >> stores will be used for future TLS initialization.
 >>
 >> Via: 
 >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
 >>
 >> > When the files are updated, Cassandra will reload them and use them 
 >> > for subsequent connections
 >>
 >> I suppose one could do a rolling disablebinary/enablebinary (if it's
 >> only client connections) after you roll out a keystore/truststore
 >> change as a way of enforcing the existing connections to reestablish.
 >>
 >> Thanks,
 >> Andy
 >>
 >>
 >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 >>  wrote:
 >> >
 >> > Dear Community,
 >> >
 >> > I hope this email