[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428638#comment-16428638 ] Jason Brown commented on CASSANDRA-13993: - [~iamaleksey] Yup, that''s basically what i meant, as well :) > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428607#comment-16428607 ] Aleksey Yeschenko commented on CASSANDRA-13993: --- [~jasobrown] I don't mean backporting this whole ticket - just the ability to parse {{PING}} messages. We can just discard them once parsed (: > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428604#comment-16428604 ] Jason Brown commented on CASSANDRA-13993: - On the surface, it seems like backporting Ping-related stuffs is more invasive than just skipping some arbitrary bytes in the stream. However, if I think understand [~iamaleksey]'s reasoning, skipping some bytes in the stream has larger implications and essentially a larger behavior change than simply adding a new message. If that's true, then I agree that backporting the Ping message is the behavior-wise best route to go. I'll go ahead and start working on the changes as discussed. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428460#comment-16428460 ] Aleksey Yeschenko commented on CASSANDRA-13993: --- I agree with basically everything you said here - except what we should backport, so: bq. if we do get an unknown verb id, skip the payload bytes in MessageIn. This leaves the input stream clean to process future messages. Yes, please, for 4.0+. bq. Further, I think we can eliminate the whole UNUSED_ verbs thing as that was an incomplete defense against unknown verbs, and it didn't account for message payload. Yes please. Keep the five we have - or, four, rather, because one will be consumed by {{PING}} - and I'd still say let it be {{UNUSED_4}} or 5, but don't introduce any more in 4.0, or after 4.0. We will reclaim the existing ones eventually as we EOL older releases. bq. backport part of CASSANDRA-13283 to get the Verb from a map, not an index array offset. This gives us safety for future-proofing against unknown verbs. Not a bad idea, but we should probably be a bit more conservative re: what we backport to 3.0, and especially 2.2 at this point. How about, instead, we just backport {{PING}} to 3.11 and 3.0, so in the upgrade scenario there will be no harm to connections? So, TL;DR, maybe do this? 1. Make 4.0 robust against {{null}} verb and skip remainders of messages we can't parse. There is precedent for it as well, see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hints/HintMessage.java#L126-L128 2. Stop introducing new {{UNUSED_}} verbs starting with 4.0. 3. Backport {{PING}} to 3.0 and 3.11, so upgraders from recent 3.0 and 3.11 with the fix will have a smoother experience when going to 4.0. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428361#comment-16428361 ] Jason Brown commented on CASSANDRA-13993: - Responding to @aleksey's comments out of order, but hopefully makes sense at the end. bq. we should handle ordinals that are outside of our known range robustly instead. Yeah, I think this is where we should get to. In {{MessageIn#read()}}, we read the verb id from the stream, and then fetch the {{Verb}} instance. In pre-4.0, we literally index into the {{Verb[]}} in {{MessagingService}}, so any unknown {{Verb}} s would blow up there with an ArrayIndexOutOfBoundsException. With CASSANDRA-13283, committed on trunk, we are more intelligently resistant to unknown {{Verb}} s, and would just get a null {{Verb}}. Unfortunately, trunk would still have problems with an unknown {{Verb}} as it would not know how to deserialize the message (pre-4.0, of course, suffers the same problem). It justs reads the basic header data, and passes it down, where [it would be dropped by {{MessageDeliveryTask}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessageDeliveryTask.java#L66]. Unfortunately, if the message had more bytes in the stream which we didn't try to deseriliaze, trying to read the next message on the connection would fail spectacularly. It's easy enough to avoid that, though, as we [already know the {{payloadSize}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessageIn.java#L146], so we can easily skip over the payload, and leave the incoming stream in a clean state after we account for the unknown message. Note: {{payloadSize}} is required by the internode messaging protocol, so we are sure to have the payload size. Thus, we can just safely skip the stream forward when we don't know how to deserialize the message, send it forward, and just discard it at {{MessagedeliveryTask}}. bq. So I was thinking about a major upgrade bounce scenario. Think the first ever node to upgrade to 4.0 in a cluster of 3.0 nodes - will send out pings to every node, but receive no pongs, correct? So every node until a threshold will have a significantly longer bounce. Do we care about this case? As the {{PingMessage}} contains a one-byte payload, it would leave the stream in a bad (unconsumed) state. This is a bug for the upgrade scenario. It's not a terrible bug, but it will cause the connection that we tried to eagerly create (to the un-upgraded peer) to be thrown away as it will fail on the next succeeding message on the connection. See proposal at the end. bq. As implemented currently, we are going to send PINGs potentially to 3.11/3.0 - unless we switch to gating by version, which we do sometimes. So here's the rub: we don't necessarily know the peer's version yet. The ping messages are sent on the large/small connections, but we're not guaranteed that at least one round of gossip has completed wherein we would learn the version of the peers (we're still at in the startup process). The un-upgraded node won't know how to respond to the the unkown {{Verb}}, which is acceptable, but we shouldn't leave the stream on that connection in a broken state (see above). Proposal: - backport part of CASSANDRA-13283 to get the {{Verb}} from a map, not an index array offset. This gives us safety for future-proofing against unknown verbs. - if we do get an unknown verb id, skip the payload bytes in {{MessageIn}}. This leaves the input stream clean to process future messages. - Further, I think we can eliminate the whole {{UNUSED_}} verbs thing as that was an incomplete defense against unknown verbs, and it didn't account for message payload. I think if we backport this to at least 3.0 (maybe 2.2?) that should be sufficient for future-proofing against unknown messages. If this sounds reasonable, I'll open a separate ticket for that work. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427583#comment-16427583 ] Aleksey Yeschenko commented on CASSANDRA-13993: --- The out-of-range problem, however, feels a bit silly. We shouldn't have padding just to avoid going out of ordinal bounds - we should handle ordinals that are outside of our known range robustly instead. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427566#comment-16427566 ] Aleksey Yeschenko commented on CASSANDRA-13993: --- Disregard my last comment here, I was wrong, by a big margin. My apologies. As implemented currently, we are going to send PINGs potentially to 3.11/3.0 - unless we switch to gating by version, which we do sometimes. And if you pick a verb after {{UNUSED_5}}, it would error out on 3.11/3.0 side. So, again, unless we gate by version (on which - see below*), we need to pick an ordinal that is within the range of 3.0/3.11 - so one of {{UNUSED_1..5}} verbs. The latest still supported release is 2.2, which has only 3 {{UNUSED}} verbs. To be super paranoid and maxmimise the # of available {{UNUSED}} verbs in case of bad things happening that would force us to introduce new verbs in old versions - which is very unlikely to happen, but did happen before, we should use one of {{UNUSED_4}} or {{UNUSED_5}} verbs here, in my opinion. But not inserting a verb before {{UNUSED_1}} like it is now - it's essentially taking up {{UNUSED_1}} verb, but confusing things between 4.0 and 3.0/3.11, where everything would slide by one and might introduce mistakes. * So I was thinking about a major upgrade bounce scenario. Think the first ever node to upgrade to 4.0 in a cluster of 3.0 nodes - will send out pings to every node, but receive no pongs, correct? So every node until a threshold will have a significantly longer bounce. Do we care about this case? > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424314#comment-16424314 ] Aleksey Yeschenko commented on CASSANDRA-13993: --- So, while the comment is before the {{UNUSED_}} verbs, we should still be doing what the comment says, and add new verbs in the end. In our case - after {{UNUSED_5}}. Now, it doesn't often happen that thing go wrong in a way that forces us to retroactively add new verbs to already released majors, but it does sometimes. Imagine for example there is a bug that causes us to add a new verb to 2.2 and 3.0, to address some issue with reads. Normally we would go an see which unused ranges overlap. In this case, {{UNUSED_1}} to {{UNUSED_3}} could be appropriated. This is why we keep the buffer there. If 4.0 appropriates the slot just before {{UNUSED_1}} - it's essentially taking over {{UNUSED_1}} spot, reducing that available buffer by 1. Now, it is unlikely that we are going to need 3 new verbs in 3.11/3.0/2.2, but it's not like extra ordinals are a precious resource. So we might as well stick to the ways of the old, and either, a) move {{PING}} verb to the end of the list, after {{UNUSED_5}}, or b) Reuse one of the ancient deprecated verbs (we did that at least for hints and batchlog recently). > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390394#comment-16390394 ] Joseph Lynch commented on CASSANDRA-13993: -- I cut CASSANDRA-14297 for follow up, will iterate in that ticket. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377336#comment-16377336 ] Joseph Lynch commented on CASSANDRA-13993: -- {quote}Further, I intentionally wanted this feature to "just work out of the box", without requiring extra configuration (for local vs each dc, and so on). {quote} [~jasobrown], I completely agree, and I believe there is a difference in "percent UP" from "count of DOWN" from a usability perspective, in particular "percent UP" is harder (or impossible) for users of the database to set properly (it will do what they want) or consistently (they leave it to the default or if they change it they use one setting everywhere), and the best default I can think of is 100%. Compare this to a "count DOWN" which is more likely to be a constant 1 or 2. Consider a user who has two multi-region clusters, one that has 12 nodes and one with 120 nodes. Seventy percent is an ok default for the first cluster, but a very bad one in the second and in either case you still have no guarantee that you will not see latency or errors even if you put the timeout at 2 days, and reflecting on it I think {{(percent_up, timeout) = (100%, 10-30s)}} would be the only default that gives users what they expect (restarting their database does not lead to errors). That aggressive setting would have clients doing local CLs waiting on all remote replicas, however, which other than preventing hint replay is a bit wasteful. On the other hand, in both clusters a {{block_for_peers_local_dc=1}} default setting is quite reasonable. The way that my patch implemented the three options it works out of the box for all deployments (vnodes, no vnodes, large clusters, small clusters, etc) whereas percent up only works well if the user _changes_ the default percentage to 100% or is not using vnodes. {quote}I'm reticent to tie this new behavior to one of those values as the use cases are different; meaning, if you change the value for one semantic meaning, you alter the other. {quote} Ok, that makes sense. {quote}This is a fair point, and I'd be open to bumping up the default threshold. However, remember that behavior exists already in cassandra (it's what you buy in to when using vnodes); this patch helps to alleviate the unavailables/timeouts, not eliminate nor accentuate them. {quote} I agree, this is a great step forward, but with a small change I think this strategy could practically eliminate the unavailables/timeouts. If I implemented the functionality with unit tests in a separate Jira would you consider reviewing it or do you think the slight additional complexity is not worth it? Even separating percentage up by local/remote datacenters would be a big step forward I think, and if we went with counts I could reduce the number of settings to 2 or 1 instead of 3 to give the advanced users less control if you think that would be less confusing for newer users. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail:
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376945#comment-16376945 ] Jason Brown commented on CASSANDRA-13993: - [~jolynch] I understand what you are saying. I think the difference between a "percent UP" and a "count of DOWN nodes" isn't that much, so either one is probably fine. Further, I intentionally wanted this feature to "just work out of the box", without requiring extra configuration (for local vs each dc, and so on). bq. relying on the timeout for large clusters (although it would be awesome if this timeout re-used or defaulted to an existing timeout relevant to gossip convergence such as BROADCAST_INTERVAL or RING_DELAY). I'm reticent to tie this new behavior to one of those values as the use cases are different; meaning, if you change the value for one semantic meaning, you alter the other. bq. especially with vnode=256 clusters where any 2 nodes down in different racks essentially guarantees an unavailable error for some intersecting token range. This is a fair point, and I'd be open to bumping up the default threshold. However, remember that behavior exists already in cassandra (it's what you buy in to when using vnodes); this patch helps to alleviate the unavailables/timeouts, not eliminate nor accentuate them. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371686#comment-16371686 ] Ariel Weisberg commented on CASSANDRA-13993: I am generally +1 other than I would like to see it spin more aggressively on checking whether the responses came back. I'm not sure about Joseph's point. I mean this is going to improve the situation just by virtue of priming all the connections even if it doesn't wait for all of them to complete setup. For nodes that are going to be available they might now be available within the timeout budget of subsequent reads and writes. For nodes that aren't available in time they might not have become available anyways. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370969#comment-16370969 ] Joseph Lynch commented on CASSANDRA-13993: -- [~jasobrown] this is a great idea and would definitely help with the pain of rolling restarts! I'm curious though about the choice to make this a percentage instead of a raw count either for only the local datacenter or for each datacenter (e.g. block until N nodes or fewer are marked down in this nodes local datacenter)? In particular I'm concerned that in typical setups (maybe something like 2 datacenters, <60 nodes, RF=3, mostly {{NetworkTopologyStrategy}} keyspaces) having anything more than one node down in gossip in the same datacenter will mean a high probability of getting unavailable exceptions @ {{LOCAL_QUORUM}} or timeouts, especially with vnode=256 clusters where any 2 nodes down in different racks essentially guarantees an unavailable error for some intersecting token range. What if instead of a percentage the system waited for a fixed number (or fewer) of endpoints to be marked as down in the local datacenter, such as 1 by default, relying on the timeout for large clusters (although it would be awesome if this timeout re-used or defaulted to an existing timeout relevant to gossip convergence such as {{BROADCAST_INTERVAL}} or {{RING_DELAY}}). What do you think? I worked up a quick proof of concept implementation that implements counts for the local DC, each DC, or all DCs (for users that are using {{LOCAL_QUORUM}} vs {{EACH_QUORUM}} vs {{QUORUM}}) over on [github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993] to show kind of what I'm thinking. I didn't fix the unit tests but if you think it's a good idea I can fix them up and add some more (that test multi dc setups). I guess that to make it even smarter a previous {{CassandraDaemon.stop}} could persist how many {{DOWN}} nodes there were in a local table or some such and then the {{CassandraDaemon.start}} waits for the maximum of that persisted number and the configured default, but that adds more complexity and given the flexibility of the three counts I am not sure it's worth it. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368255#comment-16368255 ] Jason Brown commented on CASSANDRA-13993: - Addessed [~aweisberg]'s first round of feedback. Now initializing all connection types at startup. Also, I've modified {{MessageOut}} to allow senders to declare the connection type they want to use. Related to this, I corrected the behavior of the gossip `EchoMessage` on the peer's side by sending out on the gossip channel (as it currently responds on the small message channel, because it's sending a {{REQUEST_RESPONSE}}). However, it's still necessary to distinguish between `EchoMessage` and {{PingMessages}} as {{PingMessage}} includes an extra byte to express the connection type the peer should use. Deserialization of {{EchoMessage}} on a node that doesn't know to read the extra byte at the end will cause problems on that connection when trying to deserialize the next message as there's that extra byte it wasn't expecting. Also, I don't need to make {{PongMessage}} a verb as is won't need a custom {{VerbHandler}}; it can just use {{ResponseVerbHandler}}, which is assigned to {{RESPONSE_RESPONSE}} messages. The main logic of this patch was originally in {{MessagingService}}, but I've moved it into it's own class ({{StartupClusterConnectivityChecker}}) and slightly refactored it to make unit testing easier. Also, added a unit test. Cleaned up the comments on {{MessagingService.Verb}} to be more correct and more clearer wrt intent and use. Added a sanity check in the static block within {{MessagingService.Verb}} where we build up the {{#idToVerbMap}}. We should never allow two verbs to have the same id. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358383#comment-16358383 ] Joshua McKenzie commented on CASSANDRA-13993: - {quote}wdyt?{quote} Passes the smell test. Legacy code is such a delight. Anyone that's relying on extending these verbs can do the leg work to better integrate with 13283's impl after this change if they haven't yet anyway, as it's a cleaner solution to this than just "keep adding a little breathing room so we maybe don't overflow". > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357555#comment-16357555 ] Jason Brown commented on CASSANDRA-13993: - Back in the mists of time, in cassandra 1.2 we had two comments in the Verbs enum: - a message about [backward compatibility|https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/net/MessagingService.java#L119], which appeared before the {{UNUSED_}} verbs {code:java} // use as padding for backwards compatability where a previous version needs to validate a verb from the future. {code} - a message about [adding to new verbs|https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/net/MessagingService.java#L124] the end of the list (after the UNUSED verbs) {code:java} // remember to add new verbs at the end, since we serialize by ordinal {code} The former message assumes we can receive some limited number of messages with verb ids that are unknown, and not blow up trying to deserialize the message. In 2.0, both of those comments were moved: - the backward compatibility comment is [now before the newly introduced paxos verbs|https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/net/MessagingService.java#L125] - the new verbs comment is [*before* the UNUSED verbs|https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/net/MessagingService.java#L130] I think this is where things become confusing. But read on ... The situation stayed the same until 3.0, where we deleted the backward compatibility comment, but kept the message about adding new verbs in the same place. This is more or less what we have in trunk. Hence, looking at trunk now, it's not clear if the UNUSED verbs are for future proofing the deserialization or are some sort of external party-specific messages. Further, in this current scheme it's not guaranteed for someone to create their custom verb and have it be safe across versions and upgrades - at least not until CASSANDRA-13283 (committed for 4.0). It seems that the original intent of the UNUSED verbs was to allow "verbs from the future" to be "validated"; that is, [not throw an ArrayIndexOutOfBoundsException|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/net/MessageIn.java#L87] when a node sees a message with a verb id it's doen't know about (assuming that verb ids matches one of the UNUSED verb ids. That message would ultimiately [be thrown away in {{MessageDeliveryTask}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/net/MessageDeliveryTask.java#L58] as we would have no {{IVerbHandler}} for the unused verb. Further, if we assume the UNUSED verb to be future proofing, then new verbs should, in fact, be added *before* the UNUSED verbs. As the ability to add new, custom verbs and be future proof from new conflicting verbs (assuming all verbs got their id from the enum's ordinal) didn't arrive until CASSANDRA-13283 (basically 4.0), I think it's reasonable to assume that nobody is currently running with custom verbs (unless they have backported CASSANDRA-13283). Thus, I think it should be safe to add new verbs to 4.0 before the UNUSED verbs as long as the new verb ids fall into the UNUSED verb ids that 3.0 and 3.11 have declared. I believe this is what we have done along. wdyt? [~aweisberg] [~JoshuaMcKenzie] > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357409#comment-16357409 ] Joshua McKenzie commented on CASSANDRA-13993: - Regardless of whether the unused slots are currently used by other consumers (be it DSE or otherwise), inserting an enum in the middle explicitly violates the contract / comment in the code: {code:java} // remember to add new verbs at the end, since we serialize by ordinal UNUSED_1, UNUSED_2, UNUSED_3, UNUSED_4, UNUSED_5, ;{code} So I'd recommend against inserting a new verb in the middle. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357354#comment-16357354 ] Ariel Weisberg commented on CASSANDRA-13993: I put up a review on a pull request https://github.com/apache/cassandra/pull/191#pullrequestreview-95170964 Those unused slots in the enum are relevant for DSE I'm not sure we can actually take them or not? Maybe they are there for us to use them? [~JoshuaMcKenzie] do you know? > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345006#comment-16345006 ] Jason Brown commented on CASSANDRA-13993: - bq. I think the new/unknown messages should just be ignored at MessageDeliveryTask#run() Even though these are new messages, and we don't have CASSANDRA-13283 in pre-4.0, I don't think 3.0/3.11 will fail to deserialize on 3.0/3.11 as the new Ping/Pong messages will get the next cardinal value from the {{Verbs}} enum (in 4.0), and it looks like we have some "UNUSED_" slots in the enum for safety. Thus a 3.11 node could successfully deserialize the {{PingMessage}}, but it won't have a {{VerbHandler}} to send back a {{PongMessage}}. This is acceptable as the connection will be successfully established (one way, at least), and the message won't deserialize incorrectly and thus throw away the connection. This would only be a transient issue during upgrade to 4.0. However, I need to test this, but at least the initial code reading seems reasonable. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239556#comment-16239556 ] Jason Brown commented on CASSANDRA-13993: - A mostly complete branch here: ||13993|| |[branch|https://github.com/jasobrown/cassandra/tree/13993]| |[utests|https://circleci.com/gh/jasobrown/cassandra/tree/13993]| The patch proposes to allow the operator to configure some extra time to wait until a configurable percentage of the peers in the cluster are marked alive (In {{Gossip.endpoitStateMap}}) and connected to. For the alives, we simply check each known peer's state in {{Gossip.endpoitStateMap}} to see if it is marked alive, using all the existing infrastructre in Gossiper (see {{Gossiper#markAlive()}}. For the connections, the bouncing node sends a new {{PingMessage}} to the peer, which will be sent on the small message channel. The peer responds with a {{PongMessage}}, sent on it's own small message channel. Thus, we eagerly create the outbound and inbound connections (small message channel) with each peer in the cluster before the client native protocol port is opened. Note: the gossip outbound and inbound connections will be created by the {{EchoMessage}} and response that is sent by {{Gossiper#markAlive()}}. There are a couple of open questions I'm still thinking through: - should the configurable parameters be yaml properties? The current implementation naively uses System props, and hard coded default values at that (which will need to change before commit). - I need to test how upgrades work, to make sure that nodes which do not know about the new messages (and their verbs), do not fail spectacularly. I think the new/unknown messages should just [be ignored at {{MessageDeliveryTask#run()}}|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/net/MessageDeliveryTask.java#L58]. If there is a problem, I'll need to add a version check before sending the new message. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239555#comment-16239555 ] Jason Brown commented on CASSANDRA-13993: - The details of the timeouts after startup are: - a client request comes in on the native protocol, either a read or write - the newly bounced node figures out which peers are responsible for the data (by partition key) - the node sends the request to the peers, we have to build up both the outbound and inbound connections (note: internode messaging connections are unidrectional) - if building those connections are not fast enough, the request will timeout (either at the coordinator or the client driver) On each connection we have to build TCP connection, possiblly perform the TLS handshake, and then perform the c* internode messaging handshake. The time for this is exacerbated with nodes that are in remote datacenters, where the round trip time is significantly higher. In pre-4.0 (before CASSANDRA-8457), this is even worse as all those actions were performed sequentially, per-each connection attempt, [on the (single) accept thread|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/net/MessagingService.java#L1284]. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13993) Add optional startup delay to wait until peers are ready
[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239554#comment-16239554 ] Jason Brown commented on CASSANDRA-13993: - The details of the causes of unavailables after startup are: - a client request comes in on the native protocol, either a read or write - the newly bounced node figures out which peers are responsible for the data (by partition key) - the node checks to see if it thinks the peers are available (see below) - if not a sufficient enough number of replicas are alive to fulfill the request, the unavailable error is returned to the client a bouncing node determines if a peer is alive by: - In StorageService#initServer(), add the IP addresses of previously known peers to gossip via {{Gossiper#addSavedEndpoint}} - {{Gossiper#addSavedEndpoint}} sets up the local state about the peer, and marks the peer as dead ({{EndpointState#markDead}}) ... time passes in the process startup sequence ... - when we get gossip data from any peer in the cluster, we will start updating the known state in gossip about each peer - for each peer updated that we think will be a live node (not decomissioned, shutdown, whatever), {{Gossiper#markAlive()}} will send an {{EchoMessage to the peer}}. This is sent on the {{OutboundMessagingPool#gossipChannel}} socket, which opens up a TCP socket, does the TCP handshake, and when we go to write the message to the socket (which will be the cassandra internode handshake), the TLS handshake is initiated and completed before the message bytes sent. - The peer will respond with a simple request-response message. This (should be) sent on the peer's {{OutboundMessagingPool#gossipChannel}} [1], which requires it's own socket, TCP handhsake, TLS handshake, and so on before the request-response bytes are sent to the socket. - The bounced node receives the request-response, and invokes the callback {{Gossiper#markRealAlive()}}. In that method we finally mark the peer as alive by invoking {{EndpointState#markAlive()}}. - All clilent-initiated DML operations will look into the EndpointState for a peer inside of Gossiper to check if the peer is alive. Thus, we must have a successful {{EchoMessage}} and response between any two nodes for the initiator to consider a peer as available for user-initiated queries. [1] Actaully, there is a bug wherein the response is sent on the {{OutboundMessagingPool#smallMessageChannel}}. CASSANDRA-13714 exists to address it. > Add optional startup delay to wait until peers are ready > > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Fix For: 4.x > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org