[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505212#comment-15505212 ] Stefania edited comment on CASSANDRA-9318 at 9/20/16 1:13 AM: -- Committed to trunk as d43b9ce5092f8879a1a66afebab74d86e9e127fb, thank you for this excellent patch [~sbtourist]! was (Author: stefania): Committed to trunk as d43b9ce5092f8879a1a66afebab74d86e9e127fb, thank you for your excellent patch [~sbtourist]! > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Fix For: 3.10 > > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502215#comment-15502215 ] Stefania edited comment on CASSANDRA-9318 at 9/19/16 4:09 AM: -- Thanks [~sbtourist], the latest commits and the entire patch LGTM. The test failures are unrelated and we have tickets for all of them: CASSANDRA-12664, CASSANDRA-12656 and CASSANDRA-12140. I've launched one more dtest build to cover the final commit and to hopefully shake off the CASSANDRA-12656 failures since these tests shouldn't even be running now. I've squashed your entire patch into [one commit|https://github.com/stef1927/cassandra/commit/1632f2e9892624f611ac3629fb84a82594fec726] and fixed some formatting issues (mostly trailing spaces) [here|https://github.com/stef1927/cassandra/commit/e3346e5f5a49b2933e10a84405730] on this [branch|https://github.com/stef1927/cassandra/commits/9318]. If you could double check the formatting nits, I can squash them and commit once the final dtest build has also completed. was (Author: stefania): Thanks [~sbtourist], the latest commits and the entire patch LGTM. The test failures are all unrelated and we have tickets for all of them: CASSANDRA-12664, CASSANDRA-12656 and CASSANDRA-12140. I've launched one more dtest build to cover the final commit and to hopefully shake off the CASSANDRA-12656 failures since these tests shouldn't even be running now. I've squashed your entire patch into [one commit|https://github.com/stef1927/cassandra/commit/1632f2e9892624f611ac3629fb84a82594fec726] and fixed some formatting issues (mostly trailing spaces) [here|https://github.com/stef1927/cassandra/commit/e3346e5f5a49b2933e10a84405730] on this [branch|https://github.com/stef1927/cassandra/commits/9318]. If you could double check the formatting nits, I can squash them and commit once the final dtest build has also completed. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386978#comment-15386978 ] Stefania edited comment on CASSANDRA-9318 at 7/21/16 1:51 AM: -- bq. I think if we add too many responsibilities to the strategy it really starts to look less and less as an actual strategy; we could rather add a "BackPressureManager", but to be honest, the current responsibility assignment feels natural to me, and back-pressure is naturally a "messaging layer" concern, so I'd keep everything as it is unless you can show any actual advantages in changing the design. I don't think we want a "BackPressureManager". We just need to decide if we are happy having a back pressure state for each OutboundTcpConnectionPool, even though back pressure may be disabled or some strategies may use different metrics, e.g. total in-flight requests or memory, or whether the states should be hidden inside the strategy. bq. This was intentional, so that outgoing requests could have been processed while pausing; but it had the drawback of making the rate limiting less "predictable" (as dependant on which thread was going to be called next), so I ended up changing it. Now the back pressure is applied before hinting, or inserting local mutations, or sending to remote data centers, but _after_ sending to local replicas. I think we may need to rework SP.sendToHintedEndpoints a little bit, I think we want to fire the insert local, then block, then send all messages. There is one more potential issue in the case of *non local replicas*. We send the mutation only to one remote replica, which then forwards it to other replicas in that DC. So remote replicas may have the outgoing rate set to zero, and therefore the limiter rate set to positive infinity, which means we won't throttle at all with FAST flow selected. In fact, we have no way of reliably knowing the status of remote replicas, or replicas to which we haven't sent any mutations recently, and we should perhaps exclude them. bq. It's not a matter of making the sort stable, but a matter of making the comparator fully consistent with equals, which should be already documented by the treeset/comparator javadoc. It's in the comparator javadoc but not the TreeSet constructor doc, that's why I missed it yesterday. Thanks. bq. Agreed. If we all (/cc Aleksey Yeschenko Sylvain Lebresne Jonathan Ellis) agree on the core concepts of the current implementation, I will rebase to trunk, remove the trailing spaces, and then I would move into testing it, and in the meantime think/work on adding metrics and making any further adjustments. I think we can start testing once we have sorted the second discussion point above, as for the API issues, we'll eventually need to reach consensus but we don't need to block the tests for it. One more tiny nit: * The documentation of {{RateBasedBackPressureState}} still mentions the overloaded flag. was (Author: stefania): bq. I think if we add too many responsibilities to the strategy it really starts to look less and less as an actual strategy; we could rather add a "BackPressureManager", but to be honest, the current responsibility assignment feels natural to me, and back-pressure is naturally a "messaging layer" concern, so I'd keep everything as it is unless you can show any actual advantages in changing the design. I don't think we want a "BackPressureManager". We just need to decide if we are happy having a back pressure state for each OutboundTcpConnectionPool, even though back pressure may be disabled or some strategies may use different metrics, e.g. total in-flight requests or memory, or whether the states should be hidden inside the strategy. bq. This was intentional, so that outgoing requests could have been processed while pausing; but it had the drawback of making the rate limiting less "predictable" (as dependant on which thread was going to be called next), so I ended up changing it. Now the back pressure is applied before hinting, or inserting local mutations, or sending to remote data centers, but _after_ sending to local replicas. I think we may need to rework SP.sendToHintedEndpoints a little bit, I think we want to fire the insert local, then block, then send all messages. There is one more potential issue in the case of *non local replicas*. We send the mutation only to one remote replica, which then forwards it to other replicas in that DC. So remote replicas may have the outgoing rate set to zero, and therefore the limiter rate set to positive infinity, which means we won't throttle at all with FAST flow selected. In fact, we have no way of reliably knowing the status of remote replicas, or replicas to which we haven't sent any mutations recently, and we should perhaps exclude them. bq. It's not a matter of making the sort stable, but a matter
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375752#comment-15375752 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 7/13/16 9:54 PM: bq. if we make the strategy a bit more generic as mentioned above so the decision is made from all replica involved (maybe the strategy should also keep track of the replica-state completely internally so we can implement basic strategy like having a simple high watermark very easy), and we make sure to not throttle too quickly (typically, if a single replica is slow and we don't really need it, start by just hinting him), then I'd be happy moving to the "actually test this" phase and see how it goes. I suppose that's reasonable in principle, with some caveats: # Throwing exceptions shouldn't be part of the API. OverloadedException dates from the Thrift days, where our flow control options were very limited and this was the best we could do to tell clients, "back off." Now that we have our own protocol and full control over Netty we should simply not read more requests until we shed some load. (Since shedding load is a gradual process -- requests time out, we write hints, our load goes down -- clients will just perceive this as slowing down, which is what we want.) # The API should provide for reporting load to clients so they can do real load balancing across coordinators and not just round-robin. # Throttling requests to the speed of the slowest replica is not something we should ship, even as an option. was (Author: jbellis): bq. if we make the strategy a bit more generic as mentioned above so the decision is made from all replica involved (maybe the strategy should also keep track of the replica-state completely internally so we can implement basic strategy like having a simple high watermark very easy), and we make sure to not throttle too quickly (typically, if a single replica is slow and we don't really need it, start by just hinting him), then I'd be happy moving to the "actually test this" phase and see how it goes. I suppose that's reasonable in principle, with some caveats: # Throwing exceptions shouldn't be part of the API. OverloadedException dates from the Thrift days, where our flow control options were very limited and this was the best we could do to tell clients, "back off." Now that we have our own protocol and full control over Netty we should simply not read more requests until we shed some load. (Since shedding load is a gradual process--requests time out, we write hints, our load goes down--clients will just perceive this as slowing down, which is what we want.) # The API should provide for reporting load to clients so they can do real load balancing across coordinators and not just round-robin. # Throttling requests to the speed of the slowest replica is not something we should ship, even as an option. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373624#comment-15373624 ] Sergio Bossa edited comment on CASSANDRA-9318 at 7/12/16 8:23 PM: -- [~Stefania], bq. if a message expires before it is sent, we consider this negatively for that replica, since we increment the outgoing rate but not the incoming rate when the callback expires, and still it may have nothing to do with the replica if the message was not sent, it may be due to the coordinator dealing with too many messages. Right, but there isn't much we can do without way more invasive changes. Anyway, I don't think that's actually a problem, as if the coordinator is overloaded we'll end up generating too many hints and fail with {{OverloadedException}} (this time with its original meaning), so we should be covered. bq. I also observe that if a replica has a low rate, then we may block when acquiring the limiter, and this will indirectly throttle for all following replicas, even if they were ready to receive mutations sooner. See my answer at the end. bq. AbstractWriteResponseHandler sets the start time in the constructor, so the time spent acquiring a rate limiter for slow replicas counts towards the total time before the coordinator throws a write timeout exception. See my answer at the end. bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the destination is alive. I know, I'm holding on these changes until we settle on a plan for the whole write path (in terms of what to do with CL, the exception to be thrown etc.). bq. Let's use UnavailableException since WriteFailureException indicates a non-timeout failure when processing a mutation, and so it is not appropriate for this case. For protocol V4 we cannot change UnavailableException, but for V5 we should add a new parameter to it. At the moment it contains , we should add the number of overloaded replicas, so that drivers can treat the two cases differently. Does it mean we should advance the protocol version in this issue, or delegate to a new issue? bq. Marking messages as throttled would let the replica know if backpressure was enabled, that's true, but it also makes the existing mechanism even more complex. How so? In implementation terms, it should be literally as easy as: 1) Add a byte parameter to {{MessageOut}}. 2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it replica-side. 3) If possible (didn't check it), when a "late" response is received on the coordinator, try to cancel the related hint. Do you see any complexity I'm missing there? bq. dropping mutations that have been in the queue for longer that the RPC write timeout is done not only to shed load on the replica, but also to avoid wasting resources to perform a mutation when the coordinator has already returned a timeout exception to the client. This is very true and that's why I said it's a bit of a wild idea. Obviously, that is true outside of back-pressure, as even now it is possible to return a write timeout to clients and still have some or all mutations applied. In the end, it might be good to optionally enable such behaviour, as the advantage would be increased consistency at the expense of more resource consumption, which is a tradeoff some users might want to make, but to be clear, I'm not strictly lobbying to implement it, just trying to reason about pros and cons. bq. I still have concerns regarding additional write timeout exceptions and whether an overloaded or slow replica can slow everything down. These are valid concerns of course, and given similar concerns from [~jbellis], I'm working on some changes to avoid write timeouts due to healthy replicas unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to my last comment above, maybe only actually back-pressure if the CL is not met. Stay tuned. was (Author: sbtourist): [~Stefania], bq. if a message expires before it is sent, we consider this negatively for that replica, since we increment the outgoing rate but not the incoming rate when the callback expires, and still it may have nothing to do with the replica if the message was not sent, it may be due to the coordinator dealing with too many messages. Right, but there isn't much we can do without way more invasive changes. Anyway, I don't think that's actually a problem, as if the coordinator is overloaded we'll end up generating too many hints and fail with {{OverloadedException}} (this time with its original meaning), so we should be covered. bq. I also observe that if a replica has a low rate, then we may block when acquiring the limiter, and this will indirectly throttle for all following replicas, even if they were ready to receive mutations sooner. See my answer at the end. bq. AbstractWriteResponseHandler sets the
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373537#comment-15373537 ] Jeremiah Jordan edited comment on CASSANDRA-9318 at 7/12/16 7:31 PM: - [~jbellis] then we can default this to off. I have interacted with *many* people using Cassandra that would actually like to see some rate limiting applied for cases 1 and 2 such that things don't fall over (shouldn't happen with new hints hopefully) or even hint like crazy. Turning this on for them would allow that to happen. was (Author: jjordan): [~jbellis] then we can default this to off. I have interacted with *many* people using Cassandra that would actually like to see some rate limiting applied for cases 1 and 2 such that things don't fall over (shouldn't happen with new hints hopefully) or even hint like crazy. Turning this on would allow that to happen. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373512#comment-15373512 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 7/12/16 7:23 PM: The more I think about it the more I think the entire approach may be a bad fit for Cassandra. Consider: # If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want to hint those writes and return success to the client. No need to rate limit. # If a node has a sustained period of slow performance, we want to hint those writes and return success to the client. No need to rate limit, unless we are overwhelmed with hints. (Not sure if hint overload is actually a problem with the new file based hints.) # Where we DO want to rate limit is when the client is throwing more updates at the coordinator than the system can handle, whether that is for a single token range or globally across all nodes. So I see this approach as doing the wrong thing for 1 and 2 and only partially helping with 3. Put another way: we do NOT want to limit performance to the slowest node in a set of replicas. That is kind of the opposite of the redundancy we want to provide. was (Author: jbellis): The more I think about it the more I think the entire approach may be a bad fit for Cassandra. Consider: # If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want to hint those writes and return success to the client. No need to rate limit. # If a node has a sustained period of slow performance, we want to hint those writes and return success to the client. No need to rate limit, unless we are overwhelmed with hints. (Not sure if hint overload is actually a problem with the new file based hints.) # Where we DO want to rate limit is when the client is throwing more updates at the coordinator than the system can handle, whether that is for a single token range or globally across many nodes. So I see this approach as doing the wrong thing for 1 and 2 and only partially helping with 3. Put another way: we do NOT want to limit performance to the slowest node in a set of replicas. That is kind of the opposite of the redundancy we want to provide. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373512#comment-15373512 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 7/12/16 7:22 PM: The more I think about it the more I think the entire approach may be a bad fit for Cassandra. Consider: # If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want to hint those writes and return success to the client. No need to rate limit. # If a node has a sustained period of slow performance, we want to hint those writes and return success to the client. No need to rate limit, unless we are overwhelmed with hints. (Not sure if hint overload is actually a problem with the new file based hints.) # Where we DO want to rate limit is when the client is throwing more updates at the coordinator than the system can handle, whether that is for a single token range or globally across many nodes. So I see this approach as doing the wrong thing for 1 and 2 and only partially helping with 3. Put another way: we do NOT want to limit performance to the slowest node in a set of replicas. That is kind of the opposite of the redundancy we want to provide. was (Author: jbellis): The more I think about it the more I think the entire approach may be a bad fit for Cassandra. Consider: # If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want to hint those writes and return success to the client. No need to rate limit. # If a node has a sustained period of slow performance, we want to hint those writes and return success to the client. No need to rate limit, unless we are overwhelmed with hints. (Not sure if hint overload is actually a problem with the new file based hints.) # Where we DO want to rate limit is when the client is throwing more updates at the coordinator than the system can handle, whether that is for a single node or globally across many nodes. So I see this approach as doing the wrong thing for 1 and 2 and only partially helping with 3. Put another way: we do NOT want to limit performance to the slowest node in a set of replicas. That is kind of the opposite of the redundancy we want to provide. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370533#comment-15370533 ] Sergio Bossa edited comment on CASSANDRA-9318 at 7/11/16 4:06 PM: -- [~Stefania], bq. Do we track the case where we receive a failed response? Specifically, in ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also when the message is a failure response? Good point, I focused more on the success case, due to dropped mutations, but that sounds like a good thing to do. bq. If we receive a response after it has timed out, won't we count that request twice, incorrectly increasing the rate for that window? But can that really happen? {{ResponseVerbHandler}} returns _before_ incrementing back-pressure if the callback is null (i.e. expired), and {{OutboundTcpConnection}} doesn't even send outbound messages if they're timed out, or am I missing something? bq. I also argue that it is quite easy to comment out the strategy and to have an empty strategy in the code that means no backpressure. Again, I believe this would make enabling/disabling back-pressure via JMX less user friendly. bq. I think what we may need is a new companion snitch that sorts the replica by backpressure ratio I do not think sorting replicas is what we really need, as you have to send the mutation to all replicas anyway. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: * If CL.ONE, fail if *all* replicas are overloaded. * If CL.QUORUM, fail if *quorum* replicas are overloaded. * if CL.ALL, fail if *any* replica is overloaded. This can be easily accomplished in {{StorageProxy#sendToHintedEndpoints}}. bq. the exception needs to be different. native_protocol_v4.spec clearly states I missed that too :( This leaves us with two options: * Adding a new exception to the native protocol. * Reusing a different exception, with {{WriteFailureException}} and {{UnavailableException}} the most likely candidates. I'm currently leaning towards the latter option. bq. By "load shedding by the replica" do we mean dropping mutations that have timed out or something else? Yes. bq. Regardless, there is the problem of ensuring that all nodes have backpressure enabled, which may not be trivial. We only need to ensure the coordinator for that specific mutation has back-pressure enabled, and we could do this by "marking" the {{MessageOut}} with a special parameter, what do you think? was (Author: sbtourist): [~Stefania], bq. Do we track the case where we receive a failed response? Specifically, in ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also when the message is a failure response? Good point, I focused more on the success case, due to dropped mutations, but that sounds like a good thing to do. bq. If we receive a response after it has timed out, won't we count that request twice, incorrectly increasing the rate for that window? But can that really happen? {{ResponseVerbHandler}} returns _before_ incrementing back-pressure if the callback is null (i.e. expired), and {{OutboundTcpConnection}} doesn't even send outbound messages if they're timed out, or am I missing something? bq. I also argue that it is quite easy to comment out the strategy and to have an empty strategy in the code that means no backpressure. Again, I believe this would make enabling/disabling back-pressure via JMX less user friendly. bq. I think what we may need is a new companion snitch that sorts the replica by backpressure ratio I do not think sorting replicas is what we really need, as you have to send the mutation to all replicas anyway. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: * If CL.ONE, fail in *all* replicas are overloaded. * If CL.QUORUM, fail if *quorum* replicas are overloaded. * if CL.ALL, fail if *any* replica is overloaded. This can be easily accomplished in {{StorageProxy#sendToHintedEndpoints}}. bq. the exception needs to be different. native_protocol_v4.spec clearly states I missed that too :( This leaves us with two options: * Adding a new exception to the native protocol. * Reusing a different exception, with {{WriteFailureException}} and {{UnavailableException}} the most likely candidates. I'm currently leaning towards the latter option. bq. By "load shedding by the replica" do we mean dropping mutations that have timed out or something else? Yes. bq. Regardless, there is the problem of ensuring that all nodes have backpressure enabled, which may not be trivial. We only need to ensure the coordinator for that specific mutation has back-pressure enabled, and we could do this by "marking" the {{MessageOut}} with a special parameter, what do
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371063#comment-15371063 ] Sergio Bossa edited comment on CASSANDRA-9318 at 7/11/16 4:05 PM: -- bq. One thing that worries me is, how do you distinguish between “node X is slow because we are writing too fast and we need to throttle clients down” and “node X is slow because it is dying, we need to ignore it and accept writes based on other replicas?” I.e. this seems to implicitly push everyone to a kind of CL.ALL model once your threshold triggers, where if one replica is slow then we can't make progress. This was already noted by [~slebresne], and you're both right, this initial implementation is heavily biased towards my specific use case :) But, the above proposed solution should fix it: bq. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: If CL.ONE, fail if all replicas are overloaded... Also, the exception would be sent to the client only if the low threshold is met, and only the first time it is met, for the duration of the back-pressure window (write RPC timeout), i.e.: * Threshold is 0.1, outgoing requests are 100, incoming responses are 10, ratio is 0.1. * Exception is thrown by all write requests for the current back-pressure window. * The outgoing rate limiter is set at 10, which means the next ratio calculation will approach the sustainable rate, and even if replicas will still lag behind, the ratio will not go down to 0.1 _unless_ the incoming rate dramatically goes down to 1. This is to say the chances of getting a meltdown due to "overloaded exceptions" is moderate (also, clients are supposed to adjust themselves when getting such exception), and the above proposal should make things play nicely with the CL too. If you all agree with that, I'll move forward and make that change. was (Author: sbtourist): bq. One thing that worries me is, how do you distinguish between “node X is slow because we are writing too fast and we need to throttle clients down” and “node X is slow because it is dying, we need to ignore it and accept writes based on other replicas?” I.e. this seems to implicitly push everyone to a kind of CL.ALL model once your threshold triggers, where if one replica is slow then we can't make progress. This was already noted by [~slebresne], and you're both right, this initial implementation is heavily biased towards my specific use case :) But, the above proposed solution should fix it: bq. I think what you rather need is a way to pre-emptively fail if the write consistency level is not met by enough "non-overloaded" replicas, i.e.: If CL.ONE, fail if all replicas are overloaded... Also, the exception would be sent to the client only if the low threshold is met, and only the first time it is met, for the duration of the back-pressure window (write RPC timeout), i.e.: * Threshold is 0.1, outgoing requests are 100, incoming responses are 10, ratio is 0.1. * Exception is thrown by all write requests for the current back-pressure window. * The outgoing rate limiter is set at 10, which means the next ratio calculation will approach the sustainable rate, and even if replicas will still lag behind, the ratio will not go down to 0.1 _unless_ the incoming rate dramatically goes down to 1. This is to say the chances of getting a meltdown due to "overloaded exceptions" is moderate (also, clients are supposed to adjust themselves when getting such exception), and the above proposal should make things play nicely with the CL too. If you all agree with that, I'll move forward and make that change. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091963#comment-15091963 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 1/11/16 2:26 PM: Since most of that discussion is implementation details, I'll quote the relevant part: bq. With consistency level less then ALL mutation processing can move to background (meaning client was answered, but there is still work to do on behalf of the request). If background request rate completion is lower than incoming request rate background request will accumulate and eventually will exhaust all memory resources. This patch's aim is to prevent this situation by monitoring how much memory all current background request take and when some threshold is passed stop moving request to background (by not replying to a client until either memory consumptions moves below the threshold or request is fully completed). bq. There are two main point where each background mutation consumes memory: holding frozen mutation until operation is complete in order to hint if it does not) and on rpc queue to each replica where it sits until it's sent out on the wire. The patch accounts for both of those separately and limits the former to be 10% of total memory and the later to be 6M. Why 6M? The best answer I can give is why not :) But on a more serious note the number should be small enough so that all the data can be sent out in a reasonable amount of time and one shard is not capable to achieve even close to a full bandwidth, so empirical evidence shows 6M to be a good number. was (Author: jbellis): Since most of that discussion is implementation details, I'll quote the relevant part: bq. With consistency level less then ALL mutation processing can move to background (meaning client was answered, but there is still work to do on behalf of the request). If background request rate completion is lower than incoming request rate background request will accumulate and eventually will exhaust all memory resources. This patch's aim is to prevent this situation by monitoring how much memory all current background request take and when some threshold is passed stop moving request to background (by not replying to a client until either memory consumptions moves below the threshold or request is fully completed). bq. There are two main point where each background mutation consumes memory: holding frozen mutation until operation is complete in order to hint it if it does not) and on rpc queue to each replica where it sits until it's sent out on the wire. The patch accounts for both of those separately and limits the former to be 10% of total memory and the later to be 6M. Why 6M? The best answer I can give is why not :) But on a more serious note the number should be small enough so that all the data can be sent out in a reasonable amount of time and one shard is not capable to achieve even close to a full bandwidth, so empirical evidence shows 6M to be a good number. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091963#comment-15091963 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 1/11/16 2:27 PM: Since most of that discussion is implementation details, I'll quote the relevant part: bq. With consistency level less then ALL mutation processing can move to background (meaning client was answered, but there is still work to do on behalf of the request). If background request rate completion is lower than incoming request rate background request will accumulate and eventually will exhaust all memory resources. This patch's aim is to prevent this situation by monitoring how much memory all current background request take and when some threshold is passed stop moving request to background (by not replying to a client until either memory consumptions moves below the threshold or request is fully completed). bq. There are two main point where each background mutation consumes memory: holding frozen mutation until operation is complete in order to hint if it does not) and on rpc queue to each replica where it sits until it's sent out on the wire. The patch accounts for both of those separately and limits the former to be 10% of total memory and the later to be 6M. Why 6M? The best answer I can give is why not :) But on a more serious note the number should be small enough so that all the data can be sent out in a reasonable amount of time and one shard is not capable to achieve even close to a full bandwidth, so empirical evidence shows 6M to be a good number. was (Author: jbellis): Since most of that discussion is implementation details, I'll quote the relevant part: bq. With consistency level less then ALL mutation processing can move to background (meaning client was answered, but there is still work to do on behalf of the request). If background request rate completion is lower than incoming request rate background request will accumulate and eventually will exhaust all memory resources. This patch's aim is to prevent this situation by monitoring how much memory all current background request take and when some threshold is passed stop moving request to background (by not replying to a client until either memory consumptions moves below the threshold or request is fully completed). bq. There are two main point where each background mutation consumes memory: holding frozen mutation until operation is complete in order to hint if it does not) and on rpc queue to each replica where it sits until it's sent out on the wire. The patch accounts for both of those separately and limits the former to be 10% of total memory and the later to be 6M. Why 6M? The best answer I can give is why not :) But on a more serious note the number should be small enough so that all the data can be sent out in a reasonable amount of time and one shard is not capable to achieve even close to a full bandwidth, so empirical evidence shows 6M to be a good number. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056411#comment-15056411 ] Ariel Weisberg edited comment on CASSANDRA-9318 at 12/14/15 6:20 PM: - Quick note. 65k mutations pending in the mutation stage. 7 memtables pending flush. [I hooked memtables pending flush into the backpressure mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39] That absolutely wrecked performance as throughput dropped to 0 zero periodically, but throughput is infinitely higher than when the database has OOMed. Kicked off a few performance runs to demonstrate what happens when you do have backpressure and you try various large limits on in flight memtables/requests. [9318 w/backpressure 64m 8g heap memtables count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f] [9318 w/backpressure 1g 8g heap memtables count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f] [9318 w/backpressure 2g 8g heap memtables count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f] I am setting the point where backpressure turns off to almost the same limit as to when it turns on. This is smooths out performance just enough for stress to not constantly emit huge numbers of errors as writes time out because the database stops serving requests for a long time waiting for a memtable to flush. With pressure from memtables somewhat accounted for the remaining source of pressure that can bring down a node is remotely delivered mutations. I can throw those into the calculation and add a listener that blocks reads from other cluster nodes. It's a nasty thing to do, but maybe not that different from OOM. I am going to hack together something to force a node to be slow so I can demonstrate overwhelming it with remotely delivered mutations first. was (Author: aweisberg): Quick note. 65k mutations pending in the mutation stage. 7 memtables pending flush. [I hooked memtables pending flush into the backpressure mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39] That absolutely wrecked performance as throughput dropped to 0 zero periodically, but throughput is infinitely higher than when the database hasn't OOMed. Kicked off a few performance runs to demonstrate what happens when you do have backpressure and you try various large limits on in flight memtables/requests. [9318 w/backpressure 64m 8g heap memtables count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f] [9318 w/backpressure 1g 8g heap memtables count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f] [9318 w/backpressure 2g 8g heap memtables count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f] I am setting the point where backpressure turns off to almost the same limit as to when it turns on. This is smooths out performance just enough for stress to not constantly emit huge numbers of errors as writes time out because the database stops serving requests for a long time waiting for a memtable to flush. With pressure from memtables somewhat accounted for the remaining source of pressure that can bring down a node is remotely delivered mutations. I can throw those into the calculation and add a listener that blocks reads from other cluster nodes. It's a nasty thing to do, but maybe not that different from OOM. I am going to hack together something to force a node to be slow so I can demonstrate overwhelming it with remotely delivered mutations first. > Bound the number of in-flight requests at the coordinator > - > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605817#comment-14605817 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/29/15 4:11 PM: https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775 bq. [It is not] a good idea to try to allow extra reads when write capacity is full or vice versa. They both ultimately use the same resources (cpu, heap, disk i/o). was (Author: jbellis): https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775 [It is not] a good idea to try to allow extra reads when write capacity is full or vice versa. They both ultimately use the same resources (cpu, heap, disk i/o). Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605817#comment-14605817 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/29/15 4:10 PM: https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775 [It is not] a good idea to try to allow extra reads when write capacity is full or vice versa. They both ultimately use the same resources (cpu, heap, disk i/o). was (Author: jbellis): https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775 Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604649#comment-14604649 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 12:08 PM: - Here's where I've ended up: # Continuing to accept writes faster than a coordinator can deliver them to replicas is bad. Even perfect load shedding is worse from a client perspective than throttling, since if we load shed and time out the client needs to try to guess the right rate to retry at. # For the same reason, accepting a write but then refusing it with UnavailableException is worse than waiting to accept the write until we have capacity for it. # It's more important to throttle writes because while we can get in trouble with large reads too (a small request turns into a big reply), in practice reads are naturally throttled because a client needs to wait for the read before taking action on it. With writes on the other hand a new user's first inclination is to see how fast s/he can bulk load stuff. In practice, I see load shedding and throttling as complementary. Replicas can continue to rely on load shedding. Perhaps we can attempt distributed back pressure later (if every replica is overloaded, we should again throttle clients) but for now let's narrow our scope to throttling clients to the capacity of a coordinator to send out. *I propose we define a limit on the amount of memory MessagingService can consume and pause reading additional requests whenever that limit is hit.* Note that: # If MS's load is distributed evenly across all destinations then this is trivially the right thing to do. # If MS's load is caused by a single replica falling over or unable to keep up, this is still the right thing to do because the alternative is worse. MS will load shed timed out requests, but if clients are sending more requests to a single replica than we can shed (if rate * timeout capacity) then we still need to throttle or we will exhaust the heap and fall over. (The hint-based UnavailableException tries to help with scenario 2, and I will open a ticket to test how well that actually works. But the hint threshold cannot help with scenario 1 at all and that is the hole this ticket needs to plug.) was (Author: jbellis): Here's where I've ended up: # Continuing to accept writes faster than a coordinator can deliver them to replicas is bad. Even perfect load shedding is worse from a client perspective than throttling, since if we load shed and time out the client needs to try to guess the right rate to retry at. # For the same reason, accepting a write but then refusing it with UnavailableException is worse than waiting to accept the write until we have capacity for it. # It's more important to throttle writes because while we can get in trouble with large reads too (a small request turns into a big reply), in practice reads are naturally throttled because a client needs to wait for the read before taking action on it. With writes on the other hand a new user's first inclination is to see how fast s/he can bulk load stuff. In practice, I see load shedding and throttling as complementary. Replicas can continue to rely on load shedding. Perhaps we can attempt distributed back pressure later (if every replica is overloaded, we should again throttle clients) but for now let's narrow our scope to throttling clients to the capacity of a coordinator to send out. I propose we define a limit on the amount of memory MessagingService can consume and pause reading additional requests whenever that limit is hit. Note that: # If MS's load is distributed evenly across all destinations then this is trivially the right thing to do. # If MS's load is caused by a single replica falling over or unable to keep up, this is still the right thing to do because the alternative is worse. MS will load shed timed out requests, but if clients are sending more requests to a single replica than we can shed (if rate * timeout capacity) then we still need to throttle or we will exhaust the heap and fall over. (The hint-based UnavailableException tries to help with scenario 2, and I will open a ticket to test how well that actually works. But the hint threshold cannot help with scenario 1 at all and that is the hole this ticket needs to plug.) Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604746#comment-14604746 ] Benedict edited comment on CASSANDRA-9318 at 6/28/15 3:48 PM: -- bq. This in no way affects our contract or guarantees, since we don't do anything at all in the intervening period except consume memory. bq. The whole point is that coordinators are falling over from OOM. This isn't just something we can wave away as negligible. I was referring here to the status quo, FTR. Also FTR, we do clearly state hints are best effort (they also aren't guaranteed to be persisted), so as far as contracts / guarantees are concerned, I don't know we make any (and I wasn't aware of this one). It would be really helpful for these (and many other) discussions if all of the assumptions, contracts and guarantees we make about correctness and delivery were made available in a single clearly spelled out document (and that, like the code style, this document is the final arbiter of what action to take). was (Author: benedict): bq. This in no way affects our contract or guarantees, since we don't do anything at all in the intervening period except consume memory. bq. The whole point is that coordinators are falling over from OOM. This isn't just something we can wave away as negligible. I was referring here to the status quo, FTR. Also FTR, we do clearly state hints are best effort (they also aren't guaranteed to be persisted), so as far as contracts / guarantees are concerned, I don't know we make any (and I wasn't aware of this one). It would be really helpful for these (and many other) discussions if all of the assumptions, contracts and guarantees we make about correctness and delivery were made available in a single clearly spelled out document. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604763#comment-14604763 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 4:31 PM: Hinting is better than leaving things in an unknown state but it's not something we should opt users into if we have a better option, since it basically turns the write into CL.ANY. I think you're [Benedict] overselling how scary it is to stop reading new requests until we can free up some memory from MS. We're not dropping connections. We're just imposing some flow control. Which is something that already happens at different levels anyway. was (Author: jbellis): Hinting is better than leaving things in an unknown state but it's not something we should opt users into if we have a better option, since it basically turns the write into CL.ANY. I think you're overselling how scary it is to stop reading new requests until we can free up some memory from MS. We're not dropping connections. We're just imposing some flow control. Which is something that already happens at different levels anyway. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604735#comment-14604735 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 3:35 PM: Let's pull optimizing hints to a separate ticket. It is complementary to don't accept more requests than you can handle. was (Author: jbellis): Let's pull optimizing hints to a separate ticket. It is complementary to don't accept more than you can handle. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604728#comment-14604728 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 3:24 PM: bq. This in no way affects our contract or guarantees, since we don't do anything at all in the intervening period except consume memory. The whole point is that coordinators are falling over from OOM. This isn't just something we can wave away as negligible. was (Author: jbellis): bq. This in no way affects our contract or guarantees, since we don't do anything at all in the intervening period except consume memory. The whole point is that coordinators are falling over from OOM. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604775#comment-14604775 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 4:52 PM: Replica and coordinator are only identical on writes when RF=1, hardly the most common case. Nor is it a good idea to try to allow extra reads when write capacity is full or vice versa. They both ultimately use the same resources (cpu, heap, disk i/o). was (Author: jbellis): Replica and coordinator are only identical on writes when RF=1. Nor is it a good idea to try to allow extra reads when write capacity is full or vice versa. They both ultimately use the same resources (cpu, heap, disk i/o). Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x, 2.2.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/10/15 12:32 AM: - I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users' expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. (Perhaps this exception or something like it can be thrown _before_ load shedding occurs.) This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints merely because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not solve the problem of needing to know when the system is being pushed beyond capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful not to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the user has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. In the current system, this decision is already made for them. They have no choice. In a more optimistic world, users would get near optimal performance for a well tuned workload with back-pressure active throughout the system, or something very much like it. We could call it a different kind of scheduler, different queue management methods, or whatever. As long as the user could prioritize stability at some bounded load over possible instability at an over-saturating load, I think they would in most cases. Like I said, they really don't have this choice right now. I know this is not trivial. We can't remove the need to make sane judgments about sizing and configuration. We might be able to, however, make the system ramp more predictably up to saturation, and behave more reasonably at that level. Order of precedence, How to designate a mode of operation, or any other concerns aren't really addressed here. I just provided the examples above as types of behaviors which are nuanced yet perfectly valid for different types of system designs. The real point here is that there is not a single overall QoS/capacity/back-pressure behavior which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. I do believe that these ideas are all compatible ideas on a spectrum of behavior. They are not mutually exclusive from a design/implementation perspective. It's possible that they could be specified per operation, even, with some traffic yield to others due to client policies. For example, a lower priority client could yield when it knows the
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/10/15 12:26 AM: - I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users' expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. (Perhaps this exception or something like it can be thrown _before_ load shedding occurs.) This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints merely because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not solve the problem of needing to know when the system is being pushed beyond capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful not to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the user has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. In the current system, this decision is already made for them. They have no choice. In a more optimistic world, users would get near optimal performance for a well tuned workload with back-pressure active throughout the system, or something very much like it. We could call it a different kind of scheduler, different queue management methods, or whatever. As long as the user could prioritize stability at some bounded load over possible instability at an over-saturating load, I think they would in most cases. Like I said, they really don't have this choice right now. I know this is not trivial. We can't remove the need to make sane judgments about sizing and configuration. We might be able to, however, make the system ramp more predictably up to saturation, and behave more reasonable at that level. Order of precedence, How to designate a mode of operation, or any other concerns aren't really addressed here. I just provided the examples above as types of behaviors which are nuanced yet perfectly valid for different types of system designs. The real point here is that there is not a single overall QoS/capacity/back-pressure behavior which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. I do believe that these ideas are all compatible ideas on a spectrum of behavior. They are not mutually exclusive from a design/implementation perspective. It's possible that they could be specified per operation, even, with some traffic yield to others due to client policies. For example, a lower priority client could yield when it knows the
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536307#comment-14536307 ] Benedict edited comment on CASSANDRA-9318 at 5/9/15 5:23 PM: - bq. Where? Are you talking about the hint limit? I was, and I realise that was a mistake; I didn't fully understand the existing logic (and your proposal took me by surprise). Now that I do, I think I understand what you are proposing. There are a few problems that I see with it, though: # the cluster as a whole, especially in large clusters, can still send a _lot_ of requests to a single node # it has the opposite impact of (and likely prevents) CASSANDRA-3852, with older operations completely blocking newer ones # it might mean a lot more OE than users are used to during temporary blips, pushing problems down to clients, when the cluster is actually quite capable of coping (through hinting) #* It seems like this would in fact seriously compromise our A property, with any failure for any node in a token range rapidly making the entire token range unavailable for writes\* # tuning it is hard; network latencies, query processing times, and cluster size (which changes over time) will each impact it I'm wary about a feature like this, when we could simply improve our current work shedding to make it more robust (MessagingService, MUTATION stage and ExpiringMap all, effectively, shed; just not with sufficient predictability), but I think I've made all my concerns sufficiently clear so I'll leave it with you. \* At the very least we would have to first fallback to hints, rather than throwing OE, and wait for hints to saturate before throwing (AFAICT). In which case we're _in effect_ introducing LIFO-leaky pruning of the ExpiringMap, MS, and the receiving node's MUTATION stage, but under a new mechanism (as opposed to inline FIFO? (tbd) pruning). I don't really have anything against this, since it is functionally equivalent, although I think FIFO-pruning is preferable; having fewer pruning mechanisms is probably preferable; these mechanisms would apply more universally; and they would insulate the node from the many-to-one effect (by making the MUTATION stage itself robust to overload). was (Author: benedict): bq. Where? Are you talking about the hint limit? I was, and I realise that was a mistake; I didn't fully understand the existing logic (and your proposal took me by surprise). Now that I do, I think I understand what you are proposing. There are a few problems that I see with it, though: # the cluster as a whole, especially in large clusters, can still send a _lot_ of requests to a single node # it has the opposite impact of (and likely prevents) CASSANDRA-3852, with older operations completely blocking newer ones # it might mean a lot more OE than users are used to during temporary blips, pushing problems down to clients, when the cluster is actually quite capable of coping (through hinting) # tuning it is hard; network latencies, query processing times, and cluster size (which changes over time) will each impact it I'm wary about a feature like this, when we could simply improve our current work shedding to make it more robust (MessagingService, MUTATION stage and ExpiringMap all, effectively, shed; just not with sufficient predictability), but I think I've made all my concerns sufficiently clear so I'll leave it with you. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/9/15 7:42 PM: --- I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not relieve us of the problem of needing to know when the system is at capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the users has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. Order of precedence, designated mode of operation, or any other concerns aren't really addressed here. I just provided them as examples of types of behaviors which are nuanced yet perfectly valid for different types of system designers. The real point here is that there is not a single overall design which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. We can come up with methods to improve the reliable and responsive capacity of the system even with some internal load management. If the first cut ends up being sub-optimal, then we can measure it against non-bounded workload tests and strive to close the gap. If it is implemented in a way that can support multiple usage scenarios, as described above, then such a limitation might be unlimited, bounded at level ___, or bounded by inline resource management.. But in any case would be controllable by some users/admin, client.. If we could ultimately give the categories of users above the ability to enable the various modes, then the 2a) scenario would be perfectly desirable for many users already even if the back-pressure logic only gave you 70% of the effective system capacity. Once testing shows that performance with active back-pressure to the client is close enough to the unbounded workloads, it could be enabled by default. Summary: We still need reasonable back-pressure support throughout the system and eventually to the client. Features like this that can be a stepping stone towards such are still needed. The most perfect load shedding and hinting systems will still not be a sufficient replacement for back-pressure and capacity management. was (Author: jshook): I would venture that a
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/9/15 7:46 PM: --- I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not relieve us of the problem of needing to know when the system is at capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the users has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. Order of precedence, designated mode of operation, or any other concerns aren't really addressed here. I just provided the examples above as types of behaviors which are nuanced yet perfectly valid for different types of system designs. The real point here is that there is not a single overall QoS/capacity/back-pressure behavior which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. We can come up with methods to improve the reliable and responsive capacity of the system even with some internal load management. If the first cut ends up being sub-optimal, then we can measure it against non-bounded workload tests and strive to close the gap. If it is implemented in a way that can support multiple usage scenarios, as described above, then such a limitation might be unlimited, bounded at level ___, or bounded by inline resource management.. But in any case would be controllable by some users/admin, client.. If we could ultimately give the categories of users above the ability to enable the various modes, then the 2a) scenario would be perfectly desirable for many users already even if the back-pressure logic only gave you 70% of the effective system capacity. Once testing shows that performance with active back-pressure to the client is close enough to the unbounded workloads, it could be enabled by default. Summary: We still need reasonable back-pressure support throughout the system and eventually to the client. Features like this that can be a stepping stone towards such are still needed. The most perfect load shedding and hinting systems will still not be a sufficient replacement for back-pressure and capacity management. was (Author:
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534260#comment-14534260 ] Benedict edited comment on CASSANDRA-9318 at 5/8/15 11:24 AM: -- bq. In my mind in-flight means that until the response is sent back to the client the request counts against the in-flight limit. bq. We already keep the original request around until we either get acks from all replicas, or they time out (and we write a hint). Perhaps we need to clarify more clearly what this ticket is proposing. I understood it to mean what Ariel commented here, whereas it sounds like Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked as well as time? The ExpiringMap requests are already in-flight and cannot be cancelled, so their effect on other nodes cannot be rescinded, and imposing a limit does not stop us issuing more requests to the nodes in the cluster that are failing to keep up and respond to us. It _might_ be sensible if introducing more aggressive shedding at processing nodes to also shed the response handlers more aggressively locally, but I'm not convinced it would have a significant impact on cluster health by itself; cluster instability spreads out from problematic nodes, and this scheme still permits us to inundate those nodes with requests they cannot keep up with. Alternatively, the approach of forbidding new requests if you have items in the ExpiringMap causes the collapse of other nodes to spread throughout the cluster, as rapidly (especially on small clusters) all requests on the system are destined for the collapsing node, and every coordinator stops accepting requests. The system seizes up, and that node is still failing since it's got requests from the entire cluster queued up with it. In general on the coordinator there's no way of distinguishing between a failed node, network partition, or just struggling, so we don't know if we should wait. Some mix of the two might be possible, if we were to wait while a node is just slow, then drop our response handlers for the node if it's marked as down. This latter may not be a bad thing to do anyway, but I would not want to depend on this behaviour to maintain our precious A It still seems the simplest and most robust solution is to make our work queues leaky, since this insulates the processing nodes from cluster-wide inundation, which the coordinator approach cannot (even with the loss of A and cessation of processing, there is a whole cluster vs a potentially single node; with hash partition it doesn't take long for all processing to begin involving a single failing node). We can do this just on the number of requests and still be much better than we are currently. We could also pair this with coordinator-level dropping of handlers for down nodes, and above a size/count threshold. This latter, though, could result in widespread uncoordinated dropping of requests, which may leave us open to a multiplying effect of cluster overload, with each node dropping different requests, possibly leading to only a tiny fraction of requests being serviced to their required CL across the cluster. I'm not sure how we can best model this risk, or avoid it without notifying coordinators of the drop of a message, and I don't see that being delivered for 2.1 was (Author: benedict): bq. In my mind in-flight means that until the response is sent back to the client the request counts against the in-flight limit. bq. We already keep the original request around until we either get acks from all replicas, or they time out (and we write a hint). Perhaps we need to clarify more clearly what this ticket is proposing. I understood it to mean what Ariel commented here, whereas it sounds like Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked as well as time? The ExpiringMap requests are already in-flight and cannot be cancelled, so their effect on other nodes cannot be rescinded, and imposing a limit does not stop us issuing more requests to the nodes in the cluster that are failing to keep up and respond to us. It _might_ be sensible if introducing more aggressive shedding at processing nodes to also shed the response handlers more aggressively locally, but I'm not convinced it would have a significant impact on cluster health by itself; cluster instability spreads out from problematic nodes, and this scheme still permits us to inundate those nodes with requests they cannot keep up with. Alternatively, the approach of forbidding new requests if you have items in the ExpiringMap causes the collapse of other nodes to spread throughout the cluster, as rapidly (especially on small clusters) all requests on the system are destined for the collapsing node, and every coordinator stops accepting requests. The system seizes up, and that node
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535369#comment-14535369 ] Benedict edited comment on CASSANDRA-9318 at 5/8/15 7:52 PM: - bq. because our existing load shedding is fine at recovering from temporary spikes in load Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; and this was in a small cluster. If anything, I suspect our ability to prune these messages is also theoretically worse, on top of this practical datapoint, because it is done on dequeue, whereas the ExpiringMap and MessagingService (whilst having a slightly longer expiry) is done asynchronously (or on enqueue) and cannot be blocked by e.g. flush. What I'm effectively suggesting is simply making all of the load shedding all happen on enqueue, and be based on queue length as well as time, so that our load shedding really is simply more robust. The coordinator is also on the right side of the equation: as the cluster grows, any single problems should spread out to the coordinators more slowly, whereas the coordinator's ability to flood a processing node scales up at the same (well, inverted) rate. was (Author: benedict): bq. because our existing load shedding is fine at recovering from temporary spikes in load Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; and this was in a small cluster. If anything, I suspect our ability to prune these messages is also theoretically worse, on top of this practical datapoint, because it is done on dequeue, whereas the ExpiringMap (whilst having a slightly longer expiry) is done asynchronously and cannot be blocked by e.g. flush. The coordinator is also on the right side of the equation: as the cluster grows, any single problems should spread out to the coordinators more slowly, whereas the coordinator's ability to flood a processing node scales up at the same (well, inverted) rate. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535356#comment-14535356 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 7:40 PM: --- bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked as well as time? No, I'm suggesting we abort requests more aggressively with OverloadedException *before sending them to replicas*. One place this might make sense is sendToHintedEndpoints, where we already throw OE. Right now we only throw OE once we start writing hints for a node that is in trouble. This doesn't seem to be aggressive enough. (Although, most of our users are on 2.0 where we allowed 8x as many hints in flight before starting to throttle.) So, I am suggesting we also track requests outstanding (perhaps with the ExpiringMap as you suggest) as well and stop accepting requests once we hit a reasonable limit of you can't possibly process more requests than this in parallel. bq. The ExpiringMap requests are already in-flight and cannot be cancelled, so their effect on other nodes cannot be rescinded, and imposing a limit does not stop us issuing more requests to the nodes in the cluster that are failing to keep up and respond to us. Right, and I'm fine with that. The goal is not to keep the replica completely out of trouble. The goal is to keep the coordinator from falling over from buffering EM and MessagingService entries that it can't drain fast enough. Secondarily, this will help the replica too because our existing load shedding is fine at recovering from temporary spikes in load. But our load shedding isn't good enough to save it when the coordinators keep throwing more at it when it's already overwhelmed. was (Author: jbellis): bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked as well as time? No, I'm suggesting we abort requests more aggressively with OverloadedException *before sending them to replicas*. One place this might make sense is sendToHintedEndpoints, where we already throw OE. Right now we only throw OE once we start writing hints for a node that is in trouble. This doesn't seem to be aggressive enough. (Although, most of our users are on 2.0 where we allowed 8x as many hints in flight before starting to throttle.) So, I am suggesting we also track requests outstanding (perhaps with the ExpiringMap as you suggest) as well and stop accepting requests once we hit a reasonable limit of you can't possibly process more requests than this in parallel. The ExpiringMap requests are already in-flight and cannot be cancelled, so their effect on other nodes cannot be rescinded, and imposing a limit does not stop us issuing more requests to the nodes in the cluster that are failing to keep up and respond to us. Right, and I'm fine with that. The goal is not to keep the replica completely out of trouble. The goal is to keep the coordinator from falling over from buffering EM and MessagingService entries that it can't drain fast enough. Secondarily, this will help the replica too because our existing load shedding is fine at recovering from temporary spikes in load. But our load shedding isn't good enough to save it when the coordinators keep throwing more at it when it's already overwhelmed. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535449#comment-14535449 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 8:36 PM: --- bq. It currently bounds the number of in flight requests low enough Where? Are you talking about the hint limit? bq. I thought if we were capable of writing a hint we had to do it. The coordinator must write a hint *if a replica times out after it sends the mutation out.* (Because otherwise we leave the client wondering what state the cluster is left in; it might be on all replicas, or on none.) No hint is written for UnavailableException or OverloadedException, because we can guarantee the state -- it is on no replicas. was (Author: jbellis): bq. It currently bounds the number of in flight requests low enough Where? Are you talking about the hint limit? bq. I thought if we were capable of writing a hint we had to do it. We have to write a hint *if we send the mutation off to the replicas.* (Because otherwise we leave the client wondering what state the cluster is left in; it might be on all replicas, or on none.) No hint is written for UnavailableException or OverloadedException, because we can guarantee the state -- it is on no replicas. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535449#comment-14535449 ] Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 8:36 PM: --- bq. It currently bounds the number of in flight requests low enough Where? Are you talking about the hint limit? bq. I thought if we were capable of writing a hint we had to do it. The coordinator must write a hint *if a replica times out after the coordinator sends the mutation out.* (Because otherwise we leave the client wondering what state the cluster is left in; it might be on all replicas, or on none.) No hint is written for UnavailableException or OverloadedException, because we can guarantee the state -- it is on no replicas. was (Author: jbellis): bq. It currently bounds the number of in flight requests low enough Where? Are you talking about the hint limit? bq. I thought if we were capable of writing a hint we had to do it. The coordinator must write a hint *if a replica times out after it sends the mutation out.* (Because otherwise we leave the client wondering what state the cluster is left in; it might be on all replicas, or on none.) No hint is written for UnavailableException or OverloadedException, because we can guarantee the state -- it is on no replicas. Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533657#comment-14533657 ] Benedict edited comment on CASSANDRA-9318 at 5/8/15 12:23 AM: -- CL=1 (or any CL ALL) was (Author: benedict): CL=1 Bound the number of in-flight requests at the coordinator - Key: CASSANDRA-9318 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 Project: Cassandra Issue Type: Improvement Reporter: Ariel Weisberg Assignee: Ariel Weisberg Fix For: 2.1.x It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes. An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark. Need to make sure that disabling read on the client connection won't introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)