[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-09-19 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505212#comment-15505212
 ] 

Stefania edited comment on CASSANDRA-9318 at 9/20/16 1:13 AM:
--

Committed to trunk as d43b9ce5092f8879a1a66afebab74d86e9e127fb, thank you for 
this excellent patch [~sbtourist]!


was (Author: stefania):
Committed to trunk as d43b9ce5092f8879a1a66afebab74d86e9e127fb, thank you for 
your excellent patch [~sbtourist]!

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Fix For: 3.10
>
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-09-18 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502215#comment-15502215
 ] 

Stefania edited comment on CASSANDRA-9318 at 9/19/16 4:09 AM:
--

Thanks [~sbtourist], the latest commits and the entire patch LGTM.

The test failures are unrelated and we have tickets for all of them: 
CASSANDRA-12664, CASSANDRA-12656 and CASSANDRA-12140. I've launched one more 
dtest build to cover the final commit and to hopefully shake off the 
CASSANDRA-12656 failures since these tests shouldn't even be running now.

I've squashed your entire patch into [one 
commit|https://github.com/stef1927/cassandra/commit/1632f2e9892624f611ac3629fb84a82594fec726]
 and fixed some formatting issues (mostly trailing spaces) 
[here|https://github.com/stef1927/cassandra/commit/e3346e5f5a49b2933e10a84405730]
 on this [branch|https://github.com/stef1927/cassandra/commits/9318].

If you could double check the formatting nits, I can squash them and commit 
once the final dtest build has also completed. 


was (Author: stefania):
Thanks [~sbtourist], the latest commits and the entire patch LGTM.

The test failures are all unrelated and we have tickets for all of them: 
CASSANDRA-12664, CASSANDRA-12656 and CASSANDRA-12140. I've launched one more 
dtest build to cover the final commit and to hopefully shake off the 
CASSANDRA-12656 failures since these tests shouldn't even be running now.

I've squashed your entire patch into [one 
commit|https://github.com/stef1927/cassandra/commit/1632f2e9892624f611ac3629fb84a82594fec726]
 and fixed some formatting issues (mostly trailing spaces) 
[here|https://github.com/stef1927/cassandra/commit/e3346e5f5a49b2933e10a84405730]
 on this [branch|https://github.com/stef1927/cassandra/commits/9318].

If you could double check the formatting nits, I can squash them and commit 
once the final dtest build has also completed. 

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-20 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386978#comment-15386978
 ] 

Stefania edited comment on CASSANDRA-9318 at 7/21/16 1:51 AM:
--

bq. I think if we add too many responsibilities to the strategy it really 
starts to look less and less as an actual strategy; we could rather add a 
"BackPressureManager", but to be honest, the current responsibility assignment 
feels natural to me, and back-pressure is naturally a "messaging layer" 
concern, so I'd keep everything as it is unless you can show any actual 
advantages in changing the design.

I don't think we want a "BackPressureManager". We just need to decide if we are 
happy having a back pressure state for each OutboundTcpConnectionPool, even 
though back pressure may be disabled or some strategies may use different 
metrics, e.g. total in-flight requests or memory, or whether the states should 
be hidden inside the strategy.

bq. This was intentional, so that outgoing requests could have been processed 
while pausing; but it had the drawback of making the rate limiting less 
"predictable" (as dependant on which thread was going to be called next), so I 
ended up changing it.

Now the back pressure is applied before hinting, or inserting local mutations, 
or sending to remote data centers, but _after_ sending to local replicas. I 
think we may need to rework SP.sendToHintedEndpoints a little bit, I think we 
want to fire the insert local, then block, then send all messages. 

There is one more potential issue in the case of *non local replicas*. We send 
the mutation only to one remote replica, which then forwards it to other 
replicas in that DC. So remote replicas may have the outgoing rate set to zero, 
and therefore the limiter rate set to positive infinity, which means we won't 
throttle at all with FAST flow selected. In fact, we have no way of reliably 
knowing the status of remote replicas, or replicas to which we haven't sent any 
mutations recently,  and we should perhaps exclude them.

bq. It's not a matter of making the sort stable, but a matter of making the 
comparator fully consistent with equals, which should be already documented by 
the treeset/comparator javadoc.

It's in the comparator javadoc but not the TreeSet constructor doc, that's why 
I missed it yesterday. Thanks.

bq. Agreed. If we all (/cc Aleksey Yeschenko Sylvain Lebresne Jonathan Ellis) 
agree on the core concepts of the current implementation, I will rebase to 
trunk, remove the trailing spaces, and then I would move into testing it, and 
in the meantime think/work on adding metrics and making any further adjustments.

I think we can start testing once we have sorted the second discussion point 
above, as for the API issues, we'll eventually need to reach consensus but we 
don't need to block the tests for it.

One more tiny nit:

* The documentation of {{RateBasedBackPressureState}} still mentions the 
overloaded flag.


was (Author: stefania):
bq. I think if we add too many responsibilities to the strategy it really 
starts to look less and less as an actual strategy; we could rather add a 
"BackPressureManager", but to be honest, the current responsibility assignment 
feels natural to me, and back-pressure is naturally a "messaging layer" 
concern, so I'd keep everything as it is unless you can show any actual 
advantages in changing the design.

I don't think we want a "BackPressureManager". We just need to decide if we are 
happy having a back pressure state for each OutboundTcpConnectionPool, even 
though back pressure may be disabled or some strategies may use different 
metrics, e.g. total in-flight requests or memory, or whether the states should 
be hidden inside the strategy.

bq. This was intentional, so that outgoing requests could have been processed 
while pausing; but it had the drawback of making the rate limiting less 
"predictable" (as dependant on which thread was going to be called next), so I 
ended up changing it.

Now the back pressure is applied before hinting, or inserting local mutations, 
or sending to remote data centers, but _after_ sending to local replicas. I 
think we may need to rework SP.sendToHintedEndpoints a little bit, I think we 
want to fire the insert local, then block, then send all messages. 

There is one more potential issue in the case of *non local replicas*. We send 
the mutation only to one remote replica, which then forwards it to other 
replicas in that DC. So remote replicas may have the outgoing rate set to zero, 
and therefore the limiter rate set to positive infinity, which means we won't 
throttle at all with FAST flow selected. In fact, we have no way of reliably 
knowing the status of remote replicas, or replicas to which we haven't sent any 
mutations recently,  and we should perhaps exclude them.

bq. It's not a matter of making the sort stable, but a matter 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-13 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375752#comment-15375752
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 7/13/16 9:54 PM:


bq. if we make the strategy a bit more generic as mentioned above so the 
decision is made from all replica involved (maybe the strategy should also keep 
track of the replica-state completely internally so we can implement basic 
strategy like having a simple high watermark very easy), and we make sure to 
not throttle too quickly (typically, if a single replica is slow and we don't 
really need it, start by just hinting him), then I'd be happy moving to the 
"actually test this" phase and see how it goes.

I suppose that's reasonable in principle, with some caveats:

# Throwing exceptions shouldn't be part of the API.  OverloadedException dates 
from the Thrift days, where our flow control options were very limited and this 
was the best we could do to tell clients, "back off."  Now that we have our own 
protocol and full control over Netty we should simply not read more requests 
until we shed some load.  (Since shedding load is a gradual process -- requests 
time out, we write hints, our load goes down -- clients will just perceive this 
as slowing down, which is what we want.)
# The API should provide for reporting load to clients so they can do real load 
balancing across coordinators and not just round-robin.
# Throttling requests to the speed of the slowest replica is not something we 
should ship, even as an option.



was (Author: jbellis):
bq. if we make the strategy a bit more generic as mentioned above so the 
decision is made from all replica involved (maybe the strategy should also keep 
track of the replica-state completely internally so we can implement basic 
strategy like having a simple high watermark very easy), and we make sure to 
not throttle too quickly (typically, if a single replica is slow and we don't 
really need it, start by just hinting him), then I'd be happy moving to the 
"actually test this" phase and see how it goes.

I suppose that's reasonable in principle, with some caveats:

# Throwing exceptions shouldn't be part of the API.  OverloadedException dates 
from the Thrift days, where our flow control options were very limited and this 
was the best we could do to tell clients, "back off."  Now that we have our own 
protocol and full control over Netty we should simply not read more requests 
until we shed some load.  (Since shedding load is a gradual process--requests 
time out, we write hints, our load goes down--clients will just perceive this 
as slowing down, which is what we want.)
# The API should provide for reporting load to clients so they can do real load 
balancing across coordinators and not just round-robin.
# Throttling requests to the speed of the slowest replica is not something we 
should ship, even as an option.


> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-12 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373624#comment-15373624
 ] 

Sergio Bossa edited comment on CASSANDRA-9318 at 7/12/16 8:23 PM:
--

[~Stefania],

bq. if a message expires before it is sent, we consider this negatively for 
that replica, since we increment the outgoing rate but not the incoming rate 
when the callback expires, and still it may have nothing to do with the replica 
if the message was not sent, it may be due to the coordinator dealing with too 
many messages.

Right, but there isn't much we can do without way more invasive changes. 
Anyway, I don't think that's actually a problem, as if the coordinator is 
overloaded we'll end up generating too many hints and fail with 
{{OverloadedException}} (this time with its original meaning), so we should be 
covered.

bq. I also observe that if a replica has a low rate, then we may block when 
acquiring the limiter, and this will indirectly throttle for all following 
replicas, even if they were ready to receive mutations sooner.

See my answer at the end.

bq. AbstractWriteResponseHandler sets the start time in the constructor, so the 
time spent acquiring a rate limiter for slow replicas counts towards the total 
time before the coordinator throws a write timeout exception.

See my answer at the end.

bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the 
destination is alive.

I know, I'm holding on these changes until we settle on a plan for the whole 
write path (in terms of what to do with CL, the exception to be thrown etc.).

bq. Let's use UnavailableException since WriteFailureException indicates a 
non-timeout failure when processing a mutation, and so it is not appropriate 
for this case. For protocol V4 we cannot change UnavailableException, but for 
V5 we should add a new parameter to it. At the moment it contains 
, we should add the number of overloaded replicas, so that 
drivers can treat the two cases differently.

Does it mean we should advance the protocol version in this issue, or delegate 
to a new issue?

bq. Marking messages as throttled would let the replica know if backpressure 
was enabled, that's true, but it also makes the existing mechanism even more 
complex.

How so? In implementation terms, it should be literally as easy as:
1) Add a byte parameter to {{MessageOut}}.
2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it 
replica-side.
3) If possible (didn't check it), when a "late" response is received on the 
coordinator, try to cancel the related hint.

Do you see any complexity I'm missing there?

bq. dropping mutations that have been in the queue for longer that the RPC 
write timeout is done not only to shed load on the replica, but also to avoid 
wasting resources to perform a mutation when the coordinator has already 
returned a timeout exception to the client.

This is very true and that's why I said it's a bit of a wild idea. Obviously, 
that is true outside of back-pressure, as even now it is possible to return a 
write timeout to clients and still have some or all mutations applied. In the 
end, it might be good to optionally enable such behaviour, as the advantage 
would be increased consistency at the expense of more resource consumption, 
which is a tradeoff some users might want to make, but to be clear, I'm not 
strictly lobbying to implement it, just trying to reason about pros and cons.

bq. I still have concerns regarding additional write timeout exceptions and 
whether an overloaded or slow replica can slow everything down.

These are valid concerns of course, and given similar concerns from [~jbellis], 
I'm working on some changes to avoid write timeouts due to healthy replicas 
unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to 
my last comment above, maybe only actually back-pressure if the CL is not met.

Stay tuned.


was (Author: sbtourist):
[~Stefania],

bq. if a message expires before it is sent, we consider this negatively for 
that replica, since we increment the outgoing rate but not the incoming rate 
when the callback expires, and still it may have nothing to do with the replica 
if the message was not sent, it may be due to the coordinator dealing with too 
many messages.

Right, but there isn't much we can do without way more invasive changes. 
Anyway, I don't think that's actually a problem, as if the coordinator is 
overloaded we'll end up generating too many hints and fail with 
{{OverloadedException}} (this time with its original meaning), so we should be 
covered.

bq. I also observe that if a replica has a low rate, then we may block when 
acquiring the limiter, and this will indirectly throttle for all following 
replicas, even if they were ready to receive mutations sooner.

See my answer at the end.

bq. AbstractWriteResponseHandler sets the 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-12 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373537#comment-15373537
 ] 

Jeremiah Jordan edited comment on CASSANDRA-9318 at 7/12/16 7:31 PM:
-

[~jbellis] then we can default this to off.

I have interacted with *many* people using Cassandra that would actually like 
to see some rate limiting applied for cases 1 and 2 such that things don't fall 
over (shouldn't happen with new hints hopefully) or even hint like crazy.  
Turning this on for them would allow that to happen.


was (Author: jjordan):
[~jbellis] then we can default this to off.

I have interacted with *many* people using Cassandra that would actually like 
to see some rate limiting applied for cases 1 and 2 such that things don't fall 
over (shouldn't happen with new hints hopefully) or even hint like crazy.  
Turning this on would allow that to happen.

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373512#comment-15373512
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 7/12/16 7:23 PM:


The more I think about it the more I think the entire approach may be a bad fit 
for Cassandra.  Consider:

# If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want 
to hint those writes and return success to the client.  No need to rate limit.
# If a node has a sustained period of slow performance, we want to hint those 
writes and return success to the client.  No need to rate limit, unless we are 
overwhelmed with hints.  (Not sure if hint overload is actually a problem with 
the new file based hints.)
# Where we DO want to rate limit is when the client is throwing more updates at 
the coordinator than the system can handle, whether that is for a single token 
range or globally across all nodes.

So I see this approach as doing the wrong thing for 1 and 2 and only partially 
helping with 3.

Put another way: we do NOT want to limit performance to the slowest node in a 
set of replicas.  That is kind of the opposite of the redundancy we want to 
provide.


was (Author: jbellis):
The more I think about it the more I think the entire approach may be a bad fit 
for Cassandra.  Consider:

# If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want 
to hint those writes and return success to the client.  No need to rate limit.
# If a node has a sustained period of slow performance, we want to hint those 
writes and return success to the client.  No need to rate limit, unless we are 
overwhelmed with hints.  (Not sure if hint overload is actually a problem with 
the new file based hints.)
# Where we DO want to rate limit is when the client is throwing more updates at 
the coordinator than the system can handle, whether that is for a single token 
range or globally across many nodes.

So I see this approach as doing the wrong thing for 1 and 2 and only partially 
helping with 3.

Put another way: we do NOT want to limit performance to the slowest node in a 
set of replicas.  That is kind of the opposite of the redundancy we want to 
provide.

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373512#comment-15373512
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 7/12/16 7:22 PM:


The more I think about it the more I think the entire approach may be a bad fit 
for Cassandra.  Consider:

# If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want 
to hint those writes and return success to the client.  No need to rate limit.
# If a node has a sustained period of slow performance, we want to hint those 
writes and return success to the client.  No need to rate limit, unless we are 
overwhelmed with hints.  (Not sure if hint overload is actually a problem with 
the new file based hints.)
# Where we DO want to rate limit is when the client is throwing more updates at 
the coordinator than the system can handle, whether that is for a single token 
range or globally across many nodes.

So I see this approach as doing the wrong thing for 1 and 2 and only partially 
helping with 3.

Put another way: we do NOT want to limit performance to the slowest node in a 
set of replicas.  That is kind of the opposite of the redundancy we want to 
provide.


was (Author: jbellis):
The more I think about it the more I think the entire approach may be a bad fit 
for Cassandra.  Consider:

# If a node has a "hiccup" of slow performance, e.g. due to a GC pause, we want 
to hint those writes and return success to the client.  No need to rate limit.
# If a node has a sustained period of slow performance, we want to hint those 
writes and return success to the client.  No need to rate limit, unless we are 
overwhelmed with hints.  (Not sure if hint overload is actually a problem with 
the new file based hints.)
# Where we DO want to rate limit is when the client is throwing more updates at 
the coordinator than the system can handle, whether that is for a single node 
or globally across many nodes.

So I see this approach as doing the wrong thing for 1 and 2 and only partially 
helping with 3.

Put another way: we do NOT want to limit performance to the slowest node in a 
set of replicas.  That is kind of the opposite of the redundancy we want to 
provide.

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-11 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370533#comment-15370533
 ] 

Sergio Bossa edited comment on CASSANDRA-9318 at 7/11/16 4:06 PM:
--

[~Stefania],

bq. Do we track the case where we receive a failed response? Specifically, in 
ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also 
when the message is a failure response?

Good point, I focused more on the success case, due to dropped mutations, but 
that sounds like a good thing to do.

bq. If we receive a response after it has timed out, won't we count that 
request twice, incorrectly increasing the rate for that window?

But can that really happen? {{ResponseVerbHandler}} returns _before_ 
incrementing back-pressure if the callback is null (i.e. expired), and 
{{OutboundTcpConnection}} doesn't even send outbound messages if they're timed 
out, or am I missing something?

bq. I also argue that it is quite easy to comment out the strategy and to have 
an empty strategy in the code that means no backpressure.

Again, I believe this would make enabling/disabling back-pressure via JMX less 
user friendly.

bq. I think what we may need is a new companion snitch that sorts the replica 
by backpressure ratio

I do not think sorting replicas is what we really need, as you have to send the 
mutation to all replicas anyway. I think what you rather need is a way to 
pre-emptively fail if the write consistency level is not met by enough 
"non-overloaded" replicas, i.e.:
* If CL.ONE, fail if *all* replicas are overloaded.
* If CL.QUORUM, fail if *quorum* replicas are overloaded.
* if CL.ALL, fail if *any* replica is overloaded.

This can be easily accomplished in {{StorageProxy#sendToHintedEndpoints}}.

bq. the exception needs to be different. native_protocol_v4.spec clearly states

I missed that too :(

This leaves us with two options:
* Adding a new exception to the native protocol.
* Reusing a different exception, with {{WriteFailureException}} and 
{{UnavailableException}} the most likely candidates.

I'm currently leaning towards the latter option.

bq. By "load shedding by the replica" do we mean dropping mutations that have 
timed out or something else?

Yes.

bq. Regardless, there is the problem of ensuring that all nodes have 
backpressure enabled, which may not be trivial.

We only need to ensure the coordinator for that specific mutation has 
back-pressure enabled, and we could do this by "marking" the {{MessageOut}} 
with a special parameter, what do you think?


was (Author: sbtourist):
[~Stefania],

bq. Do we track the case where we receive a failed response? Specifically, in 
ResponseVerbHandler.doVerb, shouldn't we call updateBackPressureState() also 
when the message is a failure response?

Good point, I focused more on the success case, due to dropped mutations, but 
that sounds like a good thing to do.

bq. If we receive a response after it has timed out, won't we count that 
request twice, incorrectly increasing the rate for that window?

But can that really happen? {{ResponseVerbHandler}} returns _before_ 
incrementing back-pressure if the callback is null (i.e. expired), and 
{{OutboundTcpConnection}} doesn't even send outbound messages if they're timed 
out, or am I missing something?

bq. I also argue that it is quite easy to comment out the strategy and to have 
an empty strategy in the code that means no backpressure.

Again, I believe this would make enabling/disabling back-pressure via JMX less 
user friendly.

bq. I think what we may need is a new companion snitch that sorts the replica 
by backpressure ratio

I do not think sorting replicas is what we really need, as you have to send the 
mutation to all replicas anyway. I think what you rather need is a way to 
pre-emptively fail if the write consistency level is not met by enough 
"non-overloaded" replicas, i.e.:
* If CL.ONE, fail in *all* replicas are overloaded.
* If CL.QUORUM, fail if *quorum* replicas are overloaded.
* if CL.ALL, fail if *any* replica is overloaded.

This can be easily accomplished in {{StorageProxy#sendToHintedEndpoints}}.

bq. the exception needs to be different. native_protocol_v4.spec clearly states

I missed that too :(

This leaves us with two options:
* Adding a new exception to the native protocol.
* Reusing a different exception, with {{WriteFailureException}} and 
{{UnavailableException}} the most likely candidates.

I'm currently leaning towards the latter option.

bq. By "load shedding by the replica" do we mean dropping mutations that have 
timed out or something else?

Yes.

bq. Regardless, there is the problem of ensuring that all nodes have 
backpressure enabled, which may not be trivial.

We only need to ensure the coordinator for that specific mutation has 
back-pressure enabled, and we could do this by "marking" the {{MessageOut}} 
with a special parameter, what do 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-07-11 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371063#comment-15371063
 ] 

Sergio Bossa edited comment on CASSANDRA-9318 at 7/11/16 4:05 PM:
--

bq. One thing that worries me is, how do you distinguish between “node X is 
slow because we are writing too fast and we need to throttle clients down” and 
“node X is slow because it is dying, we need to ignore it and accept writes 
based on other replicas?” I.e. this seems to implicitly push everyone to a kind 
of CL.ALL model once your threshold triggers, where if one replica is slow then 
we can't make progress.

This was already noted by [~slebresne], and you're both right, this initial 
implementation is heavily biased towards my specific use case :)
But, the above proposed solution should fix it:

bq. I think what you rather need is a way to pre-emptively fail if the write 
consistency level is not met by enough "non-overloaded" replicas, i.e.: If 
CL.ONE, fail if all replicas are overloaded...

Also, the exception would be sent to the client only if the low threshold is 
met, and only the first time it is met, for the duration of the back-pressure 
window (write RPC timeout), i.e.:
* Threshold is 0.1, outgoing requests are 100, incoming responses are 10, ratio 
is 0.1.
* Exception is thrown by all write requests for the current back-pressure 
window.
* The outgoing rate limiter is set at 10, which means the next ratio 
calculation will approach the sustainable rate, and even if replicas will still 
lag behind, the ratio will not go down to 0.1 _unless_ the incoming rate 
dramatically goes down to 1.

This is to say the chances of getting a meltdown due to "overloaded exceptions" 
is moderate (also, clients are supposed to adjust themselves when getting such 
exception), and the above proposal should make things play nicely with the CL 
too.

If you all agree with that, I'll move forward and make that change.


was (Author: sbtourist):
bq. One thing that worries me is, how do you distinguish between “node X is 
slow because we are writing too fast and we need to throttle clients down” and 
“node X is slow because it is dying, we need to ignore it and accept writes 
based on other replicas?”
I.e. this seems to implicitly push everyone to a kind of CL.ALL model once your 
threshold triggers, where if one replica is slow then we can't make progress.

This was already noted by [~slebresne], and you're both right, this initial 
implementation is heavily biased towards my specific use case :)
But, the above proposed solution should fix it:

bq. I think what you rather need is a way to pre-emptively fail if the write 
consistency level is not met by enough "non-overloaded" replicas, i.e.: If 
CL.ONE, fail if all replicas are overloaded...

Also, the exception would be sent to the client only if the low threshold is 
met, and only the first time it is met, for the duration of the back-pressure 
window (write RPC timeout), i.e.:
* Threshold is 0.1, outgoing requests are 100, incoming responses are 10, ratio 
is 0.1.
* Exception is thrown by all write requests for the current back-pressure 
window.
* The outgoing rate limiter is set at 10, which means the next ratio 
calculation will approach the sustainable rate, and even if replicas will still 
lag behind, the ratio will not go down to 0.1 _unless_ the incoming rate 
dramatically goes down to 1.

This is to say the chances of getting a meltdown due to "overloaded exceptions" 
is moderate (also, clients are supposed to adjust themselves when getting such 
exception), and the above proposal should make things play nicely with the CL 
too.

If you all agree with that, I'll move forward and make that change.

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Sergio Bossa
> Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-01-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091963#comment-15091963
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 1/11/16 2:26 PM:


Since most of that discussion is implementation details, I'll quote the 
relevant part:

bq. With consistency level less then ALL mutation processing can move to 
background (meaning client was answered, but there is still work to do on 
behalf of the request). If background request rate completion is lower than 
incoming request rate background request will accumulate and eventually will 
exhaust all memory resources. This patch's aim is to prevent this situation by 
monitoring how much memory all current background request take and when some 
threshold is passed stop moving request to background (by not replying to a 
client until either memory consumptions moves below the threshold or request is 
fully completed).

bq. There are two main point where each background mutation consumes memory: 
holding frozen mutation until operation is complete in order to hint 
if it does not) and on rpc queue to each replica where it sits until it's sent 
out on the wire. The patch accounts for both of those separately and limits the 
former to be 10% of total memory and the later to be 6M. Why 6M? The best 
answer I can give is why not :) But on a more serious note the number should be 
small enough so that all the data can be sent out in a reasonable amount of 
time and one shard is not capable to achieve even close to a full bandwidth, so 
empirical evidence shows 6M to be a good number. 


was (Author: jbellis):
Since most of that discussion is implementation details, I'll quote the 
relevant part:

bq. With consistency level less then ALL mutation processing can move to
background (meaning client was answered, but there is still work to
do on behalf of the request). If background request rate completion
is lower than incoming request rate background request will accumulate
and eventually will exhaust all memory resources. This patch's aim is
to prevent this situation by monitoring how much memory all current
background request take and when some threshold is passed stop moving
request to background (by not replying to a client until either memory
consumptions moves below the threshold or request is fully completed).

bq. There are two main point where each background mutation consumes memory:
holding frozen mutation until operation is complete in order to hint it
if it does not) and on rpc queue to each replica where it sits until it's
sent out on the wire. The patch accounts for both of those separately
and limits the former to be 10% of total memory and the later to be 6M.
Why 6M? The best answer I can give is why not :) But on a more serious
note the number should be small enough so that all the data can be
sent out in a reasonable amount of time and one shard is not capable to
achieve even close to a full bandwidth, so empirical evidence shows 6M
to be a good number. 

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
> Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2016-01-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091963#comment-15091963
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 1/11/16 2:27 PM:


Since most of that discussion is implementation details, I'll quote the 
relevant part:

bq. With consistency level less then ALL mutation processing can move to 
background (meaning client was answered, but there is still work to do on 
behalf of the request). If background request rate completion is lower than 
incoming request rate background request will accumulate and eventually will 
exhaust all memory resources. This patch's aim is to prevent this situation by 
monitoring how much memory all current background request take and when some 
threshold is passed stop moving request to background (by not replying to a 
client until either memory consumptions moves below the threshold or request is 
fully completed).

bq. There are two main point where each background mutation consumes memory: 
holding frozen mutation until operation is complete in order to hint if it does 
not) and on rpc queue to each replica where it sits until it's sent out on the 
wire. The patch accounts for both of those separately and limits the former to 
be 10% of total memory and the later to be 6M. Why 6M? The best answer I can 
give is why not :) But on a more serious note the number should be small enough 
so that all the data can be sent out in a reasonable amount of time and one 
shard is not capable to achieve even close to a full bandwidth, so empirical 
evidence shows 6M to be a good number. 


was (Author: jbellis):
Since most of that discussion is implementation details, I'll quote the 
relevant part:

bq. With consistency level less then ALL mutation processing can move to 
background (meaning client was answered, but there is still work to do on 
behalf of the request). If background request rate completion is lower than 
incoming request rate background request will accumulate and eventually will 
exhaust all memory resources. This patch's aim is to prevent this situation by 
monitoring how much memory all current background request take and when some 
threshold is passed stop moving request to background (by not replying to a 
client until either memory consumptions moves below the threshold or request is 
fully completed).

bq. There are two main point where each background mutation consumes memory: 
holding frozen mutation until operation is complete in order to hint 
if it does not) and on rpc queue to each replica where it sits until it's sent 
out on the wire. The patch accounts for both of those separately and limits the 
former to be 10% of total memory and the later to be 6M. Why 6M? The best 
answer I can give is why not :) But on a more serious note the number should be 
small enough so that all the data can be sent out in a reasonable amount of 
time and one shard is not capable to achieve even close to a full bandwidth, so 
empirical evidence shows 6M to be a good number. 

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
> Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-12-14 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056411#comment-15056411
 ] 

Ariel Weisberg edited comment on CASSANDRA-9318 at 12/14/15 6:20 PM:
-

Quick note. 65k mutations pending in the mutation stage. 7 memtables pending 
flush. [I hooked memtables pending flush into the backpressure 
mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39]
 That absolutely wrecked performance as throughput dropped to 0 zero 
periodically, but throughput is infinitely higher than when the database has 
OOMed.

Kicked off a few performance runs to demonstrate what happens when you do have 
backpressure and you try various large limits on in flight memtables/requests.

[9318 w/backpressure 64m 8g heap memtables 
count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f]
[9318 w/backpressure 1g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f]
[9318 w/backpressure 2g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f]

I am setting the point where backpressure turns off to almost the same limit as 
to when it turns on. This is smooths out performance just enough for stress to 
not constantly emit huge numbers of errors as writes time out because the 
database stops serving requests for a long time waiting for a memtable to flush.

With pressure from memtables somewhat accounted for the remaining source of 
pressure that can bring down a node is remotely delivered mutations. I can 
throw those into the calculation and add a listener that blocks reads from 
other cluster nodes. It's a nasty thing to do, but maybe not that different 
from OOM.

I am going to hack together something to force a node to be slow so I can 
demonstrate overwhelming it with remotely delivered mutations first.


was (Author: aweisberg):
Quick note. 65k mutations pending in the mutation stage. 7 memtables pending 
flush. [I hooked memtables pending flush into the backpressure 
mechanism.|https://github.com/apache/cassandra/commit/494eabf48ab48f1e86c058c0b583166ab39dcc39]
 That absolutely wrecked performance as throughput dropped to 0 zero 
periodically, but throughput is infinitely higher than when the database hasn't 
OOMed.

Kicked off a few performance runs to demonstrate what happens when you do have 
backpressure and you try various large limits on in flight memtables/requests.

[9318 w/backpressure 64m 8g heap memtables 
count|http://cstar.datastax.com/tests/id/fa769eec-a283-11e5-bbc9-0256e416528f]
[9318 w/backpressure 1g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/4c52dd6e-a286-11e5-bbc9-0256e416528f]
[9318 w/backpressure 2g 8g heap memtables 
count|http://cstar.datastax.com/tests/id/b3d5b470-a286-11e5-bbc9-0256e416528f]

I am setting the point where backpressure turns off to almost the same limit as 
to when it turns on. This is smooths out performance just enough for stress to 
not constantly emit huge numbers of errors as writes time out because the 
database stops serving requests for a long time waiting for a memtable to flush.

With pressure from memtables somewhat accounted for the remaining source of 
pressure that can bring down a node is remotely delivered mutations. I can 
throw those into the calculation and add a listener that blocks reads from 
other cluster nodes. It's a nasty thing to do, but maybe not that different 
from OOM.

I am going to hack together something to force a node to be slow so I can 
demonstrate overwhelming it with remotely delivered mutations first.

> Bound the number of in-flight requests at the coordinator
> -
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths, Streaming and Messaging
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
> Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605817#comment-14605817
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/29/15 4:11 PM:


https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775

bq. [It is not] a good idea to try to allow extra reads when write capacity is 
full or vice versa. They both ultimately use the same resources (cpu, heap, 
disk i/o).


was (Author: jbellis):
https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775

 [It is not] a good idea to try to allow extra reads when write capacity is 
 full or vice versa. They both ultimately use the same resources (cpu, heap, 
 disk i/o).

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x, 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605817#comment-14605817
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/29/15 4:10 PM:


https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775

 [It is not] a good idea to try to allow extra reads when write capacity is 
 full or vice versa. They both ultimately use the same resources (cpu, heap, 
 disk i/o).


was (Author: jbellis):
https://issues.apache.org/jira/browse/CASSANDRA-9318?focusedCommentId=14604775page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604775

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x, 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-28 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604649#comment-14604649
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 12:08 PM:
-

Here's where I've ended up:

# Continuing to accept writes faster than a coordinator can deliver them to 
replicas is bad.  Even perfect load shedding is worse from a client perspective 
than throttling, since if we load shed and time out the client needs to try to 
guess the right rate to retry at.
# For the same reason, accepting a write but then refusing it with 
UnavailableException is worse than waiting to accept the write until we have 
capacity for it.
# It's more important to throttle writes because while we can get in trouble 
with large reads too (a small request turns into a big reply), in practice 
reads are naturally throttled because a client needs to wait for the read 
before taking action on it.  With writes on the other hand a new user's first 
inclination is to see how fast s/he can bulk load stuff.

In practice, I see load shedding and throttling as complementary.  Replicas can 
continue to rely on load shedding.  Perhaps we can attempt distributed back 
pressure later (if every replica is overloaded, we should again throttle 
clients) but for now let's narrow our scope to throttling clients to the 
capacity of a coordinator to send out.

*I propose we define a limit on the amount of memory MessagingService can 
consume and pause reading additional requests whenever that limit is hit.*  
Note that:

# If MS's load is distributed evenly across all destinations then this is 
trivially the right thing to do.
# If MS's load is caused by a single replica falling over or unable to keep up, 
this is still the right thing to do because the alternative is worse.  MS will 
load shed timed out requests, but if clients are sending more requests to a 
single replica than we can shed (if rate * timeout  capacity) then we still 
need to throttle or we will exhaust the heap and fall over.  

(The hint-based UnavailableException tries to help with scenario 2, and I will 
open a ticket to test how well that actually works.  But the hint threshold 
cannot help with scenario 1 at all and that is the hole this ticket needs to 
plug.)


was (Author: jbellis):
Here's where I've ended up:

# Continuing to accept writes faster than a coordinator can deliver them to 
replicas is bad.  Even perfect load shedding is worse from a client perspective 
than throttling, since if we load shed and time out the client needs to try to 
guess the right rate to retry at.
# For the same reason, accepting a write but then refusing it with 
UnavailableException is worse than waiting to accept the write until we have 
capacity for it.
# It's more important to throttle writes because while we can get in trouble 
with large reads too (a small request turns into a big reply), in practice 
reads are naturally throttled because a client needs to wait for the read 
before taking action on it.  With writes on the other hand a new user's first 
inclination is to see how fast s/he can bulk load stuff.

In practice, I see load shedding and throttling as complementary.  Replicas can 
continue to rely on load shedding.  Perhaps we can attempt distributed back 
pressure later (if every replica is overloaded, we should again throttle 
clients) but for now let's narrow our scope to throttling clients to the 
capacity of a coordinator to send out.

I propose we define a limit on the amount of memory MessagingService can 
consume and pause reading additional requests whenever that limit is hit.  Note 
that:

# If MS's load is distributed evenly across all destinations then this is 
trivially the right thing to do.
# If MS's load is caused by a single replica falling over or unable to keep up, 
this is still the right thing to do because the alternative is worse.  MS will 
load shed timed out requests, but if clients are sending more requests to a 
single replica than we can shed (if rate * timeout  capacity) then we still 
need to throttle or we will exhaust the heap and fall over.  

(The hint-based UnavailableException tries to help with scenario 2, and I will 
open a ticket to test how well that actually works.  But the hint threshold 
cannot help with scenario 1 at all and that is the hole this ticket needs to 
plug.)

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604746#comment-14604746
 ] 

Benedict edited comment on CASSANDRA-9318 at 6/28/15 3:48 PM:
--

bq. This in no way affects our contract or guarantees, since we don't do 
anything at all in the intervening period except consume memory.
bq. The whole point is that coordinators are falling over from OOM. This isn't 
just something we can wave away as negligible.

I was referring here to the status quo, FTR.

Also FTR, we do clearly state hints are best effort (they also aren't 
guaranteed to be persisted), so as far as contracts / guarantees are concerned, 
I don't know we make any (and I wasn't aware of this one). It would be really 
helpful for these (and many other) discussions if all of the assumptions, 
contracts and guarantees we make about correctness and delivery were made 
available in a single clearly spelled out document (and that, like the code 
style, this document is the final arbiter of what action to take).


was (Author: benedict):
bq. This in no way affects our contract or guarantees, since we don't do 
anything at all in the intervening period except consume memory.
bq. The whole point is that coordinators are falling over from OOM. This isn't 
just something we can wave away as negligible.

I was referring here to the status quo, FTR.

Also FTR, we do clearly state hints are best effort (they also aren't 
guaranteed to be persisted), so as far as contracts / guarantees are concerned, 
I don't know we make any (and I wasn't aware of this one). It would be really 
helpful for these (and many other) discussions if all of the assumptions, 
contracts and guarantees we make about correctness and delivery were made 
available in a single clearly spelled out document.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x, 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-28 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604763#comment-14604763
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 4:31 PM:


Hinting is better than leaving things in an unknown state but it's not 
something we should opt users into if we have a better option, since it 
basically turns the write into CL.ANY.

I think you're [Benedict] overselling how scary it is to stop reading new 
requests until we can free up some memory from MS.  We're not dropping 
connections.  We're just imposing some flow control.  Which is something that 
already happens at different levels anyway.


was (Author: jbellis):
Hinting is better than leaving things in an unknown state but it's not 
something we should opt users into if we have a better option, since it 
basically turns the write into CL.ANY.

I think you're overselling how scary it is to stop reading new requests until 
we can free up some memory from MS.  We're not dropping connections.  We're 
just imposing some flow control.  Which is something that already happens at 
different levels anyway.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x, 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-28 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604735#comment-14604735
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 3:35 PM:


Let's pull optimizing hints to a separate ticket.  It is complementary to 
don't accept more requests than you can handle.


was (Author: jbellis):
Let's pull optimizing hints to a separate ticket.  It is complementary to 
don't accept more than you can handle.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x, 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-28 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604728#comment-14604728
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 3:24 PM:


bq. This in no way affects our contract or guarantees, since we don't do 
anything at all in the intervening period except consume memory. 

The whole point is that coordinators are falling over from OOM.  This isn't 
just something we can wave away as negligible.


was (Author: jbellis):
bq. This in no way affects our contract or guarantees, since we don't do 
anything at all in the intervening period except consume memory. 

The whole point is that coordinators are falling over from OOM.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-06-28 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604775#comment-14604775
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 6/28/15 4:52 PM:


Replica and coordinator are only identical on writes when RF=1, hardly the most 
common case.

Nor is it a good idea to try to allow extra reads when write capacity is full 
or vice versa.  They both ultimately use the same resources (cpu, heap, disk 
i/o).


was (Author: jbellis):
Replica and coordinator are only identical on writes when RF=1.

Nor is it a good idea to try to allow extra reads when write capacity is full 
or vice versa.  They both ultimately use the same resources (cpu, heap, disk 
i/o).

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x, 2.2.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-09 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846
 ] 

Jonathan Shook edited comment on CASSANDRA-9318 at 5/10/15 12:32 AM:
-

I would venture that a solid load shedding system may improve the degenerate 
overloading case, but it is not the preferred method for dealing with 
overloading for most users. The concept of back-pressure is more squarely what 
people expect, for better or worse.

Here is what I think reasonable users want to see, with some variations:
1) The system performs with stability, up to the workload that it is able to 
handle with stability.
2a) Once it reaches that limit, it starts pushing back in terms of how quickly 
it accepts new work. This means that it simply blocks the operations or 
submissions of new requests with some useful bound that is determined by the 
system. It does not yet have to shed load. It does not yet have to give 
exceptions. This is a very reasonable expectation for most users. This is what 
they expect. Load shedding is a term of art which does not change the users' 
expectations.
2b) Once it reaches that limit, it starts throwing OE to the client. It does 
not have to shed load yet. (Perhaps this exception or something like it can be 
thrown _before_ load shedding occurs.) This is a very reasonable expectation 
for users who are savvy enough to do active load management at the client 
level. It may have to start writing hints, but if you are writing hints merely 
because of load, this might not be the best justification for having the hints 
system kick in. To me this is inherently a convenient remedy for the wrong 
problem, even if it works well. Yes, hints are there as a general mechanism, 
but it does not solve the problem of needing to know when the system is being 
pushed beyond capacity and how to handle it proactively. You could also say 
that hints actively hurt capacity when you need them most sometimes. They are 
expensive to process given the current implementation, and will always be load 
shifting even at theoretical best. Still we need them for node availability 
concerns, although we should be careful not to use them as a crutch for general 
capacity issues.
2c) Once it reaches that limit, it starts backlogging (without a helpful 
signature of such in the responses, maybe BackloggingException with some queue 
estimate). This is a very reasonable expectation for users who are savvy enough 
to manage their peak and valley workloads in a sensible way. Sometimes you 
actually want to tax the ingest and flush side of the system for a bit before 
allowing it to switch modes and catch up with compaction. The fact that C* can 
do this is an interesting capability, but those who want backpressure will not 
easily see it that way.
2d) If the system is being pushed beyond its capacity, then it may have to shed 
load. This should only happen if the user has decided that they want to be 
responsible for such and have pushed the system beyond the reasonable limit 
without paying attention to the indications in 2a, 2b, and 2c. In the current 
system, this decision is already made for them. They have no choice.

In a more optimistic world, users would get near optimal performance for a well 
tuned workload with back-pressure active throughout the system, or something 
very much like it. We could call it a different kind of scheduler, different 
queue management methods, or whatever. 
As long as the user could prioritize stability at some bounded load over 
possible instability at an over-saturating load, I think they would in most 
cases. Like I said, they really don't have this choice right now. I know this 
is not trivial. We can't remove the need to make sane judgments about sizing 
and configuration. We might be able to, however, make the system ramp more 
predictably up to saturation, and behave more reasonably at that level.

Order of precedence, How to designate a mode of operation, or any other 
concerns aren't really addressed here. I just provided the examples above as 
types of behaviors which are nuanced yet perfectly valid for different types of 
system designs. The real point here is that there is not a single overall 
QoS/capacity/back-pressure behavior which is going to be acceptable to all 
users. Still, we need to ensure stability under saturating load where possible. 
I would like to think that with CASSANDRA-8099 that we can start discussing 
some of the client-facing back-pressure ideas more earnestly. I do believe that 
these ideas are all compatible ideas on a spectrum of behavior. They are not 
mutually exclusive from a design/implementation perspective. It's possible that 
they could be specified per operation, even, with some traffic yield to others 
due to client policies. For example, a lower priority client could yield when 
it knows the 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-09 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846
 ] 

Jonathan Shook edited comment on CASSANDRA-9318 at 5/10/15 12:26 AM:
-

I would venture that a solid load shedding system may improve the degenerate 
overloading case, but it is not the preferred method for dealing with 
overloading for most users. The concept of back-pressure is more squarely what 
people expect, for better or worse.

Here is what I think reasonable users want to see, with some variations:
1) The system performs with stability, up to the workload that it is able to 
handle with stability.
2a) Once it reaches that limit, it starts pushing back in terms of how quickly 
it accepts new work. This means that it simply blocks the operations or 
submissions of new requests with some useful bound that is determined by the 
system. It does not yet have to shed load. It does not yet have to give 
exceptions. This is a very reasonable expectation for most users. This is what 
they expect. Load shedding is a term of art which does not change the users' 
expectations.
2b) Once it reaches that limit, it starts throwing OE to the client. It does 
not have to shed load yet. (Perhaps this exception or something like it can be 
thrown _before_ load shedding occurs.) This is a very reasonable expectation 
for users who are savvy enough to do active load management at the client 
level. It may have to start writing hints, but if you are writing hints merely 
because of load, this might not be the best justification for having the hints 
system kick in. To me this is inherently a convenient remedy for the wrong 
problem, even if it works well. Yes, hints are there as a general mechanism, 
but it does not solve the problem of needing to know when the system is being 
pushed beyond capacity and how to handle it proactively. You could also say 
that hints actively hurt capacity when you need them most sometimes. They are 
expensive to process given the current implementation, and will always be load 
shifting even at theoretical best. Still we need them for node availability 
concerns, although we should be careful not to use them as a crutch for general 
capacity issues.
2c) Once it reaches that limit, it starts backlogging (without a helpful 
signature of such in the responses, maybe BackloggingException with some queue 
estimate). This is a very reasonable expectation for users who are savvy enough 
to manage their peak and valley workloads in a sensible way. Sometimes you 
actually want to tax the ingest and flush side of the system for a bit before 
allowing it to switch modes and catch up with compaction. The fact that C* can 
do this is an interesting capability, but those who want backpressure will not 
easily see it that way.
2d) If the system is being pushed beyond its capacity, then it may have to shed 
load. This should only happen if the user has decided that they want to be 
responsible for such and have pushed the system beyond the reasonable limit 
without paying attention to the indications in 2a, 2b, and 2c. In the current 
system, this decision is already made for them. They have no choice.

In a more optimistic world, users would get near optimal performance for a well 
tuned workload with back-pressure active throughout the system, or something 
very much like it. We could call it a different kind of scheduler, different 
queue management methods, or whatever. 
As long as the user could prioritize stability at some bounded load over 
possible instability at an over-saturating load, I think they would in most 
cases. Like I said, they really don't have this choice right now. I know this 
is not trivial. We can't remove the need to make sane judgments about sizing 
and configuration. We might be able to, however, make the system ramp more 
predictably up to saturation, and behave more reasonable at that level.

Order of precedence, How to designate a mode of operation, or any other 
concerns aren't really addressed here. I just provided the examples above as 
types of behaviors which are nuanced yet perfectly valid for different types of 
system designs. The real point here is that there is not a single overall 
QoS/capacity/back-pressure behavior which is going to be acceptable to all 
users. Still, we need to ensure stability under saturating load where possible. 
I would like to think that with CASSANDRA-8099 that we can start discussing 
some of the client-facing back-pressure ideas more earnestly. I do believe that 
these ideas are all compatible ideas on a spectrum of behavior. They are not 
mutually exclusive from a design/implementation perspective. It's possible that 
they could be specified per operation, even, with some traffic yield to others 
due to client policies. For example, a lower priority client could yield when 
it knows the 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536307#comment-14536307
 ] 

Benedict edited comment on CASSANDRA-9318 at 5/9/15 5:23 PM:
-

bq. Where? Are you talking about the hint limit?

I was, and I realise that was a mistake; I didn't fully understand the existing 
logic (and your proposal took me by surprise). Now that I do, I think I 
understand what you are proposing. There are a few problems that I see with it, 
though:

# the cluster as a whole, especially in large clusters, can still send a _lot_ 
of requests to a single node
# it has the opposite impact of (and likely prevents) CASSANDRA-3852, with 
older operations completely blocking newer ones 
# it might mean a lot more OE than users are used to during temporary blips, 
pushing problems down to clients, when the cluster is actually quite capable of 
coping (through hinting)
#* It seems like this would in fact seriously compromise our A property, with 
any failure for any node in a token range rapidly making the entire token range 
unavailable for writes\*
# tuning it is hard; network latencies, query processing times, and cluster 
size (which changes over time) will each impact it

I'm wary about a feature like this, when we could simply improve our current 
work shedding to make it more robust (MessagingService, MUTATION stage and 
ExpiringMap all, effectively, shed; just not with sufficient predictability), 
but I think I've made all my concerns sufficiently clear so I'll leave it with 
you.

\* At the very least we would have to first fallback to hints, rather than 
throwing OE, and wait for hints to saturate before throwing (AFAICT). In which 
case we're _in effect_ introducing LIFO-leaky pruning of the ExpiringMap, MS, 
and the receiving node's MUTATION stage, but under a new mechanism (as opposed 
to inline FIFO? (tbd) pruning). I don't really have anything against this, 
since it is functionally equivalent, although I think FIFO-pruning is 
preferable; having fewer pruning mechanisms is probably preferable; these 
mechanisms would apply more universally; and they would insulate the node from 
the many-to-one effect (by making the MUTATION stage itself robust to overload).


was (Author: benedict):
bq. Where? Are you talking about the hint limit?

I was, and I realise that was a mistake; I didn't fully understand the existing 
logic (and your proposal took me by surprise). Now that I do, I think I 
understand what you are proposing. There are a few problems that I see with it, 
though:

# the cluster as a whole, especially in large clusters, can still send a _lot_ 
of requests to a single node
# it has the opposite impact of (and likely prevents) CASSANDRA-3852, with 
older operations completely blocking newer ones 
# it might mean a lot more OE than users are used to during temporary blips, 
pushing problems down to clients, when the cluster is actually quite capable of 
coping (through hinting)
# tuning it is hard; network latencies, query processing times, and cluster 
size (which changes over time) will each impact it

I'm wary about a feature like this, when we could simply improve our current 
work shedding to make it more robust (MessagingService, MUTATION stage and 
ExpiringMap all, effectively, shed; just not with sufficient predictability), 
but I think I've made all my concerns sufficiently clear so I'll leave it with 
you.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-09 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846
 ] 

Jonathan Shook edited comment on CASSANDRA-9318 at 5/9/15 7:42 PM:
---

I would venture that a solid load shedding system may improve the degenerate 
overloading case, but it is not the preferred method for dealing with 
overloading for most users. The concept of back-pressure is more squarely what 
people expect, for better or worse.

Here is what I think reasonable users want to see, with some variations:
1) The system performs with stability, up to the workload that it is able to 
handle with stability.
2a) Once it reaches that limit, it starts pushing back in terms of how quickly 
it accepts new work. This means that it simply blocks the operations or 
submissions of new requests with some useful bound that is determined by the 
system. It does not yet have to shed load. It does not yet have to give 
exceptions. This is a very reasonable expectation for most users. This is what 
they expect. Load shedding is a term of art which does not change the users 
expectations.
2b) Once it reaches that limit, it starts throwing OE to the client. It does 
not have to shed load yet. This is a very reasonable expectation for users who 
are savvy enough to do active load management at the client level. It may have 
to start writing hints, but if you are writing hints because of load, this 
might not be the best justification for having the hints system kick in. To me 
this is inherently a convenient remedy for the wrong problem, even if it works 
well. Yes, hints are there as a general mechanism, but it does not relieve us 
of the problem of needing to know when the system is at capacity and how to 
handle it proactively. You could also say that hints actively hurt capacity 
when you need them most sometimes. They are expensive to process given the 
current implementation, and will always be load shifting even at theoretical 
best. Still we need them for node availability concerns, although we should be 
careful to use them as a crutch for general capacity issues.
2c) Once it reaches that limit, it starts backlogging (without a helpful 
signature of such in the responses, maybe BackloggingException with some queue 
estimate). This is a very reasonable expectation for users who are savvy enough 
to manage their peak and valley workloads in a sensible way. Sometimes you 
actually want to tax the ingest and flush side of the system for a bit before 
allowing it to switch modes and catch up with compaction. The fact that C* can 
do this is an interesting capability, but those who want backpressure will not 
easily see it that way.
2d) If the system is being pushed beyond its capacity, then it may have to shed 
load. This should only happen if the users has decided that they want to be 
responsible for such and have pushed the system beyond the reasonable limit 
without paying attention to the indications in 2a, 2b, and 2c.

Order of precedence, designated mode of operation, or any other concerns aren't 
really addressed here. I just provided them as examples of types of behaviors 
which are nuanced yet perfectly valid for different types of system designers. 
The real point here is that there is not a single overall design which is going 
to be acceptable to all users. Still, we need to ensure stability under 
saturating load where possible. I would like to think that with CASSANDRA-8099 
that we can start discussing some of the client-facing back-pressure ideas more 
earnestly.

We can come up with methods to improve the reliable and responsive capacity of 
the system even with some internal load management. If the first cut ends up 
being sub-optimal, then we can measure it against non-bounded workload tests 
and strive to close the gap. If it is implemented in a way that can support 
multiple usage scenarios, as described above, then such a limitation might be 
unlimited, bounded at level ___, or bounded by inline resource 
management.. But in any case would be controllable by some users/admin, 
client.. If we could ultimately give the categories of users above the ability 
to enable the various modes, then the 2a) scenario would be perfectly desirable 
for many users already even if the back-pressure logic only gave you 70% of the 
effective system capacity. Once testing shows that performance with active 
back-pressure to the client is close enough to the unbounded workloads, it 
could be enabled by default.

Summary: We still need reasonable back-pressure support throughout the system 
and eventually to the client. Features like this that can be a stepping stone 
towards such are still needed. The most perfect load shedding and hinting 
systems will still not be a sufficient replacement for back-pressure and 
capacity management.


was (Author: jshook):
I would venture that a 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-09 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846
 ] 

Jonathan Shook edited comment on CASSANDRA-9318 at 5/9/15 7:46 PM:
---

I would venture that a solid load shedding system may improve the degenerate 
overloading case, but it is not the preferred method for dealing with 
overloading for most users. The concept of back-pressure is more squarely what 
people expect, for better or worse.

Here is what I think reasonable users want to see, with some variations:
1) The system performs with stability, up to the workload that it is able to 
handle with stability.
2a) Once it reaches that limit, it starts pushing back in terms of how quickly 
it accepts new work. This means that it simply blocks the operations or 
submissions of new requests with some useful bound that is determined by the 
system. It does not yet have to shed load. It does not yet have to give 
exceptions. This is a very reasonable expectation for most users. This is what 
they expect. Load shedding is a term of art which does not change the users 
expectations.
2b) Once it reaches that limit, it starts throwing OE to the client. It does 
not have to shed load yet. This is a very reasonable expectation for users who 
are savvy enough to do active load management at the client level. It may have 
to start writing hints, but if you are writing hints because of load, this 
might not be the best justification for having the hints system kick in. To me 
this is inherently a convenient remedy for the wrong problem, even if it works 
well. Yes, hints are there as a general mechanism, but it does not relieve us 
of the problem of needing to know when the system is at capacity and how to 
handle it proactively. You could also say that hints actively hurt capacity 
when you need them most sometimes. They are expensive to process given the 
current implementation, and will always be load shifting even at theoretical 
best. Still we need them for node availability concerns, although we should be 
careful to use them as a crutch for general capacity issues.
2c) Once it reaches that limit, it starts backlogging (without a helpful 
signature of such in the responses, maybe BackloggingException with some queue 
estimate). This is a very reasonable expectation for users who are savvy enough 
to manage their peak and valley workloads in a sensible way. Sometimes you 
actually want to tax the ingest and flush side of the system for a bit before 
allowing it to switch modes and catch up with compaction. The fact that C* can 
do this is an interesting capability, but those who want backpressure will not 
easily see it that way.
2d) If the system is being pushed beyond its capacity, then it may have to shed 
load. This should only happen if the users has decided that they want to be 
responsible for such and have pushed the system beyond the reasonable limit 
without paying attention to the indications in 2a, 2b, and 2c.

Order of precedence, designated mode of operation, or any other concerns aren't 
really addressed here. I just provided the examples above as types of behaviors 
which are nuanced yet perfectly valid for different types of system designs. 
The real point here is that there is not a single overall 
QoS/capacity/back-pressure behavior which is going to be acceptable to all 
users. Still, we need to ensure stability under saturating load where possible. 
I would like to think that with CASSANDRA-8099 that we can start discussing 
some of the client-facing back-pressure ideas more earnestly.

We can come up with methods to improve the reliable and responsive capacity of 
the system even with some internal load management. If the first cut ends up 
being sub-optimal, then we can measure it against non-bounded workload tests 
and strive to close the gap. If it is implemented in a way that can support 
multiple usage scenarios, as described above, then such a limitation might be 
unlimited, bounded at level ___, or bounded by inline resource 
management.. But in any case would be controllable by some users/admin, 
client.. If we could ultimately give the categories of users above the ability 
to enable the various modes, then the 2a) scenario would be perfectly desirable 
for many users already even if the back-pressure logic only gave you 70% of the 
effective system capacity. Once testing shows that performance with active 
back-pressure to the client is close enough to the unbounded workloads, it 
could be enabled by default.

Summary: We still need reasonable back-pressure support throughout the system 
and eventually to the client. Features like this that can be a stepping stone 
towards such are still needed. The most perfect load shedding and hinting 
systems will still not be a sufficient replacement for back-pressure and 
capacity management.


was (Author: 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534260#comment-14534260
 ] 

Benedict edited comment on CASSANDRA-9318 at 5/8/15 11:24 AM:
--

bq. In my mind in-flight means that until the response is sent back to the 
client the request counts against the in-flight limit.
bq. We already keep the original request around until we either get acks from 
all replicas, or they time out (and we write a hint).

Perhaps we need to clarify more clearly what this ticket is proposing. I 
understood it to mean what Ariel commented here, whereas it sounds like 
Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked 
as well as time?

The ExpiringMap requests are already in-flight and cannot be cancelled, so 
their effect on other nodes cannot be rescinded, and imposing a limit does not 
stop us issuing more requests to the nodes in the cluster that are failing to 
keep up and respond to us. It _might_ be sensible if introducing more 
aggressive shedding at processing nodes to also shed the response handlers more 
aggressively locally, but I'm not convinced it would have a significant impact 
on cluster health by itself; cluster instability spreads out from problematic 
nodes, and this scheme still permits us to inundate those nodes with requests 
they cannot keep up with.

Alternatively, the approach of forbidding new requests if you have items in the 
ExpiringMap causes the collapse of other nodes to spread throughout the 
cluster, as rapidly (especially on small clusters) all requests on the system 
are destined for the collapsing node, and every coordinator stops accepting 
requests. The system seizes up, and that node is still failing since it's got 
requests from the entire cluster queued up with it. In general on the 
coordinator there's no way of distinguishing between a failed node, network 
partition, or just struggling, so we don't know if we should wait.

Some mix of the two might be possible, if we were to wait while a node is just 
slow, then drop our response handlers for the node if it's marked as down. This 
latter may not be a bad thing to do anyway, but I would not want to depend on 
this behaviour to maintain our precious A

It still seems the simplest and most robust solution is to make our work queues 
leaky, since this insulates the processing nodes from cluster-wide inundation, 
which the coordinator approach cannot (even with the loss of A and cessation 
of processing, there is a whole cluster vs a potentially single node; with hash 
partition it doesn't take long for all processing to begin involving a single 
failing node). We can do this just on the number of requests and still be much 
better than we are currently. 

We could also pair this with coordinator-level dropping of handlers for down 
nodes, and above a size/count threshold. This latter, though, could result in 
widespread uncoordinated dropping of requests, which may leave us open to a 
multiplying effect of cluster overload, with each node dropping different 
requests, possibly leading to only a tiny fraction of requests being serviced 
to their required CL across the cluster. I'm not sure how we can best model 
this risk, or avoid it without notifying coordinators of the drop of a message, 
and I don't see that being delivered for 2.1


was (Author: benedict):
bq. In my mind in-flight means that until the response is sent back to the 
client the request counts against the in-flight limit.
bq. We already keep the original request around until we either get acks from 
all replicas, or they time out (and we write a hint).

Perhaps we need to clarify more clearly what this ticket is proposing. I 
understood it to mean what Ariel commented here, whereas it sounds like 
Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked 
as well as time?

The ExpiringMap requests are already in-flight and cannot be cancelled, so 
their effect on other nodes cannot be rescinded, and imposing a limit does not 
stop us issuing more requests to the nodes in the cluster that are failing to 
keep up and respond to us. It _might_ be sensible if introducing more 
aggressive shedding at processing nodes to also shed the response handlers more 
aggressively locally, but I'm not convinced it would have a significant impact 
on cluster health by itself; cluster instability spreads out from problematic 
nodes, and this scheme still permits us to inundate those nodes with requests 
they cannot keep up with.

Alternatively, the approach of forbidding new requests if you have items in the 
ExpiringMap causes the collapse of other nodes to spread throughout the 
cluster, as rapidly (especially on small clusters) all requests on the system 
are destined for the collapsing node, and every coordinator stops accepting 
requests. The system seizes up, and that node 

[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535369#comment-14535369
 ] 

Benedict edited comment on CASSANDRA-9318 at 5/8/15 7:52 PM:
-

bq. because our existing load shedding is fine at recovering from temporary 
spikes in load

Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated 
the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; 
and this was in a small cluster.

If anything, I suspect our ability to prune these messages is also 
theoretically worse, on top of this practical datapoint, because it is done on 
dequeue, whereas the ExpiringMap and MessagingService (whilst having a slightly 
longer expiry) is done asynchronously (or on enqueue) and cannot be blocked by 
e.g. flush. What I'm effectively suggesting is simply making all of the load 
shedding all happen on enqueue, and be based on queue length as well as time, 
so that our load shedding really is simply more robust.

The coordinator is also on the right side of the equation: as the cluster 
grows, any single problems should spread out to the coordinators more slowly, 
whereas the coordinator's ability to flood a processing node scales up at the 
same (well, inverted) rate.


was (Author: benedict):
bq. because our existing load shedding is fine at recovering from temporary 
spikes in load

Are you certain? The recent testing Ariel did on CASSANDRA-8670 demonstrated 
the MUTATION stage was what was bringing the cluster down, not the ExpiringMap; 
and this was in a small cluster.

If anything, I suspect our ability to prune these messages is also 
theoretically worse, on top of this practical datapoint, because it is done on 
dequeue, whereas the ExpiringMap (whilst having a slightly longer expiry) is 
done asynchronously and cannot be blocked by e.g. flush.

The coordinator is also on the right side of the equation: as the cluster 
grows, any single problems should spread out to the coordinators more slowly, 
whereas the coordinator's ability to flood a processing node scales up at the 
same (well, inverted) rate.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535356#comment-14535356
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 7:40 PM:
---

bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based 
on bytes tracked as well as time?

No, I'm suggesting we abort requests more aggressively with OverloadedException 
*before sending them to replicas*.  One place this might make sense is 
sendToHintedEndpoints, where we already throw OE.

Right now we only throw OE once we start writing hints for a node that is in 
trouble.  This doesn't seem to be aggressive enough.  (Although, most of our 
users are on 2.0 where we allowed 8x as many hints in flight before starting to 
throttle.)

So, I am suggesting we also track requests outstanding (perhaps with the 
ExpiringMap as you suggest) as well and stop accepting requests once we hit a 
reasonable limit of you can't possibly process more requests than this in 
parallel.

bq. The ExpiringMap requests are already in-flight and cannot be cancelled, 
so their effect on other nodes cannot be rescinded, and imposing a limit does 
not stop us issuing more requests to the nodes in the cluster that are failing 
to keep up and respond to us.

Right, and I'm fine with that.  The goal is not to keep the replica completely 
out of trouble.  The goal is to keep the coordinator from falling over from 
buffering EM and MessagingService entries that it can't drain fast enough.  
Secondarily, this will help the replica too because our existing load shedding 
is fine at recovering from temporary spikes in load.  But our load shedding 
isn't good enough to save it when the coordinators keep throwing more at it 
when it's already overwhelmed.


was (Author: jbellis):
bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based 
on bytes tracked as well as time?

No, I'm suggesting we abort requests more aggressively with OverloadedException 
*before sending them to replicas*.  One place this might make sense is 
sendToHintedEndpoints, where we already throw OE.

Right now we only throw OE once we start writing hints for a node that is in 
trouble.  This doesn't seem to be aggressive enough.  (Although, most of our 
users are on 2.0 where we allowed 8x as many hints in flight before starting to 
throttle.)

So, I am suggesting we also track requests outstanding (perhaps with the 
ExpiringMap as you suggest) as well and stop accepting requests once we hit a 
reasonable limit of you can't possibly process more requests than this in 
parallel.

 The ExpiringMap requests are already in-flight and cannot be cancelled, so 
 their effect on other nodes cannot be rescinded, and imposing a limit does 
 not stop us issuing more requests to the nodes in the cluster that are 
 failing to keep up and respond to us.

Right, and I'm fine with that.  The goal is not to keep the replica completely 
out of trouble.  The goal is to keep the coordinator from falling over from 
buffering EM and MessagingService entries that it can't drain fast enough.  
Secondarily, this will help the replica too because our existing load shedding 
is fine at recovering from temporary spikes in load.  But our load shedding 
isn't good enough to save it when the coordinators keep throwing more at it 
when it's already overwhelmed.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535449#comment-14535449
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 8:36 PM:
---

bq. It currently bounds the number of in flight requests low enough

Where?  Are you talking about the hint limit?

bq. I thought if we were capable of writing a hint we had to do it. 

The coordinator must write a hint *if a replica times out after it sends the 
mutation out.*  (Because otherwise we leave the client wondering what state the 
cluster is left in; it might be on all replicas, or on none.)  No hint is 
written for UnavailableException or OverloadedException, because we can 
guarantee the state -- it is on no replicas.


was (Author: jbellis):
bq. It currently bounds the number of in flight requests low enough

Where?  Are you talking about the hint limit?

bq. I thought if we were capable of writing a hint we had to do it. 

We have to write a hint *if we send the mutation off to the replicas.*  
(Because otherwise we leave the client wondering what state the cluster is left 
in; it might be on all replicas, or on none.)  No hint is written for 
UnavailableException or OverloadedException, because we can guarantee the state 
-- it is on no replicas.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535449#comment-14535449
 ] 

Jonathan Ellis edited comment on CASSANDRA-9318 at 5/8/15 8:36 PM:
---

bq. It currently bounds the number of in flight requests low enough

Where?  Are you talking about the hint limit?

bq. I thought if we were capable of writing a hint we had to do it. 

The coordinator must write a hint *if a replica times out after the coordinator 
sends the mutation out.*  (Because otherwise we leave the client wondering what 
state the cluster is left in; it might be on all replicas, or on none.)  No 
hint is written for UnavailableException or OverloadedException, because we can 
guarantee the state -- it is on no replicas.


was (Author: jbellis):
bq. It currently bounds the number of in flight requests low enough

Where?  Are you talking about the hint limit?

bq. I thought if we were capable of writing a hint we had to do it. 

The coordinator must write a hint *if a replica times out after it sends the 
mutation out.*  (Because otherwise we leave the client wondering what state the 
cluster is left in; it might be on all replicas, or on none.)  No hint is 
written for UnavailableException or OverloadedException, because we can 
guarantee the state -- it is on no replicas.

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator

2015-05-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533657#comment-14533657
 ] 

Benedict edited comment on CASSANDRA-9318 at 5/8/15 12:23 AM:
--

CL=1 (or any CL  ALL)


was (Author: benedict):
CL=1

 Bound the number of in-flight requests at the coordinator
 -

 Key: CASSANDRA-9318
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
 Fix For: 2.1.x


 It's possible to somewhat bound the amount of load accepted into the cluster 
 by bounding the number of in-flight requests and request bytes.
 An implementation might do something like track the number of outstanding 
 bytes and requests and if it reaches a high watermark disable read on client 
 connections until it goes back below some low watermark.
 Need to make sure that disabling read on the client connection won't 
 introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)