[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds

2018-11-18 Thread C. Scott Andreas (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

C. Scott Andreas updated CASSANDRA-3852:

Component/s: Core

> use LIFO queueing policy when queue size exceeds thresholds
> ---
>
> Key: CASSANDRA-3852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Peter Schuller
>Priority: Major
>  Labels: performance
> Fix For: 4.x
>
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds

2015-03-19 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-3852:

Assignee: (was: Benedict)

> use LIFO queueing policy when queue size exceeds thresholds
> ---
>
> Key: CASSANDRA-3852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Peter Schuller
>  Labels: performance
> Fix For: 3.1
>
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds

2015-03-17 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-3852:

Fix Version/s: (was: 3.0)
   3.1

> use LIFO queueing policy when queue size exceeds thresholds
> ---
>
> Key: CASSANDRA-3852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Peter Schuller
>  Labels: performance
> Fix For: 3.1
>
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds

2015-02-03 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3852:
--
Assignee: (was: Peter Schuller)

> use LIFO queueing policy when queue size exceeds thresholds
> ---
>
> Key: CASSANDRA-3852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Peter Schuller
>  Labels: performance
> Fix For: 3.0
>
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds

2014-08-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-3852:


Fix Version/s: 3.0

> use LIFO queueing policy when queue size exceeds thresholds
> ---
>
> Key: CASSANDRA-3852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Peter Schuller
>Assignee: Peter Schuller
>  Labels: performance
> Fix For: 3.0
>
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds

2014-06-24 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3852:
--

Labels: performance  (was: )

/cc [~benedict]

> use LIFO queueing policy when queue size exceeds thresholds
> ---
>
> Key: CASSANDRA-3852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Peter Schuller
>Assignee: Peter Schuller
>  Labels: performance
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.



--
This message was sent by Atlassian JIRA
(v6.2#6252)