[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Hanna updated CASSANDRA-7032: Labels: LWT performance vnodes (was: performance vnodes) > Improve vnode allocation > > > Key: CASSANDRA-7032 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Branimir Lambov >Priority: Major > Labels: LWT, performance, vnodes > Fix For: 3.0 alpha 1 > > Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, > TestVNodeAllocation.java > > > It's been known for a little while that random vnode allocation causes > hotspots of ownership. It should be possible to improve dramatically on this > with deterministic allocation. I have quickly thrown together a simple greedy > algorithm that allocates vnodes efficiently, and will repair hotspots in a > randomly allocated cluster gradually as more nodes are added, and also > ensures that token ranges are fairly evenly spread between nodes (somewhat > tunably so). The allocation still permits slight discrepancies in ownership, > but it is bound by the inverse of the size of the cluster (as opposed to > random allocation, which strangely gets worse as the cluster size increases). > I'm sure there is a decent dynamic programming solution to this that would be > even better. > If on joining the ring a new node were to CAS a shared table where a > canonical allocation of token ranges lives after running this (or a similar) > algorithm, we could then get guaranteed bounds on the ownership distribution > in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-7032: - Fix Version/s: (was: 3.x) 3.0 beta 1 Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 beta 1 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Branimir Lambov updated CASSANDRA-7032: --- Attachment: TestVNodeAllocation.java Attached new version that does not depend on a fixed number of vnodes per token. It distributes the load proportionally to the number of tokens allocated to the node and aims to minimize the over/underutilization as a percentage of the node's load. The scheme is the same as the above, only in order to take varying loads into account the load of nodes and tokens in the improvement calculation is normalized via the inverse of the size of the node. The results achieved in asymmetric node configurations are very similar to the homogeneous scenario. The behaviour for fixed vnode counts is exactly the same as the previous version. I am done tinkering with the algorithm and have started work to attach it to the code base. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-7032: -- Reviewer: Benedict Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Branimir Lambov updated CASSANDRA-7032: --- Attachment: TestVNodeAllocation.java New version with acceptable performance (adding a node to 1000 node 256 vnode cluster takes ~0.1s). No changes in the algorithm in this one, just restructures the way things are stored to replace a lot of previously complex operations with O(1) steps. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Branimir Lambov updated CASSANDRA-7032: --- Attachment: TestVNodeAllocation.java A work-in-progress algorithm for selecting vnodes in the replicated case is attached. The main idea of the algorithm is to select token positions for each new vnode in such a way as to get best improvement in replicated load variance (i.e. standard deviation) across nodes and vnodes *1. More specifically, it prepares a selection of token positions to try (by picking the middle positions between existing vnodes *2), evaluates the expected improvement in variance for each selection and chooses the best *3, continuing until all the vnodes of the new node have assigned tokens. To improve average performance, the expected improvement for all choices is calculated once; for the second and later vnode we only recalculate it for the best candidate until we find one that does not deteriorate to worse than the next option in the list *4. Tested with simple factor-3 replication, it maintains the following utilization ranges: - 1 vnode: 70% - 135% - 4 vnodes: 80% - 115% - 16 vnodes: 83 - 106% - 64 vnodes: 86 - 103% - 256 vnodes: 87 - 102% Unlike random allocation, the overutilization does not grow with the number of nodes, and a much smaller number of vnodes suffice (4 or 8 vnodes would probably be enough for most usecases). The underutilization for this algorithm is affected less by the number of vnodes; this is due to the effect of replication on newly added vnodes: they necessarily have to take the share of one fewer vnode replica than the vnode they split (regardless of the algorithm we use, if we add a new node to a large enough perfectly balanced cluster where all vnodes are responsible for the same share of tokens, the new node will necessarily have at most 2/3 (for RF=3) of the average load.). This could possibly be improved if we manage to keep enough individual tokens with load closer to RF / (RF - 1), which I've yet to try. The algorithm is implemented in the {{ReplicationAwareTokenDistributor}} in the attached file. Running the file simulates the effect of adding nodes using this algorithm on a randomly-generated cluster and prints out the minimum and maximum per-node and per-token replicated load after each step, as well as the standard deviation of the load. Sample results: {code} Random generation of 500 nodes with 8 tokens each Size 500 node max 1.88 min 0.51 stddev 0.22193 Adding 1 node(s) using ReplicationAwareTokenDistributor Size 501 node max 1.90 min 0.51 stddev 0.21922Simple 3 replicas Adding 4 node(s) using ReplicationAwareTokenDistributor Size 505 node max 1.63 min 0.51 stddev 0.20580 token max 3.72 min 0.01 stddev 0.58768Simple 3 replicas Adding 15 node(s) using ReplicationAwareTokenDistributor Size 520 node max 1.51 min 0.53 stddev 0.17369 token max 3.83 min 0.01 stddev 0.54526Simple 3 replicas Adding 105 node(s) using ReplicationAwareTokenDistributor Size 625 node max 1.15 min 0.63 stddev 0.08069 token max 2.73 min 0.00 stddev 0.40190Simple 3 replicas Adding 375 node(s) using ReplicationAwareTokenDistributor Size 1000 node max 1.08 min 0.84 stddev 0.03041 token max 1.99 min 0.00 stddev 0.22341Simple 3 replicas Losing 1 nodes Size 999 node max 1.09 min 0.84 stddev 0.03081 token max 1.98 min 0.00 stddev 0.22429Simple 3 replicas Adding 1 node(s) using ReplicationAwareTokenDistributor Size 1000 node max 1.08 min 0.84 stddev 0.03019 token max 1.99 min 0.00 stddev 0.22335Simple 3 replicas Losing 5 nodes Size 995 node max 1.17 min 0.83 stddev 0.03380 token max 2.01 min 0.00 stddev 0.22565Simple 3 replicas Adding 5 node(s) using ReplicationAwareTokenDistributor Size 1000 node max 1.08 min 0.84 stddev 0.03000 token max 1.99 min 0.00 stddev 0.22181Simple 3 replicas Losing 20 nodes Size 980 node max 1.19 min 0.88 stddev 0.04362 token max 2.44 min 0.00 stddev 0.23370Simple 3 replicas Adding 20 node(s) using ReplicationAwareTokenDistributor Size 1000 node max 1.08 min 0.89 stddev 0.02962 token max 1.99 min 0.00 stddev 0.21681Simple 3 replicas Losing 125 nodes Size 875 node max 1.31 min 0.79 stddev 0.08499 token max 2.81 min 0.00 stddev 0.28763Simple 3 replicas Adding 125 node(s) using ReplicationAwareTokenDistributor Size 1000 node max 1.08 min 0.90 stddev 0.02805 token max 1.85 min 0.00 stddev 0.19258Simple 3 replicas {code} This is far from finished as it is much slower than I'd like it to be. Notes / other things I've tried: - *1 Only controlling individual vnode load: Because of the replication effect mentioned above, the ratio between largest and smallest node has to necessarily be at best 3:2 (for RF=3). If we don't control overall node size, about 30% over/underutilization is the best we can
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Branimir Lambov updated CASSANDRA-7032: --- Attachment: TestVNodeAllocation.java Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7032: Assignee: Branimir Lambov Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7032: Attachment: TestVNodeAllocation.java Attached a very simple algorithm implementation, and some debug code for viewing the resulting cluster token range distribution for various sizes of cluster. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7032: Attachment: (was: TestVNodeAllocation.java) Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7032: Attachment: TestVNodeAllocation.java Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7032: Attachment: TestVNodeAllocation.java Another updated version with improved statistics on disk distribution. Not up to scratch for CASSANDRA-6696 unfortunately. However there's a simpler algorithm still that's much better and trivial to implement: when we randomise our token selection, instead of randomising each token over the global token range, simply split the token range into vnode regions, and randomise a token uniformly within each region. This actually gives near optimal behaviour across all cluster sizes - disk distributions are still not perfect, on average having as much as 30% gap in max-to-min range allocated to each disk with many disks, but this can be improved by having more vnodes. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.2#6252)