[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184132#comment-14184132 ] Jonathan Ellis commented on CASSANDRA-7546: --- fixed CHANGES > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546-v3.txt, cassandra-2.1-7546.txt, > graph2_7546.png, graph3_7546.png, graph4_7546.png, graphs1.png, > hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175425#comment-14175425 ] graham sanderson commented on CASSANDRA-7546: - Thanks [~yukim] ... note I just noticed that in CHANGES.txt this is recorded in the "merge from 2.0:" section > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546-v3.txt, cassandra-2.1-7546.txt, > graph2_7546.png, graph3_7546.png, graph4_7546.png, graphs1.png, > hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168841#comment-14168841 ] graham sanderson commented on CASSANDRA-7546: - Actually this is the first time I've looked at the Locks.java code in detail myself - it should probably not throw an AssertionError on failure (it should log) since it is optional - and maybe the methods should be renamed to indicate that it may be a noop > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546-v3.txt, cassandra-2.1-7546.txt, > graph2_7546.png, graph3_7546.png, graph4_7546.png, graphs1.png, > hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168489#comment-14168489 ] graham sanderson commented on CASSANDRA-7546: - For what it's worth, I happened to be poking around the JVM source today debugging something, and so stopped to take a look - the monitorEnter does indeed just revoke any bios and inflate the lock... so seems perfectly fine for our purposes (since we expect lock contention anyway) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546.txt, graph2_7546.png, > graph3_7546.png, graph4_7546.png, graphs1.png, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167633#comment-14167633 ] graham sanderson commented on CASSANDRA-7546: - Just to be clear from the graphs - that is 70gig of GC during the 913 thread count run! > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546.txt, graph2_7546.png, > graph3_7546.png, graph4_7546.png, graphs1.png, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167613#comment-14167613 ] graham sanderson commented on CASSANDRA-7546: - Sorry [~yukim], I somehow missed your update - I'm about to attach the test results here... note they show much higher GC issues in native_obj than heap_buffers without the fix, I'm guessing because the spinning is much faster with native_obj As for monitorEnter/monitorExit Benedict and I had a discussion about that above (I originally had it with either multiple copies of the code, or nested functions), but it complicated stuff, and I was unable to prove any issues with monitorEnter or monitorExit (or indeed reference any, other than some vague suspicions I had that maybe this excludes biased locking or anything else which assumes these are neatly paired in a stack frame). In any case we don't really care because if we are using them we've already proved we're contended, and the monitor would be inflated anyway. The other issue was the use of Unsafe, but Benedict seemed fine with that also, since without Unsafe (which most people have) you just get the old behavior So, I say go ahead and promote the fix as is (yes current 2.1 trunk seemed to have Locks.java already added - I didn't diff them, but I peeked briefly and it looked about the same) It is possible someone will find a usage scenario that this makes slower, in which case we can look at that, but I suspect as mentioned before, in all of these cases where we degrade performance it is probably because the original performance is just on a lucky knife edge between under utilization, and a complete mess! Finally, I'll summarize what Benedict said up above, that whilst we could add a switch for this, this is really an internal implementation fix, the goal of which is eventually that there should be no bottleneck even when mutation the same partition (something he planned to address in version >=3.0 with lazy updates, and repair on read) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546.txt, graph2_7546.png, > graph3_7546.png, graphs1.png, hint_spikes.png, suggestion1.txt, > suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164208#comment-14164208 ] Yuki Morishita commented on CASSANDRA-7546: --- +1 to v2 (it lacks Locks.java but I assume it is unchanged). My concern is use of monitorEnter/monitorExit as I'm not sure the downside of those, but I don't think I have better alternative. [~graham.sanderson] can I go ahead and commit to 2.1 or you want me to wait until you do native_objects test? > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, > cassandra-2.1-7546-v2.txt, cassandra-2.1-7546.txt, graph2_7546.png, > graph3_7546.png, graphs1.png, hint_spikes.png, suggestion1.txt, > suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153780#comment-14153780 ] graham sanderson commented on CASSANDRA-7546: - Just a little update: I have numbers for one node down & hinting with heap_buffers, I just need to re-run a few tests since there were a couple of spurious points (might have be due to not using a totally clean cluster every time - this is not a cluster I can easily re-create) that I want to verify before I post them. Generally this patch thus far seems to be good, and while there is a non-"sweet spot" where it can be mildly harmful, this is basically on the knife edge of where you are almost overcommitting your hardware, which is probably not where people are hoping to be running. The other point to note is that while the excess GC allocation here does not cause huge issues, in a busy cluster which had a huge number of resident slabs to start off with, this can cause major knock on GC - head-aches (with slabs spilling into old gen with other garbage etc)... The GC issue isn't as much of a problem with the native allocators in 2.1 (though they do seem to become a bottleneck under high allocation rates), the fact that it is still generally faster with this patch suggests we should keep it on for those too. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, graph2_7546.png, > graphs1.png, hint_spikes.png, suggestion1.txt, suggestion1_21.txt, > young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150409#comment-14150409 ] graham sanderson commented on CASSANDRA-7546: - Busy week - I did the native_objects graphs. It actually really helps out here too - seems like native allocation starts taking a hit with too much concurrency. I was about to do the hinting graphs, but cassandra-stress seems to be pulling the server names from the server (so I can't start it with one node down) - or maybe I can, and I should just ignore the errors (I just tried giving it 4/5 nodes on the command line) What would you like me to do for n= ... I do have the full raw output for all these runs !graphs2_7546.png! > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, graph2_7546.png, > graphs1.png, hint_spikes.png, suggestion1.txt, suggestion1_21.txt, > young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141858#comment-14141858 ] Benedict commented on CASSANDRA-7546: - In general the idea for the auto mode is to get a general overview of the various conditions, _especially_ when run in target uncertainty (err<) mode, which is the default mode. I've just committed a minor change, that was previously talked about, that supports running all thread counts in the range unconditionally, however it will log a warning if you run this with target uncertainty mode, as the workloads will be different. Really we should be tearing down and rebuilding the cluster between runs. However it looks like the results are pretty much a wash for all modes except those where high contention on a single partition is to be expected. It's a bit strange that .999%ile is higher with the patch for the highest thread counts but lower contention, but that may be noise. Certainly the heap reduction looks promising. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, graphs1.png, > hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141812#comment-14141812 ] graham sanderson commented on CASSANDRA-7546: - Make of this what you will (these are the 1-1024 partitions with and without the patch as mentioned above)... You can clearly see the higher mem usage without the patch. Beyond that there looks to be some noise from compaction. As expected, the patch helps under high contention... dosen't seem to hurt at the low end (some of the low thread count stuff looks like it might be cassandra-stress related), and I'm not sure yet if the small differences in the middle thread counts are just noise. !graphs1.png! > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, graphs1.png, > hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141766#comment-14141766 ] graham sanderson commented on CASSANDRA-7546: - I'll try and make a graph of the data I have so far at some point over the weekend anyway. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141764#comment-14141764 ] graham sanderson commented on CASSANDRA-7546: - Thanks - I updated, and have run 1/16/256/1024 partitions against both my baseline 2.1.1, and patched (with 7546.21_v1.txt) 2.1.1 using heap_buffers and all 5 nodes up. Things look promising so far, I need to run with a node down (I assume I take it out of the seeds list), and also with native_objects/native_buffers... this is something I can do in parallel with other work, but will still take some time. Random cassandra-stress question: Generally it seems that the threadCount where it stops seems to be the one after it has started overloading the system. Maybe this is what is wanted for the final results, but generally it seems that the latency of this final run is not representative of the previous one or two thread counts which were doing about the same number of ops/second (hence why it stopped). Not sure what the thinking is on that, I'm sure it has come up before. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140158#comment-14140158 ] Benedict commented on CASSANDRA-7546: - force pushed another update that both enforces the sample size _if it is likely that multiple visits will be needed_, and also reduces local contention by changing the saved seed position to a scalar from an int[], which can be incremented much more cheaply > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140087#comment-14140087 ] Benedict commented on CASSANDRA-7546: - I meant to mention, but forgot, in case you worried about this: for simplicity and performance, we don't guarantee that we only generate as many partitions as the sample defines, we only guarantee that when sampling we follow that distribution (and so will ignore any overshoot that we generated). Essentially any thread sampling the working set that hits _past the end of the set_ (i.e. either into an area not yet populated, or one that has been finished and not replaced) will asynchronously generate a new seed, write to it, and _then_ update the sample. This is because updating the sample is itself costly, and for workloads where the work is likely to be completed in one shot we don't want to incur that cost. That said it should be quite possible to decide upfront if the workload meets these characteristics and, if it doesn't (like this one), update the sample in advance. There's also sort-of an off-by-1 error for the 1025, though. We're not taking the minimum index off from the generated sample index, so with a distribution 1..1024, we're never sampling index 0, and our sample size will be 1025. I've pushed a fix for this. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140011#comment-14140011 ] graham sanderson commented on CASSANDRA-7546: - Oh I should mention the warmup ended up generating 20 partitions, and during the cause of the whole test, it got bumped to 21... maybe that'll give you an "ah ha" moment. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140007#comment-14140007 ] graham sanderson commented on CASSANDRA-7546: - Didn't want to deep dive, but out of curiosity I did do one run configured for a single partition {code} Results: op rate : 5760 partition rate: 5760 row rate : 5760 latency mean : 158.7 latency median: 151.2 latency 95th percentile : 221.5 latency 99th percentile : 262.3 latency 99.9th percentile : 282.4 latency max : 396.0 total gc count: 3 total gc mb : 18779 total gc time (s) : 0 avg gc time(ms) : 67 stdev gc time(ms) : 26 Total operation time : 00:00:35 Improvement over 609 threadCount: 4% id, total ops , adj row/s,op/s,pk/s, row/s,mean, med, .95, .99,.999, max, time, stderr, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, 6782 ,-0, 120, 120, 120,33.3, 43.8,50.6,63.0,83.9,85.7, 56.6, 0.01940, 0, 0, 0, 0, 0 8 threadCount, 6629 ,-0, 212, 212, 212,37.7, 39.1,57.0,75.0, 127.2, 138.2, 31.3, 0.00868, 0, 0, 0, 0, 0 16 threadCount, 27730 ,-0, 566, 566, 566,28.2, 26.2,50.6,75.7, 125.5, 170.4, 49.0, 0.01963, 0, 0, 0, 0, 0 24 threadCount, 51763 , 798, 796, 796, 796,30.1, 29.5,51.0,76.9,90.8, 144.4, 65.0, 0.01977, 2, 203, 203, 10, 12877 36 threadCount, 74953 ,-0,1253,1253,1253,28.7, 27.8,50.7,60.5,79.6, 308.0, 59.8, 0.01938, 0, 0, 0, 0, 0 54 threadCount, 56948 ,-0,1807,1807,1807,29.8, 27.6,52.6,63.1,78.1, 121.1, 31.5, 0.01170, 3, 176, 176, 12, 19816 81 threadCount, 74856 ,-0,2369,2369,2369,34.1, 33.2,57.2,67.6,76.6, 108.6, 31.6, 0.00946, 0, 0, 0, 0, 0 121 threadCount, 100526,-0,3158,3158,3158,38.2, 37.8,63.4,78.9,89.1, 446.6, 31.8, 0.01805, 2, 93, 93, 1, 13063 181 threadCount, 277875,-0,4491,4491,4491,40.2, 40.2,63.1,79.1,94.0, 679.7, 61.9, 0.01985, 5, 286, 286, 28, 32541 271 threadCount, 169870,-0,5205,5205,5205,52.0, 49.2,84.9, 110.5, 140.5, 843.9, 32.6, 0.01320, 3, 157, 157, 11, 19408 406 threadCount, 187985, 5648,,,,73.0, 64.2, 122.1, 156.0, 285.3, 848.6, 33.8, 0.01421, 3, 173, 173, 12, 19570 609 threadCount, 201184, 5540,5534,5534,5534, 110.1, 101.1, 160.5, 230.1, 378.9, 555.6, 36.4, 0.01917, 3, 163, 163, 17, 19709 913 threadCount, 205466, 5787,5760,5760,5760, 158.7, 151.2, 221.5, 262.3, 282.4, 396.0, 35.7, 0.01335, 3, 200, 200, 26, 18779 {code} Obviously I don't know if the slowdown is on the load end or the server end (though there is some GC increase here - we'll see what the patch for this issue does). Note that if this is a synchronization problem with the load generator still, we do know for a fact that hinting is a good way of turning a large partition domain into a small partition domain (so I'll obviously be testing that too, though that isn't apples to apples either). > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst ma
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139986#comment-14139986 ] graham sanderson commented on CASSANDRA-7546: - FYI in case I didn't mention it, this is a 5 node cluster, and we're running LOCAL_QUORUM and repl factor 3 > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139982#comment-14139982 ] graham sanderson commented on CASSANDRA-7546: - Ok, cool thanks - I've upgraded my 2.1.0 to 2.1.1... {{7cfd3ed}} for what it's worth. I merged {{7964+7926}} into that and updated my load machine with that. I switched to 40x40x40x40 clustering keys as suggested and changed the 10M entries in the command line args to 256 accordingly (it now runs successfully) The output is below Note I ended up with 1275 partitions (note during the warmup I ended up with 1025 so there may be a 1-off bug there also either in stress or my config!)... still not sure this is what we expect - each node has only seen about 3M mutations total (and I've run the stress test twice - once without the GC stuff working) Anyway, let me know what you think - I won't be running more tests until tomorrow US time. Another question - what do you usually do to get comparable results; right now I have been blowing away the stresscql keyspace every time to at least keep compaction out of the equation. Given the length of the cassandra-stress run, I'm not sure there is much to be gained by bouncing the cluster in between runs, but you probably know better having used it before. {code} Results: op rate : 10595 partition rate: 10595 row rate : 10595 latency mean : 85.8 latency median: 49.9 latency 95th percentile : 360.0 latency 99th percentile : 417.9 latency 99.9th percentile : 491.9 latency max : 552.2 total gc count: 3 total gc mb : 19471 total gc time (s) : 0 avg gc time(ms) : 67 stdev gc time(ms) : 5 Total operation time : 00:00:40 Improvement over 609 threadCount: -1% id, total ops , adj row/s,op/s,pk/s, row/s,mean, med, .95, .99,.999, max, time, stderr, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, 6939 ,-0, 226, 226, 226,17.6, 16.3,40.3,49.4,51.1, 131.8, 30.6, 0.01464, 0, 0, 0, 0, 0 8 threadCount, 11827 , 385, 385, 385, 385,20.7, 15.1,47.5,51.3,82.1, 111.7, 30.7, 0.02511, 0, 0, 0, 0, 0 16 threadCount, 19068 ,-0, 612, 612, 612,26.1, 28.8,49.9,60.6,89.7, 172.1, 31.2, 0.01924, 0, 0, 0, 0, 0 24 threadCount, 24441 ,-0, 775, 775, 775,30.9, 32.6,52.1,80.3,88.3, 150.4, 31.5, 0.01508, 0, 0, 0, 0, 0 36 threadCount, 36641 ,-0,1155,1155,1155,31.1, 30.2,59.0,78.1,89.7, 172.1, 31.7, 0.01127, 0, 0, 0, 0, 0 54 threadCount, 55220 ,-0,1730,1730,1730,31.1, 29.1,54.5,74.3,84.3, 164.4, 31.9, 0.00883, 0, 0, 0, 0, 0 81 threadCount, 83460 ,-0,2609,2609,2609,31.0, 28.9,51.2,71.0,79.2, 175.4, 32.0, 0.01678, 0, 0, 0, 0, 0 121 threadCount, 140705,-0,4402,4402,4402,27.4, 25.8,49.7,53.2,70.3, 302.8, 32.0, 0.01438, 2, 462, 462, 11, 12889 181 threadCount, 226213,-0,7116,7116,7116,25.4, 24.2,48.8,51.8,60.1, 279.0, 31.8, 0.01335, 1, 230, 230, 0,6401 271 threadCount, 320658,-0, 10089, 10089, 10089,26.8, 25.0,48.3,50.1,57.4, 297.0, 31.8, 0.01256, 2, 425, 425, 14, 12786 406 threadCount, 342451,-0, 10609, 10609, 10609,38.2, 40.3,59.0,77.5,81.7, 142.4, 32.3, 0.00920, 0, 0, 0, 0, 0 609 threadCount, 381058,-0, 10651, 10651, 10651,57.0, 48.6, 171.5, 224.4, 248.4, 342.0, 35.8, 0.01234, 1, 66, 66, 0,6520 913 threadCount, 432518,-0, 10595, 10595, 10595,85.8, 49.9, 360.0, 417.9, 491.9, 552.2, 40.8, 0.01471, 3, 200, 200, 5, 19471 END {code} > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson >
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139341#comment-14139341 ] Benedict commented on CASSANDRA-7546: - I've uploaded a patch [here|https://github.com/belliottsmith/cassandra/tree/7964-simultinserts], and another [here|https://github.com/belliottsmith/cassandra/tree/7964+7926] which combines it with another stress patch that reduces the risk of OOM (although this risk is pretty low, and almost certainly not what you were hitting) - but as you scale thread count up it becomes more of a risk The main 7964 patch includes a couple of small bug fixes as well, and I've tested it against your schema and some other related schemas that are trickier to process. One thing I would suggest considering is expanding the clustering column count to increase the speed of generation, as 1200 items is still quite a few to create for only sending 1 item, which might end up reducing contention server side. Possibly reduce to only 30-40 items per tier. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137888#comment-14137888 ] graham sanderson commented on CASSANDRA-7546: - Yeah, this is only a cluster for my testing this... I just don't want a massive breakage that stops it working completely! I'll install head > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137887#comment-14137887 ] graham sanderson commented on CASSANDRA-7546: - Hmm - i'll definitely have to try again - it didn't respond to SIGHUP or non -F jstack, and isn't responding to ctrl+C, so maybe close to OOM > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137885#comment-14137885 ] Benedict commented on CASSANDRA-7546: - It's hard to say for certain, but glancing at CHANGES.txt, it looks like 2.1.1-HEAD is same ballpark as safe to run as 2.1.0. There are a lot of changes merged, but mostly for tools like cqlsh, and the things in the core application are pretty minor. I don't officially endorse it though, since we only just shipped to 2.1.0, and haven't had much time to QA 2.1.1 > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137872#comment-14137872 ] graham sanderson commented on CASSANDRA-7546: - Cool, thanks, I'll wait on your patch (I have plenty of other things to do ;-) ). that said, am I relatively safe to upgrade the actual nodes to current head of 2.1 branch (and thusly pick up your latest GC monitoring stuff?) if I have a spare moment before then? Ideally I'd upgrade to the last commit in 2.1 needed to be in place on test nodes for correct latest cassandra-stress operation. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137856#comment-14137856 ] Benedict commented on CASSANDRA-7546: - Hmm. This looks like a subtle "bug" with the latest stress when operating over such a small domain. But also highlights a problem with using it for this workload - I may need to do some tweaking tomorrow to make it suitable. To ensure we keep our procedurally generated state for the partition intact we only let one insert thread operate over a given partition at a time. If there is a conflict, we fall back to the underlying id distribution to avoid wasting time. This means that with a small domain we will steadily visit more and more partitions, but also that we will never have competing updates to the same partition, which is a glaring limitation (especially here). As it happens, the latest version of the procedural generation is reasonably easy to safely partition the work across multiple threads without mutual exclusivity, so I'll try to patch that ASAP. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1413#comment-1413 ] graham sanderson commented on CASSANDRA-7546: - OK, so I'm running latest stress.jar on my load machine - given the number of changes to stress in 2.1.1 (and the addition by the looks of things of remote GC logging via cassandra-stress which would be useful in this case), I guess I'll upgrade the cluster as well. Here is my current config (minus the comments) and the launch command... note there were some typos in our conversation above {code} keyspace: stresscql keyspace_definition: | CREATE KEYSPACE stresscql WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; table: testtable table_definition: | CREATE TABLE testtable ( p text, c1 int, c2 int, c3 int, v blob, PRIMARY KEY(p, c1, c2, c3) ) WITH COMPACT STORAGE AND compaction = { 'class':'LeveledCompactionStrategy' } AND comment='TestTable' columnspec: - name: p size: fixed(16) - name: c1 cluster: fixed(100) - name: c2 cluster: fixed(100) - name: c3 cluster: fixed(1000) # note I made it slightly bigger since 10M is better than 1M for a max - 1M happens pretty quickly - name: v size: gaussian(50..250) queries: simple1: cql: select * from testtable where k = ? and v = ? LIMIT 10 fields: samerow {code} {code} ./cassandra-stress user profile=~/cqlstress-7546.yaml ops\(insert=1\) cl=LOCAL_QUORUM -node $NODES -mode native prepared cql3 -pop seq=1..10M -insert visits=fixed\(10M\) revisit=uniform\(1..1024\) | tee results/results-2.1.0-p1024-a.txt {code} As of right now, we're still (8 minutes later) at: {code} INFO 19:11:51 Using data-center name 'Austin' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: Austin Multi-Tenant Cassandra 1 INFO 19:11:51 New Cassandra host cassandra4.aus.vast.com/172.17.26.14:9042 added Datatacenter: Austin; Host: cassandra4.aus.vast.com/172.17.26.14; Rack: 98.9 Datatacenter: Austin; Host: /172.17.26.15; Rack: 98.9 Datatacenter: Austin; Host: /172.17.26.13; Rack: 98.9 Datatacenter: Austin; Host: /172.17.26.12; Rack: 98.9 Datatacenter: Austin; Host: /172.17.26.11; Rack: 98.9 INFO 19:11:51 New Cassandra host /172.17.26.12:9042 added INFO 19:11:51 New Cassandra host /172.17.26.11:9042 added INFO 19:11:51 New Cassandra host /172.17.26.13:9042 added INFO 19:11:51 New Cassandra host /172.17.26.15:9042 added Created schema. Sleeping 5s for propagation. Warming up insert with 25 iterations... Failed to connect over JMX; not collecting these stats Generating batches with [1..1] partitions and [1..1] rows (of [1000..1000] total rows in the partitions) {code} Number of distinct partitions is currently 2365 and growing. Is this what we expect? doesn't seem like 250,000 should have exhausted any partitions? > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135637#comment-14135637 ] graham sanderson commented on CASSANDRA-7546: - Ok, thanks Sylvain, yes I was a bit confused (also because Benedict's changes included in the incorrect tag had CHANGES.txt with his new stress change as part of listed 2.1.0 changes - which of course now makes sense); anyways... this is good news for me, I'll leave the test cluster on what I deployed (2.1.0-tentative == the real 2.1.0 as expected according to how the vote was looking at the time), and update stress.jar on my load machine to come from 2.1 head. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135201#comment-14135201 ] Sylvain Lebresne commented on CASSANDRA-7546: - bq. https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=log;h=f099e086f3f002789e24bd6c58e52b7553cd5381 is what was released according to the 2.1.0 tag in git vs what Sylvain Lebresne said in the email thread regarding no changes after c6a2c65a75adea9a62896269da98dd036c8e57f3 which was 2.1.0-tentative I messed up when tagging it, it's the vote email that was correct, and I apologize for the confusion. I've updated the tag to reflect what was actually released. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135004#comment-14135004 ] Benedict commented on CASSANDRA-7546: - 1: that's great news :) 3: if you want lots of unique clustering key values per partition, currently stress has some limitations and you will need/want to have multiple clustering columns for it to be able to generate that smoothly without taking donkeys years per insert (on the workload generation side). Its minimum unit of generation (not insert) is a single tier of clustering values, so it would generate all 100B values each time you wanted to insert any number with your spec. So, you want to consider a yaml like this: {noformat} table_definition: | CREATE TABLE testtable ( p text, c1 int, c2 int, c3 int v blob, PRIMARY KEY(p, c1, c2, c3) ) WITH COMPACT STORAGE AND compaction = { 'class':'LeveledCompactionStrategy' } AND comment='TestTable' columnspec: - name: p size: fixed(16) - name: c1 cluster: fixed(100) - name: c2 cluster: fixed(100) - name: c3 cluster: fixed(100) - name: v size: gaussian(50..250) {noformat} Then you want to consider passing -pop seq=1..1M -insert visits=fixed(1M) revisits=uniform(1..1024) The visits parameter here tells stress to split each partition into 1M distinct inserts, which given its deterministic 1M keys means exactly 1 item inserted each visit. The revisits distribution defines the number of partition keys we will operate over until we exhaust one before selecting another to include in our working set. Notice I've removed the population spec from your partition key in the yaml. This is because it is not necessary to constrain it here, as you can constrain the _seed_ population with the -pop parameter, which is the better way to do it here (so you can use the same yaml across runs). However, in this case given our revisits() distribution we can also not constrain the seed population, since once our first 1024 have been generated no other PK will be visited until one of these has been fully exhausted (i.e. 1024 * 1M inserts, quite a few...). You may also constrain the seed to the same range, which once a key is exhausted would always result in filling back in that key to the working set. It doesn't matter what distribution you choose in this case, since it will keep generating a value until one not present in the stash crops up, which if they operate over the same domain can only result in 1 item regardless of distribution, so I suggest a sequential distribution to ensure determinism. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134695#comment-14134695 ] graham sanderson commented on CASSANDRA-7546: - # https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=log;h=f099e086f3f002789e24bd6c58e52b7553cd5381 is what was released according to the 2.1.0 tag in git vs despite what [~slebresne] said in the email thread regarding no changes after c6a2c65a75adea9a62896269da98dd036c8e57f3 which was 2.1.0-tentative # ok, I'll try offheap_objects instead (or as well) # I'm still a bit confused about visit/revisit (which are in the 2.1.0 tagged release)... I want to evenly spread the load across all my partitions (genernally using a new clustering key every time, though I want to put a practical limit on the size of the partitions, so I was hoping to let it wrap at 10M or so unique clustering key values)... so it ounds like i want a visits=fixed(1) and revisits=not quite sure > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134674#comment-14134674 ] Benedict commented on CASSANDRA-7546: - Hi Graham, I must admit I'm a bit confused, and it's partially self inflicted. In 2.1.1 we have changed stress again from what we released in 2.1.0, and I can't tell which version you're referring to, though it seems 2.1.1. Neither version has a 'visits' property in the yaml, but 2.1.1 supports -insert visits= revisit=, which are certainly functions worth exploring and I recommend you use 2.1.1 for stress functionality either way. As far as using these functions are concerned, 'visits' splits a wide row up into multiple inserts; if a visits value of 10 is produced, and there are on average 100 rows generated for the partition, approximately 10 rows will be inserted, then the state of the partition will be stashed away and the next insert that operates on that partition will pick up where the previous one left off. Which partition is performed next is decided by the 'revisit' distribution, which selects from the stash of partially completed inserts, with a value of 1 selecting the most recently stashed (the max value of this distribution defines the total number of partitions to stash); if it ever selects outside of the current stash a new partition is generated instead. So the value for 'visits' is related to the number of unique clustering columns you generate for each partition, whereas the value for revisit is determined by how diverse the data you operate over in a given time window is. Separately, it's worth mentioning that offheap_objects is likely a better choice than offheap_buffers, since it is considerably more memory dense. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134635#comment-14134635 ] graham sanderson commented on CASSANDRA-7546: - Finally getting back to this, been doing other things (this slightly lower priority as we have it in production already)... I just realized that the version c6a2c65a75ade being voted on for 2.1.0 that I deployed is not the same as 2.1.0 released. I am now upgrading, since cassandra-stress changes snuck in. Note, than I plan to stress using 1024, 256, 16, 1 partitions, with all 5 nodes up, and then with 4 nodes up and one down to test effect of hinting, (note repl factor of 3 and cl=LOCAL_QUORUM) I want to do one cell insert per batch... I'm upgrading in part because of the new visit/revisit stuff - I'm not 100% sure how to use them correctly, I'll keep playing but you may answer before I have finished upgrading and tried with this. My first attempt on the original 2.1.0 revision, ended up with only one clustering key value per partition which is not what I wanted (because it'll make trees small) Sample YAML for 1024 partitions {code} # # This is an example YAML profile for cassandra-stress # # insert data # cassandra-stress user profile=/home/jake/stress1.yaml ops(insert=1) # # read, using query simple1: # cassandra-stress profile=/home/jake/stress1.yaml ops(simple1=1) # # mixed workload (90/10) # cassandra-stress user profile=/home/jake/stress1.yaml ops(insert=1,simple1=9) # # Keyspace info # keyspace: stresscql # # The CQL for creating a keyspace (optional if it already exists) # keyspace_definition: | CREATE KEYSPACE stresscql WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; # # Table info # table: testtable # # The CQL for creating a table you wish to stress (optional if it already exists) # table_definition: | CREATE TABLE testtable ( p text, c text, v blob, PRIMARY KEY(p, c) ) WITH COMPACT STORAGE AND compaction = { 'class':'LeveledCompactionStrategy' } AND comment='TestTable' # # Optional meta information on the generated columns in the above table # The min and max only apply to text and blob types # The distribution field represents the total unique population # distribution of that column across rows. Supported types are # # EXP(min..max)An exponential distribution over the range [min..max] # EXTREME(min..max,shape) An extreme value (Weibull) distribution over the range [min..max] # GAUSSIAN(min..max,stdvrng) A gaussian/normal distribution, where mean=(min+max)/2, and stdev is (mean-min)/stdvrng # GAUSSIAN(min..max,mean,stdev)A gaussian/normal distribution, with explicitly defined mean and stdev # UNIFORM(min..max)A uniform distribution over the range [min, max] # FIXED(val) A fixed distribution, always returning the same value # Aliases: extr, gauss, normal, norm, weibull # # If preceded by ~, the distribution is inverted # # Defaults for all columns are size: uniform(4..8), population: uniform(1..100B), cluster: fixed(1) # columnspec: - name: p size: fixed(16) population: uniform(1..1024) # the range of unique values to select for the field (default is 100Billion) - name: c size: fixed(26) #cluster: uniform(1..100B) - name: v size: gaussian(50..250) insert: partitions: fixed(1)# number of unique partitions to update in a single operation # if batchcount > 1, multiple batches will be used but all partitions will # occur in all batches (unless they finish early); only the row counts will vary batchtype: LOGGED # type of batch to use visits: fixed(10M)# not sure about this queries: simple1: select * from testtable where k = ? and v = ? LIMIT 10 {code} Command-line {code} ./cassandra-stress user profile=~/cqlstress-1024.yaml ops\(insert=1\) cl=LOCAL_QUORUM -node $NODES -mode native prepared cql3 | tee results/results-2.1.0-p1024-a.txt {code} > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestio
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123920#comment-14123920 ] graham sanderson commented on CASSANDRA-7546: - Yes, I'm actually waiting on one of our main Cassandra Ops guys to come back from vacation on Monday to upgrade one of our clusters to 2.1 before I can run the stress tests, but we do have the patch running in production on 2.0.x. It detects hints, and it would also seem (which makes sense) fast hint playback of things with low cardinality keys I will certainly change the log level to INFO or DEBUG though... as this shouldn't really be a WARNING. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123917#comment-14123917 ] Benedict commented on CASSANDRA-7546: - Overall the patch LVGTM, though not giving it an official +1 until I'm closer to 100%. Look forward to seeing the results. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, > suggestion1.txt, suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117896#comment-14117896 ] graham sanderson commented on CASSANDRA-7546: - Hi Benedict, I hope you are OK and get well soon... it will likely be a week or two before we can prove in production that this is fixing the problem. I also have been on vacation and then sick, so I have a lot of other catching up to do. Once I have some time, I will play with the new stress testing stuff in 2.1 along with this and try and get some firm evidence there. All I ask is that it doesn't get pushed to 3.0.x ;-) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, hint_spikes.png, suggestion1.txt, > suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117882#comment-14117882 ] Benedict commented on CASSANDRA-7546: - Hi Graham, Just an FYI I won't be in a position for a little while to perform a formal review on something this critical, after having had an accident. Just wanted to let you know I'm not ignoring progress though, and will get to it soon enough. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, hint_spikes.png, suggestion1.txt, > suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117845#comment-14117845 ] graham sanderson commented on CASSANDRA-7546: - Ok, NP, we can do our own custom builds with it in 2.0.x... I'll make and attach a 2.1.x patch for this sensible (sensitive?) part of the code soon. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, hint_spikes.png, suggestion1.txt, > suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117276#comment-14117276 ] Sylvain Lebresne commented on CASSANDRA-7546: - bq. Assuming all is well, then I would like to request this be targeted for 2.0.11 too I'm afraid this is a bit too complex in a bit too sensible part of the code to be eligible for 2.0 at this point. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, 7546.20_async.txt, hint_spikes.png, suggestion1.txt, > suggestion1_21.txt, young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116226#comment-14116226 ] graham sanderson commented on CASSANDRA-7546: - In beta, the patch worked well at detecting hint activity. next week we will put it on half the production nodes, to verify that those nodes don't go into memory allocation craziness in response to hinting under heavy load. Assuming all is well, then I would like to request this be targeted for 2.0.11 too > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Fix For: 2.1.1 > > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, > 7546.20_alt.txt, hint_spikes.png, suggestion1.txt, suggestion1_21.txt, > young_gen_gc.png > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092462#comment-14092462 ] graham sanderson commented on CASSANDRA-7546: - Had a lot going on... have this running in beta right now (without double counting), but haven't had a chance to deliberately test it with a node down. That said, it does detect OpsCenter.pdps in beta (we generally have OpsCenter turned off in production for high volume stuff, and this would seem to validate our decision) Anyway, I myself am now on vacation for the next 10 days... I'd be super interested if we could see some results from 2.1 > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089617#comment-14089617 ] graham sanderson commented on CASSANDRA-7546: - doh - i should have asked about the double counting - didn't see it, now I do > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088738#comment-14088738 ] graham sanderson commented on CASSANDRA-7546: - I ran my smoke test on this and it is as expected; I have added the patch (with a warn log statement on memtable flush if we have resorted to pessimistic concurrency for some rows) to our 2.0.9 beta env... I will try and repro there with a node down (though this cluster is pretty much limited by commit volumes under high load, so can't equal production concurrency), but that said I just want to check that everything is OK, before I patch a single node in production (also 2.0.9) On a separate note (I don't have access to a 2.1 cluster ATM), it would be interesting to try something similar to http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster with a node down & hinting as a test case for this on 2.1 > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087885#comment-14087885 ] Benedict commented on CASSANDRA-7546: - Sounds good, thanks! > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087850#comment-14087850 ] Benedict commented on CASSANDRA-7546: - bq. We probably mean "to the left" of... "before" or "after" are a bit confusing here! Yep, good catch! bq. Volatile read of the wasteTracker in the "fast" path. At the moment we mostly optimise for x86 for the moment, and it's essentially free here as you say. Even on platforms it isn't, it's unlikely to be a significant part of the overall costs, so better to keep it cleaner bq. Adjacent in memory CASed vars in the AtomicSortedColumns - Again not majorly worried here... I don't think the (CASed) variables themselves are highly contended, it is more that we are doing lots of slow concurrent work, and then failing the CAS. Absolutely not worried about this. Like you say, most of the cost is elsewhere. Would be much worse to pollute the cache with padding to avoid it. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087841#comment-14087841 ] graham sanderson commented on CASSANDRA-7546: - Cool will do; addColumns also CASes a thread locally modified Holder anyway. Yes I agree it is ugly to have a non final in something like Holder (being CASed immutable state) but I think we can live with it since it is not mutated after CAS As said, we can revert to monitor enter/exit if you wish... I can't prove it is worse, and there isn't a whole lot that needs optimization here Note you have a comment {quote} in wasteTracker we maintain within EXCESS_WASTE_OFFSET either side of the current time {quote} We probably mean "to the left" of... "before" or "after" are a bit confusing here! I thought about a couple of things while you were on vacation # Volatile read of the wasteTracker in the "fast" path. We could avoid this thru some ugliness of hijacking the top bit in the tree size mark pessimistic locking too. Not to concerned about this - believe it is free on intel anyway # Adjacent in memory CASed vars in the AtomicSortedColumns - Again not majorly worried here... I don't think the (CASed) variables themselves are highly contended, it is more that we are doing lots of slow concurrent work, and then failing the CAS. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087814#comment-14087814 ] Benedict commented on CASSANDRA-7546: - Well, technically we never ever call addColumn() directly, but in 2.0 we haven't removed / UnsupportedOperationException'd that path, so I'm not totally comfortable leaving it as a regular int, as an external call to addColumn would break it (but then, this probably isn't the end of the world). However, I actually introduced a double counting bug in changing that :/ ... and since we don't want to incur the incAndGet every change, and we don't want to dup code, let's settle for the possible race for maintaining size if somebody uses the API in a way it isn;t in the codebase right now. However I think I would prefer to make size final in this case. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087799#comment-14087799 ] graham sanderson commented on CASSANDRA-7546: - +1 on another set of eyes (yes the isSynchronized is ugly) - that said, I can move ahead on testing the main functionality of this patch (the waste detection) since we are all agreed I think on the basic mechanism. I am reading your patch (thanks for cleaning up - mine was a bit verbose for discussion purposes), I will read it in more detail now, but just from an initial glance in its raw form, why did you make the size in Holder volatile/atomically updated. The holder instances should only mutated by a single thread > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077057#comment-14077057 ] graham sanderson commented on CASSANDRA-7546: - Ok, thank you... yeah my only reason for recording something in the actual codebase was to indicate that to the user that they had ultra heavy partition contention that might be detrimental to performance, and they should perhaps review their schema. Given that this may not be the case at all in 3.0 (i.e. it may be gracefully handled in all cases), I'll try out locally with a WARN statement instead. I'll probably do it at memtable flush anyway which has more useful context (e.g. the CF in question), and would be less spam-y (i.e. one warn with the number of contended partitions, though perhaps the contended key(s) are interesting at a lower log level)... whether we include such logging in the final patch I don't know. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_alt.txt, suggestion1.txt, > suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077018#comment-14077018 ] Benedict commented on CASSANDRA-7546: - My biggest concern with metrics is that what we expose as a metric will probably change when we change tack to a lock-free lazy-update design, since it will be more expensive to maintain. Certainly tracking the amount of 'wasted' work will be meaningless then, although possibly we could track the raw occurrences of failure to make a change atomically without interference (which in the lazy case would be failure to acquire exclusivity to merge your changes in) I'm currently on holiday but will try to review your patch shortly. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_5.txt, 7546.20_alt.txt, suggestion1.txt, > suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074127#comment-14074127 ] graham sanderson commented on CASSANDRA-7546: - Actually looking at my numbers here on the production level h/w, I certainly don't think the numbers are too aggressive (i.e. if anything they kick in too late), but as I say it'd be nice to actually watch this in the real world. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_4.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074124#comment-14074124 ] graham sanderson commented on CASSANDRA-7546: - Once again numbers - note I'm still using the same test driver as before (hence the 0 up/down, count numbers etc), though I have updated it to pass a column cloner in the transform. {code} [junit] -- [junit] 1 THREAD; ELEMENT SIZE 64 [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 1 [junit] original code: [junit] Duration = 1020ms maxConcurrency = 1 [junit] GC for PS Scavenge: 37 ms for 3 collections [junit] Approx allocation = 589MB vs 8MB; ratio to raw data size = 73.61468285714285 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 963ms maxConcurrency = 1 [junit] GC for PS Scavenge: 22 ms for 2 collections [junit] Approx allocation = 584MB vs 8MB; ratio to raw data size = 72.99738571428571 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 16 [junit] original code: [junit] Duration = 826ms maxConcurrency = 1 [junit] GC for PS Scavenge: 24 ms for 2 collections [junit] Approx allocation = 496MB vs 8MB; ratio to raw data size = 61.99165047619048 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 746ms maxConcurrency = 1 [junit] GC for PS Scavenge: 25 ms for 2 collections [junit] Approx allocation = 477MB vs 8MB; ratio to raw data size = 59.63136380952381 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 256 [junit] original code: [junit] Duration = 617ms maxConcurrency = 1 [junit] GC for PS Scavenge: 11 ms for 1 collections [junit] Approx allocation = 362MB vs 8MB; ratio to raw data size = 45.24315523809524 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 602ms maxConcurrency = 1 [junit] GC for PS Scavenge: 11 ms for 1 collections [junit] Approx allocation = 366MB vs 8MB; ratio to raw data size = 45.77833523809524 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 1024 [junit] original code: [junit] Duration = 443ms maxConcurrency = 1 [junit] GC for PS Scavenge: 11 ms for 1 collections [junit] Approx allocation = 308MB vs 8MB; ratio to raw data size = 38.4688464 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 422ms maxConcurrency = 1 [junit] GC for PS Scavenge: 10 ms for 1 collections [junit] Approx allocation = 309MB vs 8MB; ratio to raw data size = 38.667831428571425 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] -- [junit] 100 THREADS; ELEMENT SIZE 64 [junit] [junit] Threads = 100 elements = 10 (of size 64) partitions = 1 [junit] original code: [junit] Duration = 2039ms maxConcurrency = 100 [junit] GC for PS Scavenge: 118 ms for 34 collections [junit] Approx allocation = 11178MB vs 8MB; ratio to raw data size = 1395.417500952381 [junit] loopRatio (closest to 1 best) 18.20478 raw 10/1820478 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 1299ms maxConcurrency = 100 [junit] GC for PS Scavenge: 14 ms for 1 collections [junit] Approx allocation = 614MB vs 8MB; ratio to raw data size = 76.68355047619048 [junit] loopRatio (closest to 1 best) 1.05291 raw 779/6045 counted 0/0 sync 99246/99246 up 0 down 0 [junit] [junit] [junit] Threads = 100 elements = 10 (of size 64) partitions = 16 [junit] original code: [junit] Duration = 224ms maxConcurrency = 100 [junit] GC for PS Scavenge: 22 ms for 2 collections [junit] Approx allocation = 832MB vs 8MB; ratio to raw data size = 103.971206 [junit] loopRatio (closest to 1 best) 1.89634 raw 10/189634 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 226ms maxConcurrency = 99 [junit] GC for PS Scavenge: 22 ms for 2 collections [junit] Approx allocation = 810MB vs 8MB; ratio to raw data size = 101.20042857142857 [junit] loopRatio (closest to 1 best) 1.92036 r
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072718#comment-14072718 ] graham sanderson commented on CASSANDRA-7546: - cool - makes sense now, It'll be tomorrow now, but I'll put up a new version > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072535#comment-14072535 ] Benedict commented on CASSANDRA-7546: - Yes, I'm referring to the memtables - an AtomicSortedColumns instance lives until its containing memtable is flushed. 100MB/s is around 1M snaptree node allocations, so that is maybe a little high for deciding there's too much competition (although with ~ 1000 items present this is only 100k inserts), so how about we fix it to 10MB/s, to be exceeded by 10Mb. We could certainly hit 100MB of waste, no trouble (under high competition we'll see orders of magnitude more wasted than used, and memtables usually store 1Gb+), but I think it's better to trigger a little more readily > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072530#comment-14072530 ] graham sanderson commented on CASSANDRA-7546: - Good point - I was mixing the two types of memory allocation in my head... that said I don't know when we are seeing this in production how long each AtomicSortedColumns instance lives. bq. they stick around until they fill up I assume you are referring to the memtables there... what defines "full" besides. - there is a hard(ish) memory limit in yaml - MeteredFlusher flushes high traffic stuff Basically, I'm just checking that we don't think our 100MB/s wastage may never trigger due to aggressive flushing... theoretically we must be wasting MUCH more than we are really writing, but I don't have numbers (I could look at the logs to get them) to see how often hints memtables were being flushed during this process and how big they were. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072494#comment-14072494 ] Benedict commented on CASSANDRA-7546: - Under load they don't last very long (i.e. they stick around until they fill up, which can be just a few minutes, or even faster under really high load), however we don't care about how much we're allocating _to the memtable_ - we care about how much memory we allocation wasteful that _do not_ make it into the memtable, i.e. all that GC overhead you were seeing - in the worst case you saw 12Gb in only 2.5s against one partition. So whatever numbers we fix for this scheme we will avoid anything like that kind of extreme scenario. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072483#comment-14072483 ] graham sanderson commented on CASSANDRA-7546: - Ignoring the monotonic bit ;-) as you say it has to be relative to something anyway bq. AtomicBTreeColumns is unlikely to live past 3.1 Sorry, I meant how long is an instance of one of those classes likely to last? i.e. is it possible to see 100MB of allocation into one single instance, or would another instance have taken over by then. I assume since you are suggesting it that it is possible, but thought I'd double check that that is what you mean. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072469#comment-14072469 ] Benedict commented on CASSANDRA-7546: - bq. it is the low 64 bits of a monotonic number That's pretty pedantic, since with nanos that stretches to 600 years before overflow! Either way, I'm not sure if I clarified or not but we should be offsetting this number from the memtable creation time so we can safely stick within 32 bits. I suggest we use the top bit being set as the indicator we've hit contention, so we naturally avoid problematic overflow (although really this would just result in our optimisation not running properly, so would also be fine) bq. how long you expect AtomicSorted/BTreeColumns to last AtomicBTreeColumns is unlikely to live past 3.1. I would like to get rid of it in 3.0, but that is probably ambitious. So another year or so at bleeding edge; a few more years at various stages downstream no doubt. AtomicSortedColumns will be around as long as 2.0.x is, which is decided by the community really. Either way, tuning this value is probably not super helpful, since the goal is simply to avoid lots of wasted memory allocations. We can simply define a sensible slightly cautious criteria for this, and that should be sufficient, since if we are slightly overly cautious the end result is only a small number of partitions seeing slightly reduced throughput for writes. It is not a huge deal either way. It's only really likely to have a measurable impact at all on very highly contended partitions, on which any sane value will likely yield a very similar improvement. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072465#comment-14072465 ] graham sanderson commented on CASSANDRA-7546: - OK - I had something else come up today, but yeah I realized my math was wrong... it is certainly a bit of a massage the correct fidelity of information within 32 bits, without overflowing too soon, or not having enough padding so that bursty allocation under the sustained limit causes problems. bq. It is monotonic; that's its main purpose. I guess we (me?) are being pedantic here... it is the low 64 bits of a monotonic number - (even this was broken on early OS/JVM combinations due to bugs, however we can take that as fact now I think); what the actual number is is undefined. It does seem on UNIX variants appear to be rebased to nanonseconds since epoch, and probably on all modern systems is some counter that was reset at least on power cycle, so you are probably ok. In any case, doing the right thing is pretty much always trivial (assuming you don't expect your JVM to run for 200+ years) -- As an aside, can you give me a hint as to how long you expect AtomicSorted/BTreeColumns to last... tuning does seem critical here, since wasting 100M would probably be a reasonable value, but I don't know in practice if something else would likely end up flushing the memtable before it ever got that far. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072225#comment-14072225 ] Benedict commented on CASSANDRA-7546: - bq. however I assume that the end result is that you don't want either the Atomic***Columns or the Holder object to grow at all (i.e. another 8 bytes), and I'm assuming you're calculating space based on compressedoops object layout Right, yes. There's room for one 'free' 32-bit value in the AtomicBTreeColumns is what I meant. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072180#comment-14072180 ] Benedict commented on CASSANDRA-7546: - Well, actually the scheme I outlined isn't _exactly_ requiring a rate of 100MB/s; all that actually needs to happen is it consistently exceed a rate of 1Mb/s for a total allocation of 100MB (which can happen if > 100MB are allocated in < 1 second, i.e. 100MB/s, but also if 110Mb is allocated over 10s). We can tweak those numbers however we like (within some window of representable numbers with enough range). For instance exceed a rate of 10MB/s consistently by a total of 10MB, which would require e.g. dividing our bytes allocated by 1k, measuring time in 100ns intervals, and offset the present by 10 * 1024. To capture a rate of 100MB/s, we would need to either expect that memtables never live for more than 0.5 days (probably reasonable, i.e. represent time in 10ns intervals) or require that a single mutator allocates 10k in one run (also quite reasonable) but we're pushing the limits of what we can safely represent. bq. nanoTime is not monotonic It is monotonic; that's its main purpose. Although there are no doubt caveats on a given machine/processor for how strictly that is guaranteed bq. which clones are you talking about Mistype. I mean the number/size of objects we estimate we've allocated wastefully. We can estimate this in 2.0 with 200+100*lg2(N), and in 2.1 we measure it exactly. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072084#comment-14072084 ] graham sanderson commented on CASSANDRA-7546: - duh - i'm an idiot, your code catches the allocation waste rate of 100MB/s without actually having to get there! > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072081#comment-14072081 ] graham sanderson commented on CASSANDRA-7546: - bq. It doesn't look to me like we re-copy the ranges (only the arrays we store them in) Oops, yeah you are correct {quote} I would rather we didn't increase the amount of memory we use. In 2.1 I'm stricter about this, because in 2.0 we can mitigate it by replacing AtomicReference with a volatile and an AtomicReferenceFieldUpdater. But whatever we do in 2.1 has to be free memory-wise. This means we have 1 integer or 1 reference to play with in the outer class (not the holder), as we can get this for free. We don't need to maintain a size in 2.1 though, so this is easy. We can track the actual amount of memory allocated (since we already do this). {quote} I'm all for not wasting memory, after all this is what this patch is about. I'm not sure exactly what 2.1 has to be _free_ memory wise means... however I assume that the end result is that you don't want either the Atomic***Columns or the Holder object to grow at all (i.e. another 8 bytes), and I'm assuming you're calculating space based on compressedoops object layout (so we may have a chance to fill in a spare 32 bit value somewhere; I'll have to check the 2 classes in 2.0 and 2.1 cases). Note the reason I'm confused about free is that the Object[] for the btree are on heap things and we allocate quite a lot of them. Perhaps by free you mean, no increase in memory usage vs today for this change. bq. get the current time in ms (but from nanoTime since we need monotonicity); Also slight confused; nanoTime is not monotonic but nanoTime minus some static base nanoTime is for all practical purposes, so I assume you mean this. Based on that I guess we can use Integer.MIN_VALUE as a "no one has wasted work yet" flag. bq. In 2.0 we multiply the number of updates we had made by by lg2(N) (N = current tree size), and multiple this by 100 (approximate size of snaptree nodes) + ~200 per clone by number of updates do you mean individual column attempts? which clones are you talking about - I have currently moved them outside the loop which allowed for pre-sharing, and for shrinking the locked work later, but this extra int[] is not free (unless we are only talking about retained space vs temporary). I guess we should probably always round up to 1K... that would still be 100,000 CAS fails a second which is certainly bad Anyway, I'll double check the allocation costs in 2.0.x, use and atomic field updater, and make a 2.0.x patch (and see how it behaves) Now "max rate" sounds more like something that should be exposable via config (though since it is an implementation detail that will go away eventually, it doesn't make sense to make it a per CF thing)... I'll run my test again to see what a good value seems to be. But yeah if something wastes 100M/s ever, I think we can call mark it as "special". Note, the one question other question I have is how big can a single Atomic***Instance get - i.e. is it even possible to allocate 100MB in one, or do they turn over too fast. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071538#comment-14071538 ] Benedict commented on CASSANDRA-7546: - It doesn't look to me like we re-copy the ranges (only the arrays we store them in) On your patch, I have a couple of minor concerns: * I would rather we didn't increase the amount of memory we use. In 2.1 I'm stricter about this, because in 2.0 we can mitigate it by replacing AtomicReference with a volatile and an AtomicReferenceFieldUpdater. But whatever we do in 2.1 has to be free memory-wise. This means we have 1 integer or 1 reference to play with in the outer class (not the holder), as we can get this for free. We don't need to maintain a size in 2.1 though, so this is easy. We can track the actual amount of memory allocated (since we already do this). * I would rather make the condition for upgrading to locks be based on some rate of wasted work (or, since it works just as well, some rate of wasted memory allocations). The current value seems a bit clunky and difficult to tune, and might be no real indication of contention. However we need to keep this encoded in an integer, and we need to ensure it is free to maintain in the fast case. So I propose the following: # we decide on a maximum rate of waste (let's say 100MB/s) # when we first waste work we: #* get the current time in ms (but from nanoTime since we need monotonicity); #* subtract from it our max rate (100Mb/s) converted to K/s, i.e. 100 * 1024, so we have present-100*1024; #* set our shared counter state to this value # whenever we waste work we: # we calculate how much we wasted\* in Kb #* we add this to our shared counter; #* if the shared counter has _gone past the present time_ we know we've exceeded our maximum wastage, and we set our counter to Integer.MAX_VALUE which is the flag to everyone to upgrade to locks; #* if we see it's too in the past, we reset it to present-(100*1024) \* To calculate wasted work, we track the size you currently are tracking in 2.0, and in 2.1 we use the BTree's existing size-delta tracking. In 2.0 we multiply the number of updates we had made by by lg2(N) (N = current tree size), and multiple this by 100 (approximate size of snaptree nodes) + ~200 per clone This is the same scheme I used for tracking wasted cycles in SharedExecutorPool (CASSANDRA-4718) and I think it works pretty well, and is succinctly represented in memory. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071330#comment-14071330 ] graham sanderson commented on CASSANDRA-7546: - Note w.r.t. deletioninfo... I'm a bit confused about who owns what. On 2.1 (And I'm not 100% sure of the exact semantics of when you need to use HeapAllocator.instance vs pure heap allocation, since I haven't looked at the 2.1 code much) {code} if (inputDeletionInfoCopy == null) inputDeletionInfoCopy = cm.deletionInfo().copy(HeapAllocator.instance); deletionInfo = current.deletionInfo.copy().add(inputDeletionInfoCopy); updater.allocated(deletionInfo.unsharedHeapSize() - current.deletionInfo.unsharedHeapSize()); {code} However, current.deletionInfo.copy() is not done with the HeapAllocator, and the passed inputDeletionInfoCopy's ranges are RE-copied (without using HeapAllocator.instance) on some code paths inside the .add() method but not others > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071171#comment-14071171 ] graham sanderson commented on CASSANDRA-7546: - In case anyone is reading them, here is the latest output - note with the current wasted work limit of 100, we actually kick in later except under the higher contention loads, but doing a one time flip, actually do less work overall... {code} [junit] -- [junit] 1 THREAD; ELEMENT SIZE 64 [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 1 [junit] original code: [junit] Duration = 996ms maxConcurrency = 1 [junit] GC for PS Scavenge: 36 ms for 3 collections [junit] Approx allocation = 563MB vs 8MB; ratio to raw data size = 70.37447428571429 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 765ms maxConcurrency = 1 [junit] GC for PS Scavenge: 38 ms for 3 collections [junit] Approx allocation = 590MB vs 8MB; ratio to raw data size = 73.67167714285715 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 16 [junit] original code: [junit] Duration = 496ms maxConcurrency = 1 [junit] GC for PS Scavenge: 20 ms for 2 collections [junit] Approx allocation = 448MB vs 8MB; ratio to raw data size = 55.95978857142857 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 574ms maxConcurrency = 1 [junit] GC for PS Scavenge: 27 ms for 2 collections [junit] Approx allocation = 485MB vs 8MB; ratio to raw data size = 60.56426285714286 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 256 [junit] original code: [junit] Duration = 662ms maxConcurrency = 1 [junit] GC for PS Scavenge: 12 ms for 1 collections [junit] Approx allocation = 333MB vs 8MB; ratio to raw data size = 41.59998095238095 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 241ms maxConcurrency = 1 [junit] GC for PS Scavenge: 9 ms for 1 collections [junit] Approx allocation = 349MB vs 8MB; ratio to raw data size = 43.65317619047619 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 1024 [junit] original code: [junit] Duration = 222ms maxConcurrency = 1 [junit] GC for PS Scavenge: 11 ms for 1 collections [junit] Approx allocation = 273MB vs 8MB; ratio to raw data size = 34.18085428571428 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 234ms maxConcurrency = 1 [junit] GC for PS Scavenge: 10 ms for 1 collections [junit] Approx allocation = 286MB vs 8MB; ratio to raw data size = 35.7883064 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] -- [junit] 100 THREADS; ELEMENT SIZE 64 [junit] [junit] Threads = 100 elements = 10 (of size 64) partitions = 1 [junit] original code: [junit] Duration = 1383ms maxConcurrency = 100 [junit] GC for PS Scavenge: 108 ms for 29 collections [junit] Approx allocation = 9525MB vs 8MB; ratio to raw data size = 1189.0213895238096 [junit] loopRatio (closest to 1 best) 16.74471 raw 10/1674471 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 1728ms maxConcurrency = 100 [junit] GC for PS Scavenge: 14 ms for 1 collections [junit] Approx allocation = 572MB vs 8MB; ratio to raw data size = 71.49758761904762 [junit] loopRatio (closest to 1 best) 1.00011 raw 144/154 counted 0/0 sync 99856/99857 up 0 down 0 [junit] [junit] [junit] Threads = 100 elements = 10 (of size 64) partitions = 16 [junit] original code: [junit] Duration = 223ms maxConcurrency = 100 [junit] GC for PS Scavenge: 24 ms for 2 collections [junit] Approx allocation = 760MB vs 8MB; ratio to raw data size = 94.87286476190476 [junit] loopRatio (closest to 1 best) 1.88353 raw 10/188353 counted 0/0 syn
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070533#comment-14070533 ] graham sanderson commented on CASSANDRA-7546: - Well that makes sense, I hadn't checked if there was a limit on mutator threads - we didn't change it... this probably explains the hard upper bound in my synthetic test (which incidentally does not to the transformation) I agree with you on SnapTreeMap, once I see the "essentially" free clone operation has to acquire a lock (or at least wait for no mutations)... I surmised there were probably dragons there that might cause all kinds of nastyness whether it be pain on concurrent updates to a horribly unbalanced tree, or dragging huge amounts of garbage with it due to overly lazy copy on write (again I didn't look too closely). BTree looks much better (and probably does less rebalancing since it has wider nodes I think), though as discussed it doesn't prevent the underlying race. So, I'll see if I have time to work on this later today, but the plan is... for 2.0.x (just checking) a) move the transformation.apply out of the loop and do it once b) do a one way flip flag per AtomicSortedColumns instance, which is flipped when a cost reaches a certain value. I was going to calculate the delta in each mutator thread (probably adding a log-like measure e.g. using Integer.numberOfLeadingZeros(tree.size()) per failing CAS), though looking ugh at SnapTreeMap again, it seems that tree.size() is not a good method to call in the presence of mutations, so I guess Holder can just track the tree size itself c) given this is possibly a temporary solution, is it worth exposing the "cut-off" value even un-documented such that it could be overridden in cassandra.yaml? Note the default should be such that most AtomicSortedColumns instance never get cut-off since they are not heavily contended and large (indicating contended inserts not updates) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070133#comment-14070133 ] Benedict commented on CASSANDRA-7546: - bq. let me know if you want me to take another stab at the patch We're always keen for competent new comers to start contributing to the project; if you've got the time that would be great, and I can review. If not, I'm happy to make this change. bq. we probably have hundreds of concurrent mutator threads for them This should never be the case. By default there are 32 concurrent writers permitted, and this should never be changed to more than a small multiple of the number of physical cores on the machine (unless running batch CL), so if there are hundreds something is going wrong. Furthermore, it makes very little sense that this problem wouldn't be hit by as many concurrent large modifications: the race condition is the same, but much easier to hit the more work there is being done per concurrent modifier. I decided to take a peek at the SnapTreeMap code, since this didn't make much sense, and I see that there is a very different behaviour if we have many clones() as opposed to many updates (larger updates would necessarily result in a lower incidence/overlap of clone()), as epochs attempt to be allocated. I don't really have time to waste digging any deeper, but it seems possible that this code path results in a great deal more object allocation (and possibly allocations that are not easily collectible) than simply performing many large updates. If this is the case, then again 2.1 will not suffer this problem. This doesn't feel like a satisfactory explanation, and nor does the slightly different possible synchronization behaviour with larger updates (snap tree is littered with synchronized() calls, which might possibly overlap more often with many updates). Either way, I'm happy to introduce the mitigation strategy we've discussed, since it makes sense in and of itself. However we clearly do not fully understand what is happening in your specific scenario, but I do not want to dig further into snap tree - it's a really ugly contraption! > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069981#comment-14069981 ] graham sanderson commented on CASSANDRA-7546: - My last piece of speculation... these single partition hint trees are probably getting thousands of nodes big, and we probably have hundreds of concurrent mutator threads for them. It may just be that we are hitting a "sweet" spot of allocation rate such that none of the on processor threads get to actually make sufficient progress to reach their cas before we end up needing to GC, at which point they must all safepoint after which I assume, they don't get any preferential dibs at running next, so we have a much higher ratio of wastage than even in my synthetic test where it was largely proportional to number of cores not number of threads. In this nasty case where we have enough cores to do lots of concurrent work, but enough work per core to cause enough allocation to cause GC before any of them finish the task at hand, you get the worst of both the locking and the spinning worlds. Anyways, let me know if you want me to take another stab at the patch including doing the one time allocation outside the loop (or on first pass) - you are more familiar with the code, but it is always good to learn. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069953#comment-14069953 ] graham sanderson commented on CASSANDRA-7546: - Note the one summary is that lots of small inserts seems to cause a lot more problems than lots of large inserts, presumably because they can happen faster and anything bounded by their intrinsic size rather than their actual overhead can fit more of them > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069938#comment-14069938 ] graham sanderson commented on CASSANDRA-7546: - {quote} However whether it is one-way or not is somewhat unimportant for me. This flip would only last the lifetime of a memtable, which is not super lengthy (under heavily load probably only a few minutes), and would not have dramatically negative consequences if it got it slightly wrong {quote} Cool, that's what I was asking/thinking. As for the tree size/rebalancing, I have no particular proof... when things go wrong we are hinting massively, and so maybe there are hundreds of hint mutation threads each with their own in progress rebalance, pinning a lot of nodes across young GC. That said, the memory allocation rate is truly spectacular, even given the excessive hinting, so I have to suspect the spinning (and as you say probably some of the in arena allocation it does too) - though that would also be surprising since these are hint updates which are a single cell update Anyway... we can track cost in the Holder I guess to avoid any atomic operations, and maybe factor in the tree size there too. Note as an aside, we are partly to blame for this issue (best practices to be learned, and ways we can mitigate) but the result is surprising enough (because things go bad at random, and usually when we are inserting 100s of times less data than we can easily handle) that others might easily get bitten. I would describe everything that I think is going on in the snowballing of problems, but it is a bit of a comedy of errors. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069896#comment-14069896 ] Benedict commented on CASSANDRA-7546: - bq. Alternatively if you are saying, let each thread keep working while they still believe they can win, This was my original rationale for the patch I posted, however now I am much more in favour of bq. a one way switch per Atomic*Columns instance that flips after a number waster "operations"? However whether it is one-way or not is somewhat unimportant for me. This flip would only last the lifetime of a memtable, which is not super lengthy (under heavily load probably only a few minutes), and would not have dramatically negative consequences if it got it slightly wrong However^2 I'm still having a hard time believing rebalancing costs in snap tree can be that high, and further if that really is the problem it should not be an issue in 2.1, as the b-tree rebalances with O(lg(N)) allocations. I'd be a little surprised if the snap tree didn't do the same, as if there were more than O(lg(N)) allocations, the algorithmic complexity would be > O(lg(N)) also. It's possible somehow that it manages to inter-refererence with on-going copies, so that we get a highly complex graph that retains exponentially more garbage the more competing updates there are, but again I would be very surprised if this were the case. However outside of either of these I would expect the garbage generated to all be immediately collectible, so it would have to be the sheer volume alone that overwhelmed the GC, which is certainly possible but this would entail a _lot_ of hinting, and I'd be surprised if a node could be receiving a large enough quantity. On the other hand the arena allocations in 2.0 are definitely incapable of being collected and could be allocated almost as rapidly. bq. I'm not sure which changes you are talking about back-porting and whether the "at most twice" refers to looping once then locking In this instance I'm referring to copying the source ColumnFamily locally in the variable once after failing the cas, so that we do not keep allocating arena space. Alternatively, we could just do it upfront in the method, as the only extra cost is an array allocation proportional in size to the input data, which is fairly cheap. All of this said, I think the behaviour of locking after wasting an excessive number of cycles is still a good one, so I'm comfortable introducing it either way, and it would certainly help with all of the above causes. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069666#comment-14069666 ] graham sanderson commented on CASSANDRA-7546: - Alternatively if you are saying, let each thread keep working while they still believe they can win, or while they have something to do that can be reused if they lose, then maybe give them one last shot to try again if they lose and haven't done anything reusable, then make them block... I'm okay with that. (of course on 2.0.x. today, that pretty much boils down to your patch!) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069504#comment-14069504 ] graham sanderson commented on CASSANDRA-7546: - {quote} I do wonder how much of a problem this is in 2.1, though. I wonder if the largest problem with these racy modifications isn't actually the massive amounts of memtable arena allocations they incur in 2.0 with all their transformation.apply() calls (which reallocate the mutation on the arena), which is most likely what causes the promotion failures, as they cannot be collected. I wonder if we shouldn't simply backport the logic to only allocate these once, or at most twice (the first time we race). It seems much more likely to me that this is where the pain is being felt. {quote} I'm not sure which changes you are talking about back-porting and whether the "at most twice" refers to looping once then locking. Certainly avoiding any repeated cloning of the cells is good, however I'm still pretty sure based on PrintFLSStatistics that the slabs themselves are not the biggest problem (I suspect SnapTreeMap nodes, combined with high rebalancing cost of huge trees in the hint case since the keys are almost entirely sorted). Are you suggesting a one way switch per Atomic*Columns instance that flips after a number waster "operations"? That sounds reasonable... I'd expect that a partition for a table is either likely to have high contention or not based on the schema design/use case. I have no idea how long these instances hang around in practice (presumably not insanely long) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069396#comment-14069396 ] Benedict commented on CASSANDRA-7546: - My concern with the approach you've outlined is that we're barely a hair's breadth from a lock: as soon as we hit _any_ contention, we inflate to locking behaviour. This is good for large partitions, and most likely bad for small ones, and more to the point seems barely worth the complexity of not just making it a lock in the first place. On further consideration, I think I would perhaps prefer to run this lock-inflation behaviour based on the size of the aborted changes, so if the amount of work we've wasted exceeds some threshold we decide it's high time all threads were stopped to let us finish. We could in this scenario flip a switch that requires all modifications to acquire the monitor once we hit this threshold once; I would be fine with this behaviour, and it would be simple. I do wonder how much of a problem this is in 2.1, though. I wonder if the largest problem with these racy modifications isn't actually the massive amounts of memtable arena allocations they incur in 2.0 with all their transformation.apply() calls (which reallocate the mutation on the arena), which is most likely what causes the promotion failures, as they cannot be collected. I wonder if we shouldn't simply backport the logic to only allocate these once, or at most twice (the first time we race). It seems much more likely to me that this is where the pain is being felt. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069173#comment-14069173 ] graham sanderson commented on CASSANDRA-7546: - Excellent - I will take a look in the 2.1 branch - I was wondering if there were some sample profiles. The main problem we have in 2.0.x is that if we are under relatively heavy sustained write load, so we are allocating memtable slabs along with all the small short lived objects in the commit log and write path... you add to that hinting which means more memtable slabs, and now because of single partition for hints, much larger snap trees (whose somewhat contentious lazy-copy-on-write may or may not make things worse, I don't know)... under that allocation rate we spill huge numbers of small (possibly snap tree nodes) objects into the tenured gen along with the slabs, which tends to lead to promotion failure and need for compaction. I'll have to play around, but I don't think it is easy to capture the effect of excessive (intended to be) temporary object allocation in a stress test as opposed to excessive CPU because the GC can cope really well until it doesn't. Note my belief is your new tree in 2.1 probably mitigates the problem quite a bit (no contention in the tree, wider nodes, less rebalancing etc), though I suggest we still fix the CAS loop allocation there too. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069065#comment-14069065 ] Benedict commented on CASSANDRA-7546: - I'll take a look at your patch shortly, but in the meantime it's worth pointing out cassandra-stress does now support fairly complex CQL inserts including various sizes of batch updates, with fine grained control over how large a partition to generate, and what percentage of that total partition to update at any point. Take a look at the sample stress profiles under the tools hierarchy on latest 2.1 > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069059#comment-14069059 ] graham sanderson commented on CASSANDRA-7546: - FYI here are the same synthetic test results for 7546.20_2.txt {code} [junit] -- [junit] 1 THREAD; ELEMENT SIZE 64 [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 1 [junit] original code: [junit] Duration = 993ms maxConcurrency = 1 [junit] GC for PS Scavenge: 34 ms for 3 collections [junit] Approx allocation = 553MB vs 8MB; ratio to raw data size = 69.13799428571429 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 761ms maxConcurrency = 1 [junit] GC for PS Scavenge: 34 ms for 3 collections [junit] Approx allocation = 579MB vs 8MB; ratio to raw data size = 72.31675047619048 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 16 [junit] original code: [junit] Duration = 780ms maxConcurrency = 1 [junit] GC for PS Scavenge: 25 ms for 2 collections [junit] Approx allocation = 436MB vs 8MB; ratio to raw data size = 54.48992095238095 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 671ms maxConcurrency = 1 [junit] GC for PS Scavenge: 24 ms for 2 collections [junit] Approx allocation = 477MB vs 8MB; ratio to raw data size = 59.545997142857146 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 256 [junit] original code: [junit] Duration = 452ms maxConcurrency = 1 [junit] GC for PS Scavenge: 11 ms for 1 collections [junit] Approx allocation = 321MB vs 8MB; ratio to raw data size = 40.14510761904762 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 460ms maxConcurrency = 1 [junit] GC for PS Scavenge: 10 ms for 1 collections [junit] Approx allocation = 341MB vs 8MB; ratio to raw data size = 42.63770857142857 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] Threads = 1 elements = 10 (of size 64) partitions = 1024 [junit] original code: [junit] Duration = 462ms maxConcurrency = 1 [junit] GC for PS Scavenge: 14 ms for 1 collections [junit] Approx allocation = 264MB vs 8MB; ratio to raw data size = 32.99879142857143 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 543ms maxConcurrency = 1 [junit] GC for PS Scavenge: 14 ms for 1 collections [junit] Approx allocation = 272MB vs 8MB; ratio to raw data size = 34.047360952380956 [junit] loopRatio (closest to 1 best) 1.0 raw 10/10 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] [junit] -- [junit] 100 THREADS; ELEMENT SIZE 64 [junit] [junit] Threads = 100 elements = 10 (of size 64) partitions = 1 [junit] original code: [junit] Duration = 2318ms maxConcurrency = 100 [junit] GC for PS Scavenge: 119 ms for 32 collections [junit] Approx allocation = 10547MB vs 8MB; ratio to raw data size = 1316.62704 [junit] loopRatio (closest to 1 best) 18.35448 raw 10/1835448 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 1315ms maxConcurrency = 100 [junit] GC for PS Scavenge: 14 ms for 1 collections [junit] Approx allocation = 629MB vs 8MB; ratio to raw data size = 78.62949142857143 [junit] loopRatio (closest to 1 best) 1.11563 raw 13653/13653 counted 0/0 sync 88223/97910 up 0 down 0 [junit] [junit] [junit] Threads = 100 elements = 10 (of size 64) partitions = 16 [junit] original code: [junit] Duration = 215ms maxConcurrency = 100 [junit] GC for PS Scavenge: 23 ms for 2 collections [junit] Approx allocation = 776MB vs 8MB; ratio to raw data size = 96.92138285714286 [junit] loopRatio (closest to 1 best) 1.95927 raw 10/195927 counted 0/0 sync 0/0 up 0 down 0 [junit] [junit] modified code: [junit] Duration = 201ms maxConcurrency = 99 [junit] GC for PS Scavenge: 9 ms for 1 collections [juni
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068803#comment-14068803 ] Benedict commented on CASSANDRA-7546: - bq. I'm not sure my code (whilst not blazingly pretty) is insanely hard to reason about... I'm not suggesting it is by any means abhorrent, only that we can achieve the desired goal with fewer changes, so unless there's a lot of evidence that the extra complexity is worth it, we should stick with the simpler approach (this also means less pollution of the instruction cache in a very hot part of the codebase, which is a good thing). I would suggest running a stress workload with a fixed number of threads, with increasing numbers of partitions (from 1 up to > number of threads) and see how the curve changes, if you want to benchmark this closely. As to (b): since we only ever acquire the lock when we are contending, it must always be inflated anyway, so this shouldn't be an issue. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > - > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: graham sanderson >Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_alt.txt, suggestion1.txt, > suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)