[jira] [Resolved] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion
[ https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko resolved CASSANDRA-6405. -- Resolution: Fixed Fix Version/s: 2.1 beta2 Reproduced In: 1.2.11, 1.1.7, 1.0.12 (was: 1.0.12, 1.1.7, 1.2.11) CASSANDRA-6506 has been delayed until 3.0, but this issues is now actually resolved in 2.1 by the combination of new memtable code and various counters++ commits (including, but not limited to, part of CASSANDRA-6506 and CASSANDRA-6953). When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion - Key: CASSANDRA-6405 URL: https://issues.apache.org/jira/browse/CASSANDRA-6405 Project: Cassandra Issue Type: Bug Environment: RF of 3, 15 nodes. Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6). Xmx of 8G. No row cache. Reporter: Jason Harvey Fix For: 2.1 beta2 Attachments: threaddump.txt We're randomly running into an interesting issue on our ring. When making use of counters, we'll occasionally have 3 nodes (always neighbors) suddenly start immediately filling up memory, CMSing, fill up again, repeat. This pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out during this period. Restarting one, two, or all three of the nodes does not resolve the spiral; after a restart the three nodes immediately start hogging up memory again and CMSing constantly. When the issue resolves itself, all 3 nodes immediately get better. Sometimes it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, trashed for 20, and repeat that cycle a few times. There are no unusual logs provided by cassandra during this period of time, other than recording of the constant dropped read requests and the constant CMS runs. I have analyzed the log files prior to multiple distinct instances of this issue and have found no preceding events which are associated with this issue. I have verified that our apps are not performing any unusual number or type of requests during this time. This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11. The way I've narrowed this down to counters is a bit naive. It started happening when we started making use of counter columns, went away after we rolled back use of counter columns. I've repeated this attempted rollout on each version now, and it consistently rears its head every time. I should note this incident does _seem_ to happen more rarely on 1.2.11 compared to the previous versions. This incident has been consistent across multiple different types of hardware, as well as major kernel version changes (2.6 all the way to 3.2). The OS is operating normally during the event. I managed to get an hprof dump when the issue was happening in the wild. Something notable in the class instance counts as reported by jhat. Here are the top 5 counts for this one node: {code} 5967846 instances of class org.apache.cassandra.db.CounterColumn 1247525 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey 1246648 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 1237526 instances of class org.apache.cassandra.db.RowIndexEntry {code} Is it normal or expected for CounterColumn to have that number of instances? The data model for how we use counters is as follows: between 50-2 counter columns per key. We currently have around 3 million keys total, but this issue also replicated when we only had a few thousand keys total. Average column count is around 1k, and 90th is 18k. New columns are added regularly, and columns are incremented regularly. No column or key deletions occur. We probably have 1-5k hot keys at any given time, spread across the entire ring. R:W ratio is typically around 50:1. This is the only CF we're using counters on, at this time. CF details are as follows: {code} ColumnFamily: CommentTree Key Validation Class: org.apache.cassandra.db.marshal.AsciiType Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.01 DC Local Read repair chance: 0.0 Populate IO Cache on flush: false Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default
[jira] [Resolved] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion
[ https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis resolved CASSANDRA-6405. --- Resolution: Duplicate Reproduced In: 1.2.11, 1.1.7, 1.0.12 (was: 1.0.12, 1.1.7, 1.2.11) Closing as a duplicate of CASSANDRA-6506. There's no reasonable way to fix this in earlier C* versions. When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion - Key: CASSANDRA-6405 URL: https://issues.apache.org/jira/browse/CASSANDRA-6405 Project: Cassandra Issue Type: Bug Environment: RF of 3, 15 nodes. Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6). Xmx of 8G. No row cache. Reporter: Jason Harvey Attachments: threaddump.txt We're randomly running into an interesting issue on our ring. When making use of counters, we'll occasionally have 3 nodes (always neighbors) suddenly start immediately filling up memory, CMSing, fill up again, repeat. This pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out during this period. Restarting one, two, or all three of the nodes does not resolve the spiral; after a restart the three nodes immediately start hogging up memory again and CMSing constantly. When the issue resolves itself, all 3 nodes immediately get better. Sometimes it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, trashed for 20, and repeat that cycle a few times. There are no unusual logs provided by cassandra during this period of time, other than recording of the constant dropped read requests and the constant CMS runs. I have analyzed the log files prior to multiple distinct instances of this issue and have found no preceding events which are associated with this issue. I have verified that our apps are not performing any unusual number or type of requests during this time. This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11. The way I've narrowed this down to counters is a bit naive. It started happening when we started making use of counter columns, went away after we rolled back use of counter columns. I've repeated this attempted rollout on each version now, and it consistently rears its head every time. I should note this incident does _seem_ to happen more rarely on 1.2.11 compared to the previous versions. This incident has been consistent across multiple different types of hardware, as well as major kernel version changes (2.6 all the way to 3.2). The OS is operating normally during the event. I managed to get an hprof dump when the issue was happening in the wild. Something notable in the class instance counts as reported by jhat. Here are the top 5 counts for this one node: {code} 5967846 instances of class org.apache.cassandra.db.CounterColumn 1247525 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey 1246648 instances of class com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 1237526 instances of class org.apache.cassandra.db.RowIndexEntry {code} Is it normal or expected for CounterColumn to have that number of instances? The data model for how we use counters is as follows: between 50-2 counter columns per key. We currently have around 3 million keys total, but this issue also replicated when we only had a few thousand keys total. Average column count is around 1k, and 90th is 18k. New columns are added regularly, and columns are incremented regularly. No column or key deletions occur. We probably have 1-5k hot keys at any given time, spread across the entire ring. R:W ratio is typically around 50:1. This is the only CF we're using counters on, at this time. CF details are as follows: {code} ColumnFamily: CommentTree Key Validation Class: org.apache.cassandra.db.marshal.AsciiType Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.01 DC Local Read repair chance: 0.0 Populate IO Cache on flush: false Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compaction Strategy Options: sstable_size_in_mb: 160