[jira] [Resolved] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion

2014-04-10 Thread Aleksey Yeschenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko resolved CASSANDRA-6405.
--

   Resolution: Fixed
Fix Version/s: 2.1 beta2
Reproduced In: 1.2.11, 1.1.7, 1.0.12  (was: 1.0.12, 1.1.7, 1.2.11)

CASSANDRA-6506 has been delayed until 3.0, but this issues is now actually 
resolved in 2.1 by the combination of new memtable code and various counters++ 
commits (including, but not limited to, part of CASSANDRA-6506 and 
CASSANDRA-6953).

 When making heavy use of counters, neighbor nodes occasionally enter spiral 
 of constant memory consumpion
 -

 Key: CASSANDRA-6405
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6405
 Project: Cassandra
  Issue Type: Bug
 Environment: RF of 3, 15 nodes.
 Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6).
 Xmx of 8G.
 No row cache.
Reporter: Jason Harvey
 Fix For: 2.1 beta2

 Attachments: threaddump.txt


 We're randomly running into an interesting issue on our ring. When making use 
 of counters, we'll occasionally have 3 nodes (always neighbors) suddenly 
 start immediately filling up memory, CMSing, fill up again, repeat. This 
 pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out 
 during this period. Restarting one, two, or all three of the nodes does not 
 resolve the spiral; after a restart the three nodes immediately start hogging 
 up memory again and CMSing constantly.
 When the issue resolves itself, all 3 nodes immediately get better. Sometimes 
 it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, 
 trashed for 20, and repeat that cycle a few times.
 There are no unusual logs provided by cassandra during this period of time, 
 other than recording of the constant dropped read requests and the constant 
 CMS runs. I have analyzed the log files prior to multiple distinct instances 
 of this issue and have found no preceding events which are associated with 
 this issue.
 I have verified that our apps are not performing any unusual number or type 
 of requests during this time.
 This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11.
 The way I've narrowed this down to counters is a bit naive. It started 
 happening when we started making use of counter columns, went away after we 
 rolled back use of counter columns. I've repeated this attempted rollout on 
 each version now, and it consistently rears its head every time. I should 
 note this incident does _seem_ to happen more rarely on 1.2.11 compared to 
 the previous versions.
 This incident has been consistent across multiple different types of 
 hardware, as well as major kernel version changes (2.6 all the way to 3.2). 
 The OS is operating normally during the event.
 I managed to get an hprof dump when the issue was happening in the wild. 
 Something notable in the class instance counts as reported by jhat. Here are 
 the top 5 counts for this one node:
 {code}
 5967846 instances of class org.apache.cassandra.db.CounterColumn 
 1247525 instances of class 
 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 
 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey 
 1246648 instances of class 
 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 
 1237526 instances of class org.apache.cassandra.db.RowIndexEntry 
 {code}
 Is it normal or expected for CounterColumn to have that number of instances?
 The data model for how we use counters is as follows: between 50-2 
 counter columns per key. We currently have around 3 million keys total, but 
 this issue also replicated when we only had a few thousand keys total. 
 Average column count is around 1k, and 90th is 18k. New columns are added 
 regularly, and columns are incremented regularly. No column or key deletions 
 occur. We probably have 1-5k hot keys at any given time, spread across the 
 entire ring. R:W ratio is typically around 50:1. This is the only CF we're 
 using counters on, at this time. CF details are as follows:
 {code}
 ColumnFamily: CommentTree
   Key Validation Class: org.apache.cassandra.db.marshal.AsciiType
   Default column value validator: 
 org.apache.cassandra.db.marshal.CounterColumnType
   Cells sorted by: 
 org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType)
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 0.01
   DC Local Read repair chance: 0.0
   Populate IO Cache on flush: false
   Replicate on write: true
   Caching: KEYS_ONLY
   Bloom Filter FP chance: default
   

[jira] [Resolved] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion

2014-02-21 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis resolved CASSANDRA-6405.
---

   Resolution: Duplicate
Reproduced In: 1.2.11, 1.1.7, 1.0.12  (was: 1.0.12, 1.1.7, 1.2.11)

Closing as a duplicate of CASSANDRA-6506.  There's no reasonable way to fix 
this in earlier C* versions.

 When making heavy use of counters, neighbor nodes occasionally enter spiral 
 of constant memory consumpion
 -

 Key: CASSANDRA-6405
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6405
 Project: Cassandra
  Issue Type: Bug
 Environment: RF of 3, 15 nodes.
 Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6).
 Xmx of 8G.
 No row cache.
Reporter: Jason Harvey
 Attachments: threaddump.txt


 We're randomly running into an interesting issue on our ring. When making use 
 of counters, we'll occasionally have 3 nodes (always neighbors) suddenly 
 start immediately filling up memory, CMSing, fill up again, repeat. This 
 pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out 
 during this period. Restarting one, two, or all three of the nodes does not 
 resolve the spiral; after a restart the three nodes immediately start hogging 
 up memory again and CMSing constantly.
 When the issue resolves itself, all 3 nodes immediately get better. Sometimes 
 it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, 
 trashed for 20, and repeat that cycle a few times.
 There are no unusual logs provided by cassandra during this period of time, 
 other than recording of the constant dropped read requests and the constant 
 CMS runs. I have analyzed the log files prior to multiple distinct instances 
 of this issue and have found no preceding events which are associated with 
 this issue.
 I have verified that our apps are not performing any unusual number or type 
 of requests during this time.
 This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11.
 The way I've narrowed this down to counters is a bit naive. It started 
 happening when we started making use of counter columns, went away after we 
 rolled back use of counter columns. I've repeated this attempted rollout on 
 each version now, and it consistently rears its head every time. I should 
 note this incident does _seem_ to happen more rarely on 1.2.11 compared to 
 the previous versions.
 This incident has been consistent across multiple different types of 
 hardware, as well as major kernel version changes (2.6 all the way to 3.2). 
 The OS is operating normally during the event.
 I managed to get an hprof dump when the issue was happening in the wild. 
 Something notable in the class instance counts as reported by jhat. Here are 
 the top 5 counts for this one node:
 {code}
 5967846 instances of class org.apache.cassandra.db.CounterColumn 
 1247525 instances of class 
 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 
 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey 
 1246648 instances of class 
 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 
 1237526 instances of class org.apache.cassandra.db.RowIndexEntry 
 {code}
 Is it normal or expected for CounterColumn to have that number of instances?
 The data model for how we use counters is as follows: between 50-2 
 counter columns per key. We currently have around 3 million keys total, but 
 this issue also replicated when we only had a few thousand keys total. 
 Average column count is around 1k, and 90th is 18k. New columns are added 
 regularly, and columns are incremented regularly. No column or key deletions 
 occur. We probably have 1-5k hot keys at any given time, spread across the 
 entire ring. R:W ratio is typically around 50:1. This is the only CF we're 
 using counters on, at this time. CF details are as follows:
 {code}
 ColumnFamily: CommentTree
   Key Validation Class: org.apache.cassandra.db.marshal.AsciiType
   Default column value validator: 
 org.apache.cassandra.db.marshal.CounterColumnType
   Cells sorted by: 
 org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType)
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 0.01
   DC Local Read repair chance: 0.0
   Populate IO Cache on flush: false
   Replicate on write: true
   Caching: KEYS_ONLY
   Bloom Filter FP chance: default
   Built indexes: []
   Compaction Strategy: 
 org.apache.cassandra.db.compaction.LeveledCompactionStrategy
   Compaction Strategy Options:
 sstable_size_in_mb: 160