Sounds like flushing due to memory consumption. 

The flush log messages include the number of ops, so you can see if this node 
was processing more mutations that the others. Try to see if there was more 
(serialised) data being written or more operations being processed. 

Also just for fun check the JVM and yaml settings are as expected. 

Cheers 


-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 6:29 AM, Mike <mthero...@yahoo.com> wrote:

> Hello,
> 
> We just hit a very odd issue in our Cassandra cluster.  We are running 
> Cassandra 1.1.2 in a 6 node cluster.  We use a replication factor of 3, and 
> all operations utilize LOCAL_QUORUM consistency.
> 
> We noticed a large performance hit in our application's maintenance 
> activities and I've been investigating.  I discovered a node in the cluster 
> that was flushing a memtable like crazy.  It was flushing every 2->3 minutes, 
> and has been apparently doing this for days. Typically, during this time of 
> day, a flush would happen every 30 minutes or so.
> 
> alldb.sh "cat /var/log/cassandra/system.log | grep \"flushing high-traffic 
> column family CFS(Keyspace='open', ColumnFamily='msgs')\" | grep 02-08 | wc 
> -l"
> [1] 18:41:04 [SUCCESS] db-1c-1
> 59
> [2] 18:41:05 [SUCCESS] db-1c-2
> 48
> [3] 18:41:05 [SUCCESS] db-1a-1
> 1206
> [4] 18:41:05 [SUCCESS] db-1d-2
> 54
> [5] 18:41:05 [SUCCESS] db-1a-2
> 56
> [6] 18:41:05 [SUCCESS] db-1d-1
> 52
> 
> 
> I restarted the database node, and, at least for now, the problem appears to 
> have stopped.
> 
> There are a number of things that don't make sense here.  We use a 
> replication factor of 3, so if this was being caused by our application, I 
> would have expected 3 nodes in the cluster to have issues.  Also, I would 
> have expected the issue to continue once the node restarted.
> 
> Another information point of interest, and I'm wondering if its exposed a 
> bug, was this node was recently converted to use ephemeral storage on EC2, 
> and was restored from a snapshot.  After the restore, a nodetool repair was 
> run.  However, repair was going to run into some heavy activity for our 
> application, and we canceled that validation compaction (2 of the 3 
> anti-entropy sessions had completed).  The spin appears to have started at 
> the start of the second session.
> 
> Any hints?
> 
> -Mike
> 
> 
> 
> 
> 

Reply via email to