Hi all, We are using C* 1.2.4 with Vnodes and SSD. We have seen behavior recently where 3 of our nodes get locked up in high load in what appears to be a GC spiral while the rest of the cluster (7 total nodes) appears fine. When I run a tpstats, I see the following (assuming tpstats returns at all) and top shows cassandra pegged at 2000%. Obviously we have a large number of blocked reads. In the past I could explain this due to unexpectedly wide rows however we have handled that. When the cluster starts to meltdown like this its hard to get visibility into what's going on and what triggered the issue as everything starts to pile on. Opscenter becomes unusable and because the effected nodes are in GC pressure, getting any data via nodetool or JMX is also difficult. What do people do to handle these situations? We are going to start graphing reads/writes/sec/CF to Ganglia in the hopes that it helps.
Thanks Pool Name Active Pending Completed Blocked All time blocked ReadStage 256 381 1245117434 0 0 RequestResponseStage 0 0 1161495947 0 0 MutationStage 8 8 481721887 0 0 ReadRepairStage 0 0 85770600 0 0 ReplicateOnWriteStage 0 0 21896804 0 0 GossipStage 0 0 1546196 0 0 AntiEntropyStage 0 0 5009 0 0 MigrationStage 0 0 1082 0 0 MemtablePostFlusher 0 0 10178 0 0 FlushWriter 0 0 6081 0 2075 MiscStage 0 0 57 0 0 commitlog_archiver 0 0 0 0 0 AntiEntropySessions 0 0 0 0 0 InternalResponseStage 0 0 6 0 0 HintedHandoff 1 1 246 0 0 Message type Dropped RANGE_SLICE 482 READ_REPAIR 0 BINARY 0 READ 515762 MUTATION 39 _TRACE 0 REQUEST_RESPONSE 29