Hi all,
We are using C* 1.2.4 with Vnodes and SSD. We have seen behavior recently
where 3 of our nodes get locked up in high load in what appears to be a GC
spiral while the rest of the cluster (7 total nodes) appears fine. When I run
a tpstats, I see the following (assuming tpstats returns at all) and top shows
cassandra pegged at 2000%. Obviously we have a large number of blocked reads.
In the past I could explain this due to unexpectedly wide rows however we have
handled that. When the cluster starts to meltdown like this its hard to get
visibility into what's going on and what triggered the issue as everything
starts to pile on. Opscenter becomes unusable and because the effected nodes
are in GC pressure, getting any data via nodetool or JMX is also difficult.
What do people do to handle these situations? We are going to start graphing
reads/writes/sec/CF to Ganglia in the hopes that it helps.
Thanks
Pool Name Active Pending Completed Blocked All
time blocked
ReadStage 256 381 1245117434 0
0
RequestResponseStage 0 0 1161495947 0
0
MutationStage 8 8 481721887 0
0
ReadRepairStage 0 0 85770600 0
0
ReplicateOnWriteStage 0 0 21896804 0
0
GossipStage 0 0 1546196 0
0
AntiEntropyStage 0 0 5009 0
0
MigrationStage 0 0 1082 0
0
MemtablePostFlusher 0 0 10178 0
0
FlushWriter 0 0 6081 0
2075
MiscStage 0 0 57 0
0
commitlog_archiver 0 0 0 0
0
AntiEntropySessions 0 0 0 0
0
InternalResponseStage 0 0 6 0
0
HintedHandoff 1 1 246 0
0
Message type Dropped
RANGE_SLICE 482
READ_REPAIR 0
BINARY 0
READ 515762
MUTATION 39
_TRACE 0
REQUEST_RESPONSE 29