Nodes get stuck

Keith Wright Tue, 20 Aug 2013 17:34:37 -0700

Hi all,

    We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior recently 
where 3 of our nodes get locked up in high load in what appears to be a GC 
spiral while the rest of the cluster (7 total nodes) appears fine.  When I run 
a tpstats, I see the following (assuming tpstats returns at all) and top shows 
cassandra pegged at 2000%.  Obviously we have a large number of blocked reads.  
In the past I could explain this due to unexpectedly wide rows however we have 
handled that.  When the cluster starts to meltdown like this its hard to get 
visibility into what's going on and what triggered the issue as everything 
starts to pile on.  Opscenter becomes unusable and because the effected nodes 
are in GC pressure, getting any data via nodetool or JMX is also difficult.  
What do people do to handle these situations?  We are going to start graphing 
reads/writes/sec/CF to Ganglia in the hopes that it helps.


Thanks

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
ReadStage                       256       381     1245117434         0          
       0
RequestResponseStage              0         0     1161495947         0          
       0
MutationStage                     8         8      481721887         0          
       0
ReadRepairStage                   0         0       85770600         0          
       0
ReplicateOnWriteStage             0         0       21896804         0          
       0
GossipStage                       0         0        1546196         0          
       0
AntiEntropyStage                  0         0           5009         0          
       0
MigrationStage                    0         0           1082         0          
       0
MemtablePostFlusher               0         0          10178         0          
       0
FlushWriter                       0         0           6081         0          
    2075
MiscStage                         0         0             57         0          
       0
commitlog_archiver                0         0              0         0          
       0
AntiEntropySessions               0         0              0         0          
       0
InternalResponseStage             0         0              6         0          
       0
HintedHandoff                     1         1            246         0          
       0

Message type           Dropped
RANGE_SLICE                482
READ_REPAIR                  0
BINARY                       0
READ                    515762
MUTATION                    39
_TRACE                       0
REQUEST_RESPONSE            29

Nodes get stuck

Reply via email to