I would first check to see if there was a time synchronization issue among nodes that triggered and/or perpetuated the event.
ml On Wed, Jun 4, 2014 at 3:12 AM, Arup Chakrabarti <[email protected]> wrote: > Hello. We had some major latency problems yesterday with our 5 node > cassandra cluster. Wanted to get some feedback on where we could start to > look to figure out what was causing the issue. If there is more info I > should provide, please let me know. > > Here are the basics of the cluster: > Clients: Hector and Cassie > Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont) > Replication Factor: 5 > Quorum Reads and Writes enabled > Read Repair set to true > Cassandra Version: 1.0.12 > > We started experiencing catastrophic latency from our app servers. We > believed at the time this was due to compactions running, and the clients > were not re-routing appropriately, so we disabled thrift on a single node > that had high load. This did not resolve the issue. After that, we stopped > gossip on the same node that had high load on it, again this did not > resolve anything. We then took down gossip on another node (leaving 3/5 up) > and that fixed the latency from the application side. For a period of ~4 > hours, every time we would try to bring up a fourth node, the app would see > the latency again. We then rotated the three nodes that were up to make > sure it was not a networking event related to a single region/provider and > we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5 > nodes would. After the ~4hours, we brought the cluster up to 5 nodes and > everything was fine. > > We currently have some ideas on what caused this behavior, but has anyone > else seen this type of problem where a full cluster causes problems, but > removing nodes fixes it? Any input on what to look for in our logs to > understand the issue? > > Thanks > > Arup >
