Hi guys,

I observed some odd behaviour with our Cassandra cluster the other day
while doing some maintenance operation and was wondering if anyone would be
able to provide some insight.

Initially, I started a node up to join the cluster. That node appeared to
be having issues joining due to some SSTable corruption it encountered.
Since it was still in early staged and I had never seen this failure
before, I decided to take it out of commission and just try again. However,
since it was in a bad state, I decided to issue a "nodetool removenode
<host id>" on a peer rather than a "nodetool decommission" on the node
itself.

The removenode command hung indefinitely - my guess is that this is related
to https://issues.apache.org/jira/browse/CASSANDRA-6542. We are using
2.1.11.

While this was happening, the driver in the application started logging
error messages about not being able to reach a quorum of 4. This, to me,
was mysterious as none of my keyspaces have an RF > 3. That quorum count in
the error implied an RF of 6 or 7.

I eventually forced that node out of the ring with "nodetool removenode
force". This seemed to mostly fix the issue, though there seems to have
been enough of a load spike to cause some of the machines' JVMs to
accumulate a lot of garbage very fast and spit out a ton of "Not marking
nodes down due to local pause of ... ", trying to clean it up. Some of
these nodes seemed unresponsive to their peers, who marked them DOWN (as
indicated by "nodetool status" and the cassandra log file on those
machines), further exacerbating the situation on the nodes that were still
up.

I guess my question is two-fold. First, can anyone provide some insight
into what may have happened? Second, what do you consider good practices
when dealing with such issues? Any advice is greatly appreciated!

Thanks,
Rutvij

Reply via email to