Re: Incorrect quorum count in driver error logs

2017-06-26 Thread Rutvij Bhatt
Yes.

On Mon, Jun 26, 2017 at 5:45 PM Hannu Kröger  wrote:

> Just to be sure: you have only one datacenter configured in Cassandra?
>
> Hannu
>
> On 27 Jun 2017, at 0.02, Rutvij Bhatt  wrote:
>
> Hi guys,
>
> I observed some odd behaviour with our Cassandra cluster the other day
> while doing some maintenance operation and was wondering if anyone would be
> able to provide some insight.
>
> Initially, I started a node up to join the cluster. That node appeared to
> be having issues joining due to some SSTable corruption it encountered.
> Since it was still in early staged and I had never seen this failure
> before, I decided to take it out of commission and just try again. However,
> since it was in a bad state, I decided to issue a "nodetool removenode
> " on a peer rather than a "nodetool decommission" on the node
> itself.
>
> The removenode command hung indefinitely - my guess is that this is
> related to https://issues.apache.org/jira/browse/CASSANDRA-6542. We are
> using 2.1.11.
>
> While this was happening, the driver in the application started logging
> error messages about not being able to reach a quorum of 4. This, to me,
> was mysterious as none of my keyspaces have an RF > 3. That quorum count in
> the error implied an RF of 6 or 7.
>
> I eventually forced that node out of the ring with "nodetool removenode
> force". This seemed to mostly fix the issue, though there seems to have
> been enough of a load spike to cause some of the machines' JVMs to
> accumulate a lot of garbage very fast and spit out a ton of "Not marking
> nodes down due to local pause of ... ", trying to clean it up. Some of
> these nodes seemed unresponsive to their peers, who marked them DOWN (as
> indicated by "nodetool status" and the cassandra log file on those
> machines), further exacerbating the situation on the nodes that were still
> up.
>
> I guess my question is two-fold. First, can anyone provide some insight
> into what may have happened? Second, what do you consider good practices
> when dealing with such issues? Any advice is greatly appreciated!
>
> Thanks,
> Rutvij
>
>


Re: Incorrect quorum count in driver error logs

2017-06-26 Thread Hannu Kröger
Just to be sure: you have only one datacenter configured in Cassandra?

Hannu

> On 27 Jun 2017, at 0.02, Rutvij Bhatt  wrote:
> 
> Hi guys,
> 
> I observed some odd behaviour with our Cassandra cluster the other day while 
> doing some maintenance operation and was wondering if anyone would be able to 
> provide some insight.
> 
> Initially, I started a node up to join the cluster. That node appeared to be 
> having issues joining due to some SSTable corruption it encountered. Since it 
> was still in early staged and I had never seen this failure before, I decided 
> to take it out of commission and just try again. However, since it was in a 
> bad state, I decided to issue a "nodetool removenode " on a peer 
> rather than a "nodetool decommission" on the node itself.
> 
> The removenode command hung indefinitely - my guess is that this is related 
> to https://issues.apache.org/jira/browse/CASSANDRA-6542. We are using 2.1.11.
> 
> While this was happening, the driver in the application started logging error 
> messages about not being able to reach a quorum of 4. This, to me, was 
> mysterious as none of my keyspaces have an RF > 3. That quorum count in the 
> error implied an RF of 6 or 7.
> 
> I eventually forced that node out of the ring with "nodetool removenode 
> force". This seemed to mostly fix the issue, though there seems to have been 
> enough of a load spike to cause some of the machines' JVMs to accumulate a 
> lot of garbage very fast and spit out a ton of "Not marking nodes down due to 
> local pause of ... ", trying to clean it up. Some of these nodes seemed 
> unresponsive to their peers, who marked them DOWN (as indicated by "nodetool 
> status" and the cassandra log file on those machines), further exacerbating 
> the situation on the nodes that were still up.
> 
> I guess my question is two-fold. First, can anyone provide some insight into 
> what may have happened? Second, what do you consider good practices when 
> dealing with such issues? Any advice is greatly appreciated!
> 
> Thanks,
> Rutvij


Incorrect quorum count in driver error logs

2017-06-26 Thread Rutvij Bhatt
Hi guys,

I observed some odd behaviour with our Cassandra cluster the other day
while doing some maintenance operation and was wondering if anyone would be
able to provide some insight.

Initially, I started a node up to join the cluster. That node appeared to
be having issues joining due to some SSTable corruption it encountered.
Since it was still in early staged and I had never seen this failure
before, I decided to take it out of commission and just try again. However,
since it was in a bad state, I decided to issue a "nodetool removenode
" on a peer rather than a "nodetool decommission" on the node
itself.

The removenode command hung indefinitely - my guess is that this is related
to https://issues.apache.org/jira/browse/CASSANDRA-6542. We are using
2.1.11.

While this was happening, the driver in the application started logging
error messages about not being able to reach a quorum of 4. This, to me,
was mysterious as none of my keyspaces have an RF > 3. That quorum count in
the error implied an RF of 6 or 7.

I eventually forced that node out of the ring with "nodetool removenode
force". This seemed to mostly fix the issue, though there seems to have
been enough of a load spike to cause some of the machines' JVMs to
accumulate a lot of garbage very fast and spit out a ton of "Not marking
nodes down due to local pause of ... ", trying to clean it up. Some of
these nodes seemed unresponsive to their peers, who marked them DOWN (as
indicated by "nodetool status" and the cassandra log file on those
machines), further exacerbating the situation on the nodes that were still
up.

I guess my question is two-fold. First, can anyone provide some insight
into what may have happened? Second, what do you consider good practices
when dealing with such issues? Any advice is greatly appreciated!

Thanks,
Rutvij