Re: Cassandra single unreachable node causing total cluster outage

Agrawal, Pratik Tue, 11 Dec 2018 06:26:37 -0800

Hello all,

I’ve been doing more analysis and I’ve few questions:



  1.  We observed that most of the requests are blocked on NTR queue. I 
increased the queue size from 128 (default) to 1024 and this time the system 
does recover automatically (latencies go back to normal) without removing node 
from the cluster.
  2.  Is there a way to fail fast the NTR requests rather than being blocked on 
the NTR queue when the queue is full?

Thanks,
Pratik
From: "Agrawal, Pratik" <paagr...@amazon.com>
Date: Monday, December 3, 2018 at 11:55 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>, Marc Selwan 
<marc.sel...@datastax.com>
Cc: Jeff Jirsa <jji...@gmail.com>, Ben Slater <ben.sla...@instaclustr.com>
Subject: Re: Cassandra single unreachable node causing total cluster outage

Hello,


  1.  Cassandra latencies spiked 5-6 times the normal. (Read and write both). 
The latencies were in higher single digit seconds.
  2.  As I said in my previous email, we don’t bound the NTR threads and queue, 
the Cassandra nodes NTR queue started piling up and requests started getting 
blocked. 8 (mainly 4) out of 18 nodes in the cluster had NTR requests blocked.
  3.  As a result of 1.) and 2.) the Cassandra system resources spiked up.(CPU, 
IO, system load, # SStables (10 times, 250->2500), Memtable switch count, 
Pending compactions etc.)
  4.  One interesting thing we observed was the read calls with quorum 
consistency were not having any issues (high latencies and requests backing up) 
while the read calls with serial consistency were consistently failing on 
client side due to C* timeouts.
  5.  We used Nodetool removenode command to remove the ndoe from the cluster. 
The node wasn’t reachable (IP down).

One thing which we don’t understand is as soon as we remove the dead node from 
the cluster the system recovers within a minute(s). My main question is, is 
there a bug in C* with respect to Cassandra serial consistency calls getting 
blocked on some dead node resource and the resources getting released as soon 
as the dead node is removed from the cluster OR are we hitting some limit here?

Also, as the cluster size increases the impact of the dead node decreases on 
serial consistency read decreases (as in the latency spike up for a minute or 
two and the system automatically recovers).

Any pointers?

Thanks,
Pratik

From: Marc Selwan <marc.sel...@datastax.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Monday, December 3, 2018 at 1:09 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: Jeff Jirsa <jji...@gmail.com>, Ben Slater <ben.sla...@instaclustr.com>
Subject: Re: Cassandra single unreachable node causing total cluster outage

Ben's question is a good one - What are the exact symptoms you're experiencing? 
Is it latency spikes? Nodes flapping? That'll help us figure out where to look.

When you removed the down node, which command did you use?

Best,
Marc

On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik <paagr...@amazon.com.invalid> 
wrote:
One other thing I forgot to add:

native_transport_max_threads: 128

we have commented this setting out, should we bound this? I am planning to 
experiment with this setting to bound it.

Thanks,
Pratik

From: "Agrawal, Pratik" <paagr...@amazon.com<mailto:paagr...@amazon.com>>
Date: Sunday, December 2, 2018 at 4:33 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>, Jeff Jirsa 
<jji...@gmail.com<mailto:jji...@gmail.com>>, Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>>

Subject: Re: Cassandra single unreachable node causing total cluster outage

I looked into some of the logs and I saw that at the time of the event the 
Native requests started getting blocked.

e.g.

 [INFO] org.apache.cassandra.utils.StatusLogger: Native-Transport-Requests      
 128       133       51795821        16             19114

The number of blocked requests kept on increasing over the period of 5 minutes 
and became constant.

As soon as we remove the dead node from the cluster, things recover pretty 
quickly and cluster becomes stable.

Any pointers on what to look for debugging why requests are getting blocked 
when a nodes goes down??

Also, one other thing to note that we reproduced this scenario in our test 
environment and as we scale up the cluster the cluster automatically recover in 
matter of minutes without removing the node from the cluster. It seems like we 
are reaching some vertical scalability limit (maybe because of our 
configuration).


Thanks,
Pratik
From: Jeff Jirsa <jji...@gmail.com<mailto:jji...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, November 27, 2018 at 9:37 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra single unreachable node causing total cluster outage

Could also be the app not detecting the host is down and it keeps trying to use 
it as a coordinator

--
Jeff Jirsa


On Nov 27, 2018, at 6:33 PM, Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>> wrote:
In what way does the cluster become unstable (ie more specifically what are the 
symptoms)? My first thought would be the loss of the node causing the other 
nodes to become overloaded but that doesn’t seem to fit with  your point 2.

Cheers
Ben

---

Ben Slater
Chief Product Officer

Error! Filename not specified.

Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=Xd_TFiFGevJNlBDe8YBlsOWpPx3ppl7LgklTrp8PH2A&e=>
  Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=epE5B8XtjE7etFaFYgWYWD-S87VbVIZ1fo3EuWBZUeQ&e=>
  Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=lK0z0xWZZpeqfIjRK1hVXbEbaJfyQ05h5gNAUKm2HyQ&e=>

Read our latest technical blog posts 
here<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instaclustr.com_blog_&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=_SLQsDkWnEKXzjPnhbN-Y4kibvu5lsptqjaoq9ad3d0&e=>.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.


On Tue, 27 Nov 2018 at 16:32, Agrawal, Pratik 
<paagr...@amazon.com.invalid<mailto:paagr...@amazon.com.invalid>> wrote:
Hello all,

Setup:

18 Cassandra node cluster. Cassandra version 2.2.8
Amazon C3.2x large machines.
Replication factor of 3 (in 3 different AZs).
Read and Write using Quorum.

Use case:


  1.  Short lived data with heavy updates (I know we are abusing Cassandra 
here) with gc grace period of 15 minutes (I know it sounds ridiculous). 
Level-tiered compaction strategy.
  2.  Timeseries data, no updates (short lived) (1 hr). TTLed out using 
Date-tiered compaction strategy.
  3.  Timeseries data, no updates (long lived) (7 days). TTLed out using 
Date-tiered compaction strategy.

Overall high read and write throughput (100000/second)

Problem:

  1.  The EC2 machine becomes unreachable (we reproduced the issue by taking 
down network card) and the entire cluster becomes unstable for the time until 
the down node is removed from the cluster. The node is shown as DN node while 
doing nodetool status. Our understanding was that a single node down in one AZ 
should not impact other nodes. We are unable to understand why a single node 
going down is causing entire cluster to become unstable. Is there any open bug 
around this?
  2.  We tried another experiment by killing Cassandra process but in this case 
we only see a blip in latencies but all the other nodes are still healthy and 
responsive (as expected).

Any thoughts/comments on what could be the issue here?

Thanks,
Pratik



--
Marc Selwan | DataStax | Product Management | (925) 413-7079

Re: Cassandra single unreachable node causing total cluster outage

Reply via email to