Ben's question is a good one - What are the exact symptoms you're
experiencing? Is it latency spikes? Nodes flapping? That'll help us figure
out where to look.

When you removed the down node, which command did you use?

Best,
Marc

On Sun, Dec 2, 2018 at 1:36 PM Agrawal, Pratik <paagr...@amazon.com.invalid>
wrote:

> One other thing I forgot to add:
>
>
>
> native_transport_max_threads: 128
>
>
>
> we have commented this setting out, should we bound this? I am planning to
> experiment with this setting to bound it.
>
>
>
> Thanks,
> Pratik
>
>
>
> *From: *"Agrawal, Pratik" <paagr...@amazon.com>
> *Date: *Sunday, December 2, 2018 at 4:33 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>, Jeff Jirsa
> <jji...@gmail.com>, Ben Slater <ben.sla...@instaclustr.com>
>
>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> I looked into some of the logs and I saw that at the time of the event the
> Native requests started getting blocked.
>
>
>
> e.g.
>
>  [INFO] org.apache.cassandra.utils.StatusLogger:
> Native-Transport-Requests       128       133       51795821        16
>           19114
>
>
>
> The number of blocked requests kept on increasing over the period of 5
> minutes and became constant.
>
>
>
> As soon as we remove the dead node from the cluster, things recover pretty
> quickly and cluster becomes stable.
>
>
>
> Any pointers on what to look for debugging why requests are getting
> blocked when a nodes goes down??
>
>
>
> Also, one other thing to note that we reproduced this scenario in our test
> environment and as we scale up the cluster the cluster automatically
> recover in matter of minutes without removing the node from the cluster. It
> seems like we are reaching some vertical scalability limit (maybe because
> of our configuration).
>
>
>
>
>
> Thanks,
>
> Pratik
>
> *From: *Jeff Jirsa <jji...@gmail.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Tuesday, November 27, 2018 at 9:37 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: Cassandra single unreachable node causing total cluster
> outage
>
>
>
> Could also be the app not detecting the host is down and it keeps trying
> to use it as a coordinator
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Nov 27, 2018, at 6:33 PM, Ben Slater <ben.sla...@instaclustr.com>
> wrote:
>
> In what way does the cluster become unstable (ie more specifically what
> are the symptoms)? My first thought would be the loss of the node causing
> the other nodes to become overloaded but that doesn’t seem to fit with
>  your point 2.
>
>
>
> Cheers
>
> Ben
>
> ---
>
> *Ben Slater*
> *Chief Product Officer*
>
> *[image: Image removed by sender.]*
>
> [image: Image removed by sender.]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=Xd_TFiFGevJNlBDe8YBlsOWpPx3ppl7LgklTrp8PH2A&e=>
>   [image: Image removed by sender.]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=epE5B8XtjE7etFaFYgWYWD-S87VbVIZ1fo3EuWBZUeQ&e=>
>   [image: Image removed by sender.]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_instaclustr&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=lK0z0xWZZpeqfIjRK1hVXbEbaJfyQ05h5gNAUKm2HyQ&e=>
>
> Read our latest technical blog posts here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instaclustr.com_blog_&d=DwMGaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=otXyNz2TYzjVC2J099Gls6ouoJNWS0gRWcF_ifF7SK4&s=_SLQsDkWnEKXzjPnhbN-Y4kibvu5lsptqjaoq9ad3d0&e=>
> .
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
>
>
>
> On Tue, 27 Nov 2018 at 16:32, Agrawal, Pratik <paagr...@amazon.com.invalid>
> wrote:
>
> Hello all,
>
>
>
> *Setup:*
>
>
>
> 18 Cassandra node cluster. Cassandra version 2.2.8
>
> Amazon C3.2x large machines.
>
> Replication factor of 3 (in 3 different AZs).
>
> Read and Write using Quorum.
>
>
>
> *Use case:*
>
>
>
>    1. Short lived data with heavy updates (I know we are abusing
>    Cassandra here) with gc grace period of 15 minutes (I know it sounds
>    ridiculous). Level-tiered compaction strategy.
>    2. Timeseries data, no updates (short lived) (1 hr). TTLed out using
>    Date-tiered compaction strategy.
>    3. Timeseries data, no updates (long lived) (7 days). TTLed out using
>    Date-tiered compaction strategy.
>
>
>
> Overall high read and write throughput (100000/second)
>
>
>
> *Problem:*
>
>    1. The EC2 machine becomes unreachable (we reproduced the issue by
>    taking down network card) and the entire cluster becomes unstable for the
>    time until the down node is removed from the cluster. The node is shown as
>    DN node while doing nodetool status. Our understanding was that a single
>    node down in one AZ should not impact other nodes. We are unable to
>    understand why a single node going down is causing entire cluster to become
>    unstable. Is there any open bug around this?
>    2. We tried another experiment by killing Cassandra process but in
>    this case we only see a blip in latencies but all the other nodes are still
>    healthy and responsive (as expected).
>
>
>
> Any thoughts/comments on what could be the issue here?
>
>
>
> Thanks,
> Pratik
>
>
>
>
>
>
>
> --
Marc Selwan | DataStax | Product Management | (925) 413-7079

Reply via email to