Re: Elevated response times from all nodes in a data center at the same time.

Jon Haddad Wed, 16 Oct 2019 08:36:49 -0700

It's possible the queries you're normally running are served out of page
cache, and during the latency spike you're hitting your disks. If you're
using read ahead you might be hitting a throughput limit on the disks.


I've got some numbers and graphs I can share later when I'm not on my
phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote:

> Hello Bill,
>
> I think it might be worth it to focus the effort on diagnosing the issue
> properly in the first place, thus I'll try to guide you through this.
>
> First some comments on your environment:
>
> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
>> each region.
>> No of Nodes: 24
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>> Instance Types: r5.8x Large
>> Average Node Size: 182 GB
>> Work Load: Read heavy
>>
>
> When I read this, I think 'Tune the garbage collection properly!'. do you
> see GC pauses being a problem? The easiest way to interpret GC logs is
> probably to upload them there: https://gceasy.io. Ensure there that the
> 'GC throughput' is at least around 95+% (ideally 98+%). This would mean
> that 2 to 5% of the time each node stops to perform stop the world GCs. If
> that's a thing, we can help you setting GC options a bit nicer than what it
> is currently probably. That post would then probably be a good starting
> point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>> Read TPS: 22k
>> Cassandra version: 3.0.15
>>
>
> Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the
> time being)  or (personal preference) 3.11.4. There was a bunch of fixes,
> maybe are you hitting something that was fixed already, check changes
> there, see if any change could do some good for your use case:
> https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124
>
> Java Version: JDK 181.
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>
> The GP2 IOPS depends on the dis size. If you find out at anytime that
> disks are not keeping up, a good way out could be to increase the disk size
> (despite the small dataset) to actually increase the disk. IOPS &
> throughput. Now this did not recently change, and it was working for you
> until now, thus you don't have to Increase the disk size now probably. Just
> be aware that GP2 with 1 TB are quite slow.
>
> About the issue:
>
> our p99 latency in our AWS us-east-1 region OLTP data center, suddenly
>> starts rising from 2 ms to 200 ms. It starts with one node where we see the
>> 99th percentile Read Request latency in Datastax Opscenter starts
>> increasing. And it spreads immediately, to all other 6 nodes in the data
>> center.
>>
>
> Well, this sounds quite bad. The first 2 things coming to my mind here are:
> - Are you reading tombstones? (check logs for tombstones, trace a few
> queries)
> - Are you reading a huge partition? (check the max partition size, compare
> it to the mean and ensure it is remaining below 1 GB (or even 100 MB
> ideally).
>
> An inefficient read, for the reasons above or other reasons, would not
> necessarily impact nodes' resources but could definitely destroy
> performances for this query and the following ones due to the 'requests
> congestion'.
>
> To try to make a sense of the current tombstones level you can look at:
> - logs (grep tombstone)
> - sstablemetadata gives you the % of droppable tombstones. This is an
> estimate and of the space that could be freed, it gives no information on
> whether tombstones are being read and can affect performances or not, yet
> it gives you an idea of the tombstones that can be generated in the workflow
> - Trace queries: either trace a manual query from cqlsh with 'TRACING ON;'
> then sending queries similar to prod ones. Or directly using 'nodetool
> settraceprobability X', /!\ ensure X is really low to start with - like
> 0.001 or 0.0001 maybe, we probably don't need many queries to understand
> what happened and tracing might inflicts a big penalty to Cassandra servers
> in terms of performances (each of the traced queries will induce a bunch of
> queries to actually persist the trace In system_traces key space.
>
> We do not see any Read request timeouts or Exception in the our API Splunk
>> logs only p99 and average latency go up suddenly.
>>
>
> What's the value you use for timeouts? Also, any other exception/timeout,
> somewhere else than for reads?
> What are the result of:
>
> - nodetool tablestats (I think this would gather what you need to check
> --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e
> partition -e tombstones)
> - watch -d nodetool tpstats (here look at any pending threads constantly
> higher than 0, any blocked or dropped threads)
>
> We have investigated CPU level usage, Disk I/O, Memory usage and Network
>> parameters for the nodes during this period and we are not experiencing any
>> sudden surge in these parameters.
>
>
> If the resources are fine, there is a bottleneck within Cassandra, that we
> need to find, the commands above aim at that, finding C*'s bottleneck,
> assuming machines can handle more load.
>
> We setup client using WhiteListPolicy to send queries to each of the 6
>> nodes to understand which one is bad, but we see all of them responding
>> with very high latency. It doesn't happen during our peak traffic period
>> sometime in the night.
>
>
>  This brings something else to my mind. The fact latency goes lower when
> there is a traffic increase, it can perfectly mean that each of the queries
> sent during the spike are really efficient, and despite you might still
> have some other queries being slow (even during peak hours), having that
> many 'efficient/quick' requests, lowers the average/pXX latencies. Does the
> max latency change over time?
>
> You can here try to get a sense of this with:
>
> - nodetool proxyhistograms
> - nodetool tablehistograms <keyspace> <table> # For the table level stats
>
> We checked the system.log files on our nodes, took a thread dump and
>> checked for any rouge processes running on the nodes which is stealing CPU
>> but we are able to find nothing.
>
>
> From what I read/understand, resource are fine, I would put these searches
> aside for now. About the log file, I like to use:
>
> - tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are
> having the issues NOW)
> - grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what
> happened and was wrong
>
> For now I can't think about anything else, I hope some of those ideas will
> help you diagnose the problem. Once it is diagnosed, we should be able to
> reason about how we can fix it.
>
> C*heers,
> -----------------------
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rpinchb...@tripadvisor.com>
> a écrit :
>
>> I’d look to see if you have compactions fronting the p99’s.  If so, then
>> go back to looking at the I/O.  Disbelieve any metrics not captured at a
>> high resolution for a time window around the compactions, like 100ms.  You
>> could be hitting I/O stalls where reads are blocked by the flushing of
>> writes.  It’s short-lived when it happens, and per-minute metrics won’t
>> provide breadcrumbs.
>>
>>
>>
>> *From: *Bill Walters <billwalter...@gmail.com>
>> *Date: *Monday, October 14, 2019 at 7:10 PM
>> *To: *<user@cassandra.apache.org>
>> *Subject: *Elevated response times from all nodes in a data center at
>> the same time.
>>
>>
>>
>> Hi Everyone,
>>
>>
>>
>> Need some suggestions regarding a peculiar issue we started facing in our
>> production cluster for the last couple of days.
>>
>>
>>
>> Here are our Production environment details.
>>
>>
>>
>> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone
>> in each region.
>>
>> No of Nodes: 24
>>
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>>
>> Instance Types: r5.8x Large
>>
>> Average Node Size: 182 GB
>>
>> Work Load: Read heavy
>>
>> Read TPS: 22k
>>
>> Cassandra version: 3.0.15
>>
>> Java Version: JDK 181.
>>
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>>
>>
>> 1. We have been running in production for more than one year and our
>> experience with Cassandra is great. Experienced little hiccups here and
>> there but nothing severe.
>>
>>
>>
>> 2. But recently for the past couple of days we see a behavior where our
>> p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts
>> rising from 2 ms to 200 ms. It starts with one node where we see the 99th
>> percentile Read Request latency in Datastax Opscenter starts increasing.
>> And it spreads immediately, to all other 6 nodes in the data center.
>>
>>
>>
>> 3. We do not see any Read request timeouts or Exception in the our API
>> Splunk logs only p99 and average latency go up suddenly.
>>
>>
>>
>> 4. We have investigated CPU level usage, Disk I/O, Memory usage and
>> Network parameters for the nodes during this period and we are not
>> experiencing any sudden surge in these parameters.
>>
>>
>>
>> 5. We setup client using WhiteListPolicy to send queries to each of the 6
>> nodes to understand which one is bad, but we see all of them responding
>> with very high latency. It doesn't happen during our peak traffic period
>> sometime in the night.
>>
>>
>>
>> 6. We checked the system.log files on our nodes, took a thread dump and
>> checked for any rouge processes running on the nodes which is stealing CPU
>> but we are able to find nothing.
>>
>>
>>
>> 7. We even checked our the write requests coming in during this time and
>> we do not see any large batch operations happening.
>>
>>
>>
>> 8. Initially we tried restarting the nodes to see if the issue can be
>> mitigated but it kept happening, and we had to fail over API traffic to
>> us-west-2 region OLTP data center. After a couple of hours we failed back
>> and everything seems to be working.
>>
>>
>>
>> We are baffled by this behavior, only correlation we find is the "Native
>> requests pending" in our Task queues when this happens.
>>
>>
>>
>> Please let us know your suggestions on how to debug this issue. Has
>> anyone experienced an issue like this before.(We had issues where one node
>> starts acting bad due to bad EBS volume I/O read and write time, but all
>> nodes experiencing an issue at same time is very peculiar)
>>
>>
>>
>> Thank You,
>>
>> Bill Walters.
>>
>

Re: Elevated response times from all nodes in a data center at the same time.

Reply via email to