It's possible the queries you're normally running are served out of page
cache, and during the latency spike you're hitting your disks. If you're
using read ahead you might be hitting a throughput limit on the disks.

I've got some numbers and graphs I can share later when I'm not on my
phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote:

> Hello Bill,
>
> I think it might be worth it to focus the effort on diagnosing the issue
> properly in the first place, thus I'll try to guide you through this.
>
> First some comments on your environment:
>
> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
>> each region.
>> No of Nodes: 24
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>> Instance Types: r5.8x Large
>> Average Node Size: 182 GB
>> Work Load: Read heavy
>>
>
> When I read this, I think 'Tune the garbage collection properly!'. do you
> see GC pauses being a problem? The easiest way to interpret GC logs is
> probably to upload them there: https://gceasy.io. Ensure there that the
> 'GC throughput' is at least around 95+% (ideally 98+%). This would mean
> that 2 to 5% of the time each node stops to perform stop the world GCs. If
> that's a thing, we can help you setting GC options a bit nicer than what it
> is currently probably. That post would then probably be a good starting
> point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>> Read TPS: 22k
>> Cassandra version: 3.0.15
>>
>
> Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the
> time being)  or (personal preference) 3.11.4. There was a bunch of fixes,
> maybe are you hitting something that was fixed already, check changes
> there, see if any change could do some good for your use case:
> https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124
>
> Java Version: JDK 181.
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>
> The GP2 IOPS depends on the dis size. If you find out at anytime that
> disks are not keeping up, a good way out could be to increase the disk size
> (despite the small dataset) to actually increase the disk. IOPS &
> throughput. Now this did not recently change, and it was working for you
> until now, thus you don't have to Increase the disk size now probably. Just
> be aware that GP2 with 1 TB are quite slow.
>
> About the issue:
>
> our p99 latency in our AWS us-east-1 region OLTP data center, suddenly
>> starts rising from 2 ms to 200 ms. It starts with one node where we see the
>> 99th percentile Read Request latency in Datastax Opscenter starts
>> increasing. And it spreads immediately, to all other 6 nodes in the data
>> center.
>>
>
> Well, this sounds quite bad. The first 2 things coming to my mind here are:
> - Are you reading tombstones? (check logs for tombstones, trace a few
> queries)
> - Are you reading a huge partition? (check the max partition size, compare
> it to the mean and ensure it is remaining below 1 GB (or even 100 MB
> ideally).
>
> An inefficient read, for the reasons above or other reasons, would not
> necessarily impact nodes' resources but could definitely destroy
> performances for this query and the following ones due to the 'requests
> congestion'.
>
> To try to make a sense of the current tombstones level you can look at:
> - logs (grep tombstone)
> - sstablemetadata gives you the % of droppable tombstones. This is an
> estimate and of the space that could be freed, it gives no information on
> whether tombstones are being read and can affect performances or not, yet
> it gives you an idea of the tombstones that can be generated in the workflow
> - Trace queries: either trace a manual query from cqlsh with 'TRACING ON;'
> then sending queries similar to prod ones. Or directly using 'nodetool
> settraceprobability X', /!\ ensure X is really low to start with - like
> 0.001 or 0.0001 maybe, we probably don't need many queries to understand
> what happened and tracing might inflicts a big penalty to Cassandra servers
> in terms of performances (each of the traced queries will induce a bunch of
> queries to actually persist the trace In system_traces key space.
>
> We do not see any Read request timeouts or Exception in the our API Splunk
>> logs only p99 and average latency go up suddenly.
>>
>
> What's the value you use for timeouts? Also, any other exception/timeout,
> somewhere else than for reads?
> What are the result of:
>
> - nodetool tablestats (I think this would gather what you need to check
> --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e
> partition -e tombstones)
> - watch -d nodetool tpstats (here look at any pending threads constantly
> higher than 0, any blocked or dropped threads)
>
> We have investigated CPU level usage, Disk I/O, Memory usage and Network
>> parameters for the nodes during this period and we are not experiencing any
>> sudden surge in these parameters.
>
>
> If the resources are fine, there is a bottleneck within Cassandra, that we
> need to find, the commands above aim at that, finding C*'s bottleneck,
> assuming machines can handle more load.
>
> We setup client using WhiteListPolicy to send queries to each of the 6
>> nodes to understand which one is bad, but we see all of them responding
>> with very high latency. It doesn't happen during our peak traffic period
>> sometime in the night.
>
>
>  This brings something else to my mind. The fact latency goes lower when
> there is a traffic increase, it can perfectly mean that each of the queries
> sent during the spike are really efficient, and despite you might still
> have some other queries being slow (even during peak hours), having that
> many 'efficient/quick' requests, lowers the average/pXX latencies. Does the
> max latency change over time?
>
> You can here try to get a sense of this with:
>
> - nodetool proxyhistograms
> - nodetool tablehistograms <keyspace> <table> # For the table level stats
>
> We checked the system.log files on our nodes, took a thread dump and
>> checked for any rouge processes running on the nodes which is stealing CPU
>> but we are able to find nothing.
>
>
> From what I read/understand, resource are fine, I would put these searches
> aside for now. About the log file, I like to use:
>
> - tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are
> having the issues NOW)
> - grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what
> happened and was wrong
>
> For now I can't think about anything else, I hope some of those ideas will
> help you diagnose the problem. Once it is diagnosed, we should be able to
> reason about how we can fix it.
>
> C*heers,
> -----------------------
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rpinchb...@tripadvisor.com>
> a écrit :
>
>> I’d look to see if you have compactions fronting the p99’s.  If so, then
>> go back to looking at the I/O.  Disbelieve any metrics not captured at a
>> high resolution for a time window around the compactions, like 100ms.  You
>> could be hitting I/O stalls where reads are blocked by the flushing of
>> writes.  It’s short-lived when it happens, and per-minute metrics won’t
>> provide breadcrumbs.
>>
>>
>>
>> *From: *Bill Walters <billwalter...@gmail.com>
>> *Date: *Monday, October 14, 2019 at 7:10 PM
>> *To: *<user@cassandra.apache.org>
>> *Subject: *Elevated response times from all nodes in a data center at
>> the same time.
>>
>>
>>
>> Hi Everyone,
>>
>>
>>
>> Need some suggestions regarding a peculiar issue we started facing in our
>> production cluster for the last couple of days.
>>
>>
>>
>> Here are our Production environment details.
>>
>>
>>
>> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone
>> in each region.
>>
>> No of Nodes: 24
>>
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>>
>> Instance Types: r5.8x Large
>>
>> Average Node Size: 182 GB
>>
>> Work Load: Read heavy
>>
>> Read TPS: 22k
>>
>> Cassandra version: 3.0.15
>>
>> Java Version: JDK 181.
>>
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>>
>>
>> 1. We have been running in production for more than one year and our
>> experience with Cassandra is great. Experienced little hiccups here and
>> there but nothing severe.
>>
>>
>>
>> 2. But recently for the past couple of days we see a behavior where our
>> p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts
>> rising from 2 ms to 200 ms. It starts with one node where we see the 99th
>> percentile Read Request latency in Datastax Opscenter starts increasing.
>> And it spreads immediately, to all other 6 nodes in the data center.
>>
>>
>>
>> 3. We do not see any Read request timeouts or Exception in the our API
>> Splunk logs only p99 and average latency go up suddenly.
>>
>>
>>
>> 4. We have investigated CPU level usage, Disk I/O, Memory usage and
>> Network parameters for the nodes during this period and we are not
>> experiencing any sudden surge in these parameters.
>>
>>
>>
>> 5. We setup client using WhiteListPolicy to send queries to each of the 6
>> nodes to understand which one is bad, but we see all of them responding
>> with very high latency. It doesn't happen during our peak traffic period
>> sometime in the night.
>>
>>
>>
>> 6. We checked the system.log files on our nodes, took a thread dump and
>> checked for any rouge processes running on the nodes which is stealing CPU
>> but we are able to find nothing.
>>
>>
>>
>> 7. We even checked our the write requests coming in during this time and
>> we do not see any large batch operations happening.
>>
>>
>>
>> 8. Initially we tried restarting the nodes to see if the issue can be
>> mitigated but it kept happening, and we had to fail over API traffic to
>> us-west-2 region OLTP data center. After a couple of hours we failed back
>> and everything seems to be working.
>>
>>
>>
>> We are baffled by this behavior, only correlation we find is the "Native
>> requests pending" in our Task queues when this happens.
>>
>>
>>
>> Please let us know your suggestions on how to debug this issue. Has
>> anyone experienced an issue like this before.(We had issues where one node
>> starts acting bad due to bad EBS volume I/O read and write time, but all
>> nodes experiencing an issue at same time is very peculiar)
>>
>>
>>
>> Thank You,
>>
>> Bill Walters.
>>
>

Reply via email to