It's possible the queries you're normally running are served out of page cache, and during the latency spike you're hitting your disks. If you're using read ahead you might be hitting a throughput limit on the disks.
I've got some numbers and graphs I can share later when I'm not on my phone. Jon On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote: > Hello Bill, > > I think it might be worth it to focus the effort on diagnosing the issue > properly in the first place, thus I'll try to guide you through this. > > First some comments on your environment: > > AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in >> each region. >> No of Nodes: 24 >> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for >> APIs and 2 OLAP Data centers for Analytics and Batch loads) >> Instance Types: r5.8x Large >> Average Node Size: 182 GB >> Work Load: Read heavy >> > > When I read this, I think 'Tune the garbage collection properly!'. do you > see GC pauses being a problem? The easiest way to interpret GC logs is > probably to upload them there: https://gceasy.io. Ensure there that the > 'GC throughput' is at least around 95+% (ideally 98+%). This would mean > that 2 to 5% of the time each node stops to perform stop the world GCs. If > that's a thing, we can help you setting GC options a bit nicer than what it > is currently probably. That post would then probably be a good starting > point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html > > >> Read TPS: 22k >> Cassandra version: 3.0.15 >> > > Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the > time being) or (personal preference) 3.11.4. There was a bunch of fixes, > maybe are you hitting something that was fixed already, check changes > there, see if any change could do some good for your use case: > https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124 > > Java Version: JDK 181. >> EBS Volumes: GP2 with 1TB 3000 iops. >> > > The GP2 IOPS depends on the dis size. If you find out at anytime that > disks are not keeping up, a good way out could be to increase the disk size > (despite the small dataset) to actually increase the disk. IOPS & > throughput. Now this did not recently change, and it was working for you > until now, thus you don't have to Increase the disk size now probably. Just > be aware that GP2 with 1 TB are quite slow. > > About the issue: > > our p99 latency in our AWS us-east-1 region OLTP data center, suddenly >> starts rising from 2 ms to 200 ms. It starts with one node where we see the >> 99th percentile Read Request latency in Datastax Opscenter starts >> increasing. And it spreads immediately, to all other 6 nodes in the data >> center. >> > > Well, this sounds quite bad. The first 2 things coming to my mind here are: > - Are you reading tombstones? (check logs for tombstones, trace a few > queries) > - Are you reading a huge partition? (check the max partition size, compare > it to the mean and ensure it is remaining below 1 GB (or even 100 MB > ideally). > > An inefficient read, for the reasons above or other reasons, would not > necessarily impact nodes' resources but could definitely destroy > performances for this query and the following ones due to the 'requests > congestion'. > > To try to make a sense of the current tombstones level you can look at: > - logs (grep tombstone) > - sstablemetadata gives you the % of droppable tombstones. This is an > estimate and of the space that could be freed, it gives no information on > whether tombstones are being read and can affect performances or not, yet > it gives you an idea of the tombstones that can be generated in the workflow > - Trace queries: either trace a manual query from cqlsh with 'TRACING ON;' > then sending queries similar to prod ones. Or directly using 'nodetool > settraceprobability X', /!\ ensure X is really low to start with - like > 0.001 or 0.0001 maybe, we probably don't need many queries to understand > what happened and tracing might inflicts a big penalty to Cassandra servers > in terms of performances (each of the traced queries will induce a bunch of > queries to actually persist the trace In system_traces key space. > > We do not see any Read request timeouts or Exception in the our API Splunk >> logs only p99 and average latency go up suddenly. >> > > What's the value you use for timeouts? Also, any other exception/timeout, > somewhere else than for reads? > What are the result of: > > - nodetool tablestats (I think this would gather what you need to check > --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e > partition -e tombstones) > - watch -d nodetool tpstats (here look at any pending threads constantly > higher than 0, any blocked or dropped threads) > > We have investigated CPU level usage, Disk I/O, Memory usage and Network >> parameters for the nodes during this period and we are not experiencing any >> sudden surge in these parameters. > > > If the resources are fine, there is a bottleneck within Cassandra, that we > need to find, the commands above aim at that, finding C*'s bottleneck, > assuming machines can handle more load. > > We setup client using WhiteListPolicy to send queries to each of the 6 >> nodes to understand which one is bad, but we see all of them responding >> with very high latency. It doesn't happen during our peak traffic period >> sometime in the night. > > > This brings something else to my mind. The fact latency goes lower when > there is a traffic increase, it can perfectly mean that each of the queries > sent during the spike are really efficient, and despite you might still > have some other queries being slow (even during peak hours), having that > many 'efficient/quick' requests, lowers the average/pXX latencies. Does the > max latency change over time? > > You can here try to get a sense of this with: > > - nodetool proxyhistograms > - nodetool tablehistograms <keyspace> <table> # For the table level stats > > We checked the system.log files on our nodes, took a thread dump and >> checked for any rouge processes running on the nodes which is stealing CPU >> but we are able to find nothing. > > > From what I read/understand, resource are fine, I would put these searches > aside for now. About the log file, I like to use: > > - tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are > having the issues NOW) > - grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what > happened and was wrong > > For now I can't think about anything else, I hope some of those ideas will > help you diagnose the problem. Once it is diagnosed, we should be able to > reason about how we can fix it. > > C*heers, > ----------------------- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > Le mar. 15 oct. 2019 à 17:26, Reid Pinchback <rpinchb...@tripadvisor.com> > a écrit : > >> I’d look to see if you have compactions fronting the p99’s. If so, then >> go back to looking at the I/O. Disbelieve any metrics not captured at a >> high resolution for a time window around the compactions, like 100ms. You >> could be hitting I/O stalls where reads are blocked by the flushing of >> writes. It’s short-lived when it happens, and per-minute metrics won’t >> provide breadcrumbs. >> >> >> >> *From: *Bill Walters <billwalter...@gmail.com> >> *Date: *Monday, October 14, 2019 at 7:10 PM >> *To: *<user@cassandra.apache.org> >> *Subject: *Elevated response times from all nodes in a data center at >> the same time. >> >> >> >> Hi Everyone, >> >> >> >> Need some suggestions regarding a peculiar issue we started facing in our >> production cluster for the last couple of days. >> >> >> >> Here are our Production environment details. >> >> >> >> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone >> in each region. >> >> No of Nodes: 24 >> >> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for >> APIs and 2 OLAP Data centers for Analytics and Batch loads) >> >> Instance Types: r5.8x Large >> >> Average Node Size: 182 GB >> >> Work Load: Read heavy >> >> Read TPS: 22k >> >> Cassandra version: 3.0.15 >> >> Java Version: JDK 181. >> >> EBS Volumes: GP2 with 1TB 3000 iops. >> >> >> >> 1. We have been running in production for more than one year and our >> experience with Cassandra is great. Experienced little hiccups here and >> there but nothing severe. >> >> >> >> 2. But recently for the past couple of days we see a behavior where our >> p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts >> rising from 2 ms to 200 ms. It starts with one node where we see the 99th >> percentile Read Request latency in Datastax Opscenter starts increasing. >> And it spreads immediately, to all other 6 nodes in the data center. >> >> >> >> 3. We do not see any Read request timeouts or Exception in the our API >> Splunk logs only p99 and average latency go up suddenly. >> >> >> >> 4. We have investigated CPU level usage, Disk I/O, Memory usage and >> Network parameters for the nodes during this period and we are not >> experiencing any sudden surge in these parameters. >> >> >> >> 5. We setup client using WhiteListPolicy to send queries to each of the 6 >> nodes to understand which one is bad, but we see all of them responding >> with very high latency. It doesn't happen during our peak traffic period >> sometime in the night. >> >> >> >> 6. We checked the system.log files on our nodes, took a thread dump and >> checked for any rouge processes running on the nodes which is stealing CPU >> but we are able to find nothing. >> >> >> >> 7. We even checked our the write requests coming in during this time and >> we do not see any large batch operations happening. >> >> >> >> 8. Initially we tried restarting the nodes to see if the issue can be >> mitigated but it kept happening, and we had to fail over API traffic to >> us-west-2 region OLTP data center. After a couple of hours we failed back >> and everything seems to be working. >> >> >> >> We are baffled by this behavior, only correlation we find is the "Native >> requests pending" in our Task queues when this happens. >> >> >> >> Please let us know your suggestions on how to debug this issue. Has >> anyone experienced an issue like this before.(We had issues where one node >> starts acting bad due to bad EBS volume I/O read and write time, but all >> nodes experiencing an issue at same time is very peculiar) >> >> >> >> Thank You, >> >> Bill Walters. >> >