Re: Elevated response times from all nodes in a data center at the same time.

Reid Pinchback Wed, 16 Oct 2019 10:16:57 -0700

Something else came to mind.

You’re on AWS.  You always have to keep noisy neighbor problems in the back of 
the mind when you aren’t running on bare metal.

Basically, either your usage pattern during these incidents is unchanged… or it 
is not unchanged.  If it is unchanged, and the problem happens, then something 
outside your usage is causing the problem.  Hence the noisy neighbor suspicion. 
 If the usage has changed during that time period, then all bets are off and 
you need to tune your cluster to be happy under more than one usage profile.

From: Jon Haddad <j...@jonhaddad.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, October 16, 2019 at 11:36 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: Bill Walters <billwalter...@gmail.com>
Subject: Re: Elevated response times from all nodes in a data center at the 
same time.

Message from External Sender
It's possible the queries you're normally running are served out of page cache, 
and during the latency spike you're hitting your disks. If you're using read 
ahead you might be hitting a throughput limit on the disks.

I've got some numbers and graphs I can share later when I'm not on my phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com>> wrote:
Hello Bill,

I think it might be worth it to focus the effort on diagnosing the issue 
properly in the first place, thus I'll try to guide you through this.

First some comments on your environment:

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each 
region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 
2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy

When I read this, I think 'Tune the garbage collection properly!'. do you see 
GC pauses being a problem? The easiest way to interpret GC logs is probably to 
upload them there: 
https://gceasy.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__gceasy.io&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=E4fnrwsKmo92Cdh31ZAXdfZ5oqYTaPb8Y1jkx4Msz3Y&e=>.
 Ensure there that the 'GC throughput' is at least around 95+% (ideally 98+%). 
This would mean that 2 to 5% of the time each node stops to perform stop the 
world GCs. If that's a thing, we can help you setting GC options a bit nicer 
than what it is currently probably. That post would then probably be a good 
starting point: 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=iB2ahtsSVaS3FYtzoStWHEE7k3y5XWtwMHOWzk7zP0k&e=>

Read TPS: 22k
Cassandra version: 3.0.15

Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the time 
being)  or (personal preference) 3.11.4. There was a bunch of fixes, maybe are 
you hitting something that was fixed already, check changes there, see if any 
change could do some good for your use case: 
https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_cassandra-2D3.0.18_CHANGES.txt-23L1-2DL124&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=ffsBZLksIBOhVH4DE1SKKU51kQThXrbCEV6-STcLAiE&e=>

Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

The GP2 IOPS depends on the dis size. If you find out at anytime that disks are 
not keeping up, a good way out could be to increase the disk size (despite the 
small dataset) to actually increase the disk. IOPS & throughput. Now this did 
not recently change, and it was working for you until now, thus you don't have 
to Increase the disk size now probably. Just be aware that GP2 with 1 TB are 
quite slow.

About the issue:

our p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts 
rising from 2 ms to 200 ms. It starts with one node where we see the 99th 
percentile Read Request latency in Datastax Opscenter starts increasing. And it 
spreads immediately, to all other 6 nodes in the data center.

Well, this sounds quite bad. The first 2 things coming to my mind here are:
- Are you reading tombstones? (check logs for tombstones, trace a few queries)
- Are you reading a huge partition? (check the max partition size, compare it 
to the mean and ensure it is remaining below 1 GB (or even 100 MB ideally).

An inefficient read, for the reasons above or other reasons, would not 
necessarily impact nodes' resources but could definitely destroy performances 
for this query and the following ones due to the 'requests congestion'.

To try to make a sense of the current tombstones level you can look at:
- logs (grep tombstone)
- sstablemetadata gives you the % of droppable tombstones. This is an estimate 
and of the space that could be freed, it gives no information on whether 
tombstones are being read and can affect performances or not, yet it gives you 
an idea of the tombstones that can be generated in the workflow
- Trace queries: either trace a manual query from cqlsh with 'TRACING ON;' then 
sending queries similar to prod ones. Or directly using 'nodetool 
settraceprobability X', /!\ ensure X is really low to start with - like 0.001 
or 0.0001 maybe, we probably don't need many queries to understand what 
happened and tracing might inflicts a big penalty to Cassandra servers in terms 
of performances (each of the traced queries will induce a bunch of queries to 
actually persist the trace In system_traces key space.

We do not see any Read request timeouts or Exception in the our API Splunk logs 
only p99 and average latency go up suddenly.

What's the value you use for timeouts? Also, any other exception/timeout, 
somewhere else than for reads?
What are the result of:

- nodetool tablestats (I think this would gather what you need to check --> 
nodetool tablestats | grep -e Keyspace -e Table: -e latency -e partition -e 
tombstones)
- watch -d nodetool tpstats (here look at any pending threads constantly higher 
than 0, any blocked or dropped threads)

We have investigated CPU level usage, Disk I/O, Memory usage and Network 
parameters for the nodes during this period and we are not experiencing any 
sudden surge in these parameters.

If the resources are fine, there is a bottleneck within Cassandra, that we need 
to find, the commands above aim at that, finding C*'s bottleneck, assuming 
machines can handle more load.

We setup client using WhiteListPolicy to send queries to each of the 6 nodes to 
understand which one is bad, but we see all of them responding with very high 
latency. It doesn't happen during our peak traffic period sometime in the night.

 This brings something else to my mind. The fact latency goes lower when there 
is a traffic increase, it can perfectly mean that each of the queries sent 
during the spike are really efficient, and despite you might still have some 
other queries being slow (even during peak hours), having that many 
'efficient/quick' requests, lowers the average/pXX latencies. Does the max 
latency change over time?

You can here try to get a sense of this with:

- nodetool proxyhistograms
- nodetool tablehistograms <keyspace> <table> # For the table level stats

We checked the system.log files on our nodes, took a thread dump and checked 
for any rouge processes running on the nodes which is stealing CPU but we are 
able to find nothing.

From what I read/understand, resource are fine, I would put these searches 
aside for now. About the log file, I like to use:

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are 
having the issues NOW)
- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what 
happened and was wrong

For now I can't think about anything else, I hope some of those ideas will help 
you diagnose the problem. Once it is diagnosed, we should be able to reason 
about how we can fix it.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g&s=kapaVkL0EZEzPVTpVk0GNWZFxxsL7aWHrZU-HJvK45I&e=>

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> a écrit :
I’d look to see if you have compactions fronting the p99’s.  If so, then go 
back to looking at the I/O.  Disbelieve any metrics not captured at a high 
resolution for a time window around the compactions, like 100ms.  You could be 
hitting I/O stalls where reads are blocked by the flushing of writes.  It’s 
short-lived when it happens, and per-minute metrics won’t provide breadcrumbs.

From: Bill Walters <billwalter...@gmail.com<mailto:billwalter...@gmail.com>>
Date: Monday, October 14, 2019 at 7:10 PM
To: <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Elevated response times from all nodes in a data center at the same 
time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our 
production cluster for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each 
region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 
2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our experience 
with Cassandra is great. Experienced little hiccups here and there but nothing 
severe.

2. But recently for the past couple of days we see a behavior where our p99 
latency in our AWS us-east-1 region OLTP data center, suddenly starts rising 
from 2 ms to 200 ms. It starts with one node where we see the 99th percentile 
Read Request latency in Datastax Opscenter starts increasing. And it spreads 
immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API Splunk 
logs only p99 and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network 
parameters for the nodes during this period and we are not experiencing any 
sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes 
to understand which one is bad, but we see all of them responding with very 
high latency. It doesn't happen during our peak traffic period sometime in the 
night.

6. We checked the system.log files on our nodes, took a thread dump and checked 
for any rouge processes running on the nodes which is stealing CPU but we are 
able to find nothing.

7. We even checked our the write requests coming in during this time and we do 
not see any large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be mitigated 
but it kept happening, and we had to fail over API traffic to us-west-2 region 
OLTP data center. After a couple of hours we failed back and everything seems 
to be working.

We are baffled by this behavior, only correlation we find is the "Native 
requests pending" in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone 
experienced an issue like this before.(We had issues where one node starts 
acting bad due to bad EBS volume I/O read and write time, but all nodes 
experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.

Re: Elevated response times from all nodes in a data center at the same time.

Reply via email to