Hello,

We are running the following setup on AWS EC2:

Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.

Apache Cassandra 3.0.17, installed from .deb packages.

$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

We have a total of 36 nodes.  All are r4.large instances, they have 2 vCPUs
and ~15 GB RAM.
On each instance we have:
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).

Non-default settings in cassandra.yaml:
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch

JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.

We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node.  The issue also happens outside of repair time.

The symptoms:
============

Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours.  According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
the data/commitlog volume.  Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.

The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.

Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.

Starting from the middle of last week we have seen this problem repeatedly
1-3 times a day, affecting different instances in a seemingly random
fashion.  Most of the time it affects only one instance, but we've had one
incident when 9 nodes (3 from each of the 3 Availability Zones) were down
at the same time due to this exact issue.

Actually, we've had the same issue previously on the same Cassandra cluster
around 3 months ago (beginning to mid of September 2018).  At that time we
were running on m4.xlarge instances (these have 4 vCPUs and 16GB RAM).

As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large.  The last migration was
completed before Nov 13th.  No changes to the cluster or application
happened since that time.

Now, after some weeks the issue appears again...

When we are not fast enough to react and reboot the affected instance, we
can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra.  After that the instance becomes available
almost immediately.  This allows us to rule out other processes running in
background as potential offenders.

We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia).  At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.

We are running atop with sampling interval of 60 seconds.  After the fact
we observe that the java process is the one responsible for the most of
disk activity during unavailability period.  We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth.  There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).

We also attempted to troubleshoot by running jstack, jmap and pmap against
Cassandra process in background every few minutes.  The idea was to compare
dumps taken before and during unavailability, but that didn't lead to any
findings so far.  Ultimately we had to stop doing this, once we've seen
that jmap can also become stuck burning CPU cycles.  Now the output of jmap
is not that useful, but we fear that jstack might also expose the same
behavior.  So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.

Now to my questions:

1. Is there anything in Cassandra or in the JVM that could explain all of a
sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?

2. Why does JVM heap utilization never reaches the 8GB that we provide to
it?

3. Why Committed virtual memory is so much bigger than sum of the heap and
off-heap reported by JMX?  To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
memory usage by JVM?

4. The only Jira issue related to Linux OOM, we've found is this one:
https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
related to our OOM, but still doesn't explain the unexpected IO anomalies.

I would really appreciate any hints / pointers / insights!  The more I
learn about this issue, the less I understand it...

Regards,
--
Alex

Reply via email to