Hello, As part of the final stages of our 2.2 --> 3.11 upgrades, one of our clusters (on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We started getting spikes of Cassandra read and write timeouts despite the fact the overall metrics volumes were unchanged. As part of the upgrade process, there was a TWCS table that we used a facade implementation to help change the namespace of the compaction class, but that has very low query volume.
The DigestMismatchException error messages, (based on sampling the hash keys and finding which tables have partitions for that hash key), seem to be occurring on the heaviest volume table (4,000 reads, 1600 writes per second per node approximately), and that table has semi-medium row widths with about 10-40 column keys. (Or at least the digest mismatch partitions have that type of width). The keyspace is an RF3 using NetworkTopology, the CL is QUORUM for both reads and writes. We have experienced the DigestMismatchException errors on all 3 of the Production clusters that we have upgraded (all of them are single DC in the us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases, those DigestMismatchException errors were not there in either the 2.1.x or 2.2.x versions of Cassandra. Does anyone know of changes from 2.2 to 3.11 that would produce additional timeout problems, such as heavier blocking read repair logic? Also, We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of the tables and across all of the nodes, and our timeouts seemed to have disappeared, but we continue to see a rapid streaming of the Digest mismatches exceptions, so much so that our Cassandra debug logs are rolling over every 15 minutes.. There is a mail list post from 2018 that indicates that some DigestMismatchException error messages are natural if you are reading while writing, but the sheer volume that we are getting is very concerning: - https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html Is that level of DigestMismatchException unusual? Or is can that volume of mismatches appear if semi-wide rows simply require a lot of resolution because flurries of quorum reads/writes (RF3) on recent partitions have a decent chance of not having fully synced data on the replica reads? Does the digest mismatch error get debug-logged on every chance read repair? (edited) Also, why are these DigestMismatchException only occurring once the upgrade to 3.11 has occurred? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sample DigestMismatchException error message: DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448 ReadCallback.java:242 - Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-6492169518344121155, 66306139353831322d323064382d313037322d663965632d636565663165326563303965) (be2c0feaa60d99c388f9d273fdc360f7 vs 09eaded2d69cf2dd49718076edf56b36) at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.11.4.jar:3.11.4] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.11.4.jar:3.11.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) [apache-cassandra-3.11.4.jar:3.11.4] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77] Cluster(s) setup: * AWS region: eu-west-1: — Nodes: 18 — single DC — keyspace: RF3 using NetworkTopology * AWS region: us-east-1: — Nodes: 20 — single DC — keyspace: RF3 using NetworkTopology * AWS region: ap-northeast-2: — Nodes: 30 — single DC — keyspace: RF3 using NetworkTopology Thanks for any insight into this issue. -- *Colleen Veloemail: cmv...@gmail.com <cmv...@gmail.com>*