Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

Colleen Velo Mon, 25 Nov 2019 09:12:19 -0800

Hello,

As part of the final stages of our 2.2 --> 3.11 upgrades, one of our
clusters (on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We
started getting spikes of Cassandra read and write timeouts despite the
fact the overall metrics volumes were unchanged. As part of the upgrade
process, there was a TWCS table that we used a facade implementation to
help change the namespace of the compaction class, but that has very low
query volume.


The DigestMismatchException error messages, (based on sampling the hash
keys and finding which tables have partitions for that hash key), seem to
be occurring on the heaviest volume table (4,000 reads, 1600 writes per
second per node approximately), and that table has semi-medium row widths
with about 10-40 column keys. (Or at least the digest mismatch partitions
have that type of width). The keyspace is an RF3 using NetworkTopology, the
CL is QUORUM for both reads and writes.

We have experienced the DigestMismatchException errors on all 3 of the
Production clusters that we have upgraded (all of them are single DC in the
us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases,
those DigestMismatchException errors were not there in either the  2.1.x or
2.2.x versions of Cassandra.
Does anyone know of changes from 2.2 to 3.11 that would produce additional
timeout problems, such as heavier blocking read repair logic?  Also,

We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of
the tables and across all of the nodes, and our timeouts seemed to have
disappeared, but we continue to see a rapid streaming of the Digest
mismatches exceptions, so much so that our Cassandra debug logs are rolling
over every 15 minutes..   There is a mail list post from 2018 that
indicates that some DigestMismatchException error messages are natural if
you are reading while writing, but the sheer volume that we are getting is
very concerning:
 - https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html

Is that level of DigestMismatchException unusual? Or is can that volume of
mismatches appear if semi-wide rows simply require a lot of resolution
because flurries of quorum reads/writes (RF3) on recent partitions have a
decent chance of not having fully synced data on the replica reads? Does
the digest mismatch error get debug-logged on every chance read repair?
(edited)
Also, why are these DigestMismatchException only occurring once the upgrade
to 3.11 has occurred?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sample DigestMismatchException error message:
    DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448
ReadCallback.java:242 - Digest mismatch:
    org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(-6492169518344121155,
66306139353831322d323064382d313037322d663965632d636565663165326563303965)
(be2c0feaa60d99c388f9d273fdc360f7 vs 09eaded2d69cf2dd49718076edf56b36)
        at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_77]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_77]
        at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.4.jar:3.11.4]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]

Cluster(s) setup:
    * AWS region: eu-west-1:
        — Nodes: 18
        — single DC
        — keyspace: RF3 using NetworkTopology

    * AWS region: us-east-1:
        — Nodes: 20
        — single DC
        — keyspace: RF3 using NetworkTopology

    * AWS region: ap-northeast-2:
        — Nodes: 30
        — single DC
        — keyspace: RF3 using NetworkTopology

Thanks for any insight into this issue.

-- 

*Colleen Veloemail: cmv...@gmail.com <cmv...@gmail.com>*

Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

Reply via email to