Thanks for the responses. @Prem yes this is after the entire cluster is on 3.11, but no I did not run upgradesstables yet.
@Thomas no I don't see any major GC going on. @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and bring it back (thankfully this cluster is not serving live traffic). The nodes seemed okay for an hour or two, but I see the issue again, without me bouncing any nodes. This time it's ReadStage that's building up, and the exception I'm seeing in the logs is: DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 - Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(6150926370328526396, 696a6374652e6f7267) (2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025) at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.11.0.jar:3.11.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_71] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) [apache-cassandra-3.11.0.jar:3.11.0] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71] Do you think running upgradesstables would help? Or relocatesstables? I presumed it shouldn't be necessary for Cassandra to function, just an optimization. On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas < thomas.steinmau...@dynatrace.com> wrote: > Dan, > > > > do you see any major GC? We have been hit by the following memory leak in > our loadtest environment with 3.11.0. > > https://issues.apache.org/jira/browse/CASSANDRA-13754 > > > > So, depending on the heap size and uptime, you might get into heap > troubles. > > > > Thomas > > > > *From:* Dan Kinder [mailto:dkin...@turnitin.com] > *Sent:* Donnerstag, 28. September 2017 18:20 > *To:* user@cassandra.apache.org > *Subject:* > > > > Hi, > > I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the > following. The cluster does function, for a while, but then some stages > begin to back up and the node does not recover and does not drain the > tasks, even under no load. This happens both to MutationStage and > GossipStage. > > I do see the following exception happen in the logs: > > > > ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 > CassandraDaemon.java:228 - Exception in thread > Thread[ReadRepairStage:2328,5,main] > > org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out > - received only 1 responses. > > at org.apache.cassandra.service.DataResolver$ > RepairMergeListener.close(DataResolver.java:171) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at org.apache.cassandra.db.partitions. > UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_91] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ~[na:1.8.0_91] > > at org.apache.cassandra.concurrent.NamedThreadFactory. > lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] > > > > But it's hard to correlate precisely with things going bad. It is also > very strange to me since I have both read_repair_chance and > dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is > confusing why ReadRepairStage would err. > > Anyone have thoughts on this? It's pretty muddling, and causes nodes to > lock up. Once it happens Cassandra can't even shut down, I have to kill -9. > If I can't find a resolution I'm going to need to downgrade and restore to > backup... > > The only issue I found that looked similar is https://issues.apache.org/ > jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10. > > > > $ nodetool tpstats > > Pool Name Active Pending Completed > Blocked All time blocked > > ReadStage 0 0 582103 0 > 0 > > MiscStage 0 0 0 0 > 0 > > CompactionExecutor 11 11 2868 0 > 0 > > MutationStage 32 4593678 55057393 0 > 0 > > GossipStage 1 2818 371487 0 > 0 > > RequestResponseStage 0 0 4345522 0 > 0 > > ReadRepairStage 0 0 151473 0 > 0 > > CounterMutationStage 0 0 0 0 > 0 > > MemtableFlushWriter 1 81 76 0 > 0 > > MemtablePostFlush 1 382 139 0 > 0 > > ValidationExecutor 0 0 0 0 > 0 > > ViewMutationStage 0 0 0 0 > 0 > > CacheCleanupExecutor 0 0 0 0 > 0 > > PerDiskMemtableFlushWriter_10 0 0 69 0 > 0 > > PerDiskMemtableFlushWriter_11 0 0 69 0 > 0 > > MemtableReclaimMemory 0 0 81 0 > 0 > > PendingRangeCalculator 0 0 32 0 > 0 > > SecondaryIndexManagement 0 0 0 0 > 0 > > HintsDispatcher 0 0 596 0 > 0 > > PerDiskMemtableFlushWriter_1 0 0 69 0 > 0 > > Native-Transport-Requests 11 0 4547746 > 0 67 > > PerDiskMemtableFlushWriter_2 0 0 69 0 > 0 > > MigrationStage 1 1545 586 0 > 0 > > PerDiskMemtableFlushWriter_0 0 0 80 0 > 0 > > Sampler 0 0 0 0 > 0 > > PerDiskMemtableFlushWriter_5 0 0 69 0 > 0 > > InternalResponseStage 0 0 45432 0 > 0 > > PerDiskMemtableFlushWriter_6 0 0 69 0 > 0 > > PerDiskMemtableFlushWriter_3 0 0 69 0 > 0 > > PerDiskMemtableFlushWriter_4 0 0 69 0 > 0 > > PerDiskMemtableFlushWriter_9 0 0 69 0 > 0 > > AntiEntropyStage 0 0 0 0 > 0 > > PerDiskMemtableFlushWriter_7 0 0 69 0 > 0 > > PerDiskMemtableFlushWriter_8 0 0 69 0 > 0 > > > > Message type Dropped > > READ 0 > > RANGE_SLICE 0 > > _TRACE 0 > > HINT 0 > > MUTATION 0 > > COUNTER_MUTATION 0 > > BATCH_STORE 0 > > BATCH_REMOVE 0 > > REQUEST_RESPONSE 0 > > PAGED_RANGE 0 > > READ_REPAIR 0 > > > > -dan > The contents of this e-mail are intended for the named addressee only. It > contains information that may be confidential. Unless you are the named > addressee or an authorized designee, you may not copy or use it, or > disclose it to anyone else. If you received it in error please notify us > immediately and then destroy it. Dynatrace Austria GmbH (registration > number FN 91482h) is a company registered in Linz whose registered office > is at 4040 Linz, Austria, Freistädterstraße 313 > <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g> > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com