Hi, we're on cassandra 3.11.2, and we're having some issues with repairs. They take ages to complete, and some time ago the incremental repair stopped working - that is, SSTables are not being marked as repaired, even though the repair reports success.
Running a full or incremental repair does not make any difference. Here's a log of a typical repair (omitted a lot of 'Maximum memory usage' messages): INFO [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139 - Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8), repairing keyspace prod with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges: 1280, pull repair: false) INFO [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event] INFO [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees for event (to [/14.0.52.115, /14.0.53.234]) INFO [ValidationExecutor:10] 2018-06-04 06:31:13,859 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385 NoSpamLogger.java:94 - Out of 14 commit log syncs over the past 134.02s with average duration of 34.90ms, 2 have exceeded the configured commit interval by an average of 60.66ms ... INFO [ValidationExecutor:10] 2018-06-04 13:31:19,011 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB INFO [AntiEntropyStage:1] 2018-06-04 13:37:17,357 RepairSession.java:180 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event from /14.0.52.115 INFO [ValidationExecutor:10] 2018-06-04 13:46:19,281 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB INFO [IndexSummaryManager:1] 2018-06-04 13:57:18,772 IndexSummaryRedistribution.java:76 - Redistributing index summaries INFO [AntiEntropyStage:1] 2018-06-04 13:58:21,971 RepairSession.java:180 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event from /14.0.53.234 INFO [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115 and /14.0.53.234 have 15406 range(s) out of sync for event INFO [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming repair of 15406 ranges with /14.0.52.115 INFO [RepairJobTask:4] 2018-06-04 13:59:49,075 StreamResultFuture.java:90 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for Repair INFO [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076 StreamSession.java:266 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to /14.0.52.115 INFO [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089 StreamCoordinator.java:264 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session with /14.0.52.115 INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423 StreamResultFuture.java:173 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed. Receiving 321 files(6.238GiB), sending 318 files(6.209GiB) WARN [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 - ConcurrentMarkSweep GC in 4095ms. CMS Old Gen: 4086661264 -> 1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space: 21541464 -> 0 ... WARN [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288 - Not marking nodes down due to local pause of 5123793157 > 5000000000 INFO [ScheduledTasks:1] 2018-06-04 14:12:20,611 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log) INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,188 StreamResultFuture.java:187 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.52.115 is complete INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190 StreamResultFuture.java:219 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190 LocalSyncTask.java:121 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Sync complete using session afc2ef90-67c0-11e8-b07c-c365701888e8 between /14.0.52.115 and /14.0.53.234 on event INFO [RepairJobTask:5] 2018-06-04 14:14:29,191 RepairJob.java:143 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] event is fully synced INFO [RepairJobTask:5] 2018-06-04 14:14:29,193 RepairSession.java:270 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Session completed successfully Tablestats: SSTable count: 714 Space used (live): 489416489322 Space used (total): 489416489322 Space used by snapshots (total): 0 Off heap memory used (total): 311166982 SSTable Compression Ratio: 0.28468835622230676 Number of partitions (estimate): 22511321 Memtable cell count: 6635 Memtable data size: 3236775 Memtable off heap memory used: 0 Memtable switch count: 1649 ... Percent repaired: 85.62 Bloom filter false positives: 36849 Bloom filter false ratio: 0.00024 Bloom filter space used: 87092784 Bloom filter off heap memory used: 87061568 Index summary off heap memory used: 15474806 Compression metadata off heap memory used: 208630608 Compacted partition minimum bytes: 61 Compacted partition maximum bytes: 107964792 Compacted partition mean bytes: 35262 Average live cells per slice (last five minutes): 301.46281597264624 Maximum live cells per slice (last five minutes): 3311 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Dropped Mutations: 0 Histograms: Percentile SSTables Write Latency Read Latency Partition Size Cell Count (micros) (micros) (bytes) 50% 2.00 24.60 3379.39 3311 310 75% 2.00 29.52 8409.01 51012 5722 95% 2.00 35.43 30130.99 105778 9887 98% 2.00 42.51 43388.63 263210 35425 99% 2.00 42.51 62479.63 263210 35425 Min 0.00 9.89 182.79 61 0 Max 3.00 263.21 155469.30 107964792 10090808 Thanks for any help! Martin --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org