In the previous message, I have pasted source code from cassandra 2.2.8 by mistake. Re-checked for 2.2.11 source. These lines are the same.
2018-06-21 2:49 GMT+05:00 Dmitry Simonov <dimmobor...@gmail.com>: > Hello! > > Using Cassandra 2.2.11, I observe behaviour, that is very similar to > https://issues.apache.org/jira/browse/CASSANDRA-12860 > > Steps to reproduce: > 1. Set up a cluster: ccm create five -v 2.2.11 && ccm populate -n 5 > --vnodes && ccm start > 2. Import some keyspace into it (approx 50 Mb of data) > 3. Start repair on one node: ccm node2 nodetool repair KEYSPACE > 4. While repair is still running, disconnect node3: sudo iptables -I > INPUT -p tcp -d 127.0.0.3 -j DROP > 5. This repair hangs. > 6. Restore network connectivity > 7. Repair is still hanging. > 8. Following repairs will also hang. > > In tpstats I see tasks that make no progress: > > $ for i in {1..5}; do echo node$i; ccm node$i nodetool tpstats | grep > "Repair#"; done > node1 > Repair#1 1 2255 1 > 0 0 > node2 > Repair#1 1 2335 26 > 0 0 > node3 > node4 > Repair#3 1 147 2175 > 0 0 > node5 > Repair#1 1 2335 17 > 0 0 > > In jconsole I see that Repair threads are blocked here: > > Name: Repair#1:1 > State: WAITING on > com.google.common.util.concurrent.AbstractFuture$Sync@73c5ab7e > Total blocked: 0 Total waited: 242 > > Stack trace: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1371) > org.apache.cassandra.repair.RepairJob.run(RepairJob.java:167) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > > > According to the source code, they are waiting for validations to complete: > > # > ./apache-cassandra-2.2.8-src/src/java/org/apache/cassandra/repair/RepairJob.java > 74 public void run() > 75 { > ... > 166 // Wait for validation to complete > 167 Futures.getUnchecked(validations); > > > https://issues.apache.org/jira/browse/CASSANDRA-11824 says that problem > was fixed in 2.2.7, but I use 2.2.11. > > Restart of all Cassandra nodes that have hanging tasks (one-by-one) allows > these tasks to disappear from tpstats. After that repairs work well (until > next network problem). > > I also suppose that long GC times on one node (as well as network issues) > during repair may also lead to the same problem. > > Is it a known issue? > > -- > Best Regards, > Dmitry Simonov > -- Best Regards, Dmitry Simonov