Re: Network problems during repair make it hang on "Wait for validation to complete"

Dmitry Simonov Thu, 21 Jun 2018 00:41:06 -0700

In the previous message, I have pasted source code from cassandra 2.2.8 by
mistake.
Re-checked for 2.2.11 source.
These lines are the same.


2018-06-21 2:49 GMT+05:00 Dmitry Simonov <dimmobor...@gmail.com>:

> Hello!
>
> Using Cassandra 2.2.11, I observe behaviour, that is very similar to
> https://issues.apache.org/jira/browse/CASSANDRA-12860
>
> Steps to reproduce:
> 1. Set up a cluster: ccm create five -v 2.2.11 && ccm populate -n 5
> --vnodes && ccm start
> 2. Import some keyspace into it (approx 50 Mb of data)
> 3. Start repair on one node: ccm node2 nodetool repair KEYSPACE
> 4. While repair is still running, disconnect node3: sudo iptables -I
> INPUT -p tcp -d 127.0.0.3 -j DROP
> 5. This repair hangs.
> 6. Restore network connectivity
> 7. Repair is still hanging.
> 8. Following repairs will also hang.
>
> In tpstats I see tasks that make no progress:
>
> $ for i in {1..5}; do echo node$i; ccm node$i nodetool tpstats | grep
> "Repair#"; done
> node1
> Repair#1                          1      2255              1
> 0                 0
> node2
> Repair#1                          1      2335             26
> 0                 0
> node3
> node4
> Repair#3                          1       147           2175
> 0                 0
> node5
> Repair#1                          1      2335             17
> 0                 0
>
> In jconsole I see that Repair threads are blocked here:
>
> Name: Repair#1:1
> State: WAITING on 
> com.google.common.util.concurrent.AbstractFuture$Sync@73c5ab7e
> Total blocked: 0  Total waited: 242
>
> Stack trace:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1371)
> org.apache.cassandra.repair.RepairJob.run(RepairJob.java:167)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>
>
> According to the source code, they are waiting for validations to complete:
>
> # 
> ./apache-cassandra-2.2.8-src/src/java/org/apache/cassandra/repair/RepairJob.java
>  74     public void run()
>  75     {
> ...
> 166         // Wait for validation to complete
> 167         Futures.getUnchecked(validations);
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-11824 says that problem
> was fixed in 2.2.7, but I use 2.2.11.
>
> Restart of all Cassandra nodes that have hanging tasks (one-by-one) allows
> these tasks to disappear from tpstats. After that repairs work well (until
> next network problem).
>
> I also suppose that long GC times on one node (as well as network issues)
> during repair may also lead to the same problem.
>
> Is it a known issue?
>
> --
> Best Regards,
> Dmitry Simonov
>



-- 
Best Regards,
Dmitry Simonov

Re: Network problems during repair make it hang on "Wait for validation to complete"

Reply via email to