Re: Hanging repairs in Cassandra

2022-01-30 Thread Bowen Song
The default is set to 2 hours for good reasons. First, only stateful firewall and NAT gateway care about timing out idle TCP connections. There's not such devices in the core Internet routing infrastructure. Those devices are only found very close to the endpoints on both sides of the TCP

Re: Hanging repairs in Cassandra

2022-01-30 Thread Troels Arvin
Hello, Manish Khandelwal wrote: /The issue was //*tcp_keepalive_time*// has the default value (7200 seconds). So once the idle connection is broken by the firewall, the application (Cassandra node) was getting notified very late.  Thus we were seeing one node sending merkle

Re: Hanging repairs in Cassandra

2022-01-25 Thread Bowen Song
TBH, 10 minutes is pretty low. That's more suitable for a web server than a database server. If it's easy to do, you may prefer to increase that on the firewall instead of tuning Cassandra. Cassandra won't be the only thing affected by it, and you may just save yourself some debugging time in

Re: Hanging repairs in Cassandra

2022-01-24 Thread manish khandelwal
TCP aging value is 10 mins. So with 7200 seconds for tcp_keepalive_time node was going unresponsive. Is TCP aging value tool low or right enough? On Mon, Jan 24, 2022 at 11:32 PM Bowen Song wrote: > Is reconfiguring your firewall an option? A stateful firewall really > shouldn't remove a TCP

Re: Hanging repairs in Cassandra

2022-01-24 Thread Bowen Song
Is reconfiguring your firewall an option? A stateful firewall really shouldn't remove a TCP connection in such short time, unless the number of connections is very large and generally short lived (which often see in web servers). On 24/01/2022 13:03, manish khandelwal wrote: Hi All Thanks

Re: Hanging repairs in Cassandra

2022-01-24 Thread manish khandelwal
Hi All Thanks for the suggestions. The issue was *tcp_keepalive_time* has the default value (7200 seconds). So once the idle connection is broken by the firewall, the application (Cassandra node) was getting notified very late. Thus we were seeing one node sending merkle tree and other not

Re: Hanging repairs in Cassandra

2022-01-21 Thread C. Scott Andreas
Hi Manish, I understand this answer is non-specific and might not be the most helpful, but figured I’d mention — Cassandra 3.11.2 is nearly four years old and a large number of bugs in repair and other subsystems have been resolved in the time since. I’d recommend upgrading to the latest

Re: Hanging repairs in Cassandra

2022-01-21 Thread manish khandelwal
Hi All After going through the system.logs, I still see sometimes the merkle tree is not received from remote DC nodes. Local DC nodes respond back as soon as they send. But in case of remote DC, it happens that one or two dcs does not respond. There is considerable time lag (15-16 minutes)

Re: Hanging repairs in Cassandra

2022-01-19 Thread manish khandelwal
We use nodetool repair -pr -full. We have scheduled these to run automatically. For us also it has been seamless on most of the clusters. This particular node is misbehaving for reasons unknown to me. As per your suggestion, going through system.logs to find that unknown. Will keep you posted if

Re: Hanging repairs in Cassandra

2022-01-19 Thread Bowen Song
May I ask how do you run the repair? Is it manually via the nodetool command line tool, or a tool or script, such as Cassandra Reaper? If you are running the repairs manually, would you mind give Cassandra Reaper a try? I have a fairly large cluster under my management, and last time I tried

Re: Hanging repairs in Cassandra

2022-01-18 Thread manish khandelwal
Agree with you on that. Just wanted to highlight that I am experiencing the same behavior. Regards Manish On Tue, Jan 18, 2022, 22:50 Bowen Song wrote: > The link was related to Cassandra 1.2, and it was 9 years ago. Cassandra > was full of bugs at that time, and it has improved a lot since

Re: Hanging repairs in Cassandra

2022-01-18 Thread Bowen Song
The link was related to Cassandra 1.2, and it was 9 years ago. Cassandra was full of bugs at that time, and it has improved a lot since then. For that reason, I would rather not compare the issue you have with some 9 years old issues someone else had. On 18/01/2022 16:11, manish khandelwal

Re: Hanging repairs in Cassandra

2022-01-18 Thread manish khandelwal
I am not sure what is happening but it has happened thrice. It is happening that merkle trees are not received from nodes of other data center. Getting issue on similar lines as mentioned here

Re: Hanging repairs in Cassandra

2022-01-18 Thread Bowen Song
Keep reading the log on the initiator and the node sending the merkle tree, anything follows that? FYI, not all log has the repair ID in it, therefore please read the relevant logs in the chronological order without filtering (e.g. "grep") on the repair ID. I'm sceptical network issue is

Re: Hanging repairs in Cassandra

2022-01-18 Thread manish khandelwal
In the system logs, on the node where repair was initiated, I see that the node has requested merkle tree from all nodes including itself INFO [Repair#3:1] 2022-01-14 03:32:18,805 RepairJob.java:172 - *[repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Requesting merkle trees for *tablename* (to

Re: Hanging repairs in Cassandra

2022-01-18 Thread Bowen Song
The entry in the debug.log is not specific to a repair session, and it could also be caused by reasons other than network connectivity issue, such as long STW GC pauses. I usually don't start troubleshooting an issue from the debug log, as it can be rather noisy. The system.log is a better