The default is set to 2 hours for good reasons.
First, only stateful firewall and NAT gateway care about timing out idle
TCP connections. There's not such devices in the core Internet routing
infrastructure. Those devices are only found very close to the endpoints
on both sides of the TCP
Hello,
Manish Khandelwal wrote:
/The issue was //*tcp_keepalive_time*// has the default value
(7200 seconds). So once the idle connection is broken by the
firewall, the application (Cassandra node) was getting notified
very late. Thus we were seeing one node sending merkle
TBH, 10 minutes is pretty low. That's more suitable for a web server
than a database server. If it's easy to do, you may prefer to increase
that on the firewall instead of tuning Cassandra. Cassandra won't be the
only thing affected by it, and you may just save yourself some debugging
time in
TCP aging value is 10 mins. So with 7200 seconds for tcp_keepalive_time
node was going unresponsive. Is TCP aging value tool low or right enough?
On Mon, Jan 24, 2022 at 11:32 PM Bowen Song wrote:
> Is reconfiguring your firewall an option? A stateful firewall really
> shouldn't remove a TCP
Is reconfiguring your firewall an option? A stateful firewall really
shouldn't remove a TCP connection in such short time, unless the number
of connections is very large and generally short lived (which often see
in web servers).
On 24/01/2022 13:03, manish khandelwal wrote:
Hi All
Thanks
Hi All
Thanks for the suggestions. The issue was *tcp_keepalive_time* has the
default value (7200 seconds). So once the idle connection is broken by the
firewall, the application (Cassandra node) was getting notified very late.
Thus we were seeing one node sending merkle tree and other not
Hi Manish,
I understand this answer is non-specific and might not be the most helpful, but
figured I’d mention — Cassandra 3.11.2 is nearly four years old and a large
number of bugs in repair and other subsystems have been resolved in the time
since.
I’d recommend upgrading to the latest
Hi All
After going through the system.logs, I still see sometimes the merkle tree
is not received from remote DC nodes. Local DC nodes respond back as soon
as they send. But in case of remote DC, it happens that one or two dcs does
not respond.
There is considerable time lag (15-16 minutes)
We use nodetool repair -pr -full. We have scheduled these to run
automatically. For us also it has been seamless on most of the clusters.
This particular node is misbehaving for reasons unknown to me. As per your
suggestion, going through system.logs to find that unknown. Will keep you
posted if
May I ask how do you run the repair? Is it manually via the nodetool
command line tool, or a tool or script, such as Cassandra Reaper? If you
are running the repairs manually, would you mind give Cassandra Reaper a
try?
I have a fairly large cluster under my management, and last time I tried
Agree with you on that. Just wanted to highlight that I am experiencing the
same behavior.
Regards
Manish
On Tue, Jan 18, 2022, 22:50 Bowen Song wrote:
> The link was related to Cassandra 1.2, and it was 9 years ago. Cassandra
> was full of bugs at that time, and it has improved a lot since
The link was related to Cassandra 1.2, and it was 9 years ago. Cassandra
was full of bugs at that time, and it has improved a lot since then. For
that reason, I would rather not compare the issue you have with some 9
years old issues someone else had.
On 18/01/2022 16:11, manish khandelwal
I am not sure what is happening but it has happened thrice. It is happening
that merkle trees are not received from nodes of other data center. Getting
issue on similar lines as mentioned here
Keep reading the log on the initiator and the node sending the merkle
tree, anything follows that? FYI, not all log has the repair ID in it,
therefore please read the relevant logs in the chronological order
without filtering (e.g. "grep") on the repair ID.
I'm sceptical network issue is
In the system logs, on the node where repair was initiated, I see that the
node has requested merkle tree from all nodes including itself
INFO [Repair#3:1] 2022-01-14 03:32:18,805 RepairJob.java:172 - *[repair
#6e3385e0-74d1-11ec-8e66-9f084ace9968*] Requesting merkle trees for
*tablename* (to
The entry in the debug.log is not specific to a repair session, and it
could also be caused by reasons other than network connectivity issue,
such as long STW GC pauses. I usually don't start troubleshooting an
issue from the debug log, as it can be rather noisy. The system.log is a
better
16 matches
Mail list logo