Hey,
We have a 3.11.9 cluster (recently upgraded from 2.1.14), and after the
upgrade we have an issue when we remove a node.

The moment I run the removenode command, 3 servers in the same dc start to
have a high amount of pending native-transport-requests (getting to around
1M) and clients are having issues due to that. We are using vnodes (32), so
I I don't see why I would have 3 servers busier than others (RF is 3 but I
don't see why it will be related).

Each node has a few TB of data, and in the past we were able to remove a
node in ~half a day, today what happens is in the first 1-2 hours we have
these issues with some nodes, then things go quite, remove is still running
and clients are ok, a few hours later the same issue is back (with same
nodes as the problematic ones), and clients have issues again, leading us
to run removenode force.

Reducing stream throughput and number of compactors has helped to mitigate
the issues a bit, but we still have this issue of pending native-transport
requests getting to insane numbers and clients suffering, eventually
causing us to run remove force. Any idea?

I saw since 3.11.6 there is a parameter
native_transport_max_concurrent_requests_in_bytes, looking into setting
this, perhaps this will prevent the amount of pending tasks to get so high.

Gil

Reply via email to