Hey, We have a 3.11.9 cluster (recently upgraded from 2.1.14), and after the upgrade we have an issue when we remove a node.
The moment I run the removenode command, 3 servers in the same dc start to have a high amount of pending native-transport-requests (getting to around 1M) and clients are having issues due to that. We are using vnodes (32), so I I don't see why I would have 3 servers busier than others (RF is 3 but I don't see why it will be related). Each node has a few TB of data, and in the past we were able to remove a node in ~half a day, today what happens is in the first 1-2 hours we have these issues with some nodes, then things go quite, remove is still running and clients are ok, a few hours later the same issue is back (with same nodes as the problematic ones), and clients have issues again, leading us to run removenode force. Reducing stream throughput and number of compactors has helped to mitigate the issues a bit, but we still have this issue of pending native-transport requests getting to insane numbers and clients suffering, eventually causing us to run remove force. Any idea? I saw since 3.11.6 there is a parameter native_transport_max_concurrent_requests_in_bytes, looking into setting this, perhaps this will prevent the amount of pending tasks to get so high. Gil