May I ask why do you scale your Cassandra cluster vertically instead of
horizontally as recommended?
I'm asking because I had dealt with a vertically scaled cluster before.
It was because they had query performance issue and blamed the hardware
wasn't strong enough. Scaling vertically had helped them to improve the
query performance, but it turned out the root caused was bad data
modelling, and it gradually got worse with the ever increasing data
size. Eventually they reached the roof of what money can realistically
buy - 256GB RAM and 16 cores 3.x GHz CPU per server in their case.
Is that your case too? Bigger RAM, more cores and higher CPU frequency
to help "fix" the performance issue? I really hope not.
On 11/03/2021 09:57, Gil Ganz wrote:
Yes. 192gb.
On Thu, Mar 11, 2021 at 10:29 AM Kane Wilson <k...@raft.so> wrote:
That is a very large heap. I presume you are using G1GC? How much
memory do your servers have?
raft.so - Cassandra consulting, support, managed services
On Thu., 11 Mar. 2021, 18:29 Gil Ganz, <gilg...@gmail.com
<mailto:gilg...@gmail.com>> wrote:
I always prefer to do decommission, but the issue here is
these servers are on-prem, and disks die from time to time.
It's a very large cluster, in multiple datacenters around the
world, so it can take some time before we have a replacement,
so we usually need to run removenode in such cases.
Other than that there are no issues in the cluster, the load
is reasonable, and when this issue happens, following a
removenode, this huge number of NTR is what I see, weird thing
it's only on some nodes.
I have been running with a very small
native_transport_max_concurrent_requests_in_bytes setting for
a few days now on some nodes (few mb's compared to the default
0.8 of a 60gb heap), it looks like it's good enough for the
app, will roll it out to the entire dc and test removal again.
On Tue, Mar 9, 2021 at 10:51 AM Kane Wilson <k...@raft.so> wrote:
It's unlikely to help in this case, but you should be
using nodetool decommission on the node you want to remove
rather than removenode from another node (and definitely
don't force removal)
native_transport_max_concurrent_requests_in_bytes defaults
to 10% of the heap, which I suppose depending on your
configuration could potentially result in a smaller number
of concurrent requests than previously. It's worth a shot
setting it higher to see if the issue is related. Is this
the only issue you see on the cluster? I assume load on
the cluster is still low/reasonable and the only symptom
you're seeing is the increased NTR requests?
raft.so <https://raft.so> - Cassandra consulting, support,
and managed services
On Mon, Mar 8, 2021 at 10:47 PM Gil Ganz
<gilg...@gmail.com <mailto:gilg...@gmail.com>> wrote:
Hey,
We have a 3.11.9 cluster (recently upgraded from
2.1.14), and after the upgrade we have an issue when
we remove a node.
The moment I run the removenode command, 3 servers in
the same dc start to have a high amount of pending
native-transport-requests (getting to around 1M) and
clients are having issues due to that. We are using
vnodes (32), so I I don't see why I would have 3
servers busier than others (RF is 3 but I don't see
why it will be related).
Each node has a few TB of data, and in the past we
were able to remove a node in ~half a day, today what
happens is in the first 1-2 hours we have these issues
with some nodes, then things go quite, remove is still
running and clients are ok, a few hours later the same
issue is back (with same nodes as the problematic
ones), and clients have issues again, leading us to
run removenode force.
Reducing stream throughput and number of compactors
has helped to mitigate the issues a bit, but we still
have this issue of pending native-transport requests
getting to insane numbers and clients suffering,
eventually causing us to run remove force. Any idea?
I saw since 3.11.6 there is a parameter
native_transport_max_concurrent_requests_in_bytes,
looking into setting this, perhaps this will prevent
the amount of pending tasks to get so high.
Gil