I would pay attention to the dirty background writer activity at the O/S level. 
 If you see that it isn’t keeping up with flushing changes to disk, then you’ll 
be in an even worse situation as you increase the JVM heap size, because that 
will be done at the cost of the size of available buffer cache.  When Linux 
can’t flush to disk, it can manifest as malloc failures (although if your C* is 
configured to have the JVM pre-touch all memory allocations, that shouldn’t 
happen… I don’t know if C* versions as old as yours do that, current ones 
definitely are configured that way).

If you get stuck, you may want to consider upgrading to something recent in the 
3.11 versions, 3.11.5 or newer.  A setting for controlling merkle-tree height 
was back-ported from the work on C* version 4, and that lets you tune some of 
the memory pressure on repairs, trading memory-related performance for 
network-related performance.  Networks are faster these days, it can be a 
reasonable tradeoff to consider. We used to periodically knock over C* nodes 
during repairs, until we incorporated a patch for that issue.

From: Ben G <guobin.em...@gmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, April 16, 2020 at 3:32 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Cassandra node JVM hang during node repair a table with 
materialized view

Message from External Sender
Thanks a lot. We are working on removing views and control the partition size.  
I hope the improvements help us

Best regards

Gb

Erick Ramirez <erick.rami...@datastax.com<mailto:erick.rami...@datastax.com>> 
于2020年4月16日周四 下午2:08写道:
GC collector is G1.  I ever repair the node after scale up. The JVM issue 
reproduced.  Can I increase the heap to 40 GB on a 64GB VM?

I wouldn't recommend going beyond 31GB on G1. It will be diminishing returns as 
I mentioned before.

Do you think the issue is related to materialized view or big partition?

Yes, materialised views are problematic and I don't recommend them for 
production since they're still experimental. But if I were to guess, I'd say 
your problem is more an issue with large partitions and too many tombstones 
both putting pressure on the heap.

The thing is if you can't bootstrap because you're running into the 
TombstoneOverwhelmException (I'm guessing), I can't see how you wouldn't run 
into it with repairs. In any case, try running repairs on the smaller tables 
first and work on the remaining tables one-by-one. But bootstrapping a node 
with repairs is a very expensive exercise than just plain old bootstrap. I get 
that you're in a tough spot right now so good luck!


--

Thanks
Guo Bin

Reply via email to