Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

Reid Pinchback Fri, 01 Nov 2019 07:56:59 -0700

Hi Sergio,

I’m definitely not enough of a network wonk to make definitive statements on 
network configuration, finding your in-company network expert is definitely 
going to be a lot more productive.  I’ve forgotten if you are on-prem or in 
AWS, so if in AWS replace “your network wonk” with “your AWS support contact” 
if you’re paying for support.  I will make two more concrete observations 
though, and you can run these notions down as appropriate.

When C* starts up, see if the logs contain a warning about jemalloc not being 
detected.  That’s something we missed in our 3.11.4 setup and is on my todo 
list to circle back around to evaluate later.  JVMs have some rather 
complicated memory management that relates to efficient allocation of memory to 
threads (this isn’t strictly a JVM thing, but JVMs definitely care).  If you 
have high connection counts, I can see that likely mattering to you.  Also, as 
part of that, the memory arena setting of 4 that is Cassandra’s default may not 
be the right one for you.  The more concurrency you have, the more that number 
may need to bump up to avoid contention on memory allocations.  We haven’t 
played with it because our simultaneous connection counts are modest.  Note 
that Cassandra can create a lot of threads but many of them have low activity 
so I think it’s more about how many area actually active.  Large connection 
counts will move the needle up on you and may motivate tuning the arena count.

When talking to your network person, I’d see what they think about C*’s 
defaults on TCP_NODELAY vs delayed ACKs.  The Datastax docs say that the 
TCP_NODELAY default setting is false in C*, but I looked in the 3.11.4 source 
and the default is coded as true.  It’s only via the config file samples that 
bounce around that it typically gets set to false.  There are times where Nagle 
and delayed ACKs don’t play well together and induce stalls.  I’m not the 
person to help you investigate that because it gets a bit gnarly on the details 
(for example, a refinement to the Nagle algorithm was proposed in the 1990’s 
that exists in some OS’s and can make my comments here moot).  Somebody who 
lives this stuff will be a more definitive source, but you are welcome to 
copy-paste my thoughts to them for context.

R

From: Sergio <lapostadiser...@gmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, October 30, 2019 at 5:56 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Cassandra 3.11.4 Node the load starts to increase after few 
minutes to 40 on 4 CPU machine

Message from External Sender
Hi Reid,

I don't have anymore this loading problem.
I solved by changing the Cassandra Driver Configuration.
Now my cluster is pretty stable and I don't have machines with crazy CPU Load.
The only thing not urgent but I need to investigate is the number of 
ESTABLISHED TCP connections. I see just one node having 7K TCP connections 
ESTABLISHED while the others are having around 4-6K connection opened. So the 
newest nodes added into the cluster have a higher number of ESTABLISHED TCP 
connections.

default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60,
'net.ipv4.tcp_keepalive_probes' => 3,
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '10000 65535',
'net.ipv4.tcp_window_scaling' => 1,
  'net.core.netdev_max_backlog' => 2500,
  'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

These are my tweaked value and I used the values recommended from datastax.

Do you have something different?

Best,
Sergio

Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> ha scritto:
Oh nvm, didn't see the later msg about just posting what your fix was.

R

On 10/30/19, 4:24 PM, "Reid Pinchback" 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> wrote:

     Message from External Sender

    Hi Sergio,

    Assuming nobody is actually mounting a SYN flood attack, then this sounds 
like you're either being hammered with connection requests in very short 
periods of time, or your TCP backlog tuning is off.   At least, that's where 
I'd start looking.  If you take that log message and google it (Possible SYN 
flooding... Sending cookies") you'll find explanations.  Or just googling "TCP 
backlog tuning".

    R

    On 10/30/19, 3:29 PM, "Sergio Bilello" 
<lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> wrote:

        >
        >Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: 
TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. 
Check SNMP counters.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org<mailto:user-unsubscr...@cassandra.apache.org>
    For additional commands, e-mail: 
user-h...@cassandra.apache.org<mailto:user-h...@cassandra.apache.org>

Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

Reply via email to