Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

Sergio Fri, 01 Nov 2019 14:48:17 -0700

Hi Reid,

Thank you for your extensive response. I don't think that we have such a
person and in any case, even if I am a Software Engineer I would be curious
to deep dive into the problem and understand the reason. The only
observation that I have right now is that I have in the same cluster 2
keyspaces and 3 datacenters.
Only the Cassandra Nodes that are serving a particular Datacenter and
Keyspace is having thousands TCP connections established and I see these
connections only from some clients.
We have 2 kinds of clients and those have been built with 2 different
approaches: Spring Cassandra Reactive and the other one with the Java
Cassandra driver without any wrapper.
I don't know a lot about it the latter one since I didn't write that code.
I want to share that just one note I asked to add LatencyAwarePolicy in the
JAVA Cassandra Driver and this decreased tremendously the CPU LOAD for any
new Cassandra node joining the cluster. I am thinking that there could be
some driver configuration that is not correct?!
I will verify my theory and I will share the results later on for the
interested reader or maybe to help someone that found the same bizarre
behavior.
However, even with thousands connection opened the load is below 3 in a 4
CPU machine and the latency is good.



Thanks and have a great weekend
Sergio




Il giorno ven 1 nov 2019 alle ore 07:56 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Hi Sergio,
>
>
>
> I’m definitely not enough of a network wonk to make definitive statements
> on network configuration, finding your in-company network expert is
> definitely going to be a lot more productive.  I’ve forgotten if you are
> on-prem or in AWS, so if in AWS replace “your network wonk” with “your AWS
> support contact” if you’re paying for support.  I will make two more
> concrete observations though, and you can run these notions down as
> appropriate.
>
>
>
> When C* starts up, see if the logs contain a warning about jemalloc not
> being detected.  That’s something we missed in our 3.11.4 setup and is on
> my todo list to circle back around to evaluate later.  JVMs have some
> rather complicated memory management that relates to efficient allocation
> of memory to threads (this isn’t strictly a JVM thing, but JVMs definitely
> care).  If you have high connection counts, I can see that likely mattering
> to you.  Also, as part of that, the memory arena setting of 4 that is
> Cassandra’s default may not be the right one for you.  The more concurrency
> you have, the more that number may need to bump up to avoid contention on
> memory allocations.  We haven’t played with it because our simultaneous
> connection counts are modest.  Note that Cassandra can create a lot of
> threads but many of them have low activity so I think it’s more about how
> many area actually active.  Large connection counts will move the needle up
> on you and may motivate tuning the arena count.
>
>
>
> When talking to your network person, I’d see what they think about C*’s
> defaults on TCP_NODELAY vs delayed ACKs.  The Datastax docs say that the
> TCP_NODELAY default setting is false in C*, but I looked in the 3.11.4
> source and the default is coded as true.  It’s only via the config file
> samples that bounce around that it typically gets set to false.  There are
> times where Nagle and delayed ACKs don’t play well together and induce
> stalls.  I’m not the person to help you investigate that because it gets a
> bit gnarly on the details (for example, a refinement to the Nagle algorithm
> was proposed in the 1990’s that exists in some OS’s and can make my
> comments here moot).  Somebody who lives this stuff will be a more
> definitive source, but you are welcome to copy-paste my thoughts to them
> for context.
>
>
>
> R
>
>
>
> *From: *Sergio <lapostadiser...@gmail.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Wednesday, October 30, 2019 at 5:56 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: Cassandra 3.11.4 Node the load starts to increase after
> few minutes to 40 on 4 CPU machine
>
>
>
> *Message from External Sender*
>
> Hi Reid,
>
> I don't have anymore this loading problem.
> I solved by changing the Cassandra Driver Configuration.
> Now my cluster is pretty stable and I don't have machines with crazy CPU
> Load.
> The only thing not urgent but I need to investigate is the number of
> ESTABLISHED TCP connections. I see just one node having 7K TCP connections
> ESTABLISHED while the others are having around 4-6K connection opened. So
> the newest nodes added into the cluster have a higher number of ESTABLISHED
> TCP connections.
>
> default['cassandra']['sysctl'] = {
> 'net.ipv4.tcp_keepalive_time' => 60,
> 'net.ipv4.tcp_keepalive_probes' => 3,
> 'net.ipv4.tcp_keepalive_intvl' => 10,
> 'net.core.rmem_max' => 16777216,
> 'net.core.wmem_max' => 16777216,
> 'net.core.rmem_default' => 16777216,
> 'net.core.wmem_default' => 16777216,
> 'net.core.optmem_max' => 40960,
> 'net.ipv4.tcp_rmem' => '4096 87380 16777216',
> 'net.ipv4.tcp_wmem' => '4096 65536 16777216',
> 'net.ipv4.ip_local_port_range' => '10000 65535',
> 'net.ipv4.tcp_window_scaling' => 1,
>   'net.core.netdev_max_backlog' => 2500,
>   'net.core.somaxconn' => 65000,
> 'vm.max_map_count' => 1048575,
> 'vm.swappiness' => 0
> }
>
> These are my tweaked value and I used the values recommended from datastax.
>
> Do you have something different?
>
> Best,
> Sergio
>
>
>
> Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback <
> rpinchb...@tripadvisor.com> ha scritto:
>
> Oh nvm, didn't see the later msg about just posting what your fix was.
>
> R
>
>
> On 10/30/19, 4:24 PM, "Reid Pinchback" <rpinchb...@tripadvisor.com> wrote:
>
>      Message from External Sender
>
>     Hi Sergio,
>
>     Assuming nobody is actually mounting a SYN flood attack, then this
> sounds like you're either being hammered with connection requests in very
> short periods of time, or your TCP backlog tuning is off.   At least,
> that's where I'd start looking.  If you take that log message and google it
> (Possible SYN flooding... Sending cookies") you'll find explanations.  Or
> just googling "TCP backlog tuning".
>
>     R
>
>
>     On 10/30/19, 3:29 PM, "Sergio Bilello" <lapostadiser...@gmail.com>
> wrote:
>
>         >
>         >Oct 17 00:23:03 prod-personalization-live-data-cassandra-08
> kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending
> cookies. Check SNMP counters.
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>     For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

Reply via email to