Hi, I kept the full answer in history to keep the list informed of your full answer.
My answer down below. On Mon, 11 Oct 2021 11:33:12 +0200 damiano giuliani <damianogiulian...@gmail.com> wrote: > ehy guys sorry for being late, was busy during the WE > > here i im: > > > > Did you see the swap activity (in/out, not just swap occupation) happen in > > the > > > > same time the member was lost on corosync side? > > Did you check corosync or some of its libs were indeed in swap? > > > > > no and i dont know how do it, i just noticed the swap occupation which > suggest me (and my collegue) to find out if it could cause some trouble. > > > First, corosync now sit on a lot of memory because of knet. Did you try to > > switch back to udpu which is using way less memory? > > > No i havent move to udpd, cast stop processes at all. > > "Could not lock memory of service to avoid page faults" > > > grep -rn 'Could not lock memory of service to avoid page faults' /var/log/* > returns noting This message should appears on corosync startup. Make sure the logs hadn't been rotated to a blackhole in the meantime... > > On my side, mlocks is unlimited on ulimit settings. Check the values > > in /proc/$(coro PID)/limits (be careful with the ulimit command, check the > > proc itself). > > > cat /proc/101350/limits > Limit Soft Limit Hard Limit Units > Max cpu time unlimited unlimited seconds > Max file size unlimited unlimited bytes > Max data size unlimited unlimited bytes > Max stack size 8388608 unlimited bytes > Max core file size 0 unlimited bytes > Max resident set unlimited unlimited bytes > Max processes 770868 770868 > processes > Max open files 1024 4096 files > Max locked memory unlimited unlimited bytes > Max address space unlimited unlimited bytes > Max file locks unlimited unlimited locks > Max pending signals 770868 770868 signals > Max msgqueue size 819200 819200 bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited us > > Ah... That's the first thing I change. > > In SLES, that is defaulted to 10s and so far I have never seen an > > environment that is stable enough for the default 1s timeout. > > > old versions have 10s default > you are not going to fix the problem lthis way, 1s timeout for a bonded > network and overkill hardware is enourmous time. > > hostnamectl | grep Kernel > Kernel: Linux 3.10.0-1160.6.1.el7.x86_64 > [root@ltaoperdbs03 ~]# cat /etc/os-release > NAME="CentOS Linux" > VERSION="7 (Core)" > > > Indeed. But it's an arbitrage between swapping process mem or freeing > > mem by removing data from cache. For database servers, it is advised to > > use a > > lower value for swappiness anyway, around 5-10, as a swapped process means > > longer query, longer data in caches, piling sessions, etc. > > > totally agree, for db server swappines has to be 5-10. > > kernel? > > What are your settings for vm.dirty_* ? > > > > hostnamectl | grep Kernel > Kernel: Linux 3.10.0-1160.6.1.el7.x86_64 > [root@ltaoperdbs03 ~]# cat /etc/os-release > NAME="CentOS Linux" > VERSION="7 (Core)" > > > sysctl -a | grep dirty > vm.dirty_background_bytes = 0 > vm.dirty_background_ratio = 10 Considering your 256GB of physical memory, this means you can dirty up to 25GB pages in cache before the kernel start to write them on storage. You might want to trigger these background, lighter syncs much before hitting this limit. > vm.dirty_bytes = 0 > vm.dirty_expire_centisecs = 3000 > vm.dirty_ratio = 20 This is 20% of your 256GB physical memory. After this limit, writes have to go to disks, directly. Considering the time to write to SSD compared to memory and the amount of data to sync in the background as well (52GB), this could be very painful. > vm.dirty_writeback_centisecs = 500 > > > > Do you have a proof that swap was the problem? > > > not at all but after switch to swappiness to 10, cluster doesnt sunndletly > swap anymore from a month _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/