Jan and Tomas, Even though it doesn't seem like I still have the 99% busy condition that I reported on Friday, I went ahead and generated a core dump, as Jan suggested in his earlier reply:
"If you still have the pcsd/ruby process in that state, it might be worth dumping a core for further off-line examination. Assuming you have enough space to store it (in order of gigabytes, it seems) and gdb installed, you can do it like: gcore -o pcsd.core 23277 I have no idea how far the support for Ruby interpretation in gdb goes (Python is quite well supported in terms of high level debugging), but could be enough for figuring out what's going on. If you are confident enough your cluster configuration does not contain anything too confidential, it would perhaps be best if you shared this core file in a compressed form privately with tojeline at redhat. Otherwise, you can use gdb itself to look around the call stack in the core file, strings utility to guess if there's excessive accumulation of particular strings, and similar analyses, some of which are applicable also on live process, and some would be usable only on live process (like strace)." There is clearly still a very high occurrence of "High CIB load detected" in the last 12 hours ... [root@zs95kj cluster]# grep "High CIB load detected" /var/log/messages |wc -l 3940 [root@zs95kj cluster]# date;grep "High CIB load detected" /var/log/messages |tail Mon Feb 6 12:10:30 EST 2017 Feb 6 12:06:00 zs95kj crmd[27233]: notice: High CIB load detected: 0.993333 Feb 6 12:06:30 zs95kj crmd[27233]: notice: High CIB load detected: 0.989333 Feb 6 12:07:00 zs95kj crmd[27233]: notice: High CIB load detected: 0.994667 Feb 6 12:07:30 zs95kj crmd[27233]: notice: High CIB load detected: 0.994000 Feb 6 12:08:00 zs95kj crmd[27233]: notice: High CIB load detected: 0.993333 Feb 6 12:08:30 zs95kj crmd[27233]: notice: High CIB load detected: 0.994333 Feb 6 12:09:00 zs95kj crmd[27233]: notice: High CIB load detected: 0.997333 Feb 6 12:09:30 zs95kj crmd[27233]: notice: High CIB load detected: 0.993000 Feb 6 12:10:00 zs95kj crmd[27233]: notice: High CIB load detected: 0.990000 Feb 6 12:10:30 zs95kj crmd[27233]: notice: High CIB load detected: 0.992333 ... but, I don't currently see 99% CPU in the top output for the ruby process. The cib process does hit 99% : top - 16:15:15 up 5 days, 6:35, 2 users, load average: 3.05, 2.81, 2.76 Tasks: 1185 total, 3 running, 1182 sleeping, 0 stopped, 0 zombie %Cpu(s): 16.0 us, 5.0 sy, 0.0 ni, 78.4 id, 0.0 wa, 0.1 hi, 0.4 si, 0.1 st KiB Mem : 65795900 total, 3073336 free, 49808324 used, 12914240 buff/cache KiB Swap: 10485657+total, 99452000 free, 5404572 used. 13466516 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME + COMMAND 27225 haclust+ 20 0 110060 71608 19412 R 99.7 0.1 6339:35 cib 23277 root 20 0 14.151g 9.626g 123696 S 19.8 15.3 938:17.53 ruby Also, commands such as 'pcs status pcsd' are returning very quickly now (as opposed to the 4-5 duration I reported earlier). Anyway, I created the core on the cluster node which was showing the hightest amount of accumulated run time on the pcsd / ruby process (PID 23277): [root@zs95kj pcsd_hang_zeus]# date;ps -ef |grep pcsd Mon Feb 6 09:08:19 EST 2017 root 23253 1 0 Feb01 ? 00:00:00 /bin/sh /usr/lib/pcsd/pcsd start root 23276 23253 0 Feb01 ? 00:00:00 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb root 23277 23276 12 Feb01 ? 15:10:10 /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb I don't know if it matters, but I had to run the core dump twice on the 23277 process because I ran out of space on the first one. I ended up deleting the first one. To get the second one: [root@zs95kj pcsd_hang_zeus]# date;gcore -o /guestimages/nfs1/pcsd.core2 23277 Mon Feb 6 09:09:51 EST 2017 [New LWP 43430] [New LWP 24440] [New LWP 23403] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x000003ffa91f42d4 in ?? () from /lib64/libc.so.6 warning: target file /proc/23277/cmdline contained unexpected null characters Saved corefile /guestimages/nfs1/pcsd.core2.23277 You have new mail in /var/spool/mail/root [root@zs95kj pcsd_hang_zeus]# I compressed it, and uploaded it to ibm.box.com , along with the recent pcsd and messages logs from my node. I will send the box link to Tomas in a separate email thread. For the heck of it, I also got a core dump of the cib process, and uploaded that too: [root@zs95kj ~]# date;gcore -o /guestimages/nfs1/zs95kj_cib_core_Feb6 27225 Mon Feb 6 16:21:03 EST 2017 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x000003ffa8a885aa in BZ2_compressBlock () from /lib64/libbz2.so.1 Saved corefile /guestimages/nfs1/zs95kj_cib_core_Feb6.27225 In the pcsd log, I see "pcs status nodes corosync" requests (approximately) once per 64 seconds. [root@zs95kj cluster]# grep "pcs status nodes corosync" /var/log/pcsd/pcsd.log |wc -l 1861 [root@zs95kj ~]# grep "pcs status nodes corosync" /var/log/pcsd/pcsd.log | tail I, [2017-02-06T16:18:53.468885 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:19:57.139796 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:21:00.833517 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:22:04.151613 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:23:07.803426 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:24:11.360107 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:25:14.828617 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:26:18.553637 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:27:22.363723 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:28:26.330005 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync A typical, repeating set of entries in the pcsd.log look like this: I, [2017-02-06T16:29:26.367339 #23277] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name I, [2017-02-06T16:29:26.367458 #23277] INFO -- : CIB USER: hacluster, groups: I, [2017-02-06T16:29:27.438925 #23277] INFO -- : Return Value: 0 ::ffff:10.20.93.13 - - [06/Feb/2017:16:29:27 -0500] "GET /remote/get_configs HTTP/1.1" 200 1077 1.0768 ::ffff:10.20.93.13 - - [06/Feb/2017:16:29:27 -0500] "GET /remote/get_configs HTTP/1.1" 200 1077 1.0770 zs93KLpcs1.pokprv.stglabs.ibm.com - - [06/Feb/2017:16:29:26 EST] "GET /remote/get_configs HTTP/1.1" 200 1077 - -> /remote/get_configs I, [2017-02-06T16:29:28.816100 #23277] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name I, [2017-02-06T16:29:28.816288 #23277] INFO -- : CIB USER: hacluster, groups: I, [2017-02-06T16:29:29.890788 #23277] INFO -- : Return Value: 0 I, [2017-02-06T16:29:29.890958 #23277] INFO -- : Running: /usr/sbin/pcs status nodes corosync I, [2017-02-06T16:29:29.890990 #23277] INFO -- : CIB USER: hacluster, groups: I, [2017-02-06T16:29:31.082885 #23277] INFO -- : Return Value: 0 I, [2017-02-06T16:29:31.084041 #23277] INFO -- : SRWT Node: zs90kppcs1 Request: get_configs I, [2017-02-06T16:29:31.085012 #23277] INFO -- : SRWT Node: zs93kjpcs1 Request: get_configs I, [2017-02-06T16:29:31.086098 #23277] INFO -- : SRWT Node: zs93KLpcs1 Request: get_configs I, [2017-02-06T16:29:31.088295 #23277] INFO -- : SRWT Node: zs95kjpcs1 Request: get_configs I, [2017-02-06T16:29:31.088515 #23277] INFO -- : SRWT Node: zs95KLpcs1 Request: get_configs I, [2017-02-06T16:29:31.157357 #23277] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name I, [2017-02-06T16:29:31.157480 #23277] INFO -- : CIB USER: hacluster, groups: I, [2017-02-06T16:29:32.271072 #23277] INFO -- : Return Value: 0 ::ffff:10.20.93.12 - - [06/Feb/2017:16:29:32 -0500] "GET /remote/get_configs HTTP/1.1" 200 1077 1.1190 ::ffff:10.20.93.12 - - [06/Feb/2017:16:29:32 -0500] "GET /remote/get_configs HTTP/1.1" 200 1077 1.1193 zs95kjpcs1.pokprv.stglabs.ibm.com - - [06/Feb/2017:16:29:31 EST] "GET /remote/get_configs HTTP/1.1" 200 1077 - -> /remote/get_configs Again, the pcsd query commands are returning quickly now, so all this data may not be of any use now. Comments / suggestions welcome. Thanks... Scott Greenlese ... IBM KVM on System Z Solutions Test, Poughkeepsie, N.Y. INTERNET: [email protected] From: Tomas Jelinek <[email protected]> To: [email protected] Date: 02/06/2017 04:55 AM Subject: Re: [ClusterLabs] pcsd 99% CPU Dne 3.2.2017 v 22:08 Scott Greenlese napsal(a): > Hi all.. > > Over the past few days, I noticed that pcsd and ruby process is pegged > at 99% CPU, and commands such as > pcs status pcsd take up to 5 minutes to complete. On all active cluster > nodes, top shows: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 27225 haclust+ 20 0 116324 91600 23136 R 99.3 0.1 1943:40 cib > 23277 root 20 0 12.868g 8.176g 8460 S 99.7 13.0 407:44.18 ruby > > The system log indicates High CIB load detected over the past 2 days: > > [root@zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep > "Feb 3" |wc -l > 1655 > [root@zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep > "Feb 2" |wc -l > 1658 > [root@zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep > "Feb 1" |wc -l > 147 > [root@zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep > "Jan 31" |wc -l > 444 > [root@zs95kj ~]# grep "High CIB load detected" /var/log/messages |grep > "Jan 30" |wc -l > 352 > > > The first entries logged on Feb 2 started around 8:42am ... > > Feb 2 08:42:12 zs95kj crmd[27233]: notice: High CIB load detected: 0.974333 > > This happens to coincide with the time that I had caused a node fence > (off) action by creating a iface-bridge resources and specified > a non-existent vlan slave interface (reported to the group yesterday in > a separate email thread). It also happened to cause me to lose > quorum in the cluster, because 2 of my 5 cluster nodes were already > offline. > > My cluster currently has just over 200 VirtualDomain resources to > manage, plus one iface-bridge resource and one iface-vlan resource. > Both of which are currently configured properly and operational. > > I would appreciate some guidance how to proceed with debugging this > issue. I have not taken any recovery actions yet. Checking /var/log/pcsd/pcsd.log to see what pcsd is actually doing might be a good start. What pcsd version do you have? > I considered stopping the cluster, recycling pcsd.service on all nodes, > restarting cluster... and also, reboot the nodes, if > necessary. But, didn't want to clear it yet in case there's anything I > can capture while in this state. Restarting just pcsd might be enough. Tomas > > Thanks.. > > Scott Greenlese ... KVM on System Z - Solutions Test, Poughkeepsie, N.Y. > INTERNET: [email protected] > > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
