Okay, I have this happening again on a couple servers right now, and am happy
to let it spin and dig more into it. I'm not at all experienced with stuff
like this though, so will need some explicit instruction on what to do beyond
what I've documented here...
I don't see anything of note in the pcsd.log - seems to just be normal activity
being logged by the master process that isn't runaway. Here's a snippet:
10.124.167.177 - - [23/May/2018:15:56:34 +0000] "GET /remote/get_configs
HTTP/1.1" 200 553 0.0145
10.124.167.177 - - [23/May/2018:15:56:34 +0000] "GET /remote/get_configs
HTTP/1.1" 200 553 0.0147
10.124.167.177 - - [23/May/2018:15:56:34 UTC] "GET /remote/get_configs
HTTP/1.1" 200 553
- -> /remote/get_configs
I, [2018-05-23T15:56:37.972682 #1378] INFO -- : Running:
/usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-05-23T15:56:37.972805 #1378] INFO -- : CIB USER: hacluster, groups:
I, [2018-05-23T15:56:37.982066 #1378] INFO -- : Return Value: 0
10.124.167.176 - - [23/May/2018:15:56:37 +0000] "GET /remote/get_configs
HTTP/1.1" 200 553 0.0107
10.124.167.176 - - [23/May/2018:15:56:37 +0000] "GET /remote/get_configs
HTTP/1.1" 200 553 0.0108
10.124.167.176 - - [23/May/2018:15:56:37 UTC] "GET /remote/get_configs
HTTP/1.1" 200 553
- -> /remote/get_configs
I, [2018-05-23T15:57:10.648134 #1378] INFO -- : Running:
/usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-05-23T15:57:10.648276 #1378] INFO -- : CIB USER: hacluster, groups:
I, [2018-05-23T15:57:10.660617 #1378] INFO -- : Return Value: 0
10.124.167.178 - - [23/May/2018:15:57:10 +0000] "GET /remote/get_configs
HTTP/1.1" 200 553 0.0140
10.124.167.178 - - [23/May/2018:15:57:10 +0000] "GET /remote/get_configs
HTTP/1.1" 200 553 0.0141
10.124.167.178 - - [23/May/2018:15:57:10 UTC] "GET /remote/get_configs
HTTP/1.1" 200 553
- -> /remote/get_configs
I ran `strace -p <pid>`, and the screen filled with the following line
repeating as fast as my terminal can render:
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
I redirected this into a file for about 1 second and it filled with about
20,000 of those lines.
I installed ltrace, but didn't really know how to use it...
`ltrace -p <pid>` didn't output anything.
`ltrace -p <pid> -S` showed something similar to strace:
SYS_sched_yield(0x7f0ebc3f5c40, 0x7f0ebc3f5c40, 0, 0x7273752f3a6e6962)
= 0
SYS_sched_yield(0x7f0ebc3f5c40, 0x7f0ebc3f5c40, 0, 0x7273752f3a6e6962)
= 0
SYS_sched_yield(0x7f0ebc3f5c40, 0x7f0ebc3f5c40, 0, 0x7273752f3a6e6962)
= 0
I next enabled debugging in /etc/default/pcsd and issued a `systemctl restart
pcsd`. Unfortunately, that killed the runaway child process.
However, I found another server where it's also happening again. Debugging is
not enabled there, but is there anything else I can do while the process is
still running?
Here are the pcsd processes:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 6103 0.0 0.3 1076744 59972 ? Ssl Apr06 67:17 /usr/bin/ruby
-C/var/lib/pcsd -I/usr/share/pcsd -- /usr/share/pcsd/ssl.rb & > /dev/null &
root 24923 99.8 0.3 1076744 52744 ? Rl May19 5556:31 \_
/usr/bin/ruby -C/var/lib/pcsd -I/usr/share/pcsd -- /usr/share/pcsd/ssl.rb & >
/dev/null &
I don't have gcore installed and don't know which package might provide it. I
also don't have experience with gdb but am happy to try anything suggested to
help figure out what's going on.
The pcs version is 0.9.149, as packaged by Debian and inherited by Ubuntu.
Regards,
--
Casey
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org