On 03/08/2017 07:13 PM, Jeffrey Westgate wrote: > yes - at least I think this is all the packages. (What I did was run a yum > update -y, for the most part - had to do pacemaker separately -- had to stop > it, update it, start it.) > > now, is it possible I'm missing a needed package after the update... but > dependencies should have handled that....? > > [root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* > keepalive\* corosync\* pacemaker\* > Loaded plugins: fastestmirror, refresh-packagekit > Loading mirror speeds from cached hostfile > * epel: fedora-epel.mirror.lstn.net > * sl: ftp.scientificlinux.org > * sl-security: ftp.scientificlinux.org > Installed Packages > ccs.x86_64 0.16.2-75.el6_6.1 > installed > cman.x86_64 3.0.12.1-59.el6 > @sl > corosync.x86_64 1.4.1-17.el6 > @sl > corosynclib.x86_64 1.4.1-17.el6 > @sl
Looks like your corosync is ancient and in particular it seems to be out of sync with pacemaker. Pacemaker looks like the version released with RHEL-6.8 but corosync there is 1.4.7-5 and you have 1.4.1-17. > > keepalived.x86_64 1.2.7-3.el6 > @sl > pacemaker.x86_64 1.1.14-8.el6_8.2 > @sl-security > pacemaker-cli.x86_64 1.1.14-8.el6_8.2 > @sl-security > pacemaker-cluster-libs.x86_64 1.1.14-8.el6_8.2 > @sl-security > pacemaker-libs.x86_64 1.1.14-8.el6_8.2 > @sl-security > pcs.x86_64 0.9.139-9.el6_7.1 > installed > resource-agents.x86_64 3.9.2-40.el6 > @sl > Available Packages > corosynclib.i686 1.4.1-17.el6 > sl > corosynclib-devel.i686 1.4.1-17.el6 > sl > corosynclib-devel.x86_64 1.4.1-17.el6 > sl > pacemaker-cluster-libs.i686 1.1.14-8.el6_8.2 > sl-security > pacemaker-cts.x86_64 1.1.14-8.el6_8.2 > sl-security > pacemaker-doc.x86_64 1.1.14-8.el6_8.2 > sl-security > pacemaker-libs.i686 1.1.14-8.el6_8.2 > sl-security > pacemaker-libs-devel.i686 1.1.14-8.el6_8.2 > sl-security > pacemaker-libs-devel.x86_64 1.1.14-8.el6_8.2 > sl-security > pacemaker-remote.x86_64 1.1.14-8.el6_8.2 > sl-security > pcs.noarch 0.9.90-2.el6 > sl > resource-agents-sap.x86_64 3.9.2-40.el6 > sl > ________________________________________ > > ------------------------------ > > Message: 2 > Date: Wed, 8 Mar 2017 10:40:49 -0600 > From: Ken Gaillot <[email protected]> > To: [email protected] > Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a > problem... > Message-ID: <[email protected]> > Content-Type: text/plain; charset=windows-1252 > > On 03/08/2017 09:58 AM, Jeffrey Westgate wrote: >> Ok. >> >> Been running monit for a few days, and atop (running a script to capture an >> atop output every 10 seconds for an hour, rotate the log, and do it again; >> runs from midnight to midnight, changes the date, and does it again). I >> correlate between the atop logs, nagios alerts, and monit, to try to find a >> trigger. Like trying to find a particular snowflake in Alaska in January. >> >> Have had a handful of episodes with all the monitors running. We have >> determined nothing. Nothing significantly changes from normal/regular to >> high host load. >> >> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and >> different datastore (so, effectively new CPU, memory, nic, disk, video... >> basically all "new" hardware. still have episodes. >> >> Was running the "VMWare provided" vmtools. removed and replaced with >> open-vm-tools this morning. just had another episode. >> >> was running atop interactively when the episode started - the only thing >> that seems to change is the hostload goes up. momentary spike in "avio" for >> the disk -- all the way up to 25 msecs. lasted for one ten-second slice from >> atop. >> >> no zombies, no wait, no spike in network, transport, mem use, disk >> reads/writes... nothing I can see (and by I, I mean "we" as we have three >> people looking) >> >> I've got other boxes running the same OS - updated them at the same time, so >> patch level is all same. No similar issues. The only thing I have >> different is these two are running pacemaker, corosync, keepalived. maybe >> when they were updated, they need a library I don't have? >> >> running /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags >> there. so - not OS, not IO, not hardware (virtual as it is...) ... only >> leaves software. >> >> Maybe pacemaker is just incompatible with: >> >> Scientific Linux release 6.5 (Carbon) >> kernel 2.6.32-642.15.1.el6.x86_64 >> >> ?? > That does sound bizarre. I haven't tried 6.5 in a while, but it's > certainly compatible with the current 6.8. > > IIRC, you updated to the 6.8 pacemaker packages ... Did you also update > the OS and/or other cluster-related packages to 6.8? > >> At this point it's more of a curiosity than an out and out problem, as >> performance does not seem to be impacted noticeably. Packet-in, packet-out >> seems unperturbed. Same cannot be said for us administrators... >> >> >> >> >> ________________________________________ >> From: [email protected] [[email protected]] >> Sent: Friday, March 03, 2017 7:27 AM >> To: [email protected] >> Subject: Users Digest, Vol 26, Issue 10 >> >> Send Users mailing list submissions to >> [email protected] >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://lists.clusterlabs.org/mailman/listinfo/users >> or, via email, send a message with subject or body 'help' to >> [email protected] >> >> You can reach the person managing the list at >> [email protected] >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Users digest..." >> >> >> Today's Topics: >> >> 1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error >> retrying (Ulrich Windl) >> 2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join >> error retrying (emmanuel segura) >> 3. Antw: Re: Never join a list without a problem... >> (Jeffrey Westgate) >> >> >> ---------------------------------------------------------------------- >> >> ------------------------------ >> >> Message: 3 >> Date: Fri, 3 Mar 2017 13:27:25 +0000 >> From: Jeffrey Westgate <[email protected]> >> To: "[email protected]" <[email protected]> >> Subject: [ClusterLabs] Antw: Re: Never join a list without a >> problem... >> Message-ID: >> >> <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net> >> >> Content-Type: text/plain; charset="us-ascii" >> >> Appreciate the offer - not familiar with monit. >> >> Going to try running atop through logratate for the day, keep 12, rotate >> hourly (to control space utilization) and see if I can catch anything that >> way. My biggest issue is we've not caught it as it starts, so we don't ever >> see anything amiss. >> >> If this doesn't work, then I will likely take you up on how to script monit >> to catch something. >> >> Thanks -- >> >> Jeff >> ________________________________________ >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 2 Mar 2017 16:32:02 +0000 >> From: Jeffrey Westgate <[email protected]> >> To: Adam Spiers <[email protected]>, "Cluster Labs - All topics related >> to open-source clustering welcomed" <[email protected]> >> Subject: Re: [ClusterLabs] Never join a list without a problem... >> Message-ID: >> >> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net> >> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Since we have both pieces of the load-balanced cluster doing the same thing >> - for still-as-yet unidentified reasons - we've put atop on one and sysdig >> on the other. Running atop at 10 second slices, hoping it will catch >> something. While configuring it yesterday, that server went into it's >> 'episode', but there was nothing in the atop log to show anything. Nothing >> else changed except the cpu load average. No increase in any other >> parameter. >> >> frustrating. >> >> >> ________________________________________ >> From: Adam Spiers [[email protected]] >> Sent: Wednesday, March 01, 2017 5:33 AM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> Cc: Jeffrey Westgate >> Subject: Re: [ClusterLabs] Never join a list without a problem... >> >> Ferenc W?gner <[email protected]> wrote: >>> Jeffrey Westgate <[email protected]> writes: >>> >>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >>>> longer, and we cannot set a clock by it - while the machine is 95% >>>> idle (or more according to 'top'), the host load shoots up to 50 or >>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >>>> to come back down to baseline, which is mostly 0.00. (attached >>>> hostload.pdf) This happens to both machines, randomly, and is >>>> concerning, as we'd like to find what's causing it and resolve it. >>> Try running atop (http://www.atoptool.nl/). It collects and logs >>> process accounting info, allowing you to step back in time and check >>> resource usage in the past. >> Nice, I didn't know atop could also log the collected data for future >> analysis. >> >> If you want to capture even more detail, sysdig is superb: >> >> http://www.sysdig.org/ >> >> ------------------------------ >> >> Message: 5 >> Date: Fri, 03 Mar 2017 08:04:22 +0100 >> From: "Ulrich Windl" <[email protected]> >> To: <[email protected]> >> Subject: [ClusterLabs] Antw: Re: Never join a list without a >> problem... >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset=UTF-8 >> >>>>> Jeffrey Westgate <[email protected]> schrieb am 02.03.2017 um >> 17:32 >> in Nachricht >> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>: >>> Since we have both pieces of the load-balanced cluster doing the same thing >> - >>> for still-as-yet unidentified reasons - we've put atop on one and sysdig on >> the >>> other. Running atop at 10 second slices, hoping it will catch something. >>> While configuring it yesterday, that server went into it's 'episode', but >>> there was nothing in the atop log to show anything. Nothing else changed >>> except the cpu load average. No increase in any other parameter. >>> >>> frustrating. >> Hi! >> >> You could try the monit-approach (I could provide an RPM with a >> "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it). >> >> The part that monitors unusual load looks like this here: >> check system host.domain.org >> if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh" >> if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh" >> if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh" >> if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh" >> if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh" >> if swap usage > 50% then exec "/var/lib/monit/log-top.sh" >> if cpu usage > 99% for 15 cycles then alert >> if cpu usage (user) > 90% for 30 cycles then alert >> if cpu usage (system) > 20% for 2 cycles then exec >> "/var/lib/monit/log-top.s >> h" >> if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh" >> group local >> ### all numbers are a matter of taste ;-) >> And my script (in lack of better ideas) looks like this: >> #!/bin/sh >> { >> echo "========== $(/bin/date) ==========" >> /usr/bin/mpstat >> echo "---" >> /usr/bin/vmstat >> echo "---" >> /usr/bin/top -b -n 1 -Hi >> } >> /var/log/monit/top.log >> >> Regards, >> Ulrich >> >>> >>> ________________________________________ >>> From: Adam Spiers [[email protected]] >>> Sent: Wednesday, March 01, 2017 5:33 AM >>> To: Cluster Labs - All topics related to open-source clustering welcomed >>> Cc: Jeffrey Westgate >>> Subject: Re: [ClusterLabs] Never join a list without a problem... >>> >>> Ferenc W?gner <[email protected]> wrote: >>>> Jeffrey Westgate <[email protected]> writes: >>>> >>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >>>>> longer, and we cannot set a clock by it - while the machine is 95% >>>>> idle (or more according to 'top'), the host load shoots up to 50 or >>>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >>>>> to come back down to baseline, which is mostly 0.00. (attached >>>>> hostload.pdf) This happens to both machines, randomly, and is >>>>> concerning, as we'd like to find what's causing it and resolve it. >>>> Try running atop (http://www.atoptool.nl/). It collects and logs >>>> process accounting info, allowing you to step back in time and check >>>> resource usage in the past. >>> Nice, I didn't know atop could also log the collected data for future >>> analysis. >>> >>> If you want to capture even more detail, sysdig is superb: >>> >>> http://www.sysdig.org/ >>> >>> _______________________________________________ >>> Users mailing list: [email protected] >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
