Just for grins and giggles (I need some of both right now) I just updated to SL6.8.
We'll see what's what now. That's EVERYTHING changed. ________________________________________ From: Jeffrey Westgate Sent: Wednesday, March 08, 2017 12:13 PM To: users@clusterlabs.org Subject: Re: Antw: Re: Never join a list without a problem... yes - at least I think this is all the packages. (What I did was run a yum update -y, for the most part - had to do pacemaker separately -- had to stop it, update it, start it.) now, is it possible I'm missing a needed package after the update... but dependencies should have handled that....? [root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* keepalive\* corosync\* pacemaker\* Loaded plugins: fastestmirror, refresh-packagekit Loading mirror speeds from cached hostfile * epel: fedora-epel.mirror.lstn.net * sl: ftp.scientificlinux.org * sl-security: ftp.scientificlinux.org Installed Packages ccs.x86_64 0.16.2-75.el6_6.1 installed cman.x86_64 3.0.12.1-59.el6 @sl corosync.x86_64 1.4.1-17.el6 @sl corosynclib.x86_64 1.4.1-17.el6 @sl keepalived.x86_64 1.2.7-3.el6 @sl pacemaker.x86_64 1.1.14-8.el6_8.2 @sl-security pacemaker-cli.x86_64 1.1.14-8.el6_8.2 @sl-security pacemaker-cluster-libs.x86_64 1.1.14-8.el6_8.2 @sl-security pacemaker-libs.x86_64 1.1.14-8.el6_8.2 @sl-security pcs.x86_64 0.9.139-9.el6_7.1 installed resource-agents.x86_64 3.9.2-40.el6 @sl Available Packages corosynclib.i686 1.4.1-17.el6 sl corosynclib-devel.i686 1.4.1-17.el6 sl corosynclib-devel.x86_64 1.4.1-17.el6 sl pacemaker-cluster-libs.i686 1.1.14-8.el6_8.2 sl-security pacemaker-cts.x86_64 1.1.14-8.el6_8.2 sl-security pacemaker-doc.x86_64 1.1.14-8.el6_8.2 sl-security pacemaker-libs.i686 1.1.14-8.el6_8.2 sl-security pacemaker-libs-devel.i686 1.1.14-8.el6_8.2 sl-security pacemaker-libs-devel.x86_64 1.1.14-8.el6_8.2 sl-security pacemaker-remote.x86_64 1.1.14-8.el6_8.2 sl-security pcs.noarch 0.9.90-2.el6 sl resource-agents-sap.x86_64 3.9.2-40.el6 sl ________________________________________ ------------------------------ Message: 2 Date: Wed, 8 Mar 2017 10:40:49 -0600 From: Ken Gaillot <kgail...@redhat.com> To: users@clusterlabs.org Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a problem... Message-ID: <408c0af6-3831-5e7a-f1dd-37dcbfb0f...@redhat.com> Content-Type: text/plain; charset=windows-1252 On 03/08/2017 09:58 AM, Jeffrey Westgate wrote: > Ok. > > Been running monit for a few days, and atop (running a script to capture an > atop output every 10 seconds for an hour, rotate the log, and do it again; > runs from midnight to midnight, changes the date, and does it again). I > correlate between the atop logs, nagios alerts, and monit, to try to find a > trigger. Like trying to find a particular snowflake in Alaska in January. > > Have had a handful of episodes with all the monitors running. We have > determined nothing. Nothing significantly changes from normal/regular to high > host load. > > It's a VMWare/ESXi-hosted VM, so we moved it to a different host and > different datastore (so, effectively new CPU, memory, nic, disk, video... > basically all "new" hardware. still have episodes. > > Was running the "VMWare provided" vmtools. removed and replaced with > open-vm-tools this morning. just had another episode. > > was running atop interactively when the episode started - the only thing that > seems to change is the hostload goes up. momentary spike in "avio" for the > disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop. > > no zombies, no wait, no spike in network, transport, mem use, disk > reads/writes... nothing I can see (and by I, I mean "we" as we have three > people looking) > > I've got other boxes running the same OS - updated them at the same time, so > patch level is all same. No similar issues. The only thing I have different > is these two are running pacemaker, corosync, keepalived. maybe when they > were updated, they need a library I don't have? > > running /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags > there. so - not OS, not IO, not hardware (virtual as it is...) ... only > leaves software. > > Maybe pacemaker is just incompatible with: > > Scientific Linux release 6.5 (Carbon) > kernel 2.6.32-642.15.1.el6.x86_64 > > ?? That does sound bizarre. I haven't tried 6.5 in a while, but it's certainly compatible with the current 6.8. IIRC, you updated to the 6.8 pacemaker packages ... Did you also update the OS and/or other cluster-related packages to 6.8? > At this point it's more of a curiosity than an out and out problem, as > performance does not seem to be impacted noticeably. Packet-in, packet-out > seems unperturbed. Same cannot be said for us administrators... > > > > > ________________________________________ > From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org] > Sent: Friday, March 03, 2017 7:27 AM > To: users@clusterlabs.org > Subject: Users Digest, Vol 26, Issue 10 > > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error > retrying (Ulrich Windl) > 2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join > error retrying (emmanuel segura) > 3. Antw: Re: Never join a list without a problem... > (Jeffrey Westgate) > > > ---------------------------------------------------------------------- > > ------------------------------ > > Message: 3 > Date: Fri, 3 Mar 2017 13:27:25 +0000 > From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov> > To: "users@clusterlabs.org" <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: Re: Never join a list without a > problem... > Message-ID: > > <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net> > > Content-Type: text/plain; charset="us-ascii" > > Appreciate the offer - not familiar with monit. > > Going to try running atop through logratate for the day, keep 12, rotate > hourly (to control space utilization) and see if I can catch anything that > way. My biggest issue is we've not caught it as it starts, so we don't ever > see anything amiss. > > If this doesn't work, then I will likely take you up on how to script monit > to catch something. > > Thanks -- > > Jeff > ________________________________________ > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 2 Mar 2017 16:32:02 +0000 > From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov> > To: Adam Spiers <aspi...@suse.com>, "Cluster Labs - All topics related > to open-source clustering welcomed" <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Never join a list without a problem... > Message-ID: > > <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net> > > Content-Type: text/plain; charset="iso-8859-1" > > Since we have both pieces of the load-balanced cluster doing the same thing - > for still-as-yet unidentified reasons - we've put atop on one and sysdig on > the other. Running atop at 10 second slices, hoping it will catch something. > While configuring it yesterday, that server went into it's 'episode', but > there was nothing in the atop log to show anything. Nothing else changed > except the cpu load average. No increase in any other parameter. > > frustrating. > > > ________________________________________ > From: Adam Spiers [aspi...@suse.com] > Sent: Wednesday, March 01, 2017 5:33 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > Cc: Jeffrey Westgate > Subject: Re: [ClusterLabs] Never join a list without a problem... > > Ferenc W?gner <wf...@niif.hu> wrote: >> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes: >> >>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >>> longer, and we cannot set a clock by it - while the machine is 95% >>> idle (or more according to 'top'), the host load shoots up to 50 or >>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >>> to come back down to baseline, which is mostly 0.00. (attached >>> hostload.pdf) This happens to both machines, randomly, and is >>> concerning, as we'd like to find what's causing it and resolve it. >> >> Try running atop (http://www.atoptool.nl/). It collects and logs >> process accounting info, allowing you to step back in time and check >> resource usage in the past. > > Nice, I didn't know atop could also log the collected data for future > analysis. > > If you want to capture even more detail, sysdig is superb: > > http://www.sysdig.org/ > > ------------------------------ > > Message: 5 > Date: Fri, 03 Mar 2017 08:04:22 +0100 > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: Re: Never join a list without a > problem... > Message-ID: <58b91576020000a100024...@gwsmtp1.uni-regensburg.de> > Content-Type: text/plain; charset=UTF-8 > >>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> schrieb am 02.03.2017 um > 17:32 > in Nachricht > <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>: >> Since we have both pieces of the load-balanced cluster doing the same thing > - >> for still-as-yet unidentified reasons - we've put atop on one and sysdig on > the >> other. Running atop at 10 second slices, hoping it will catch something. >> While configuring it yesterday, that server went into it's 'episode', but >> there was nothing in the atop log to show anything. Nothing else changed >> except the cpu load average. No increase in any other parameter. >> >> frustrating. > > Hi! > > You could try the monit-approach (I could provide an RPM with a > "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it). > > The part that monitors unusual load looks like this here: > check system host.domain.org > if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh" > if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh" > if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh" > if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh" > if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh" > if swap usage > 50% then exec "/var/lib/monit/log-top.sh" > if cpu usage > 99% for 15 cycles then alert > if cpu usage (user) > 90% for 30 cycles then alert > if cpu usage (system) > 20% for 2 cycles then exec > "/var/lib/monit/log-top.s > h" > if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh" > group local > ### all numbers are a matter of taste ;-) > And my script (in lack of better ideas) looks like this: > #!/bin/sh > { > echo "========== $(/bin/date) ==========" > /usr/bin/mpstat > echo "---" > /usr/bin/vmstat > echo "---" > /usr/bin/top -b -n 1 -Hi > } >> /var/log/monit/top.log > > Regards, > Ulrich > >> >> >> ________________________________________ >> From: Adam Spiers [aspi...@suse.com] >> Sent: Wednesday, March 01, 2017 5:33 AM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> Cc: Jeffrey Westgate >> Subject: Re: [ClusterLabs] Never join a list without a problem... >> >> Ferenc W?gner <wf...@niif.hu> wrote: >>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes: >>> >>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >>>> longer, and we cannot set a clock by it - while the machine is 95% >>>> idle (or more according to 'top'), the host load shoots up to 50 or >>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >>>> to come back down to baseline, which is mostly 0.00. (attached >>>> hostload.pdf) This happens to both machines, randomly, and is >>>> concerning, as we'd like to find what's causing it and resolve it. >>> >>> Try running atop (http://www.atoptool.nl/). It collects and logs >>> process accounting info, allowing you to step back in time and check >>> resource usage in the past. >> >> Nice, I didn't know atop could also log the collected data for future >> analysis. >> >> If you want to capture even more detail, sysdig is superb: >> >> http://www.sysdig.org/ >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org