yes - at least I think this is all the packages. (What I did was run a yum
update -y, for the most part - had to do pacemaker separately -- had to stop
it, update it, start it.)
now, is it possible I'm missing a needed package after the update... but
dependencies should have handled that....?
[root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\*
keepalive\* corosync\* pacemaker\*
Loaded plugins: fastestmirror, refresh-packagekit
Loading mirror speeds from cached hostfile
* epel: fedora-epel.mirror.lstn.net
* sl: ftp.scientificlinux.org
* sl-security: ftp.scientificlinux.org
Installed Packages
ccs.x86_64 0.16.2-75.el6_6.1
installed
cman.x86_64 3.0.12.1-59.el6
@sl
corosync.x86_64 1.4.1-17.el6
@sl
corosynclib.x86_64 1.4.1-17.el6
@sl
keepalived.x86_64 1.2.7-3.el6
@sl
pacemaker.x86_64 1.1.14-8.el6_8.2
@sl-security
pacemaker-cli.x86_64 1.1.14-8.el6_8.2
@sl-security
pacemaker-cluster-libs.x86_64 1.1.14-8.el6_8.2
@sl-security
pacemaker-libs.x86_64 1.1.14-8.el6_8.2
@sl-security
pcs.x86_64 0.9.139-9.el6_7.1
installed
resource-agents.x86_64 3.9.2-40.el6
@sl
Available Packages
corosynclib.i686 1.4.1-17.el6
sl
corosynclib-devel.i686 1.4.1-17.el6
sl
corosynclib-devel.x86_64 1.4.1-17.el6
sl
pacemaker-cluster-libs.i686 1.1.14-8.el6_8.2
sl-security
pacemaker-cts.x86_64 1.1.14-8.el6_8.2
sl-security
pacemaker-doc.x86_64 1.1.14-8.el6_8.2
sl-security
pacemaker-libs.i686 1.1.14-8.el6_8.2
sl-security
pacemaker-libs-devel.i686 1.1.14-8.el6_8.2
sl-security
pacemaker-libs-devel.x86_64 1.1.14-8.el6_8.2
sl-security
pacemaker-remote.x86_64 1.1.14-8.el6_8.2
sl-security
pcs.noarch 0.9.90-2.el6
sl
resource-agents-sap.x86_64 3.9.2-40.el6
sl
________________________________________
------------------------------
Message: 2
Date: Wed, 8 Mar 2017 10:40:49 -0600
From: Ken Gaillot <[email protected]>
To: [email protected]
Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
problem...
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252
On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
> Ok.
>
> Been running monit for a few days, and atop (running a script to capture an
> atop output every 10 seconds for an hour, rotate the log, and do it again;
> runs from midnight to midnight, changes the date, and does it again). I
> correlate between the atop logs, nagios alerts, and monit, to try to find a
> trigger. Like trying to find a particular snowflake in Alaska in January.
>
> Have had a handful of episodes with all the monitors running. We have
> determined nothing. Nothing significantly changes from normal/regular to high
> host load.
>
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and
> different datastore (so, effectively new CPU, memory, nic, disk, video...
> basically all "new" hardware. still have episodes.
>
> Was running the "VMWare provided" vmtools. removed and replaced with
> open-vm-tools this morning. just had another episode.
>
> was running atop interactively when the episode started - the only thing that
> seems to change is the hostload goes up. momentary spike in "avio" for the
> disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.
>
> no zombies, no wait, no spike in network, transport, mem use, disk
> reads/writes... nothing I can see (and by I, I mean "we" as we have three
> people looking)
>
> I've got other boxes running the same OS - updated them at the same time, so
> patch level is all same. No similar issues. The only thing I have different
> is these two are running pacemaker, corosync, keepalived. maybe when they
> were updated, they need a library I don't have?
>
> running /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags
> there. so - not OS, not IO, not hardware (virtual as it is...) ... only
> leaves software.
>
> Maybe pacemaker is just incompatible with:
>
> Scientific Linux release 6.5 (Carbon)
> kernel 2.6.32-642.15.1.el6.x86_64
>
> ??
That does sound bizarre. I haven't tried 6.5 in a while, but it's
certainly compatible with the current 6.8.
IIRC, you updated to the 6.8 pacemaker packages ... Did you also update
the OS and/or other cluster-related packages to 6.8?
> At this point it's more of a curiosity than an out and out problem, as
> performance does not seem to be impacted noticeably. Packet-in, packet-out
> seems unperturbed. Same cannot be said for us administrators...
>
>
>
>
> ________________________________________
> From: [email protected] [[email protected]]
> Sent: Friday, March 03, 2017 7:27 AM
> To: [email protected]
> Subject: Users Digest, Vol 26, Issue 10
>
> Send Users mailing list submissions to
> [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> [email protected]
>
> You can reach the person managing the list at
> [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
> 1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
> retrying (Ulrich Windl)
> 2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
> error retrying (emmanuel segura)
> 3. Antw: Re: Never join a list without a problem...
> (Jeffrey Westgate)
>
>
> ----------------------------------------------------------------------
>
> ------------------------------
>
> Message: 3
> Date: Fri, 3 Mar 2017 13:27:25 +0000
> From: Jeffrey Westgate <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: [ClusterLabs] Antw: Re: Never join a list without a
> problem...
> Message-ID:
>
> <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Appreciate the offer - not familiar with monit.
>
> Going to try running atop through logratate for the day, keep 12, rotate
> hourly (to control space utilization) and see if I can catch anything that
> way. My biggest issue is we've not caught it as it starts, so we don't ever
> see anything amiss.
>
> If this doesn't work, then I will likely take you up on how to script monit
> to catch something.
>
> Thanks --
>
> Jeff
> ________________________________________
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 2 Mar 2017 16:32:02 +0000
> From: Jeffrey Westgate <[email protected]>
> To: Adam Spiers <[email protected]>, "Cluster Labs - All topics related
> to open-source clustering welcomed" <[email protected]>
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> Message-ID:
>
> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>
>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Since we have both pieces of the load-balanced cluster doing the same thing -
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
> the other. Running atop at 10 second slices, hoping it will catch something.
> While configuring it yesterday, that server went into it's 'episode', but
> there was nothing in the atop log to show anything. Nothing else changed
> except the cpu load average. No increase in any other parameter.
>
> frustrating.
>
>
> ________________________________________
> From: Adam Spiers [[email protected]]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
>
> Ferenc W?gner <[email protected]> wrote:
>> Jeffrey Westgate <[email protected]> writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00. (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>> Try running atop (http://www.atoptool.nl/). It collects and logs
>> process accounting info, allowing you to step back in time and check
>> resource usage in the past.
>
> Nice, I didn't know atop could also log the collected data for future
> analysis.
>
> If you want to capture even more detail, sysdig is superb:
>
> http://www.sysdig.org/
>
> ------------------------------
>
> Message: 5
> Date: Fri, 03 Mar 2017 08:04:22 +0100
> From: "Ulrich Windl" <[email protected]>
> To: <[email protected]>
> Subject: [ClusterLabs] Antw: Re: Never join a list without a
> problem...
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=UTF-8
>
>>>> Jeffrey Westgate <[email protected]> schrieb am 02.03.2017 um
> 17:32
> in Nachricht
> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>:
>> Since we have both pieces of the load-balanced cluster doing the same thing
> -
>> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
> the
>> other. Running atop at 10 second slices, hoping it will catch something.
>> While configuring it yesterday, that server went into it's 'episode', but
>> there was nothing in the atop log to show anything. Nothing else changed
>> except the cpu load average. No increase in any other parameter.
>>
>> frustrating.
>
> Hi!
>
> You could try the monit-approach (I could provide an RPM with a
> "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).
>
> The part that monitors unusual load looks like this here:
> check system host.domain.org
> if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
> if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
> if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
> if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
> if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
> if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
> if cpu usage > 99% for 15 cycles then alert
> if cpu usage (user) > 90% for 30 cycles then alert
> if cpu usage (system) > 20% for 2 cycles then exec
> "/var/lib/monit/log-top.s
> h"
> if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
> group local
> ### all numbers are a matter of taste ;-)
> And my script (in lack of better ideas) looks like this:
> #!/bin/sh
> {
> echo "========== $(/bin/date) =========="
> /usr/bin/mpstat
> echo "---"
> /usr/bin/vmstat
> echo "---"
> /usr/bin/top -b -n 1 -Hi
> } >> /var/log/monit/top.log
>
> Regards,
> Ulrich
>
>>
>>
>> ________________________________________
>> From: Adam Spiers [[email protected]]
>> Sent: Wednesday, March 01, 2017 5:33 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Cc: Jeffrey Westgate
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>
>> Ferenc W?gner <[email protected]> wrote:
>>> Jeffrey Westgate <[email protected]> writes:
>>>
>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>> to come back down to baseline, which is mostly 0.00. (attached
>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>> concerning, as we'd like to find what's causing it and resolve it.
>>>
>>> Try running atop (http://www.atoptool.nl/). It collects and logs
>>> process accounting info, allowing you to step back in time and check
>>> resource usage in the past.
>>
>> Nice, I didn't know atop could also log the collected data for future
>> analysis.
>>
>> If you want to capture even more detail, sysdig is superb:
>>
>> http://www.sysdig.org/
>>
>> _______________________________________________
>> Users mailing list: [email protected]
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org