Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

Klaus Wenninger Thu, 09 Mar 2017 09:50:02 -0800

On 03/08/2017 07:13 PM, Jeffrey Westgate wrote:
> yes - at least I think this is all the packages.  (What I did was run a yum 
> update -y, for the most part - had to do pacemaker separately -- had to stop 
> it, update it, start it.)
>
> now, is it possible I'm missing a needed package after the update... but 
> dependencies should have handled that....?
>
> [root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* 
> keepalive\* corosync\* pacemaker\*
> Loaded plugins: fastestmirror, refresh-packagekit
> Loading mirror speeds from cached hostfile
>  * epel: fedora-epel.mirror.lstn.net
>  * sl: ftp.scientificlinux.org
>  * sl-security: ftp.scientificlinux.org
> Installed Packages
> ccs.x86_64                                              0.16.2-75.el6_6.1     
>                           installed   
> cman.x86_64                                             3.0.12.1-59.el6       
>                           @sl         
> corosync.x86_64                                         1.4.1-17.el6          
>                           @sl         
> corosynclib.x86_64                                      1.4.1-17.el6          
>                           @sl


Looks like your corosync is ancient and in particular it seems to be out
of sync with
pacemaker. Pacemaker looks like the version released with RHEL-6.8 but
corosync
there is 1.4.7-5 and you have 1.4.1-17.

>          
> keepalived.x86_64                                       1.2.7-3.el6           
>                           @sl         
> pacemaker.x86_64                                        1.1.14-8.el6_8.2      
>                           @sl-security
> pacemaker-cli.x86_64                                    1.1.14-8.el6_8.2      
>                           @sl-security
> pacemaker-cluster-libs.x86_64                           1.1.14-8.el6_8.2      
>                           @sl-security
> pacemaker-libs.x86_64                                   1.1.14-8.el6_8.2      
>                           @sl-security
> pcs.x86_64                                              0.9.139-9.el6_7.1     
>                           installed   
> resource-agents.x86_64                                  3.9.2-40.el6          
>                           @sl         
> Available Packages
> corosynclib.i686                                        1.4.1-17.el6          
>                           sl          
> corosynclib-devel.i686                                  1.4.1-17.el6          
>                           sl          
> corosynclib-devel.x86_64                                1.4.1-17.el6          
>                           sl          
> pacemaker-cluster-libs.i686                             1.1.14-8.el6_8.2      
>                           sl-security 
> pacemaker-cts.x86_64                                    1.1.14-8.el6_8.2      
>                           sl-security 
> pacemaker-doc.x86_64                                    1.1.14-8.el6_8.2      
>                           sl-security 
> pacemaker-libs.i686                                     1.1.14-8.el6_8.2      
>                           sl-security 
> pacemaker-libs-devel.i686                               1.1.14-8.el6_8.2      
>                           sl-security 
> pacemaker-libs-devel.x86_64                             1.1.14-8.el6_8.2      
>                           sl-security 
> pacemaker-remote.x86_64                                 1.1.14-8.el6_8.2      
>                           sl-security 
> pcs.noarch                                              0.9.90-2.el6          
>                           sl          
> resource-agents-sap.x86_64                              3.9.2-40.el6          
>                           sl         
> ________________________________________
>
> ------------------------------
>
> Message: 2
> Date: Wed, 8 Mar 2017 10:40:49 -0600
> From: Ken Gaillot <[email protected]>
> To: [email protected]
> Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
>         problem...
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=windows-1252
>
> On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
>> Ok.
>>
>> Been running monit for a few days, and atop (running a script to capture an 
>> atop output every 10 seconds for an hour, rotate the log, and do it again; 
>> runs from midnight to midnight, changes the date, and does it again).  I 
>> correlate between the atop logs, nagios alerts, and monit, to try to find a 
>> trigger.  Like trying to find a particular snowflake in Alaska in January.
>>
>> Have had a handful of episodes with all the monitors running.  We have 
>> determined nothing. Nothing significantly changes from normal/regular to 
>> high host load.
>>
>> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
>> different datastore (so, effectively new CPU, memory, nic, disk, video... 
>> basically all "new" hardware.  still have episodes.
>>
>> Was running the "VMWare provided" vmtools.  removed and replaced with 
>> open-vm-tools this morning.  just had another episode.
>>
>> was running atop interactively when the episode started - the only thing 
>> that seems to change is the hostload goes up.  momentary spike in "avio" for 
>> the disk -- all the way up to 25 msecs. lasted for one ten-second slice from 
>> atop.
>>
>> no zombies, no wait, no spike in network, transport, mem use, disk 
>> reads/writes... nothing I can see (and by I, I mean "we" as we have three 
>> people looking)
>>
>> I've got other boxes running the same OS - updated them at the same time, so 
>> patch level is all same.  No similar issues.  The only thing I have 
>> different is these two are running pacemaker, corosync, keepalived.  maybe 
>> when they were updated, they need a library I don't have?
>>
>> running     /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags 
>> there.  so - not OS, not IO, not hardware (virtual as it is...) ... only 
>> leaves software.
>>
>> Maybe pacemaker is just incompatible with:
>>
>> Scientific Linux release 6.5 (Carbon)
>> kernel  2.6.32-642.15.1.el6.x86_64
>>
>> ??
> That does sound bizarre. I haven't tried 6.5 in a while, but it's
> certainly compatible with the current 6.8.
>
> IIRC, you updated to the 6.8 pacemaker packages ... Did you also update
> the OS and/or other cluster-related packages to 6.8?
>
>> At this point it's more of a curiosity than an out and out problem, as 
>> performance does not seem to be impacted noticeably.  Packet-in, packet-out 
>> seems unperturbed. Same cannot be said for us administrators...
>>
>>
>>
>>
>> ________________________________________
>> From: [email protected] [[email protected]]
>> Sent: Friday, March 03, 2017 7:27 AM
>> To: [email protected]
>> Subject: Users Digest, Vol 26, Issue 10
>>
>> Send Users mailing list submissions to
>>         [email protected]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://lists.clusterlabs.org/mailman/listinfo/users
>> or, via email, send a message with subject or body 'help' to
>>         [email protected]
>>
>> You can reach the person managing the list at
>>         [email protected]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Users digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
>>       retrying (Ulrich Windl)
>>    2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
>>       error retrying (emmanuel segura)
>>    3. Antw: Re:  Never join a list without a problem...
>>       (Jeffrey Westgate)
>>
>>
>> ----------------------------------------------------------------------
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Fri, 3 Mar 2017 13:27:25 +0000
>> From: Jeffrey Westgate <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>>         problem...
>> Message-ID:
>>         
>> <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net>
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Appreciate the offer - not familiar with monit.
>>
>> Going to try running atop through logratate for the day, keep 12, rotate 
>> hourly (to control space utilization) and see if I can catch anything that 
>> way.  My biggest issue is we've not caught it as it starts, so we don't ever 
>> see anything amiss.
>>
>> If this doesn't work, then I will likely take you up on how to script monit 
>> to catch something.
>>
>> Thanks --
>>
>> Jeff
>> ________________________________________
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Thu, 2 Mar 2017 16:32:02 +0000
>> From: Jeffrey Westgate <[email protected]>
>> To: Adam Spiers <[email protected]>, "Cluster Labs - All topics related
>>         to      open-source clustering welcomed" <[email protected]>
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>> Message-ID:
>>         
>> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>
>>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Since we have both pieces of the load-balanced cluster doing the same thing 
>> - for still-as-yet unidentified reasons - we've put atop on one and sysdig 
>> on the other.  Running atop at 10 second slices, hoping it will catch 
>> something.  While configuring it yesterday, that server went into it's 
>> 'episode', but there was nothing in the atop log to show anything.  Nothing 
>> else changed except the cpu load average.  No increase in any other 
>> parameter.
>>
>> frustrating.
>>
>>
>> ________________________________________
>> From: Adam Spiers [[email protected]]
>> Sent: Wednesday, March 01, 2017 5:33 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Cc: Jeffrey Westgate
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>
>> Ferenc W?gner <[email protected]> wrote:
>>> Jeffrey Westgate <[email protected]> writes:
>>>
>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>> to come back down to baseline, which is mostly 0.00.  (attached
>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>> concerning, as we'd like to find what's causing it and resolve it.
>>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>>> process accounting info, allowing you to step back in time and check
>>> resource usage in the past.
>> Nice, I didn't know atop could also log the collected data for future
>> analysis.
>>
>> If you want to capture even more detail, sysdig is superb:
>>
>>     http://www.sysdig.org/
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Fri, 03 Mar 2017 08:04:22 +0100
>> From: "Ulrich Windl" <[email protected]>
>> To: <[email protected]>
>> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>>         problem...
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset=UTF-8
>>
>>>>> Jeffrey Westgate <[email protected]> schrieb am 02.03.2017 um
>> 17:32
>> in Nachricht
>> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>:
>>> Since we have both pieces of the load-balanced cluster doing the same thing
>> -
>>> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
>> the
>>> other.  Running atop at 10 second slices, hoping it will catch something.
>>> While configuring it yesterday, that server went into it's 'episode', but
>>> there was nothing in the atop log to show anything.  Nothing else changed
>>> except the cpu load average.  No increase in any other parameter.
>>>
>>> frustrating.
>> Hi!
>>
>> You could try the monit-approach (I could provide an RPM with a
>> "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).
>>
>> The part that monitors unusual load looks like this here:
>>   check system host.domain.org
>>     if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
>>     if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
>>     if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
>>     if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>>     if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>>     if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
>>     if cpu usage > 99% for 15 cycles then alert
>>     if cpu usage (user) > 90% for 30 cycles then alert
>>     if cpu usage (system) > 20% for 2 cycles then exec
>> "/var/lib/monit/log-top.s
>> h"
>>     if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
>>     group local
>> ### all numbers are a matter of taste ;-)
>> And my script (in lack of better ideas) looks like this:
>> #!/bin/sh
>> {
>>     echo "========== $(/bin/date) =========="
>>     /usr/bin/mpstat
>>     echo "---"
>>     /usr/bin/vmstat
>>     echo "---"
>>     /usr/bin/top -b -n 1 -Hi
>> } >> /var/log/monit/top.log
>>
>> Regards,
>> Ulrich
>>
>>>
>>> ________________________________________
>>> From: Adam Spiers [[email protected]]
>>> Sent: Wednesday, March 01, 2017 5:33 AM
>>> To: Cluster Labs - All topics related to open-source clustering welcomed
>>> Cc: Jeffrey Westgate
>>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>>
>>> Ferenc W?gner <[email protected]> wrote:
>>>> Jeffrey Westgate <[email protected]> writes:
>>>>
>>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>>> to come back down to baseline, which is mostly 0.00.  (attached
>>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>>> concerning, as we'd like to find what's causing it and resolve it.
>>>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>>>> process accounting info, allowing you to step back in time and check
>>>> resource usage in the past.
>>> Nice, I didn't know atop could also log the collected data for future
>>> analysis.
>>>
>>> If you want to capture even more detail, sysdig is superb:
>>>
>>>     http://www.sysdig.org/
>>>
>>> _______________________________________________
>>> Users mailing list: [email protected]
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
> _______________________________________________
> Users mailing list: [email protected]
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

Reply via email to