Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

Jeffrey Westgate Wed, 08 Mar 2017 13:19:21 -0800

Just for grins and giggles (I need some of both right now) I just updated to 
SL6.8.


We'll see what's what now.  That's EVERYTHING changed.


________________________________________
From: Jeffrey Westgate
Sent: Wednesday, March 08, 2017 12:13 PM
To: [email protected]
Subject: Re: Antw: Re: Never join a list without a problem...

yes - at least I think this is all the packages.  (What I did was run a yum 
update -y, for the most part - had to do pacemaker separately -- had to stop 
it, update it, start it.)

now, is it possible I'm missing a needed package after the update... but 
dependencies should have handled that....?

[root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* 
keepalive\* corosync\* pacemaker\*
Loaded plugins: fastestmirror, refresh-packagekit
Loading mirror speeds from cached hostfile
 * epel: fedora-epel.mirror.lstn.net
 * sl: ftp.scientificlinux.org
 * sl-security: ftp.scientificlinux.org
Installed Packages
ccs.x86_64                                              0.16.2-75.el6_6.1       
                        installed
cman.x86_64                                             3.0.12.1-59.el6         
                        @sl
corosync.x86_64                                         1.4.1-17.el6            
                        @sl
corosynclib.x86_64                                      1.4.1-17.el6            
                        @sl
keepalived.x86_64                                       1.2.7-3.el6             
                        @sl
pacemaker.x86_64                                        1.1.14-8.el6_8.2        
                        @sl-security
pacemaker-cli.x86_64                                    1.1.14-8.el6_8.2        
                        @sl-security
pacemaker-cluster-libs.x86_64                           1.1.14-8.el6_8.2        
                        @sl-security
pacemaker-libs.x86_64                                   1.1.14-8.el6_8.2        
                        @sl-security
pcs.x86_64                                              0.9.139-9.el6_7.1       
                        installed
resource-agents.x86_64                                  3.9.2-40.el6            
                        @sl
Available Packages
corosynclib.i686                                        1.4.1-17.el6            
                        sl
corosynclib-devel.i686                                  1.4.1-17.el6            
                        sl
corosynclib-devel.x86_64                                1.4.1-17.el6            
                        sl
pacemaker-cluster-libs.i686                             1.1.14-8.el6_8.2        
                        sl-security
pacemaker-cts.x86_64                                    1.1.14-8.el6_8.2        
                        sl-security
pacemaker-doc.x86_64                                    1.1.14-8.el6_8.2        
                        sl-security
pacemaker-libs.i686                                     1.1.14-8.el6_8.2        
                        sl-security
pacemaker-libs-devel.i686                               1.1.14-8.el6_8.2        
                        sl-security
pacemaker-libs-devel.x86_64                             1.1.14-8.el6_8.2        
                        sl-security
pacemaker-remote.x86_64                                 1.1.14-8.el6_8.2        
                        sl-security
pcs.noarch                                              0.9.90-2.el6            
                        sl
resource-agents-sap.x86_64                              3.9.2-40.el6            
                        sl
________________________________________

------------------------------

Message: 2
Date: Wed, 8 Mar 2017 10:40:49 -0600
From: Ken Gaillot <[email protected]>
To: [email protected]
Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
        problem...
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252

On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
> Ok.
>
> Been running monit for a few days, and atop (running a script to capture an 
> atop output every 10 seconds for an hour, rotate the log, and do it again; 
> runs from midnight to midnight, changes the date, and does it again).  I 
> correlate between the atop logs, nagios alerts, and monit, to try to find a 
> trigger.  Like trying to find a particular snowflake in Alaska in January.
>
> Have had a handful of episodes with all the monitors running.  We have 
> determined nothing. Nothing significantly changes from normal/regular to high 
> host load.
>
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.
>
> Was running the "VMWare provided" vmtools.  removed and replaced with 
> open-vm-tools this morning.  just had another episode.
>
> was running atop interactively when the episode started - the only thing that 
> seems to change is the hostload goes up.  momentary spike in "avio" for the 
> disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.
>
> no zombies, no wait, no spike in network, transport, mem use, disk 
> reads/writes... nothing I can see (and by I, I mean "we" as we have three 
> people looking)
>
> I've got other boxes running the same OS - updated them at the same time, so 
> patch level is all same.  No similar issues.  The only thing I have different 
> is these two are running pacemaker, corosync, keepalived.  maybe when they 
> were updated, they need a library I don't have?
>
> running     /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags 
> there.  so - not OS, not IO, not hardware (virtual as it is...) ... only 
> leaves software.
>
> Maybe pacemaker is just incompatible with:
>
> Scientific Linux release 6.5 (Carbon)
> kernel  2.6.32-642.15.1.el6.x86_64
>
> ??

That does sound bizarre. I haven't tried 6.5 in a while, but it's
certainly compatible with the current 6.8.

IIRC, you updated to the 6.8 pacemaker packages ... Did you also update
the OS and/or other cluster-related packages to 6.8?

> At this point it's more of a curiosity than an out and out problem, as 
> performance does not seem to be impacted noticeably.  Packet-in, packet-out 
> seems unperturbed. Same cannot be said for us administrators...
>
>
>
>
> ________________________________________
> From: [email protected] [[email protected]]
> Sent: Friday, March 03, 2017 7:27 AM
> To: [email protected]
> Subject: Users Digest, Vol 26, Issue 10
>
> Send Users mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
>    1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
>       retrying (Ulrich Windl)
>    2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
>       error retrying (emmanuel segura)
>    3. Antw: Re:  Never join a list without a problem...
>       (Jeffrey Westgate)
>
>
> ----------------------------------------------------------------------
>
> ------------------------------
>
> Message: 3
> Date: Fri, 3 Mar 2017 13:27:25 +0000
> From: Jeffrey Westgate <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>         problem...
> Message-ID:
>         
> <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Appreciate the offer - not familiar with monit.
>
> Going to try running atop through logratate for the day, keep 12, rotate 
> hourly (to control space utilization) and see if I can catch anything that 
> way.  My biggest issue is we've not caught it as it starts, so we don't ever 
> see anything amiss.
>
> If this doesn't work, then I will likely take you up on how to script monit 
> to catch something.
>
> Thanks --
>
> Jeff
> ________________________________________

>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 2 Mar 2017 16:32:02 +0000
> From: Jeffrey Westgate <[email protected]>
> To: Adam Spiers <[email protected]>, "Cluster Labs - All topics related
>         to      open-source clustering welcomed" <[email protected]>
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> Message-ID:
>         
> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>
>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Since we have both pieces of the load-balanced cluster doing the same thing - 
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on 
> the other.  Running atop at 10 second slices, hoping it will catch something. 
>  While configuring it yesterday, that server went into it's 'episode', but 
> there was nothing in the atop log to show anything.  Nothing else changed 
> except the cpu load average.  No increase in any other parameter.
>
> frustrating.
>
>
> ________________________________________
> From: Adam Spiers [[email protected]]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
>
> Ferenc W?gner <[email protected]> wrote:
>> Jeffrey Westgate <[email protected]> writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00.  (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>> process accounting info, allowing you to step back in time and check
>> resource usage in the past.
>
> Nice, I didn't know atop could also log the collected data for future
> analysis.
>
> If you want to capture even more detail, sysdig is superb:
>
>     http://www.sysdig.org/
>
> ------------------------------
>
> Message: 5
> Date: Fri, 03 Mar 2017 08:04:22 +0100
> From: "Ulrich Windl" <[email protected]>
> To: <[email protected]>
> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>         problem...
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=UTF-8
>
>>>> Jeffrey Westgate <[email protected]> schrieb am 02.03.2017 um
> 17:32
> in Nachricht
> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>:
>> Since we have both pieces of the load-balanced cluster doing the same thing
> -
>> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
> the
>> other.  Running atop at 10 second slices, hoping it will catch something.
>> While configuring it yesterday, that server went into it's 'episode', but
>> there was nothing in the atop log to show anything.  Nothing else changed
>> except the cpu load average.  No increase in any other parameter.
>>
>> frustrating.
>
> Hi!
>
> You could try the monit-approach (I could provide an RPM with a
> "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).
>
> The part that monitors unusual load looks like this here:
>   check system host.domain.org
>     if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
>     if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
>     if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
>     if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>     if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>     if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
>     if cpu usage > 99% for 15 cycles then alert
>     if cpu usage (user) > 90% for 30 cycles then alert
>     if cpu usage (system) > 20% for 2 cycles then exec
> "/var/lib/monit/log-top.s
> h"
>     if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
>     group local
> ### all numbers are a matter of taste ;-)
> And my script (in lack of better ideas) looks like this:
> #!/bin/sh
> {
>     echo "========== $(/bin/date) =========="
>     /usr/bin/mpstat
>     echo "---"
>     /usr/bin/vmstat
>     echo "---"
>     /usr/bin/top -b -n 1 -Hi
> } >> /var/log/monit/top.log
>
> Regards,
> Ulrich
>
>>
>>
>> ________________________________________
>> From: Adam Spiers [[email protected]]
>> Sent: Wednesday, March 01, 2017 5:33 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Cc: Jeffrey Westgate
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>
>> Ferenc W?gner <[email protected]> wrote:
>>> Jeffrey Westgate <[email protected]> writes:
>>>
>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>> to come back down to baseline, which is mostly 0.00.  (attached
>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>> concerning, as we'd like to find what's causing it and resolve it.
>>>
>>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>>> process accounting info, allowing you to step back in time and check
>>> resource usage in the past.
>>
>> Nice, I didn't know atop could also log the collected data for future
>> analysis.
>>
>> If you want to capture even more detail, sysdig is superb:
>>
>>     http://www.sysdig.org/
>>
>> _______________________________________________
>> Users mailing list: [email protected]
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

Reply via email to