after I posted that, we updated from SL6.5 to SL6.8.  What was listed below was 
latest available for 6.5.  Under 6.8 its:

ccs.x86_64                                          0.16.2-86.el6               
                @sl             
cman.x86_64                                         3.0.12.1-78.el6             
                @sl             
corosync.x86_64                                     1.4.7-5.el6                 
                @sl             
corosynclib.x86_64                                  1.4.7-5.el6                 
                @sl             
keepalived.x86_64                                   1.2.13-5.el6_6              
                @sl             
pacemaker.x86_64                                    1.1.14-8.el6_8.2            
                @sl-security/6.5
pacemaker-cli.x86_64                                1.1.14-8.el6_8.2            
                @sl-security/6.5
pacemaker-cluster-libs.x86_64                       1.1.14-8.el6_8.2            
                @sl-security/6.5
pacemaker-libs.x86_64                               1.1.14-8.el6_8.2            
                @sl-security/6.5
pcs.x86_64                                          0.9.148-7.el6               
                @sl             
resource-agents.x86_64                              3.9.5-34.el6                
                @sl     

and while we still have had a few episodes, they are lower intensity and lesser 
duration.  Still not satisfied.

was running    perf top    yesterday during a high host load episode -- was 
running it before the episode started so watched ALL of it -  and absolutely 
nothing changed from 'normal' running.  Nothing shot to the top, nothing 
consumed and more CPU.  Load climbed to 20, from 0.01, and   the highest listed 
'overhead' process was as it has been since the update, at an average of 
between 4 and 5% overhead, kernel process "__do_page_fault"

It didn't rise higher than 7% during the whole episode, and it was basically 
identical to when host load was 0.01


________________________________________
From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
Sent: Friday, March 10, 2017 5:00 AM
To: users@clusterlabs.org
Subject: Users Digest, Vol 26, Issue 23

Send Users mailing list submissions to
        users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
        users-requ...@clusterlabs.org

You can reach the person managing the list at
        users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."


Today's Topics:

   1. Re: Antw: Re: Never join a list without a problem...
      (Klaus Wenninger)


----------------------------------------------------------------------

Message: 1
Date: Thu, 9 Mar 2017 18:45:56 +0100
From: Klaus Wenninger <kwenn...@redhat.com>
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
        problem...
Message-ID: <67fce3dd-5deb-b48f-466b-4d2ec9b0a...@redhat.com>
Content-Type: text/plain; charset=windows-1252

On 03/08/2017 07:13 PM, Jeffrey Westgate wrote:
> yes - at least I think this is all the packages.  (What I did was run a yum 
> update -y, for the most part - had to do pacemaker separately -- had to stop 
> it, update it, start it.)
>
> now, is it possible I'm missing a needed package after the update... but 
> dependencies should have handled that....?
>
> [root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* 
> keepalive\* corosync\* pacemaker\*
> Loaded plugins: fastestmirror, refresh-packagekit
> Loading mirror speeds from cached hostfile
>  * epel: fedora-epel.mirror.lstn.net
>  * sl: ftp.scientificlinux.org
>  * sl-security: ftp.scientificlinux.org
> Installed Packages
> ccs.x86_64                                              0.16.2-75.el6_6.1     
>                           installed
> cman.x86_64                                             3.0.12.1-59.el6       
>                           @sl
> corosync.x86_64                                         1.4.1-17.el6          
>                           @sl
> corosynclib.x86_64                                      1.4.1-17.el6          
>                           @sl

Looks like your corosync is ancient and in particular it seems to be out
of sync with
pacemaker. Pacemaker looks like the version released with RHEL-6.8 but
corosync
there is 1.4.7-5 and you have 1.4.1-17.

>
> keepalived.x86_64                                       1.2.7-3.el6           
>                           @sl
> pacemaker.x86_64                                        1.1.14-8.el6_8.2      
>                           @sl-security
> pacemaker-cli.x86_64                                    1.1.14-8.el6_8.2      
>                           @sl-security
> pacemaker-cluster-libs.x86_64                           1.1.14-8.el6_8.2      
>                           @sl-security
> pacemaker-libs.x86_64                                   1.1.14-8.el6_8.2      
>                           @sl-security
> pcs.x86_64                                              0.9.139-9.el6_7.1     
>                           installed
> resource-agents.x86_64                                  3.9.2-40.el6          
>                           @sl
> Available Packages
> corosynclib.i686                                        1.4.1-17.el6          
>                           sl
> corosynclib-devel.i686                                  1.4.1-17.el6          
>                           sl
> corosynclib-devel.x86_64                                1.4.1-17.el6          
>                           sl
> pacemaker-cluster-libs.i686                             1.1.14-8.el6_8.2      
>                           sl-security
> pacemaker-cts.x86_64                                    1.1.14-8.el6_8.2      
>                           sl-security
> pacemaker-doc.x86_64                                    1.1.14-8.el6_8.2      
>                           sl-security
> pacemaker-libs.i686                                     1.1.14-8.el6_8.2      
>                           sl-security
> pacemaker-libs-devel.i686                               1.1.14-8.el6_8.2      
>                           sl-security
> pacemaker-libs-devel.x86_64                             1.1.14-8.el6_8.2      
>                           sl-security
> pacemaker-remote.x86_64                                 1.1.14-8.el6_8.2      
>                           sl-security
> pcs.noarch                                              0.9.90-2.el6          
>                           sl
> resource-agents-sap.x86_64                              3.9.2-40.el6          
>                           sl
> ________________________________________
>
> ------------------------------
>
> Message: 2
> Date: Wed, 8 Mar 2017 10:40:49 -0600
> From: Ken Gaillot <kgail...@redhat.com>
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
>         problem...
> Message-ID: <408c0af6-3831-5e7a-f1dd-37dcbfb0f...@redhat.com>
> Content-Type: text/plain; charset=windows-1252
>
> On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
>> Ok.
>>
>> Been running monit for a few days, and atop (running a script to capture an 
>> atop output every 10 seconds for an hour, rotate the log, and do it again; 
>> runs from midnight to midnight, changes the date, and does it again).  I 
>> correlate between the atop logs, nagios alerts, and monit, to try to find a 
>> trigger.  Like trying to find a particular snowflake in Alaska in January.
>>
>> Have had a handful of episodes with all the monitors running.  We have 
>> determined nothing. Nothing significantly changes from normal/regular to 
>> high host load.
>>
>> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
>> different datastore (so, effectively new CPU, memory, nic, disk, video... 
>> basically all "new" hardware.  still have episodes.
>>
>> Was running the "VMWare provided" vmtools.  removed and replaced with 
>> open-vm-tools this morning.  just had another episode.
>>
>> was running atop interactively when the episode started - the only thing 
>> that seems to change is the hostload goes up.  momentary spike in "avio" for 
>> the disk -- all the way up to 25 msecs. lasted for one ten-second slice from 
>> atop.
>>
>> no zombies, no wait, no spike in network, transport, mem use, disk 
>> reads/writes... nothing I can see (and by I, I mean "we" as we have three 
>> people looking)
>>
>> I've got other boxes running the same OS - updated them at the same time, so 
>> patch level is all same.  No similar issues.  The only thing I have 
>> different is these two are running pacemaker, corosync, keepalived.  maybe 
>> when they were updated, they need a library I don't have?
>>
>> running     /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags 
>> there.  so - not OS, not IO, not hardware (virtual as it is...) ... only 
>> leaves software.
>>
>> Maybe pacemaker is just incompatible with:
>>
>> Scientific Linux release 6.5 (Carbon)
>> kernel  2.6.32-642.15.1.el6.x86_64
>>
>> ??
> That does sound bizarre. I haven't tried 6.5 in a while, but it's
> certainly compatible with the current 6.8.
>
> IIRC, you updated to the 6.8 pacemaker packages ... Did you also update
> the OS and/or other cluster-related packages to 6.8?
>
>> At this point it's more of a curiosity than an out and out problem, as 
>> performance does not seem to be impacted noticeably.  Packet-in, packet-out 
>> seems unperturbed. Same cannot be said for us administrators...
>>
>>
>>
>>
>> ________________________________________
>> From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
>> Sent: Friday, March 03, 2017 7:27 AM
>> To: users@clusterlabs.org
>> Subject: Users Digest, Vol 26, Issue 10
>>
>> Send Users mailing list submissions to
>>         users@clusterlabs.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://lists.clusterlabs.org/mailman/listinfo/users
>> or, via email, send a message with subject or body 'help' to
>>         users-requ...@clusterlabs.org
>>
>> You can reach the person managing the list at
>>         users-ow...@clusterlabs.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Users digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
>>       retrying (Ulrich Windl)
>>    2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
>>       error retrying (emmanuel segura)
>>    3. Antw: Re:  Never join a list without a problem...
>>       (Jeffrey Westgate)
>>
>>
>> ----------------------------------------------------------------------
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Fri, 3 Mar 2017 13:27:25 +0000
>> From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
>> To: "users@clusterlabs.org" <users@clusterlabs.org>
>> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>>         problem...
>> Message-ID:
>>         
>> <a36b14fa9aa67f4e836c0ee59dea89c4015b214...@cm-sas-mbx-07.sas.arkgov.net>
>>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Appreciate the offer - not familiar with monit.
>>
>> Going to try running atop through logratate for the day, keep 12, rotate 
>> hourly (to control space utilization) and see if I can catch anything that 
>> way.  My biggest issue is we've not caught it as it starts, so we don't ever 
>> see anything amiss.
>>
>> If this doesn't work, then I will likely take you up on how to script monit 
>> to catch something.
>>
>> Thanks --
>>
>> Jeff
>> ________________________________________
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Thu, 2 Mar 2017 16:32:02 +0000
>> From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
>> To: Adam Spiers <aspi...@suse.com>, "Cluster Labs - All topics related
>>         to      open-source clustering welcomed" <users@clusterlabs.org>
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>> Message-ID:
>>         
>> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>
>>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Since we have both pieces of the load-balanced cluster doing the same thing 
>> - for still-as-yet unidentified reasons - we've put atop on one and sysdig 
>> on the other.  Running atop at 10 second slices, hoping it will catch 
>> something.  While configuring it yesterday, that server went into it's 
>> 'episode', but there was nothing in the atop log to show anything.  Nothing 
>> else changed except the cpu load average.  No increase in any other 
>> parameter.
>>
>> frustrating.
>>
>>
>> ________________________________________
>> From: Adam Spiers [aspi...@suse.com]
>> Sent: Wednesday, March 01, 2017 5:33 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Cc: Jeffrey Westgate
>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>
>> Ferenc W?gner <wf...@niif.hu> wrote:
>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes:
>>>
>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>> to come back down to baseline, which is mostly 0.00.  (attached
>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>> concerning, as we'd like to find what's causing it and resolve it.
>>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>>> process accounting info, allowing you to step back in time and check
>>> resource usage in the past.
>> Nice, I didn't know atop could also log the collected data for future
>> analysis.
>>
>> If you want to capture even more detail, sysdig is superb:
>>
>>     http://www.sysdig.org/
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Fri, 03 Mar 2017 08:04:22 +0100
>> From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
>> To: <users@clusterlabs.org>
>> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
>>         problem...
>> Message-ID: <58b91576020000a100024...@gwsmtp1.uni-regensburg.de>
>> Content-Type: text/plain; charset=UTF-8
>>
>>>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> schrieb am 02.03.2017 um
>> 17:32
>> in Nachricht
>> <a36b14fa9aa67f4e836c0ee59dea89c4015b212...@cm-sas-mbx-07.sas.arkgov.net>:
>>> Since we have both pieces of the load-balanced cluster doing the same thing
>> -
>>> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
>> the
>>> other.  Running atop at 10 second slices, hoping it will catch something.
>>> While configuring it yesterday, that server went into it's 'episode', but
>>> there was nothing in the atop log to show anything.  Nothing else changed
>>> except the cpu load average.  No increase in any other parameter.
>>>
>>> frustrating.
>> Hi!
>>
>> You could try the monit-approach (I could provide an RPM with a
>> "recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).
>>
>> The part that monitors unusual load looks like this here:
>>   check system host.domain.org
>>     if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
>>     if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
>>     if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
>>     if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>>     if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
>>     if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
>>     if cpu usage > 99% for 15 cycles then alert
>>     if cpu usage (user) > 90% for 30 cycles then alert
>>     if cpu usage (system) > 20% for 2 cycles then exec
>> "/var/lib/monit/log-top.s
>> h"
>>     if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
>>     group local
>> ### all numbers are a matter of taste ;-)
>> And my script (in lack of better ideas) looks like this:
>> #!/bin/sh
>> {
>>     echo "========== $(/bin/date) =========="
>>     /usr/bin/mpstat
>>     echo "---"
>>     /usr/bin/vmstat
>>     echo "---"
>>     /usr/bin/top -b -n 1 -Hi
>> } >> /var/log/monit/top.log
>>
>> Regards,
>> Ulrich
>>
>>>
>>> ________________________________________
>>> From: Adam Spiers [aspi...@suse.com]
>>> Sent: Wednesday, March 01, 2017 5:33 AM
>>> To: Cluster Labs - All topics related to open-source clustering welcomed
>>> Cc: Jeffrey Westgate
>>> Subject: Re: [ClusterLabs] Never join a list without a problem...
>>>
>>> Ferenc W?gner <wf...@niif.hu> wrote:
>>>> Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes:
>>>>
>>>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>>>> longer, and we cannot set a clock by it - while the machine is 95%
>>>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>>>> to come back down to baseline, which is mostly 0.00.  (attached
>>>>> hostload.pdf) This happens to both machines, randomly, and is
>>>>> concerning, as we'd like to find what's causing it and resolve it.
>>>> Try running atop (http://www.atoptool.nl/).  It collects and logs
>>>> process accounting info, allowing you to step back in time and check
>>>> resource usage in the past.
>>> Nice, I didn't know atop could also log the collected data for future
>>> analysis.
>>>
>>> If you want to capture even more detail, sysdig is superb:
>>>
>>>     http://www.sysdig.org/
>>>
>>> _______________________________________________
>>> Users mailing list: Users@clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





------------------------------

_______________________________________________
Users mailing list
Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users


End of Users Digest, Vol 26, Issue 23
*************************************

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to