Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-10 Thread Kai Dupke
On 03/08/2017 04:58 PM, Jeffrey Westgate wrote:
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.

Just curious, do you monitor the host as well? I mean, when the host
reduces CPU assignment, or reduces IO capabilities for a VM, this can
have fancy effects.

So the problem might not within the guest but outside?

regards,
Kai Dupke
Senior Product Manager
SUSE Linux Enterprise 13
-- 
Sell not virtue to purchase wealth, nor liberty to purchase power.
Phone:  +49-(0)5102-9310828 Mail: kdu...@suse.com
Mobile: +49-(0)173-5876766  WWW:  www.suse.com

SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-09 Thread Klaus Wenninger
On 03/08/2017 07:13 PM, Jeffrey Westgate wrote:
> yes - at least I think this is all the packages.  (What I did was run a yum 
> update -y, for the most part - had to do pacemaker separately -- had to stop 
> it, update it, start it.)
>
> now, is it possible I'm missing a needed package after the update... but 
> dependencies should have handled that?
>
> [root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* 
> keepalive\* corosync\* pacemaker\*
> Loaded plugins: fastestmirror, refresh-packagekit
> Loading mirror speeds from cached hostfile
>  * epel: fedora-epel.mirror.lstn.net
>  * sl: ftp.scientificlinux.org
>  * sl-security: ftp.scientificlinux.org
> Installed Packages
> ccs.x86_64  0.16.2-75.el6_6.1 
>   installed   
> cman.x86_64 3.0.12.1-59.el6   
>   @sl 
> corosync.x86_64 1.4.1-17.el6  
>   @sl 
> corosynclib.x86_64  1.4.1-17.el6  
>   @sl

Looks like your corosync is ancient and in particular it seems to be out
of sync with
pacemaker. Pacemaker looks like the version released with RHEL-6.8 but
corosync
there is 1.4.7-5 and you have 1.4.1-17.

>  
> keepalived.x86_64   1.2.7-3.el6   
>   @sl 
> pacemaker.x86_641.1.14-8.el6_8.2  
>   @sl-security
> pacemaker-cli.x86_641.1.14-8.el6_8.2  
>   @sl-security
> pacemaker-cluster-libs.x86_64   1.1.14-8.el6_8.2  
>   @sl-security
> pacemaker-libs.x86_64   1.1.14-8.el6_8.2  
>   @sl-security
> pcs.x86_64  0.9.139-9.el6_7.1 
>   installed   
> resource-agents.x86_64  3.9.2-40.el6  
>   @sl 
> Available Packages
> corosynclib.i6861.4.1-17.el6  
>   sl  
> corosynclib-devel.i686  1.4.1-17.el6  
>   sl  
> corosynclib-devel.x86_641.4.1-17.el6  
>   sl  
> pacemaker-cluster-libs.i686 1.1.14-8.el6_8.2  
>   sl-security 
> pacemaker-cts.x86_641.1.14-8.el6_8.2  
>   sl-security 
> pacemaker-doc.x86_641.1.14-8.el6_8.2  
>   sl-security 
> pacemaker-libs.i686 1.1.14-8.el6_8.2  
>   sl-security 
> pacemaker-libs-devel.i686   1.1.14-8.el6_8.2  
>   sl-security 
> pacemaker-libs-devel.x86_64 1.1.14-8.el6_8.2  
>   sl-security 
> pacemaker-remote.x86_64 1.1.14-8.el6_8.2  
>   sl-security 
> pcs.noarch  0.9.90-2.el6  
>   sl  
> resource-agents-sap.x86_64  3.9.2-40.el6  
>   sl 
> ________________
>
> ------
>
> Message: 2
> Date: Wed, 8 Mar 2017 10:40:49 -0600
> From: Ken Gaillot 
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
> problem...
> Message-ID: <408c0af6-3831-5e7a-f1dd-37dcbfb0f...@redhat.com>
> Content-Type: text/plain; charset=windows-1252
>
> On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
>> Ok.
>>
>> Been running monit for a few days, and atop (running a script to capture an 
>> atop output every 10 seconds for an hour, rotate the log, and do it again; 
>> runs from midnight to midnight, changes the date, and does it again).  I 
>> correlate between the atop logs, nagios alerts, and monit, to try to find a 
>> trigger.  Like trying to find a particular snowflake in Alaska in January.
>>
>> Have had a handful of episodes with all the monitors running.  We have 
>> determined nothing. Nothing s

Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-08 Thread Jeffrey Westgate
Just for grins and giggles (I need some of both right now) I just updated to 
SL6.8.

We'll see what's what now.  That's EVERYTHING changed.



From: Jeffrey Westgate
Sent: Wednesday, March 08, 2017 12:13 PM
To: users@clusterlabs.org
Subject: Re: Antw: Re: Never join a list without a problem...

yes - at least I think this is all the packages.  (What I did was run a yum 
update -y, for the most part - had to do pacemaker separately -- had to stop 
it, update it, start it.)

now, is it possible I'm missing a needed package after the update... but 
dependencies should have handled that?

[root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* 
keepalive\* corosync\* pacemaker\*
Loaded plugins: fastestmirror, refresh-packagekit
Loading mirror speeds from cached hostfile
 * epel: fedora-epel.mirror.lstn.net
 * sl: ftp.scientificlinux.org
 * sl-security: ftp.scientificlinux.org
Installed Packages
ccs.x86_64  0.16.2-75.el6_6.1   
installed
cman.x86_64 3.0.12.1-59.el6 
@sl
corosync.x86_64 1.4.1-17.el6
@sl
corosynclib.x86_64  1.4.1-17.el6
@sl
keepalived.x86_64   1.2.7-3.el6 
@sl
pacemaker.x86_641.1.14-8.el6_8.2
@sl-security
pacemaker-cli.x86_641.1.14-8.el6_8.2
@sl-security
pacemaker-cluster-libs.x86_64   1.1.14-8.el6_8.2
@sl-security
pacemaker-libs.x86_64   1.1.14-8.el6_8.2
@sl-security
pcs.x86_64  0.9.139-9.el6_7.1   
installed
resource-agents.x86_64  3.9.2-40.el6
@sl
Available Packages
corosynclib.i6861.4.1-17.el6
sl
corosynclib-devel.i686  1.4.1-17.el6
sl
corosynclib-devel.x86_641.4.1-17.el6
sl
pacemaker-cluster-libs.i686 1.1.14-8.el6_8.2
sl-security
pacemaker-cts.x86_641.1.14-8.el6_8.2
sl-security
pacemaker-doc.x86_641.1.14-8.el6_8.2
sl-security
pacemaker-libs.i686 1.1.14-8.el6_8.2
sl-security
pacemaker-libs-devel.i686   1.1.14-8.el6_8.2
sl-security
pacemaker-libs-devel.x86_64 1.1.14-8.el6_8.2
sl-security
pacemaker-remote.x86_64 1.1.14-8.el6_8.2
sl-security
pcs.noarch  0.9.90-2.el6
sl
resource-agents-sap.x86_64  3.9.2-40.el6
sl


--

Message: 2
Date: Wed, 8 Mar 2017 10:40:49 -0600
From: Ken Gaillot 
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
problem...
Message-ID: <408c0af6-3831-5e7a-f1dd-37dcbfb0f...@redhat.com>
Content-Type: text/plain; charset=windows-1252

On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
> Ok.
>
> Been running monit for a few days, and atop (running a script to capture an 
> atop output every 10 seconds for an hour, rotate the log, and do it again; 
> runs from midnight to midnight, changes the date, and does it again).  I 
> correlate between the atop logs, nagios alerts, and monit, to try to find a 
> trigger.  Like trying to find a particular snowflake in Alaska in January.
>
> Have had a handful of episodes with all the monitors running.  We have 
> determined nothing. Nothing significantly changes from normal/regular to high 
> host load.
>
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.
>
> Was running the "VMWare provided" vmtools.  removed and replaced with 
> open-vm-tools this morning.  just had anoth

Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-08 Thread Jeffrey Westgate
yes - at least I think this is all the packages.  (What I did was run a yum 
update -y, for the most part - had to do pacemaker separately -- had to stop 
it, update it, start it.)

now, is it possible I'm missing a needed package after the update... but 
dependencies should have handled that?

[root@resolver-lb3 log]# yum list resource-agents\* ccs\* pcs\* cman\* 
keepalive\* corosync\* pacemaker\*
Loaded plugins: fastestmirror, refresh-packagekit
Loading mirror speeds from cached hostfile
 * epel: fedora-epel.mirror.lstn.net
 * sl: ftp.scientificlinux.org
 * sl-security: ftp.scientificlinux.org
Installed Packages
ccs.x86_64  0.16.2-75.el6_6.1   
installed   
cman.x86_64 3.0.12.1-59.el6 
@sl 
corosync.x86_64 1.4.1-17.el6
@sl 
corosynclib.x86_64  1.4.1-17.el6
@sl 
keepalived.x86_64   1.2.7-3.el6 
@sl 
pacemaker.x86_641.1.14-8.el6_8.2
@sl-security
pacemaker-cli.x86_641.1.14-8.el6_8.2
@sl-security
pacemaker-cluster-libs.x86_64   1.1.14-8.el6_8.2
@sl-security
pacemaker-libs.x86_64   1.1.14-8.el6_8.2
@sl-security
pcs.x86_64  0.9.139-9.el6_7.1   
installed   
resource-agents.x86_64  3.9.2-40.el6
@sl 
Available Packages
corosynclib.i6861.4.1-17.el6
sl  
corosynclib-devel.i686  1.4.1-17.el6
sl  
corosynclib-devel.x86_641.4.1-17.el6
sl  
pacemaker-cluster-libs.i686 1.1.14-8.el6_8.2
sl-security 
pacemaker-cts.x86_641.1.14-8.el6_8.2
sl-security 
pacemaker-doc.x86_641.1.14-8.el6_8.2
sl-security 
pacemaker-libs.i686 1.1.14-8.el6_8.2
sl-security 
pacemaker-libs-devel.i686   1.1.14-8.el6_8.2
sl-security 
pacemaker-libs-devel.x86_64 1.1.14-8.el6_8.2
sl-security 
pacemaker-remote.x86_64 1.1.14-8.el6_8.2
sl-security 
pcs.noarch  0.9.90-2.el6
sl  
resource-agents-sap.x86_64  3.9.2-40.el6
sl 


--

Message: 2
Date: Wed, 8 Mar 2017 10:40:49 -0600
From: Ken Gaillot 
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Antw: Re: Never join a list without a
    problem...
Message-ID: <408c0af6-3831-5e7a-f1dd-37dcbfb0f...@redhat.com>
Content-Type: text/plain; charset=windows-1252

On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
> Ok.
>
> Been running monit for a few days, and atop (running a script to capture an 
> atop output every 10 seconds for an hour, rotate the log, and do it again; 
> runs from midnight to midnight, changes the date, and does it again).  I 
> correlate between the atop logs, nagios alerts, and monit, to try to find a 
> trigger.  Like trying to find a particular snowflake in Alaska in January.
>
> Have had a handful of episodes with all the monitors running.  We have 
> determined nothing. Nothing significantly changes from normal/regular to high 
> host load.
>
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.
>
> Was running the "VMWare provided" vmtools.  removed and replaced with 
> open-vm-tools this morning.  just had another episode.
>
> was running atop interactively when the episode started - the only thing that 
> seems to change is the hostload goes up.  momentary spike in "avio" for the 
> disk -- all the way up to 25 msecs. lasted for

Re: [ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-08 Thread Ken Gaillot
On 03/08/2017 09:58 AM, Jeffrey Westgate wrote:
> Ok. 
> 
> Been running monit for a few days, and atop (running a script to capture an 
> atop output every 10 seconds for an hour, rotate the log, and do it again; 
> runs from midnight to midnight, changes the date, and does it again).  I 
> correlate between the atop logs, nagios alerts, and monit, to try to find a 
> trigger.  Like trying to find a particular snowflake in Alaska in January.
> 
> Have had a handful of episodes with all the monitors running.  We have 
> determined nothing. Nothing significantly changes from normal/regular to high 
> host load.
> 
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.
> 
> Was running the "VMWare provided" vmtools.  removed and replaced with 
> open-vm-tools this morning.  just had another episode.
> 
> was running atop interactively when the episode started - the only thing that 
> seems to change is the hostload goes up.  momentary spike in "avio" for the 
> disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.
> 
> no zombies, no wait, no spike in network, transport, mem use, disk 
> reads/writes... nothing I can see (and by I, I mean "we" as we have three 
> people looking)
> 
> I've got other boxes running the same OS - updated them at the same time, so 
> patch level is all same.  No similar issues.  The only thing I have different 
> is these two are running pacemaker, corosync, keepalived.  maybe when they 
> were updated, they need a library I don't have? 
> 
> running /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags 
> there.  so - not OS, not IO, not hardware (virtual as it is...) ... only 
> leaves software.
> 
> Maybe pacemaker is just incompatible with:
> 
> Scientific Linux release 6.5 (Carbon)
> kernel  2.6.32-642.15.1.el6.x86_64
> 
> ??

That does sound bizarre. I haven't tried 6.5 in a while, but it's
certainly compatible with the current 6.8.

IIRC, you updated to the 6.8 pacemaker packages ... Did you also update
the OS and/or other cluster-related packages to 6.8?

> At this point it's more of a curiosity than an out and out problem, as 
> performance does not seem to be impacted noticeably.  Packet-in, packet-out 
> seems unperturbed. Same cannot be send for us administrators...
> 
> 
> 
> 
> 
> From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
> Sent: Friday, March 03, 2017 7:27 AM
> To: users@clusterlabs.org
> Subject: Users Digest, Vol 26, Issue 10
> 
> Send Users mailing list submissions to
> users@clusterlabs.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@clusterlabs.org
> 
> You can reach the person managing the list at
> users-ow...@clusterlabs.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
> 
> 
> Today's Topics:
> 
>1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
>   retrying (Ulrich Windl)
>2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
>   error retrying (emmanuel segura)
>3. Antw: Re:  Never join a list without a problem...
>   (Jeffrey Westgate)
> 
> 
> --
> 
> --
> 
> Message: 3
> Date: Fri, 3 Mar 2017 13:27:25 +
> From: Jeffrey Westgate 
> To: "users@clusterlabs.org" 
> Subject: [ClusterLabs] Antw: Re:  Never join a list without a
> problem...
> Message-ID:
> 
> 
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> Appreciate the offer - not familiar with monit.
> 
> Going to try running atop through logratate for the day, keep 12, rotate 
> hourly (to control space utilization) and see if I can catch anything that 
> way.  My biggest issue is we've not caught it as it starts, so we don't ever 
> see anything amiss.
> 
> If this doesn't work, then I will likely take you up on how to script monit 
> to catch something.
> 
> Thanks --
> 
> Jeff
> 
> From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
> Sent: Friday, March 03, 2017 4:51 AM
> To: users@clusterlabs.org
> Subject: Users 

[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-08 Thread Jeffrey Westgate
Ok. 

Been running monit for a few days, and atop (running a script to capture an 
atop output every 10 seconds for an hour, rotate the log, and do it again; runs 
from midnight to midnight, changes the date, and does it again).  I correlate 
between the atop logs, nagios alerts, and monit, to try to find a trigger.  
Like trying to find a particular snowflake in Alaska in January.

Have had a handful of episodes with all the monitors running.  We have 
determined nothing. Nothing significantly changes from normal/regular to high 
host load.

It's a VMWare/ESXi-hosted VM, so we moved it to a different host and different 
datastore (so, effectively new CPU, memory, nic, disk, video... basically all 
"new" hardware.  still have episodes.

Was running the "VMWare provided" vmtools.  removed and replaced with 
open-vm-tools this morning.  just had another episode.

was running atop interactively when the episode started - the only thing that 
seems to change is the hostload goes up.  momentary spike in "avio" for the 
disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.

no zombies, no wait, no spike in network, transport, mem use, disk 
reads/writes... nothing I can see (and by I, I mean "we" as we have three 
people looking)

I've got other boxes running the same OS - updated them at the same time, so 
patch level is all same.  No similar issues.  The only thing I have different 
is these two are running pacemaker, corosync, keepalived.  maybe when they were 
updated, they need a library I don't have? 

running /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags there. 
 so - not OS, not IO, not hardware (virtual as it is...) ... only leaves 
software.

Maybe pacemaker is just incompatible with:

Scientific Linux release 6.5 (Carbon)
kernel  2.6.32-642.15.1.el6.x86_64

??

At this point it's more of a curiosity than an out and out problem, as 
performance does not seem to be impacted noticeably.  Packet-in, packet-out 
seems unperturbed. Same cannot be send for us administrators...





From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
Sent: Friday, March 03, 2017 7:27 AM
To: users@clusterlabs.org
Subject: Users Digest, Vol 26, Issue 10

Send Users mailing list submissions to
users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@clusterlabs.org

You can reach the person managing the list at
users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."


Today's Topics:

   1. Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error
  retrying (Ulrich Windl)
   2. Re: Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join
  error retrying (emmanuel segura)
   3. Antw: Re:  Never join a list without a problem...
  (Jeffrey Westgate)


--

--

Message: 3
Date: Fri, 3 Mar 2017 13:27:25 +0000
From: Jeffrey Westgate 
To: "users@clusterlabs.org" 
Subject: [ClusterLabs] Antw: Re:  Never join a list without a
problem...
Message-ID:



Content-Type: text/plain; charset="us-ascii"

Appreciate the offer - not familiar with monit.

Going to try running atop through logratate for the day, keep 12, rotate hourly 
(to control space utilization) and see if I can catch anything that way.  My 
biggest issue is we've not caught it as it starts, so we don't ever see 
anything amiss.

If this doesn't work, then I will likely take you up on how to script monit to 
catch something.

Thanks --

Jeff

From: users-requ...@clusterlabs.org [users-requ...@clusterlabs.org]
Sent: Friday, March 03, 2017 4:51 AM
To: users@clusterlabs.org
Subject: Users Digest, Vol 26, Issue 9

Send Users mailing list submissions to
users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@clusterlabs.org

You can reach the person managing the list at
users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."


Today's Topics:

   1. Re: Never join a list without a problem... (Jeffrey Westgate)
   2. Re: PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to
  PCMK_OCF_UNKNOWN_ERROR (Ken Gaillot)
   3. Re: Cannot clone clvmd resource (Eric Ren)
   4. Re: Cannot clone clvmd resource (Eric Ren)
   5. Antw: Re:  Never join a list withou

[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-03 Thread Jeffrey Westgate
de 739312140: node2
>>> node 739312141: node3
>>> primitive admin_addr IPaddr2 \
>>>  params ip=172.17.2.10 \
>>>      op monitor interval=10 timeout=20 \
>>>  meta target-role=Started
>>> primitive p-clvmd ocf:lvm2:clvmd \
>>>  op start timeout=90 interval=0 \
>>>  op stop timeout=100 interval=0 \
>>>  op monitor interval=30 timeout=90
>>> primitive p-dlm ocf:pacemaker:controld \
>>>  op start timeout=90 interval=0 \
>>>  op stop timeout=100 interval=0 \
>>>  op monitor interval=60 timeout=90
>>> primitive stonith-sbd stonith:external/sbd
>>> group g-clvm p-dlm p-clvmd
>>> clone c-clvm g-clvm meta interleave=true
>>> property cib-bootstrap-options: \
>>>  have-watchdog=true \
>>>  dc-version=1.1.13-14.7-6f22ad7 \
>>>  cluster-infrastructure=corosync \
>>>  cluster-name=hacluster \
>>>  stonith-enabled=true \
>>>  placement-strategy=balanced \
>>>  no-quorum-policy=freeze \
>>>  last-lrm-refresh=1488404073
>>> rsc_defaults rsc-options: \
>>>  resource-stickiness=1 \
>>>  migration-threshold=10
>>> op_defaults op-options: \
>>>  timeout=600 \
>>>  record-pending=true
>>>
>>> Thanks in advance for your input
>>>
>>> Cheers
>>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>




--

Message: 5
Date: Fri, 03 Mar 2017 08:04:22 +0100
From: "Ulrich Windl" 
To: 
Subject: [ClusterLabs] Antw: Re:  Never join a list without a
problem...
Message-ID: <58b9157602a100024...@gwsmtp1.uni-regensburg.de>
Content-Type: text/plain; charset=UTF-8

>>> Jeffrey Westgate  schrieb am 02.03.2017 um
17:32
in Nachricht
:
> Since we have both pieces of the load-balanced cluster doing the same thing
-
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
the
> other.  Running atop at 10 second slices, hoping it will catch something.
> While configuring it yesterday, that server went into it's 'episode', but
> there was nothing in the atop log to show anything.  Nothing else changed
> except the cpu load average.  No increase in any other parameter.
>
> frustrating.

Hi!

You could try the monit-approach (I could provide an RPM with a
"recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).

The part that monitors unusual load looks like this here:
  check system host.domain.org
if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
if cpu usage > 99% for 15 cycles then alert
if cpu usage (user) > 90% for 30 cycles then alert
if cpu usage (system) > 20% for 2 cycles then exec
"/var/lib/monit/log-top.s
h"
if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
group local
### all numbers are a matter of taste ;-)
And my script (in lack of better ideas) looks like this:
#!/bin/sh
{
echo "== $(/bin/date) =="
/usr/bin/mpstat
echo "---"
/usr/bin/vmstat
echo "---"
/usr/bin/top -b -n 1 -Hi
} >> /var/log/monit/top.log

Regards,
Ulrich

>
>
> 
> From: Adam Spiers [aspi...@suse.com]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
>
> Ferenc W?gner  wrote:
>>Jeffrey Westgate  writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
&g

[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-02 Thread Ulrich Windl
>>> Jeffrey Westgate  schrieb am 02.03.2017 um
17:32
in Nachricht
:
> Since we have both pieces of the load-balanced cluster doing the same thing
- 
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
the 
> other.  Running atop at 10 second slices, hoping it will catch something.  
> While configuring it yesterday, that server went into it's 'episode', but 
> there was nothing in the atop log to show anything.  Nothing else changed 
> except the cpu load average.  No increase in any other parameter.
> 
> frustrating.

Hi!

You could try the monit-approach (I could provide an RPM with a
"recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).

The part that monitors unusual load looks like this here:
  check system host.domain.org
if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
if cpu usage > 99% for 15 cycles then alert
if cpu usage (user) > 90% for 30 cycles then alert
if cpu usage (system) > 20% for 2 cycles then exec
"/var/lib/monit/log-top.s
h"
if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
group local
### all numbers are a matter of taste ;-)
And my script (in lack of better ideas) looks like this:
#!/bin/sh
{
echo "== $(/bin/date) =="
/usr/bin/mpstat
echo "---"
/usr/bin/vmstat
echo "---"
/usr/bin/top -b -n 1 -Hi
} >> /var/log/monit/top.log

Regards,
Ulrich

> 
> 
> 
> From: Adam Spiers [aspi...@suse.com]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> 
> Ferenc Wágner  wrote:
>>Jeffrey Westgate  writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00.  (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>>Try running atop (http://www.atoptool.nl/).  It collects and logs
>>process accounting info, allowing you to step back in time and check
>>resource usage in the past.
> 
> Nice, I didn't know atop could also log the collected data for future
> analysis.
> 
> If you want to capture even more detail, sysdig is superb:
> 
> http://www.sysdig.org/ 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-01 Thread Ulrich Windl
>>> Kai Dupke  schrieb am 01.03.2017 um 09:55 in Nachricht
:
> On 02/27/2017 02:26 PM, Jeffrey Westgate  wrote:
>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.
> 
> So, you have a time window of ~1h where the system is under load, right?
> This is somewhat different to what Ulrich had, but his approach might be
> useful for you, too.
> 
> Something against running some monitoring and capturing the processes,
> process states and load say, every 5 minutes?
> 
> Of course, the peaks might correlate to something in the logs - like
> cron, logins, logrotates or whatever.

The main issue is "expected load" vs. "unexpected load". In my case the system 
was expected to be completely idle at night, so I had set the thresholds rather 
low. Other systems can use different approaches. I hope to hear what caused the 
problem in your case.

Ulrich




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-01 Thread Ulrich Windl
>>> Jeffrey Westgate  schrieb am 27.02.2017 um 
>>> 14:26
in Nachricht
:
> Thanks, Ken. 
> 
> Our late guru was the admin who set all this up, and it's been rock solid 
> until recent oddities started cropping up.  They still function fine - 
> they've 
> just developed some... quirks.
> 
> I found the solution before I got your reply, which was essentially what we 
> did; update all but pacemaker, reboot, stop pacemaker, update pacemaker, 
> reboot.  That process was necessary because they've been running sooo long, 
> pacemaker would not stop.  it would try, then seemingly stall after several 
> minutes.
> 
> We're good now, up-to-date-wise, and stuck only with the initial issue we 
> were 
> hoping to eliminate by updating/patching EVERYthing.  And we honestly don't 
> know what may be causing it.
> 
> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.  (attached hostload.pdf)  This happens to both 
> machines, randomly, and is concerning, as we'd like to find what's causing it 
> and resolve it.

We use SLES11 here, and it took me a really long time to find out what is 
causing nightly load peaks on our servers. It turned out tho be the rebuild of 
the manual database (mandb). It didn't show in Nagios load statistics, but in 
monit alerts (on some machines we use both). In monit you can run a script when 
some condition is met. So  I constructed a "capture script" to find the guilty 
parties ;-)

However the peaks were so short that it took many runs to find it. Here the 
load was back to normal already, but monit had reported an event like "cpu 
system usage of 30.2% matches resource limit [cpu system usage>20.0%]":

Sat May 11 01:31:13 CEST 2013
top - 01:31:14 up 2 days,  9:31,  0 users,  load average: 0.91, 0.31, 0.15
Tasks: 114 total,   2 running, 112 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1065628k total,  1055292k used,10336k free,   143708k buffers
Swap:  2097148k total,0k used,  2097148k free,   578736k cached

  PID USER  PR  NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
 2832 root  20   0  8916 1060  776 R  0  0.1   0:00.00 top
 2910 man   30  10 840 R  0  0.0   0:00.00 mandb

Maybe this helps.

Regards,
Ulrich

> 
> We were hoping "uptime kernel bug", but patching has not helped.  There 
> seems to be no increase in the number of processes running, and the processes 
> running do not take any more cpu time.  They are DNS forwarding resolvers, 
> but there is no correlation between dns requests and load increase - 
> sometimes 
> (like this morning) it rises around 1 AM when the dns load is minimal.
> 
> The oddity is - these are the only two boxes with this issue, and we have a 
> couple dozen at the same OS and level.  Only these two, with this role and 
> this particular package set have the issue.
> 
> --
> Jeff





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org