Re: [Linux-HA] Hertbeat fail-over Email Alert

2014-09-24 Thread Tom Parker
Hi Lars

Can you provide more details about this resource agent.  The
documentation is a little sparse.  What events will cause an e-mail to
be sent?

Thanks!

Tom

On 24/09/14 06:53 PM, Lars Ellenberg wrote:
 On Tue, Sep 23, 2014 at 04:55:20PM +0530, Atul Yadav wrote:
 Dear Team ,

 In our environment for storage HA, we are using heartbeat method.

 Our Storage HA is working fine with Heartbeat management.

 Now we need your guidance to setup the EMAIL alert at the time of fail-over
  happen and fail over completed.

 We already setup smtp in both the servers.
 And we are able to send mail from terminal window.
 Storage1
 Storage2

 Please guide us.
 What's wrong with the MailTo resource agent?


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK

2014-04-23 Thread Tom Parker
ok.  I have fixed that to be no_path_retry fail but I don't think this
has anything to do with the errors I am seeing. 

They seem to be related to sbd's link with my cluster, not with disk I/O

Tom

On 23/04/14 03:11 AM, emmanuel segura wrote:
 the first thing, you are using no_path_retry in wrong way in your
 multipath, try to read this
 http://www.novell.com/documentation/oes2/clus_admin_lx/data/bl9ykz6.html


 2014-04-22 20:41 GMT+02:00 Tom Parker tpar...@cbnco.com:

 I have attached the config files to this e-mail.  The sbd dump is below

 [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump
 ==Dumping header on disk /dev/mapper/qa-xen-sbd
 Header version : 2.1
 UUID   : ae835596-3d26-4681-ba40-206b4d51149b
 Number of slots: 255
 Sector size: 512
 Timeout (watchdog) : 45
 Timeout (allocate) : 2
 Timeout (loop) : 1
 Timeout (msgwait)  : 90
 ==Header on disk /dev/mapper/qa-xen-sbd is dumped

 On 22/04/14 02:30 PM, emmanuel segura wrote:
 you are missingo cluster configuration and sbd configuration and
 multipath
 config


 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com:

 Has anyone seen this?  Do you know what might be causing the flapping?

 Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled.
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device
 /dev/mapper/qa-xen-sbd
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd
 uuid: ae835596-3d26-4681-ba40-206b4d51149b
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS
 quorum check enabled
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device:
 /dev/watchdog
 Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45
 seconds.
 Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now.
 Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
 Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
 Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN

Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK

2014-04-23 Thread Tom Parker
SDB has a connection to pacemaker to establish overall cluster health
(the -P flag).  This seems to be where the problem is.  I just don't
know what the problem might be.

On 23/04/14 11:32 AM, emmanuel segura wrote:
 what do you mean with link?


 2014-04-23 15:23 GMT+02:00 Tom Parker tpar...@cbnco.com:

 ok.  I have fixed that to be no_path_retry fail but I don't think this
 has anything to do with the errors I am seeing.

 They seem to be related to sbd's link with my cluster, not with disk I/O

 Tom

 On 23/04/14 03:11 AM, emmanuel segura wrote:
 the first thing, you are using no_path_retry in wrong way in your
 multipath, try to read this
 http://www.novell.com/documentation/oes2/clus_admin_lx/data/bl9ykz6.html


 2014-04-22 20:41 GMT+02:00 Tom Parker tpar...@cbnco.com:

 I have attached the config files to this e-mail.  The sbd dump is below

 [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump
 ==Dumping header on disk /dev/mapper/qa-xen-sbd
 Header version : 2.1
 UUID   : ae835596-3d26-4681-ba40-206b4d51149b
 Number of slots: 255
 Sector size: 512
 Timeout (watchdog) : 45
 Timeout (allocate) : 2
 Timeout (loop) : 1
 Timeout (msgwait)  : 90
 ==Header on disk /dev/mapper/qa-xen-sbd is dumped

 On 22/04/14 02:30 PM, emmanuel segura wrote:
 you are missingo cluster configuration and sbd configuration and
 multipath
 config


 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com:

 Has anyone seen this?  Do you know what might be causing the flapping?

 Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled.
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device
 /dev/mapper/qa-xen-sbd
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device
 /dev/mapper/qa-xen-sbd
 uuid: ae835596-3d26-4681-ba40-206b4d51149b
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected,
 AIS
 quorum check enabled
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device:
 /dev/watchdog
 Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45
 seconds.
 Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right
 now.
 Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
 Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
 Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13

[Linux-HA] Resource blocked

2014-04-22 Thread Tom Parker
Good morning

I am trying to restart resources on one of my clusters and I am getting
the message

pengine[13397]:   notice: LogActions: Start   domtcot1-qa(qaxen1
- blocked)

How can I find out why this resource is blocked.

Thanks
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK

2014-04-22 Thread Tom Parker
I have attached the config files to this e-mail.  The sbd dump is below

[LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump
==Dumping header on disk /dev/mapper/qa-xen-sbd
Header version : 2.1
UUID   : ae835596-3d26-4681-ba40-206b4d51149b
Number of slots: 255
Sector size: 512
Timeout (watchdog) : 45
Timeout (allocate) : 2
Timeout (loop) : 1
Timeout (msgwait)  : 90
==Header on disk /dev/mapper/qa-xen-sbd is dumped

On 22/04/14 02:30 PM, emmanuel segura wrote:
 you are missingo cluster configuration and sbd configuration and multipath
 config


 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com:

 Has anyone seen this?  Do you know what might be causing the flapping?

 Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled.
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device
 /dev/mapper/qa-xen-sbd
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd
 uuid: ae835596-3d26-4681-ba40-206b4d51149b
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS
 quorum check enabled
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device:
 /dev/watchdog
 Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45
 seconds.
 Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now.
 Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
 Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
 Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:31:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:31:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:32:52 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:32:52 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:33:01 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:33:01 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:44:39 qaxen6 sbd: [12974]: WARN

Re: [Linux-HA] /usr/sbin/lrmadmin missing from cluster-glue

2014-01-24 Thread Tom Parker
Thanks Kristoffer.

How is tuning done for lrm now?

Tom

On 01/24/2014 01:41 AM, Kristoffer Grönlund wrote:
 On Sat, 28 Dec 2013 11:18:44 -0500
 Tom Parker tpar...@cbnco.com wrote:

 Hello

 /usr/sbin/lrmadmin is missing from the latest version of cluster-glue
 in SLES SP3.  Has the program been deprecated or is this an issue in
 the packaging of the RPM?

 Hi,

 I know this is a bit late, but I just discovered this email. Yes,
 lrmadmin has been deprecated since it is incompatible with recent
 versions of pacemaker.

 Thanks

 Tom
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] /usr/sbin/lrmadmin missing from cluster-glue

2013-12-28 Thread Tom Parker
Hello

/usr/sbin/lrmadmin is missing from the latest version of cluster-glue in
SLES SP3.  Has the program been deprecated or is this an issue in the
packaging of the RPM?

Thanks

Tom
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Xen XL Resource Agent

2013-11-18 Thread Tom Parker
My thoughts were to create a new RA and let admins choose. If you would prefer 
to auto-detect that is an option as well.

Tom

From: Lars Marowsky-Bree
Sent: Monday, November 18, 2013 8:27 AM
To: General Linux-HA mailing list
Reply To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Antw: Xen XL Resource Agent


On 2013-11-15T09:05:53, Tom Parker tpar...@cbnco.com wrote:

 The XL tools are much faster and lighter weight.   I am not sure if they
 report proper codes (I will have to test) but the XM stack has been
 deprecated so at some point I assume it will go away completely.

The Xen RA already supports xen-list and xen-destroy in addition to
the xm tools. Patches to additionally support xl are welcome.

(Auto-detect what is available, and then choose xl - xen-* - xm.)

We can't yet drop xm, since not all environments have xl yet.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Xen XL Resource Agent

2013-11-15 Thread Tom Parker
The XL tools are much faster and lighter weight.   I am not sure if they
report proper codes (I will have to test) but the XM stack has been
deprecated so at some point I assume it will go away completely.

[LIVE] qaxen1:~ # time xl list
NameID   Mem VCPUs  State  
Time(s)
Domain-0 0 2534712 r- 
146383.6

Domain list removed

real0m0.053s
user0m0.000s
sys 0m0.008s


[LIVE] qaxen1:~ # time xm
list

  

NameID   Mem VCPUs  State  
Time(s) 


Domain-0 0 2534712 r-
146381.1
  


Domain list removed

real0m0.352s
user0m0.236s
sys 0m0.036s

On 11/15/2013 02:04 AM, Ulrich Windl wrote:
 Tom Parker tpar...@cbnco.com schrieb am 14.11.2013 um 19:23 in Nachricht
 5285150b.9050...@cbnco.com:
 Hello.

 Now that XM has been deprecated is anyone working on a Xen RA that uses
 the xl tool stack? 
 I woonder whether xl will (opposed to xm) report proper exit codes if 
 operations fail. Otherwise I don't see the reason to change tools. MHO...

 I am willing to do the work but I don't want to duplicate the effort if
 someone else is doing/has already done it.

 Tom
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Xen XL Resource Agent

2013-11-14 Thread Tom Parker
Hello.

Now that XM has been deprecated is anyone working on a Xen RA that uses
the xl tool stack? 

I am willing to do the work but I don't want to duplicate the effort if
someone else is doing/has already done it.

Tom
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How many primitives, groups can I have

2013-11-11 Thread Tom Parker
You will also have to be careful of the shared memory size between the nodes. I 
had issues with massive cibs. Setting some environment variables fixed the 
issue but the defaults are too small.

From: Digimer
Sent: Monday, November 11, 2013 10:24 AM
To: General Linux-HA mailing list
Reply To: General Linux-HA mailing list
Subject: Re: [Linux-HA] How many primitives, groups can I have


On 11/11/13 07:57, Michael Brookhuis wrote:
 Hi,

 Is there a limit in the number of proimitives, etc you can have?
 What maximum number is recommended based on best-practices?

 Are 1500 to many?

 Thanks
 Mima

The cib will be very large, so pushing changes to other nodes will take
time (specially if you have many nodes). I suspect you will run into
corosync timeouts before you hit any coded upper limits. You will likely
have to play with corosync timing values to get that high, assuming your
network is fast enough at all.

But in the end, as I understand it, there is no coded upper limit.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How many primitives, groups can I have

2013-11-11 Thread Tom Parker
I have found my settings.

I needed to set the following in /etc/sysconfig/pacemaker

# Force use of a particular class of IPC connection

# PCMK_ipc_type=shared-mem|socket|posix|sysv

export PCMK_ipc_type=shared-mem

# Specify an IPC buffer size in bytes

# Useful when connecting to really big clusters that exceed the default 20k 
buffer

# PCMK_ipc_buffer=20480

export PCMK_ipc_buffer=2048


and in my bashrc file (for the crm tools to work proplery) I have

[LIVE] qaxen1:~ # cat .bashrc 

# Load Pacemaker IPC settings for crm

PACEMAKER_SYSCONFIG=/etc/sysconfig/pacemaker

if [ -f $PACEMAKER_SYSCONFIG ]; then

. $PACEMAKER_SYSCONFIG

fi


Hope this helps.

On 11/11/2013 03:35 PM, Tom Parker wrote:
 You will also have to be careful of the shared memory size between the nodes. 
 I had issues with massive cibs. Setting some environment variables fixed the 
 issue but the defaults are too small.

 From: Digimer
 Sent: Monday, November 11, 2013 10:24 AM
 To: General Linux-HA mailing list
 Reply To: General Linux-HA mailing list
 Subject: Re: [Linux-HA] How many primitives, groups can I have


 On 11/11/13 07:57, Michael Brookhuis wrote:
 Hi,

 Is there a limit in the number of proimitives, etc you can have?
 What maximum number is recommended based on best-practices?

 Are 1500 to many?

 Thanks
 Mima
 The cib will be very large, so pushing changes to other nodes will take
 time (specially if you have many nodes). I suspect you will run into
 corosync timeouts before you hit any coded upper limits. You will likely
 have to play with corosync timing values to get that high, assuming your
 network is fast enough at all.

 But in the end, as I understand it, there is no coded upper limit.

 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-21 Thread Tom Parker
Thanks for the feedback.  Dejan, I have some SLES nodes that are running
around 30 pretty heavy VMs and I found that while I never go to 5s that
the time it would take to reboot was not a constant.   I have a feeling
that this bug in xen-list may take a while to be fixed upstream and
trickle down into the released xen packages so we may be using this fix
for a while.

The full longdesc now reads:

longdesc lang=en
When the guest is rebooting, there is a short interval where the guest
completely disappears from xm list, which, in turn, will cause the monitor
operation to return a not running status.

If a monitor status returns not running, then test status
again for wait_for_reboot seconds (perhaps it'll show up).

NOTE: This timer increases the amount of time the cluster will
wait before declaring a VM dead and recovering it.
/longdesc

Tom

On 10/21/2013 03:28 AM, Ulrich Windl wrote:
 When the guest is rebooting, there is a short interval where the guest
 completely disappears from xm list, which, in turn, will cause the monitor
 operation to return a not running status. If the guest cannot be found , 
 this
 value will cause some extra delay in the monitor operation to work around the
 problem.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-21 Thread Tom Parker
Hi Dejan.

How can I revert my commits so that they are not include multiple
things? I will submit one patch with the logging cleanup and then if
needed another with my changes to the meta-data.

Tom

On 10/21/2013 09:39 AM, Dejan Muhamedagic wrote:
 Hi Ulrich!

 On Mon, Oct 21, 2013 at 09:28:50AM +0200, Ulrich Windl wrote:
 Hi!

 Basically I think there should be no hard-coded constants whose value depends
 on some performance measurements, like 5s for rebooting a VM.
 It's actually not 5s, but the status is run 5 times. If the load
 is high, my guess is that the Xen tools used by the RA would
 suffer proportionally.

 So I support
 Tom's changes.

 However I noticed:

 +running; apparently, this period lasts only for a second or
 +two

 (missing full stop at end of sentence)
 That's at the end of the comment and, typically, comments end
 with a carriage return (as is here the case).

 Actually I'd rephrase the description:

 When the guest is rebooting, there is a short interval where the guest
 completely disappears from xm list, which, in turn, will cause the monitor
 operation to return a not running status. If the guest cannot be found , 
 this
 value will cause some extra delay in the monitor operation to work around the
 problem.

 (I.e. try to describe the effect, not the implementation)
 That's the code, so the implementation is described. The very
 top of the comment says:

   # If the guest is rebooting, it may completely disappear from the
   # list of defined guests

 I was hoping that that was enough of an explanation. Look for
 a more thorough description of the cause in the changelog. BTW,
 note that this is a _workaround_ and that the thing should
 eventually be fixed in Xen.

 And yes, I appreciate consistent log formats also ;-)
 That's always welcome, of course. It should also go in a
 separate commit.

 Thanks,

 Dejan

 Regards,
 Ulrich

 Tom Parker tpar...@cbnco.com schrieb am 18.10.2013 um 19:30 in
 Nachricht
 5261703a.5070...@cbnco.com:
 Hi Dejan.  Sorry to be slow to respond to this.  I have done some
 testing and everything looks good. 

 I spent some time tweaking the RA and I added a parameter called
 wait_for_reboot (default 5s) to allow us to override the reboot sleep
 times (in case it's more than 5 seconds on really loaded hypervisors). 
 I also cleaned up a few log entries to make them consistent in the RA
 and edited your entries for xen status to be a little bit more clear as
 to why we think we should be waiting. 

 I have attached a patch here because I have NO idea how to create a
 branch and pull request.  If there are links to a good place to start I
 may be able to contribute occasionally to some other RAs that I use.

 Please let me know what you think.

 Thanks for your help

 Tom


 On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
 On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
 Hi Tom,

 On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
 Some more reading of the source code makes me think the  || [
 $__OCF_ACTION != stop ]; is not needed. 
 Yes, you're right. I'll drop that part of the if statement. Many
 thanks for testing.
 Fixed now. The if statement, which was obviously hard to follow,
 got relegated to the monitor function.  Which makes the
 Xen_Status_with_Retry really stand for what's happening in there ;-)

 Tom, hope you can test again.

 Cheers,

 Dejan

 Cheers,

 Dejan

 Xen_Status_with_Retry() is only called from Stop and Monitor so we only
 need to check if it's a probe.  Everything else should be handled in the
 case statement in the loop.

 Tom

 On 10/16/2013 05:16 PM, Tom Parker wrote:
 Hi.  I think there is an issue with the Updated Xen RA.

 I think there is an issue with the if statement here but I am not sure.
 I may be confused about how bash || works but I don't see my servers
 ever entering the loop on a vm disappearing.

 if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
 fi

 Does this not mean that if we run a monitor operation that is not a
 probe we will have:

 (ocf_is_probe) return false
 (stop != monitor) return true
 (false || true) return true

 which will cause the if statement to return $rc and never enter the
 loop? 
 Xen_Status_with_Retry() {
   local rc cnt=5

   Xen_Status $1
   rc=$?
   if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
   fi
   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
 case $__OCF_ACTION in
 stop)
   ocf_log debug domain $1 reported as not running, waiting
 $cnt
 seconds ...
   ;;
 monitor)
   ocf_log warn domain $1 reported as not running, but it is
 expected to be running! Retrying for $cnt seconds ...
   ;;
 *) : not reachable
 ;;
 esac
 sleep 1
 Xen_Status $1
 rc=$?
 let cnt=$((cnt-1))
   done
   return $rc
 }



 On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
 Hi Tom,

 On Tue, Oct 15

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-18 Thread Tom Parker
Hi Dejan.  Sorry to be slow to respond to this.  I have done some
testing and everything looks good. 

I spent some time tweaking the RA and I added a parameter called
wait_for_reboot (default 5s) to allow us to override the reboot sleep
times (in case it's more than 5 seconds on really loaded hypervisors). 
I also cleaned up a few log entries to make them consistent in the RA
and edited your entries for xen status to be a little bit more clear as
to why we think we should be waiting. 

I have attached a patch here because I have NO idea how to create a
branch and pull request.  If there are links to a good place to start I
may be able to contribute occasionally to some other RAs that I use.

Please let me know what you think.

Thanks for your help

Tom


On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
 On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
 Hi Tom,

 On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
 Some more reading of the source code makes me think the  || [
 $__OCF_ACTION != stop ]; is not needed. 
 Yes, you're right. I'll drop that part of the if statement. Many
 thanks for testing.
 Fixed now. The if statement, which was obviously hard to follow,
 got relegated to the monitor function.  Which makes the
 Xen_Status_with_Retry really stand for what's happening in there ;-)

 Tom, hope you can test again.

 Cheers,

 Dejan

 Cheers,

 Dejan

 Xen_Status_with_Retry() is only called from Stop and Monitor so we only
 need to check if it's a probe.  Everything else should be handled in the
 case statement in the loop.

 Tom

 On 10/16/2013 05:16 PM, Tom Parker wrote:
 Hi.  I think there is an issue with the Updated Xen RA.

 I think there is an issue with the if statement here but I am not sure. 
 I may be confused about how bash || works but I don't see my servers
 ever entering the loop on a vm disappearing.

 if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
 fi

 Does this not mean that if we run a monitor operation that is not a
 probe we will have:

 (ocf_is_probe) return false
 (stop != monitor) return true
 (false || true) return true

 which will cause the if statement to return $rc and never enter the loop? 

 Xen_Status_with_Retry() {
   local rc cnt=5

   Xen_Status $1
   rc=$?
   if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
   fi
   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
 case $__OCF_ACTION in
 stop)
   ocf_log debug domain $1 reported as not running, waiting $cnt
 seconds ...
   ;;
 monitor)
   ocf_log warn domain $1 reported as not running, but it is
 expected to be running! Retrying for $cnt seconds ...
   ;;
 *) : not reachable
 ;;
 esac
 sleep 1
 Xen_Status $1
 rc=$?
 let cnt=$((cnt-1))
   done
   return $rc
 }



 On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
 Hi Tom,

 On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
 Hi Dejan

 Just a quick question.  I cannot see your new log messages being logged
 to syslog

 ocf_log warn domain $1 reported as not running, but it is expected to
 be running! Retrying for $cnt seconds ...

 Do you know where I can set my logging to see warn level messages?  I
 expected to see them in my testing by default but that does not seem to
 be true.
 You should see them by default. But note that these warnings may
 not happen, depending on the circumstances on your host. In my
 experiments they were logged only while the guest was rebooting
 and then just once or maybe twice. If you have recent
 resource-agents and crmsh, you can enable operation tracing (with
 crm resource trace rsc monitor interval).

 Thanks,

 Dejan

 Thanks

 Tom


 On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
 Hi,

 On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now 
 I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and 
 pygrub is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered another 
 bug in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 
 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started 
 h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 
 14906: pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start
 prm_xen_v02_start_0 on h05 (local)
 lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
 [...]
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: 
 Domain 'v02

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-18 Thread Tom Parker
I may have actually created the pull request properly... 

Please let me know and again thanks for your help.

Tom

On 10/18/2013 01:30 PM, Tom Parker wrote:
 Hi Dejan.  Sorry to be slow to respond to this.  I have done some
 testing and everything looks good. 

 I spent some time tweaking the RA and I added a parameter called
 wait_for_reboot (default 5s) to allow us to override the reboot sleep
 times (in case it's more than 5 seconds on really loaded hypervisors). 
 I also cleaned up a few log entries to make them consistent in the RA
 and edited your entries for xen status to be a little bit more clear as
 to why we think we should be waiting. 

 I have attached a patch here because I have NO idea how to create a
 branch and pull request.  If there are links to a good place to start I
 may be able to contribute occasionally to some other RAs that I use.

 Please let me know what you think.

 Thanks for your help

 Tom


 On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
 On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
 Hi Tom,

 On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
 Some more reading of the source code makes me think the  || [
 $__OCF_ACTION != stop ]; is not needed. 
 Yes, you're right. I'll drop that part of the if statement. Many
 thanks for testing.
 Fixed now. The if statement, which was obviously hard to follow,
 got relegated to the monitor function.  Which makes the
 Xen_Status_with_Retry really stand for what's happening in there ;-)

 Tom, hope you can test again.

 Cheers,

 Dejan

 Cheers,

 Dejan

 Xen_Status_with_Retry() is only called from Stop and Monitor so we only
 need to check if it's a probe.  Everything else should be handled in the
 case statement in the loop.

 Tom

 On 10/16/2013 05:16 PM, Tom Parker wrote:
 Hi.  I think there is an issue with the Updated Xen RA.

 I think there is an issue with the if statement here but I am not sure. 
 I may be confused about how bash || works but I don't see my servers
 ever entering the loop on a vm disappearing.

 if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
 fi

 Does this not mean that if we run a monitor operation that is not a
 probe we will have:

 (ocf_is_probe) return false
 (stop != monitor) return true
 (false || true) return true

 which will cause the if statement to return $rc and never enter the loop? 

 Xen_Status_with_Retry() {
   local rc cnt=5

   Xen_Status $1
   rc=$?
   if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
   fi
   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
 case $__OCF_ACTION in
 stop)
   ocf_log debug domain $1 reported as not running, waiting $cnt
 seconds ...
   ;;
 monitor)
   ocf_log warn domain $1 reported as not running, but it is
 expected to be running! Retrying for $cnt seconds ...
   ;;
 *) : not reachable
 ;;
 esac
 sleep 1
 Xen_Status $1
 rc=$?
 let cnt=$((cnt-1))
   done
   return $rc
 }



 On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
 Hi Tom,

 On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
 Hi Dejan

 Just a quick question.  I cannot see your new log messages being logged
 to syslog

 ocf_log warn domain $1 reported as not running, but it is expected to
 be running! Retrying for $cnt seconds ...

 Do you know where I can set my logging to see warn level messages?  I
 expected to see them in my testing by default but that does not seem to
 be true.
 You should see them by default. But note that these warnings may
 not happen, depending on the circumstances on your host. In my
 experiments they were logged only while the guest was rebooting
 and then just once or maybe twice. If you have recent
 resource-agents and crmsh, you can enable operation tracing (with
 crm resource trace rsc monitor interval).

 Thanks,

 Dejan

 Thanks

 Tom


 On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
 Hi,

 On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now 
 I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and 
 pygrub is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered 
 another bug in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 
 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started 
 h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 
 14906: pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-16 Thread Tom Parker
Hi.  I think there is an issue with the Updated Xen RA.

I think there is an issue with the if statement here but I am not sure. 
I may be confused about how bash || works but I don't see my servers
ever entering the loop on a vm disappearing.

if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
return $rc
fi

Does this not mean that if we run a monitor operation that is not a
probe we will have:

(ocf_is_probe) return false
(stop != monitor) return true
(false || true) return true

which will cause the if statement to return $rc and never enter the loop? 

Xen_Status_with_Retry() {
  local rc cnt=5

  Xen_Status $1
  rc=$?
  if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
return $rc
  fi
  while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
case $__OCF_ACTION in
stop)
  ocf_log debug domain $1 reported as not running, waiting $cnt
seconds ...
  ;;
monitor)
  ocf_log warn domain $1 reported as not running, but it is
expected to be running! Retrying for $cnt seconds ...
  ;;
*) : not reachable
;;
esac
sleep 1
Xen_Status $1
rc=$?
let cnt=$((cnt-1))
  done
  return $rc
}



On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
 Hi Tom,

 On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
 Hi Dejan

 Just a quick question.  I cannot see your new log messages being logged
 to syslog

 ocf_log warn domain $1 reported as not running, but it is expected to
 be running! Retrying for $cnt seconds ...

 Do you know where I can set my logging to see warn level messages?  I
 expected to see them in my testing by default but that does not seem to
 be true.
 You should see them by default. But note that these warnings may
 not happen, depending on the circumstances on your host. In my
 experiments they were logged only while the guest was rebooting
 and then just once or maybe twice. If you have recent
 resource-agents and crmsh, you can enable operation tracing (with
 crm resource trace rsc monitor interval).

 Thanks,

 Dejan

 Thanks

 Tom


 On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
 Hi,

 On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and pygrub 
 is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered another bug 
 in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: 
 pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start
 prm_xen_v02_start_0 on h05 (local)
 lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
 [...]
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
 'v02'
 already exists with ID '3'
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config 
 file
 /etc/xen/vm/v02.
 [...]
 lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: 
 pid
 19686 exited with return code 1
 [...]
 crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
 crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
 failed (target: 0 vs. rc: 1): Error
 [...]

 As you can clearly see start failed, because the guest was found up 
 already!
 IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
 Yes, I've seen that. It's basically the same issue, i.e. the
 domain being gone for a while and then reappearing.

 I guess the following test is problematic:
 ---
   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
   rc=$?
   if [ $rc -ne 0 ]; then
 return $OCF_ERR_GENERIC
 ---
 Here xm create probably fails if the guest is already created...
 It should fail too. Note that this is a race, but the race is
 anyway caused by the strange behaviour of xen. With the recent
 fix (or workaround) in the RA, this shouldn't be happening.

 Thanks,

 Dejan

 Regards,
 Ulrich


 Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 
 in
 Nachricht 20131001102430.GA4687@walrus.homenet:
 Hi,

 On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
 On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote:

 Thanks for paying attention to this issue (not really a bug) as I am
 sure I am not the only one with this issue.  For now I have set all my
 VMs to destroy so

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-16 Thread Tom Parker
Some more reading of the source code makes me think the  || [
$__OCF_ACTION != stop ]; is not needed. 

Xen_Status_with_Retry() is only called from Stop and Monitor so we only
need to check if it's a probe.  Everything else should be handled in the
case statement in the loop.

Tom

On 10/16/2013 05:16 PM, Tom Parker wrote:
 Hi.  I think there is an issue with the Updated Xen RA.

 I think there is an issue with the if statement here but I am not sure. 
 I may be confused about how bash || works but I don't see my servers
 ever entering the loop on a vm disappearing.

 if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
 fi

 Does this not mean that if we run a monitor operation that is not a
 probe we will have:

 (ocf_is_probe) return false
 (stop != monitor) return true
 (false || true) return true

 which will cause the if statement to return $rc and never enter the loop? 

 Xen_Status_with_Retry() {
   local rc cnt=5

   Xen_Status $1
   rc=$?
   if ocf_is_probe || [ $__OCF_ACTION != stop ]; then
 return $rc
   fi
   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
 case $__OCF_ACTION in
 stop)
   ocf_log debug domain $1 reported as not running, waiting $cnt
 seconds ...
   ;;
 monitor)
   ocf_log warn domain $1 reported as not running, but it is
 expected to be running! Retrying for $cnt seconds ...
   ;;
 *) : not reachable
 ;;
 esac
 sleep 1
 Xen_Status $1
 rc=$?
 let cnt=$((cnt-1))
   done
   return $rc
 }



 On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
 Hi Tom,

 On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
 Hi Dejan

 Just a quick question.  I cannot see your new log messages being logged
 to syslog

 ocf_log warn domain $1 reported as not running, but it is expected to
 be running! Retrying for $cnt seconds ...

 Do you know where I can set my logging to see warn level messages?  I
 expected to see them in my testing by default but that does not seem to
 be true.
 You should see them by default. But note that these warnings may
 not happen, depending on the circumstances on your host. In my
 experiments they were logged only while the guest was rebooting
 and then just once or maybe twice. If you have recent
 resource-agents and crmsh, you can enable operation tracing (with
 crm resource trace rsc monitor interval).

 Thanks,

 Dejan

 Thanks

 Tom


 On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
 Hi,

 On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and pygrub 
 is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered another 
 bug in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 
 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: 
 pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start
 prm_xen_v02_start_0 on h05 (local)
 lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
 [...]
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
 'v02'
 already exists with ID '3'
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config 
 file
 /etc/xen/vm/v02.
 [...]
 lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: 
 pid
 19686 exited with return code 1
 [...]
 crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
 crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on 
 h05
 failed (target: 0 vs. rc: 1): Error
 [...]

 As you can clearly see start failed, because the guest was found up 
 already!
 IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
 Yes, I've seen that. It's basically the same issue, i.e. the
 domain being gone for a while and then reappearing.

 I guess the following test is problematic:
 ---
   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
   rc=$?
   if [ $rc -ne 0 ]; then
 return $OCF_ERR_GENERIC
 ---
 Here xm create probably fails if the guest is already created...
 It should fail too. Note that this is a race, but the race is
 anyway caused by the strange behaviour of xen. With the recent
 fix (or workaround) in the RA, this shouldn't be happening.

 Thanks,

 Dejan

 Regards,
 Ulrich


 Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-15 Thread Tom Parker
Hi Dejan

Just a quick question.  I cannot see your new log messages being logged
to syslog

ocf_log warn domain $1 reported as not running, but it is expected to
be running! Retrying for $cnt seconds ...

Do you know where I can set my logging to see warn level messages?  I
expected to see them in my testing by default but that does not seem to
be true.

Thanks

Tom


On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
 Hi,

 On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered another bug in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start
 prm_xen_v02_start_0 on h05 (local)
 lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
 [...]
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
 'v02'
 already exists with ID '3'
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file
 /etc/xen/vm/v02.
 [...]
 lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid
 19686 exited with return code 1
 [...]
 crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
 crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
 failed (target: 0 vs. rc: 1): Error
 [...]

 As you can clearly see start failed, because the guest was found up 
 already!
 IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
 Yes, I've seen that. It's basically the same issue, i.e. the
 domain being gone for a while and then reappearing.

 I guess the following test is problematic:
 ---
   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
   rc=$?
   if [ $rc -ne 0 ]; then
 return $OCF_ERR_GENERIC
 ---
 Here xm create probably fails if the guest is already created...
 It should fail too. Note that this is a race, but the race is
 anyway caused by the strange behaviour of xen. With the recent
 fix (or workaround) in the RA, this shouldn't be happening.

 Thanks,

 Dejan

 Regards,
 Ulrich


 Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in
 Nachricht 20131001102430.GA4687@walrus.homenet:
 Hi,

 On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
 On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote:

 Thanks for paying attention to this issue (not really a bug) as I am
 sure I am not the only one with this issue.  For now I have set all my
 VMs to destroy so that the cluster is the only thing managing them but
 this is not super clean as I get failures in my logs that are not really
 failures.
 It is very much a severe bug.

 The Xen RA has gained a workaround for this now, but we're also pushing
 Take a look here:

 https://github.com/ClusterLabs/resource-agents/pull/314 

 Thanks,

 Dejan

 the Xen team (where the real problem is) to investigate and fix.


 Regards,
 Lars

 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-10 Thread Tom Parker
This scares me too.  If the start operation finds a running vm and
fails, my cluster config will automatically try to start the same VM on
the next node it has available.  This scenario almost guarantees
duplicate VMs even if I  have the on_reboot=destroy. 

Dejan,  I am not sure but I don't think your patch will take care of
this.  In my opinion a start that finds a running version should return
success (vm should be started and it is.)

Tom

On 10/08/2013 07:52 AM, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered another bug in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start
 prm_xen_v02_start_0 on h05 (local)
 lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
 [...]
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02'
 already exists with ID '3'
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file
 /etc/xen/vm/v02.
 [...]
 lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid
 19686 exited with return code 1
 [...]
 crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
 crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
 failed (target: 0 vs. rc: 1): Error
 [...]

 As you can clearly see start failed, because the guest was found up already!
 IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).

 I guess the following test is problematic:
 ---
   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
   rc=$?
   if [ $rc -ne 0 ]; then
 return $OCF_ERR_GENERIC
 ---
 Here xm create probably fails if the guest is already created...

 Regards,
 Ulrich


 Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in
 Nachricht 20131001102430.GA4687@walrus.homenet:
 Hi,

 On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
 On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote:

 Thanks for paying attention to this issue (not really a bug) as I am
 sure I am not the only one with this issue.  For now I have set all my
 VMs to destroy so that the cluster is the only thing managing them but
 this is not super clean as I get failures in my logs that are not really
 failures.
 It is very much a severe bug.

 The Xen RA has gained a workaround for this now, but we're also pushing
 Take a look here:

 https://github.com/ClusterLabs/resource-agents/pull/314 

 Thanks,

 Dejan

 the Xen team (where the real problem is) to investigate and fix.


 Regards,
 Lars

 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-10 Thread Tom Parker
I want to test the updated RA.  Does anyone know how I can increase the
loglevel to warn or debug without restarting my cluster?  I am not
seeing any of the new messages in my logs.

Tom

On 10/08/2013 07:52 AM, Ulrich Windl wrote:
 Hi!

 I thought, I'll never be bitten by this bug, but I actually was! Now I'm
 wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is
 still counting down for actual boot...

 But the reason why I'm writing is that I think I've discovered another bug in
 the RA:

 CRM decided to recover the guest VM v02:
 [...]
 lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
 pid 19516 exited with return code 7
 [...]
  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
 [...]
  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
 prm_xen_v02_stop_0 on h05 (local)
 [...]
 Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
 [...]
 lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid
 19552 exited with return code 0
 [...]
 crmd: [14906]: info: te_rsc_command: Initiating action 78: start
 prm_xen_v02_start_0 on h05 (local)
 lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
 [...]
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02'
 already exists with ID '3'
 lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file
 /etc/xen/vm/v02.
 [...]
 lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid
 19686 exited with return code 1
 [...]
 crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
 crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
 failed (target: 0 vs. rc: 1): Error
 [...]

 As you can clearly see start failed, because the guest was found up already!
 IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).

 I guess the following test is problematic:
 ---
   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
   rc=$?
   if [ $rc -ne 0 ]; then
 return $OCF_ERR_GENERIC
 ---
 Here xm create probably fails if the guest is already created...

 Regards,
 Ulrich


 Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in
 Nachricht 20131001102430.GA4687@walrus.homenet:
 Hi,

 On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
 On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote:

 Thanks for paying attention to this issue (not really a bug) as I am
 sure I am not the only one with this issue.  For now I have set all my
 VMs to destroy so that the cluster is the only thing managing them but
 this is not super clean as I get failures in my logs that are not really
 failures.
 It is very much a severe bug.

 The Xen RA has gained a workaround for this now, but we're also pushing
 Take a look here:

 https://github.com/ClusterLabs/resource-agents/pull/314 

 Thanks,

 Dejan

 the Xen team (where the real problem is) to investigate and fix.


 Regards,
 Lars

 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-02 Thread Tom Parker
Thanks to everyone who helped on this one.  I really appreciate the
speed that this has been looked at and resolved.  I am kind of surprised
that no one has reported it before.

Lars.  Do you know the bug report number with the Xen guys?  I would
like to watch that as it progresses as well.

Thanks again!

Tom

On 10/01/2013 06:24 AM, Dejan Muhamedagic wrote:
 Hi,

 On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
 On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote:

 Thanks for paying attention to this issue (not really a bug) as I am
 sure I am not the only one with this issue.  For now I have set all my
 VMs to destroy so that the cluster is the only thing managing them but
 this is not super clean as I get failures in my logs that are not really
 failures.
 It is very much a severe bug.

 The Xen RA has gained a workaround for this now, but we're also pushing
 Take a look here:

 https://github.com/ClusterLabs/resource-agents/pull/314

 Thanks,

 Dejan

 the Xen team (where the real problem is) to investigate and fix.


 Regards,
 Lars

 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-09-30 Thread Tom Parker
Hi Ulrich.  You have summed it up exactly and the chances seem small but
in the real world (Murphy's Law I guess) I have hit this many times. 
Twice to the point where I have mangled a Production VM to the point of
garbage.  The larger the available free memory on the cluster as a whole
seems to make a big difference because there seems to be a much greater
chance of the cluster deciding to move a dead vm while it is rebooting. 

Thanks for paying attention to this issue (not really a bug) as I am
sure I am not the only one with this issue.  For now I have set all my
VMs to destroy so that the cluster is the only thing managing them but
this is not super clean as I get failures in my logs that are not really
failures.

Tom


On 09/30/2013 07:56 AM, Ulrich Windl wrote:
 Hi!

 With Xen paravirtualization, when a VM (guest) is rebootet (e.g. via guest's 
 reboot), the actual VM (which doesn't really exist as a concept in 
 paravirtualization) is destroyed for a moment and then is recreated (AFAIK). 
 That's why xm console does not survive a guest reboot, and that's why a RA 
 may see the guest is gone for a moment before it's recreated.

 A clean fix would be in Xen to keep the guest in xm list during reboot.

 The chances to be hit by the problem are small, but when hit, the 
 consequences are bad.

 Regards,
 Ulrich

 Ferenc Wagner wf...@niif.hu schrieb am 17.09.2013 um 11:38 in Nachricht
 87six37pph@lant.ki.iif.hu:
 Lars Marowsky-Bree l...@suse.com writes:

 The RA thinks the guest is gone, the cluster reacts and schedules it
 to be started (perhaps elsewhere); and then the hypervisor starts it
 locally again *too*.

 I think changing those libvirt settings to destroy could work - the
 cluster will then restart the guest appropriately, not the hypervisor.
 Maybe the RA is just too picky about the reported VM state.  This is one
 of the reasons* I'm using my own RA for managing libvirt virtual
 domains: mine does not care about the fine points, if the domain is
 active in any state, it's running, as far as the RA is concerned, so a
 domain reset is not a cluster event in any case.

 On the other hand, doesn't the recover action after a monitor failure
 consist of a stop action on the original host before the new start, just
 to make sure?  Or maybe I'm confusing things...

 Regards,
 Feri.

 * Another is that mine gets the VM definition as a parameter, not via
   some shared filesystem.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen RA and rebooting

2013-09-17 Thread Tom Parker

On 09/17/2013 01:13 AM, Vladislav Bogdanov wrote:
 14.09.2013 07:28, Tom Parker wrote:
 Hello All

 Does anyone know of a good way to prevent pacemaker from declaring a vm
 dead if it's rebooted from inside the vm.  It seems to be detecting the
 vm as stopped for the brief moment between shutting down and starting
 up.  Often this causes the cluster to have two copies of the same vm if
 the locks are not set properly (which I have found to be unreliable) one
 that is managed and one that is abandonded.

 If anyone has any suggestions or parameters that I should be tweaking
 that would be appreciated.
 I use following in libvirt VM definitions to prevent this:
   on_poweroffdestroy/on_poweroff
   on_rebootdestroy/on_reboot
   on_crashdestroy/on_crash

 Vladislav
Does this not show as a lot of failed operations?  I guess they will
clean themselves up after the failure expires.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen RA and rebooting

2013-09-17 Thread Tom Parker

On 09/17/2013 04:18 AM, Lars Marowsky-Bree wrote:
 On 2013-09-16T16:36:38, Tom Parker tpar...@cbnco.com wrote:

 Can you kindly file a bug report here so it doesn't get lost
 https://github.com/ClusterLabs/resource-agents/issues ?
 Submitted (Issue *#308)*
 Thanks.

 It definitely leads to data corruption and I think has to do with the
 way that the locking is not working properly on my lvm partitions. 
 Well, not really an LVM issue. The RA thinks the guest is gone, the
 cluster reacts and schedules it to be started (perhaps elsewhere); and
 then the hypervisor starts it locally again *too*.
I mean the locking of the LVs.  I should not be able to mount the same
LV in two places.  I know I can lock each LV exclusive to a node but I
am not sure how to tell the RA to do that for me.  At the moment I am
activating a VG with the LVM RA and that is shared across all my
physical machines.  If I do exclusive activation I think that locks the
vg to a particular node instead of the LVs.

 I think changing those libvirt settings to destroy could work - the
 cluster will then restart the guest appropriately, not the hypervisor.


 Regards,
 Lars


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Clone colocation missing?

2013-09-16 Thread Tom Parker
Now that I have started using resource templates for my VMs (Thanks for
this suggestion Lars!) I have added one simple colocation rule to my cluster

My VMs all use the @CBNXen resource template and I have:

order virtual-machines-after-storage inf: storage-clone CBNXen
colocation virtual-machines-with-storage inf: CBNXen storage-clone

This should take care of all the ordering and colocation needs for my VMs

Tom


On 09/14/2013 07:14 AM, Lars Marowsky-Bree wrote:
 On 2013-09-13T17:48:40, Tom Parker tpar...@cbnco.com wrote:

 Hi Feri

 I agree that it should be necessary but for some reason it works well 
 the way it is and everything starts in the correct order.  Maybe 
 someone on the dev list can explain a little bit better why this is 
 working.  It may have something to do with the fact that it's a clone 
 instead of a primitive.
 And luck. Your behaviour is undefined, and will work for most of the
 common cases.

 : versus inf: on the order means that, during a healthy start-up,
 A will be scheduled to start before B. It does not mean that B will need
 to be stopped before A. Or that B shouldn't start if A can't. Typically,
 both are required.

 Since you've got ordering, *normally*, B will start on a node where A is
 running. However, if A can't run on a node for any given reason, B will
 still try to start there without collocation. Typically, you'd want to
 avoid that.

 The issues with the start sequence tend to be mostly harmless - you'll
 just get additional errors for failure cases that might distract you
 from the real cause.

 The stop issue can be more difficult, because it might imply that A
 fails to stop because B is still active, and you'll get stop escalation
 (fencing). However, it might also mean that A enters an escalated stop
 procedure itself (like Filesystem, which will kill -9 processes that are
 still active), and thus implicitly stop B by force. That'll probably
 work, you'll see everything stopping, but it might require additional
 time from B on next start-up to recover from the aborted state.

 e.g., you can be lucky, but you also might turn out not to be. In my
 experience, this means it'll all work just fine during controlled
 testing, and then fail spectacularly under a production load.

 Hence the recommendation to fix the constraints ;-)


 (And, yes, this *does* suggest that we made a mistake in making this so
 easy to misconfigure. But, hey, ordering and colocation are independent
 concepts! The design and abstraction is pure! And I admit that I guess
 I'm to blame for that to a large degree ... If the most common case is
 A, then B + B where A, why isn't there a recommended constraint that
 just does both with no way of misconfiguring that? It's pretty high on
 my list of things to have fixed.)


 Regards,
 Lars


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen RA and rebooting

2013-09-16 Thread Tom Parker

On 09/14/2013 07:18 AM, Lars Marowsky-Bree wrote:
 On 2013-09-14T00:28:30, Tom Parker tpar...@cbnco.com wrote:

 Does anyone know of a good way to prevent pacemaker from declaring a vm
 dead if it's rebooted from inside the vm.  It seems to be detecting the
 vm as stopped for the brief moment between shutting down and starting
 up. 
 Hrm. Good question. Because to the monitor, it really looks as if the VM
 is temporarily gone, and it doesn't know ... Perhaps we need to keep
 looking for it for a few seconds.

 Can you kindly file a bug report here so it doesn't get lost
 https://github.com/ClusterLabs/resource-agents/issues ?
Submitted (Issue *#308)*
 Often this causes the cluster to have two copies of the same vm if the
 locks are not set properly (which I have found to be unreliable) one
 that is managed and one that is abandonded.
 *This* however is really, really worrisome and sounds like data
 corruption. How is this happening?
It definitely leads to data corruption and I think has to do with the
way that the locking is not working properly on my lvm partitions.   It
seems to mostly happen on clusters where I am using lvm slices on an MSA
as shared storage (they don't seem to lock at the lv level) and the
placement-strategy is utilization.  If Xen reboots and the cluster
declares the vm as dead it seems to try to start it on another node that
has more resources instead of the node where it was running.  It doesn't
happen consistently enough for me to detect a pattern and seems to never
happen on my QA system where I can actually cause corruption without
anyone getting mad.  If I can isolate how it happens I will file a bug.


 The work-around right now is to put the VM resource into maintenance
 mode for the reboot, or to reboot it via stop/start of the cluster
 manager.


 Regards,
 Lars


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Clone colocation missing? (was: Pacemaker 1.19 cannot manage more than 127 resources)

2013-09-13 Thread Tom Parker
Hi Feri

I agree that it should be necessary but for some reason it works well 
the way it is and everything starts in the correct order.  Maybe 
someone on the dev list can explain a little bit better why this is 
working.  It may have something to do with the fact that it's a clone 
instead of a primitive.

Tom

On Thu 05 Sep 2013 04:48:40 AM EDT, Ferenc Wagner wrote:
 Tom Parker tpar...@cbnco.com writes:

 I have attached my original crm config with 201 primitives to this e-mail.

 Hi,

 Sorry to sidetrack this thread, but I really wonder why you only have
 order constraints for your Xen resources, without any colocation
 constraints.  After all, they can only start after the *local* storage
 clone has started...  For example, you have this:

 primitive abrazotedb ocf:heartbeat:Xen [...]
 order abrazotedb-after-storage-clone : storage-clone abrazotedb

 and I miss (besides the already mentioned inf above) something like:

 colocation abrazotedb-with-storage-clone inf: abratozedb storage-clone

 Or is this really unnecessary for some reason?  Please enlighten me.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Xen RA and rebooting

2013-09-13 Thread Tom Parker
Hello All

Does anyone know of a good way to prevent pacemaker from declaring a vm
dead if it's rebooted from inside the vm.  It seems to be detecting the
vm as stopped for the brief moment between shutting down and starting
up.  Often this causes the cluster to have two copies of the same vm if
the locks are not set properly (which I have found to be unreliable) one
that is managed and one that is abandonded.

If anyone has any suggestions or parameters that I should be tweaking
that would be appreciated.

Tom
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] error: te_connect_stonith: Sign-in failed: triggered a retry

2013-08-29 Thread Tom Parker
Hello

Since my upgrade last night I am also seeing this message in the logs on
my servers.

error: te_connect_stonith: Sign-in failed: triggered a retry

Old mailing lists seem to imply that this is an issue with heartbeat
which I don't think I am running.

My software stack is this at the moment:

cluster-glue-1.0.11-0.15.28
libcorosync4-1.4.5-0.18.15
corosync-1.4.5-0.18.15
pacemaker-mgmt-2.1.2-0.7.40
pacemaker-mgmt-client-2.1.2-0.7.40
pacemaker-1.1.9-0.19.102

Does anyone know where this may be coming from?

Thanks

Tom Parker.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] error: te_connect_stonith: Sign-in failed: triggered a retry

2013-08-29 Thread Tom Parker
This is happening when I am using the really large CIB and no.  There
doesn't seem to be anything else.  3 of my 6 nodes were showing this error.

Now that I have deleted and recreated my CIB this log message seems to
have gone away.


On 08/29/2013 10:16 PM, Andrew Beekhof wrote:
 On 30/08/2013, at 5:51 AM, Tom Parker tpar...@cbnco.com wrote:

 Hello

 Since my upgrade last night I am also seeing this message in the logs on
 my servers.

 error: te_connect_stonith: Sign-in failed: triggered a retry

 Old mailing lists seem to imply that this is an issue with heartbeat
 which I don't think I am running.

 My software stack is this at the moment:

 cluster-glue-1.0.11-0.15.28
 libcorosync4-1.4.5-0.18.15
 corosync-1.4.5-0.18.15
 pacemaker-mgmt-2.1.2-0.7.40
 pacemaker-mgmt-client-2.1.2-0.7.40
 pacemaker-1.1.9-0.19.102

 Does anyone know where this may be coming from?
 were there any other errors?
 ie. did stonithd crash?

 Thanks

 Tom Parker.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources

2013-08-29 Thread Tom Parker
My pacemaker config contains the following settings:

LRMD_MAX_CHILDREN=8
export PCMK_ipc_buffer=3172882

This is what I had today to get to 127 Resources defined.  I am not sure
what I should choose for the PCMK_ipc_type.  Do you have any suggestions
for large clusters?

Thanks

Tom

On 08/29/2013 11:19 PM, Andrew Beekhof wrote:
 On 30/08/2013, at 5:49 AM, Tom Parker tpar...@cbnco.com wrote:

 Hello.  Las night I updated my SLES 11 servers to HAE-SP3 which contains
 the following versions of software:

 cluster-glue-1.0.11-0.15.28
 libcorosync4-1.4.5-0.18.15
 corosync-1.4.5-0.18.15
 pacemaker-mgmt-2.1.2-0.7.40
 pacemaker-mgmt-client-2.1.2-0.7.40
 pacemaker-1.1.9-0.19.102

 With the previous versions of openais/corosync I could run over 200
 resources with no problems and with very little lag with the management
 commands (crm_mon, crm configure, etc)

 Today I am unable to configure more than 127 resources.  When I commit
 my 128th resource all the crm commands start to fail (crm_mon just
 hangs) or timeout (ERROR: running cibadmin -Ql: Call cib_query failed
 (-62): Timer expired)

 I have attached my original crm config with 201 primitives to this e-mail.

 If anyone has any ideas as to what may have changed between pacemaker
 versions that would cause this please let me know.  If I can't get this
 solved this week I will have to downgrade to SP2 again.

 Thanks for any information.
 I suspect you've hit an IPC buffer limit.

 Depending on exactly what went into the SUSE builds, you should have the 
 following environment variables (documentation from /etc/syconfig/pacemaker 
 on RHEL) to play with:

 # Force use of a particular class of IPC connection
 # PCMK_ipc_type=shared-mem|socket|posix|sysv

 # Specify an IPC buffer size in bytes
 # Useful when connecting to really big clusters that exceed the default 20k 
 buffer
 # PCMK_ipc_buffer=20480




 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources

2013-08-29 Thread Tom Parker
Do you know if this has changed significantly from the older versions?  
This cluster was working fine before the upgrade.

On Fri 30 Aug 2013 12:16:35 AM EDT, Andrew Beekhof wrote:

 On 30/08/2013, at 1:42 PM, Tom Parker tpar...@cbnco.com wrote:

 My pacemaker config contains the following settings:

 LRMD_MAX_CHILDREN=8
 export PCMK_ipc_buffer=3172882

 perhaps go higher


 This is what I had today to get to 127 Resources defined.  I am not sure 
 what I should choose for the PCMK_ipc_type.  Do you have any suggestions for 
 large clusters?

 shm is the new upstream default, but it may not have propagated to suse yet.


 Thanks

 Tom

 On 08/29/2013 11:19 PM, Andrew Beekhof wrote:
 On 30/08/2013, at 5:49 AM, Tom Parker tpar...@cbnco.com
  wrote:


 Hello.  Las night I updated my SLES 11 servers to HAE-SP3 which contains
 the following versions of software:

 cluster-glue-1.0.11-0.15.28
 libcorosync4-1.4.5-0.18.15
 corosync-1.4.5-0.18.15
 pacemaker-mgmt-2.1.2-0.7.40
 pacemaker-mgmt-client-2.1.2-0.7.40
 pacemaker-1.1.9-0.19.102

 With the previous versions of openais/corosync I could run over 200
 resources with no problems and with very little lag with the management
 commands (crm_mon, crm configure, etc)

 Today I am unable to configure more than 127 resources.  When I commit
 my 128th resource all the crm commands start to fail (crm_mon just
 hangs) or timeout (ERROR: running cibadmin -Ql: Call cib_query failed
 (-62): Timer expired)

 I have attached my original crm config with 201 primitives to this e-mail.

 If anyone has any ideas as to what may have changed between pacemaker
 versions that would cause this please let me know.  If I can't get this
 solved this week I will have to downgrade to SP2 again.

 Thanks for any information.

 I suspect you've hit an IPC buffer limit.

 Depending on exactly what went into the SUSE builds, you should have the 
 following environment variables (documentation from /etc/syconfig/pacemaker 
 on RHEL) to play with:

 # Force use of a particular class of IPC connection
 # PCMK_ipc_type=shared-mem|socket|posix|sysv

 # Specify an IPC buffer size in bytes
 # Useful when connecting to really big clusters that exceed the default 20k 
 buffer
 # PCMK_ipc_buffer=20480





 ___
 Linux-HA mailing list

 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha

 See also:
 http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources

2013-08-29 Thread Tom Parker
Thanks for your help.  I think I have it solved.

The trick is that the crm tools also need to know what the Pacemaker 
IPC buffer size is.  I have set:

/etc/sysconfig/pacemaker
#export LRMD_MAX_CHILDREN=8

# Force use of a particular class of IPC connection
# PCMK_ipc_type=shared-mem|socket|posix|sysv
export PCMK_ipc_type=shared-mem

# Specify an IPC buffer size in bytes
# Useful when connecting to really big clusters that exceed the default 
20k buffer
# PCMK_ipc_buffer=20480
export PCMK_ipc_buffer=2048

and

~/.bashrc
export PCMK_ipc_type=shared-mem
export PCMK_ipc_buffer=2048

And now everything seems to play nicely together.

A 20MB buffer seems huge but I have a TON of virtual machines on this 
cluster.

On Fri 30 Aug 2013 01:00:36 AM EDT, Andrew Beekhof wrote:
 You'd have to ask suse.
 They'd know what the old and new are and therefor the differences between the 
 two.

 On 30/08/2013, at 2:21 PM, Tom Parker tpar...@cbnco.com wrote:

 Do you know if this has changed significantly from the older versions?
 This cluster was working fine before the upgrade.

 On Fri 30 Aug 2013 12:16:35 AM EDT, Andrew Beekhof wrote:

 On 30/08/2013, at 1:42 PM, Tom Parker tpar...@cbnco.com wrote:

 My pacemaker config contains the following settings:

 LRMD_MAX_CHILDREN=8
 export PCMK_ipc_buffer=3172882

 perhaps go higher


 This is what I had today to get to 127 Resources defined.  I am not sure 
 what I should choose for the PCMK_ipc_type.  Do you have any suggestions 
 for large clusters?

 shm is the new upstream default, but it may not have propagated to suse yet.


 Thanks

 Tom

 On 08/29/2013 11:19 PM, Andrew Beekhof wrote:
 On 30/08/2013, at 5:49 AM, Tom Parker tpar...@cbnco.com
 wrote:


 Hello.  Las night I updated my SLES 11 servers to HAE-SP3 which contains
 the following versions of software:

 cluster-glue-1.0.11-0.15.28
 libcorosync4-1.4.5-0.18.15
 corosync-1.4.5-0.18.15
 pacemaker-mgmt-2.1.2-0.7.40
 pacemaker-mgmt-client-2.1.2-0.7.40
 pacemaker-1.1.9-0.19.102

 With the previous versions of openais/corosync I could run over 200
 resources with no problems and with very little lag with the management
 commands (crm_mon, crm configure, etc)

 Today I am unable to configure more than 127 resources.  When I commit
 my 128th resource all the crm commands start to fail (crm_mon just
 hangs) or timeout (ERROR: running cibadmin -Ql: Call cib_query failed
 (-62): Timer expired)

 I have attached my original crm config with 201 primitives to this 
 e-mail.

 If anyone has any ideas as to what may have changed between pacemaker
 versions that would cause this please let me know.  If I can't get this
 solved this week I will have to downgrade to SP2 again.

 Thanks for any information.

 I suspect you've hit an IPC buffer limit.

 Depending on exactly what went into the SUSE builds, you should have the 
 following environment variables (documentation from 
 /etc/syconfig/pacemaker on RHEL) to play with:

 # Force use of a particular class of IPC connection
 # PCMK_ipc_type=shared-mem|socket|posix|sysv

 # Specify an IPC buffer size in bytes
 # Useful when connecting to really big clusters that exceed the default 
 20k buffer
 # PCMK_ipc_buffer=20480





 ___
 Linux-HA mailing list

 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha

 See also:
 http://linux-ha.org/ReportingProblems



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems