Re: [Linux-HA] Hertbeat fail-over Email Alert
Hi Lars Can you provide more details about this resource agent. The documentation is a little sparse. What events will cause an e-mail to be sent? Thanks! Tom On 24/09/14 06:53 PM, Lars Ellenberg wrote: On Tue, Sep 23, 2014 at 04:55:20PM +0530, Atul Yadav wrote: Dear Team , In our environment for storage HA, we are using heartbeat method. Our Storage HA is working fine with Heartbeat management. Now we need your guidance to setup the EMAIL alert at the time of fail-over happen and fail over completed. We already setup smtp in both the servers. And we are able to send mail from terminal window. Storage1 Storage2 Please guide us. What's wrong with the MailTo resource agent? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK
ok. I have fixed that to be no_path_retry fail but I don't think this has anything to do with the errors I am seeing. They seem to be related to sbd's link with my cluster, not with disk I/O Tom On 23/04/14 03:11 AM, emmanuel segura wrote: the first thing, you are using no_path_retry in wrong way in your multipath, try to read this http://www.novell.com/documentation/oes2/clus_admin_lx/data/bl9ykz6.html 2014-04-22 20:41 GMT+02:00 Tom Parker tpar...@cbnco.com: I have attached the config files to this e-mail. The sbd dump is below [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump ==Dumping header on disk /dev/mapper/qa-xen-sbd Header version : 2.1 UUID : ae835596-3d26-4681-ba40-206b4d51149b Number of slots: 255 Sector size: 512 Timeout (watchdog) : 45 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 90 ==Header on disk /dev/mapper/qa-xen-sbd is dumped On 22/04/14 02:30 PM, emmanuel segura wrote: you are missingo cluster configuration and sbd configuration and multipath config 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com: Has anyone seen this? Do you know what might be causing the flapping? Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled. Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device /dev/mapper/qa-xen-sbd Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd uuid: ae835596-3d26-4681-ba40-206b4d51149b Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS quorum check enabled Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device: /dev/watchdog Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45 seconds. Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now. Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN
Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK
SDB has a connection to pacemaker to establish overall cluster health (the -P flag). This seems to be where the problem is. I just don't know what the problem might be. On 23/04/14 11:32 AM, emmanuel segura wrote: what do you mean with link? 2014-04-23 15:23 GMT+02:00 Tom Parker tpar...@cbnco.com: ok. I have fixed that to be no_path_retry fail but I don't think this has anything to do with the errors I am seeing. They seem to be related to sbd's link with my cluster, not with disk I/O Tom On 23/04/14 03:11 AM, emmanuel segura wrote: the first thing, you are using no_path_retry in wrong way in your multipath, try to read this http://www.novell.com/documentation/oes2/clus_admin_lx/data/bl9ykz6.html 2014-04-22 20:41 GMT+02:00 Tom Parker tpar...@cbnco.com: I have attached the config files to this e-mail. The sbd dump is below [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump ==Dumping header on disk /dev/mapper/qa-xen-sbd Header version : 2.1 UUID : ae835596-3d26-4681-ba40-206b4d51149b Number of slots: 255 Sector size: 512 Timeout (watchdog) : 45 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 90 ==Header on disk /dev/mapper/qa-xen-sbd is dumped On 22/04/14 02:30 PM, emmanuel segura wrote: you are missingo cluster configuration and sbd configuration and multipath config 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com: Has anyone seen this? Do you know what might be causing the flapping? Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled. Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device /dev/mapper/qa-xen-sbd Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd uuid: ae835596-3d26-4681-ba40-206b4d51149b Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS quorum check enabled Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device: /dev/watchdog Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45 seconds. Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now. Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13
[Linux-HA] Resource blocked
Good morning I am trying to restart resources on one of my clusters and I am getting the message pengine[13397]: notice: LogActions: Start domtcot1-qa(qaxen1 - blocked) How can I find out why this resource is blocked. Thanks ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK
I have attached the config files to this e-mail. The sbd dump is below [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump ==Dumping header on disk /dev/mapper/qa-xen-sbd Header version : 2.1 UUID : ae835596-3d26-4681-ba40-206b4d51149b Number of slots: 255 Sector size: 512 Timeout (watchdog) : 45 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 90 ==Header on disk /dev/mapper/qa-xen-sbd is dumped On 22/04/14 02:30 PM, emmanuel segura wrote: you are missingo cluster configuration and sbd configuration and multipath config 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com: Has anyone seen this? Do you know what might be causing the flapping? Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled. Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device /dev/mapper/qa-xen-sbd Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd uuid: ae835596-3d26-4681-ba40-206b4d51149b Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS quorum check enabled Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device: /dev/watchdog Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45 seconds. Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now. Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:31:44 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:31:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:32:52 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:32:52 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:33:01 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:33:01 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:44:39 qaxen6 sbd: [12974]: WARN
Re: [Linux-HA] /usr/sbin/lrmadmin missing from cluster-glue
Thanks Kristoffer. How is tuning done for lrm now? Tom On 01/24/2014 01:41 AM, Kristoffer Grönlund wrote: On Sat, 28 Dec 2013 11:18:44 -0500 Tom Parker tpar...@cbnco.com wrote: Hello /usr/sbin/lrmadmin is missing from the latest version of cluster-glue in SLES SP3. Has the program been deprecated or is this an issue in the packaging of the RPM? Hi, I know this is a bit late, but I just discovered this email. Yes, lrmadmin has been deprecated since it is incompatible with recent versions of pacemaker. Thanks Tom ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] /usr/sbin/lrmadmin missing from cluster-glue
Hello /usr/sbin/lrmadmin is missing from the latest version of cluster-glue in SLES SP3. Has the program been deprecated or is this an issue in the packaging of the RPM? Thanks Tom ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Xen XL Resource Agent
My thoughts were to create a new RA and let admins choose. If you would prefer to auto-detect that is an option as well. Tom From: Lars Marowsky-Bree Sent: Monday, November 18, 2013 8:27 AM To: General Linux-HA mailing list Reply To: General Linux-HA mailing list Subject: Re: [Linux-HA] Antw: Xen XL Resource Agent On 2013-11-15T09:05:53, Tom Parker tpar...@cbnco.com wrote: The XL tools are much faster and lighter weight. I am not sure if they report proper codes (I will have to test) but the XM stack has been deprecated so at some point I assume it will go away completely. The Xen RA already supports xen-list and xen-destroy in addition to the xm tools. Patches to additionally support xl are welcome. (Auto-detect what is available, and then choose xl - xen-* - xm.) We can't yet drop xm, since not all environments have xl yet. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Xen XL Resource Agent
The XL tools are much faster and lighter weight. I am not sure if they report proper codes (I will have to test) but the XM stack has been deprecated so at some point I assume it will go away completely. [LIVE] qaxen1:~ # time xl list NameID Mem VCPUs State Time(s) Domain-0 0 2534712 r- 146383.6 Domain list removed real0m0.053s user0m0.000s sys 0m0.008s [LIVE] qaxen1:~ # time xm list NameID Mem VCPUs State Time(s) Domain-0 0 2534712 r- 146381.1 Domain list removed real0m0.352s user0m0.236s sys 0m0.036s On 11/15/2013 02:04 AM, Ulrich Windl wrote: Tom Parker tpar...@cbnco.com schrieb am 14.11.2013 um 19:23 in Nachricht 5285150b.9050...@cbnco.com: Hello. Now that XM has been deprecated is anyone working on a Xen RA that uses the xl tool stack? I woonder whether xl will (opposed to xm) report proper exit codes if operations fail. Otherwise I don't see the reason to change tools. MHO... I am willing to do the work but I don't want to duplicate the effort if someone else is doing/has already done it. Tom ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Xen XL Resource Agent
Hello. Now that XM has been deprecated is anyone working on a Xen RA that uses the xl tool stack? I am willing to do the work but I don't want to duplicate the effort if someone else is doing/has already done it. Tom ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How many primitives, groups can I have
You will also have to be careful of the shared memory size between the nodes. I had issues with massive cibs. Setting some environment variables fixed the issue but the defaults are too small. From: Digimer Sent: Monday, November 11, 2013 10:24 AM To: General Linux-HA mailing list Reply To: General Linux-HA mailing list Subject: Re: [Linux-HA] How many primitives, groups can I have On 11/11/13 07:57, Michael Brookhuis wrote: Hi, Is there a limit in the number of proimitives, etc you can have? What maximum number is recommended based on best-practices? Are 1500 to many? Thanks Mima The cib will be very large, so pushing changes to other nodes will take time (specially if you have many nodes). I suspect you will run into corosync timeouts before you hit any coded upper limits. You will likely have to play with corosync timing values to get that high, assuming your network is fast enough at all. But in the end, as I understand it, there is no coded upper limit. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How many primitives, groups can I have
I have found my settings. I needed to set the following in /etc/sysconfig/pacemaker # Force use of a particular class of IPC connection # PCMK_ipc_type=shared-mem|socket|posix|sysv export PCMK_ipc_type=shared-mem # Specify an IPC buffer size in bytes # Useful when connecting to really big clusters that exceed the default 20k buffer # PCMK_ipc_buffer=20480 export PCMK_ipc_buffer=2048 and in my bashrc file (for the crm tools to work proplery) I have [LIVE] qaxen1:~ # cat .bashrc # Load Pacemaker IPC settings for crm PACEMAKER_SYSCONFIG=/etc/sysconfig/pacemaker if [ -f $PACEMAKER_SYSCONFIG ]; then . $PACEMAKER_SYSCONFIG fi Hope this helps. On 11/11/2013 03:35 PM, Tom Parker wrote: You will also have to be careful of the shared memory size between the nodes. I had issues with massive cibs. Setting some environment variables fixed the issue but the defaults are too small. From: Digimer Sent: Monday, November 11, 2013 10:24 AM To: General Linux-HA mailing list Reply To: General Linux-HA mailing list Subject: Re: [Linux-HA] How many primitives, groups can I have On 11/11/13 07:57, Michael Brookhuis wrote: Hi, Is there a limit in the number of proimitives, etc you can have? What maximum number is recommended based on best-practices? Are 1500 to many? Thanks Mima The cib will be very large, so pushing changes to other nodes will take time (specially if you have many nodes). I suspect you will run into corosync timeouts before you hit any coded upper limits. You will likely have to play with corosync timing values to get that high, assuming your network is fast enough at all. But in the end, as I understand it, there is no coded upper limit. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Thanks for the feedback. Dejan, I have some SLES nodes that are running around 30 pretty heavy VMs and I found that while I never go to 5s that the time it would take to reboot was not a constant. I have a feeling that this bug in xen-list may take a while to be fixed upstream and trickle down into the released xen packages so we may be using this fix for a while. The full longdesc now reads: longdesc lang=en When the guest is rebooting, there is a short interval where the guest completely disappears from xm list, which, in turn, will cause the monitor operation to return a not running status. If a monitor status returns not running, then test status again for wait_for_reboot seconds (perhaps it'll show up). NOTE: This timer increases the amount of time the cluster will wait before declaring a VM dead and recovering it. /longdesc Tom On 10/21/2013 03:28 AM, Ulrich Windl wrote: When the guest is rebooting, there is a short interval where the guest completely disappears from xm list, which, in turn, will cause the monitor operation to return a not running status. If the guest cannot be found , this value will cause some extra delay in the monitor operation to work around the problem. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Hi Dejan. How can I revert my commits so that they are not include multiple things? I will submit one patch with the logging cleanup and then if needed another with my changes to the meta-data. Tom On 10/21/2013 09:39 AM, Dejan Muhamedagic wrote: Hi Ulrich! On Mon, Oct 21, 2013 at 09:28:50AM +0200, Ulrich Windl wrote: Hi! Basically I think there should be no hard-coded constants whose value depends on some performance measurements, like 5s for rebooting a VM. It's actually not 5s, but the status is run 5 times. If the load is high, my guess is that the Xen tools used by the RA would suffer proportionally. So I support Tom's changes. However I noticed: +running; apparently, this period lasts only for a second or +two (missing full stop at end of sentence) That's at the end of the comment and, typically, comments end with a carriage return (as is here the case). Actually I'd rephrase the description: When the guest is rebooting, there is a short interval where the guest completely disappears from xm list, which, in turn, will cause the monitor operation to return a not running status. If the guest cannot be found , this value will cause some extra delay in the monitor operation to work around the problem. (I.e. try to describe the effect, not the implementation) That's the code, so the implementation is described. The very top of the comment says: # If the guest is rebooting, it may completely disappear from the # list of defined guests I was hoping that that was enough of an explanation. Look for a more thorough description of the cause in the changelog. BTW, note that this is a _workaround_ and that the thing should eventually be fixed in Xen. And yes, I appreciate consistent log formats also ;-) That's always welcome, of course. It should also go in a separate commit. Thanks, Dejan Regards, Ulrich Tom Parker tpar...@cbnco.com schrieb am 18.10.2013 um 19:30 in Nachricht 5261703a.5070...@cbnco.com: Hi Dejan. Sorry to be slow to respond to this. I have done some testing and everything looks good. I spent some time tweaking the RA and I added a parameter called wait_for_reboot (default 5s) to allow us to override the reboot sleep times (in case it's more than 5 seconds on really loaded hypervisors). I also cleaned up a few log entries to make them consistent in the RA and edited your entries for xen status to be a little bit more clear as to why we think we should be waiting. I have attached a patch here because I have NO idea how to create a branch and pull request. If there are links to a good place to start I may be able to contribute occasionally to some other RAs that I use. Please let me know what you think. Thanks for your help Tom On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote: On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote: Hi Tom, On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote: Some more reading of the source code makes me think the || [ $__OCF_ACTION != stop ]; is not needed. Yes, you're right. I'll drop that part of the if statement. Many thanks for testing. Fixed now. The if statement, which was obviously hard to follow, got relegated to the monitor function. Which makes the Xen_Status_with_Retry really stand for what's happening in there ;-) Tom, hope you can test again. Cheers, Dejan Cheers, Dejan Xen_Status_with_Retry() is only called from Stop and Monitor so we only need to check if it's a probe. Everything else should be handled in the case statement in the loop. Tom On 10/16/2013 05:16 PM, Tom Parker wrote: Hi. I think there is an issue with the Updated Xen RA. I think there is an issue with the if statement here but I am not sure. I may be confused about how bash || works but I don't see my servers ever entering the loop on a vm disappearing. if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi Does this not mean that if we run a monitor operation that is not a probe we will have: (ocf_is_probe) return false (stop != monitor) return true (false || true) return true which will cause the if statement to return $rc and never enter the loop? Xen_Status_with_Retry() { local rc cnt=5 Xen_Status $1 rc=$? if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do case $__OCF_ACTION in stop) ocf_log debug domain $1 reported as not running, waiting $cnt seconds ... ;; monitor) ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... ;; *) : not reachable ;; esac sleep 1 Xen_Status $1 rc=$? let cnt=$((cnt-1)) done return $rc } On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: Hi Tom, On Tue, Oct 15
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Hi Dejan. Sorry to be slow to respond to this. I have done some testing and everything looks good. I spent some time tweaking the RA and I added a parameter called wait_for_reboot (default 5s) to allow us to override the reboot sleep times (in case it's more than 5 seconds on really loaded hypervisors). I also cleaned up a few log entries to make them consistent in the RA and edited your entries for xen status to be a little bit more clear as to why we think we should be waiting. I have attached a patch here because I have NO idea how to create a branch and pull request. If there are links to a good place to start I may be able to contribute occasionally to some other RAs that I use. Please let me know what you think. Thanks for your help Tom On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote: On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote: Hi Tom, On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote: Some more reading of the source code makes me think the || [ $__OCF_ACTION != stop ]; is not needed. Yes, you're right. I'll drop that part of the if statement. Many thanks for testing. Fixed now. The if statement, which was obviously hard to follow, got relegated to the monitor function. Which makes the Xen_Status_with_Retry really stand for what's happening in there ;-) Tom, hope you can test again. Cheers, Dejan Cheers, Dejan Xen_Status_with_Retry() is only called from Stop and Monitor so we only need to check if it's a probe. Everything else should be handled in the case statement in the loop. Tom On 10/16/2013 05:16 PM, Tom Parker wrote: Hi. I think there is an issue with the Updated Xen RA. I think there is an issue with the if statement here but I am not sure. I may be confused about how bash || works but I don't see my servers ever entering the loop on a vm disappearing. if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi Does this not mean that if we run a monitor operation that is not a probe we will have: (ocf_is_probe) return false (stop != monitor) return true (false || true) return true which will cause the if statement to return $rc and never enter the loop? Xen_Status_with_Retry() { local rc cnt=5 Xen_Status $1 rc=$? if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do case $__OCF_ACTION in stop) ocf_log debug domain $1 reported as not running, waiting $cnt seconds ... ;; monitor) ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... ;; *) : not reachable ;; esac sleep 1 Xen_Status $1 rc=$? let cnt=$((cnt-1)) done return $rc } On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: Hi Tom, On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: Hi Dejan Just a quick question. I cannot see your new log messages being logged to syslog ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... Do you know where I can set my logging to see warn level messages? I expected to see them in my testing by default but that does not seem to be true. You should see them by default. But note that these warnings may not happen, depending on the circumstances on your host. In my experiments they were logged only while the guest was rebooting and then just once or maybe twice. If you have recent resource-agents and crmsh, you can enable operation tracing (with crm resource trace rsc monitor interval). Thanks, Dejan Thanks Tom On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start prm_xen_v02_start_0 on h05 (local) lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) [...] lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
I may have actually created the pull request properly... Please let me know and again thanks for your help. Tom On 10/18/2013 01:30 PM, Tom Parker wrote: Hi Dejan. Sorry to be slow to respond to this. I have done some testing and everything looks good. I spent some time tweaking the RA and I added a parameter called wait_for_reboot (default 5s) to allow us to override the reboot sleep times (in case it's more than 5 seconds on really loaded hypervisors). I also cleaned up a few log entries to make them consistent in the RA and edited your entries for xen status to be a little bit more clear as to why we think we should be waiting. I have attached a patch here because I have NO idea how to create a branch and pull request. If there are links to a good place to start I may be able to contribute occasionally to some other RAs that I use. Please let me know what you think. Thanks for your help Tom On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote: On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote: Hi Tom, On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote: Some more reading of the source code makes me think the || [ $__OCF_ACTION != stop ]; is not needed. Yes, you're right. I'll drop that part of the if statement. Many thanks for testing. Fixed now. The if statement, which was obviously hard to follow, got relegated to the monitor function. Which makes the Xen_Status_with_Retry really stand for what's happening in there ;-) Tom, hope you can test again. Cheers, Dejan Cheers, Dejan Xen_Status_with_Retry() is only called from Stop and Monitor so we only need to check if it's a probe. Everything else should be handled in the case statement in the loop. Tom On 10/16/2013 05:16 PM, Tom Parker wrote: Hi. I think there is an issue with the Updated Xen RA. I think there is an issue with the if statement here but I am not sure. I may be confused about how bash || works but I don't see my servers ever entering the loop on a vm disappearing. if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi Does this not mean that if we run a monitor operation that is not a probe we will have: (ocf_is_probe) return false (stop != monitor) return true (false || true) return true which will cause the if statement to return $rc and never enter the loop? Xen_Status_with_Retry() { local rc cnt=5 Xen_Status $1 rc=$? if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do case $__OCF_ACTION in stop) ocf_log debug domain $1 reported as not running, waiting $cnt seconds ... ;; monitor) ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... ;; *) : not reachable ;; esac sleep 1 Xen_Status $1 rc=$? let cnt=$((cnt-1)) done return $rc } On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: Hi Tom, On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: Hi Dejan Just a quick question. I cannot see your new log messages being logged to syslog ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... Do you know where I can set my logging to see warn level messages? I expected to see them in my testing by default but that does not seem to be true. You should see them by default. But note that these warnings may not happen, depending on the circumstances on your host. In my experiments they were logged only while the guest was rebooting and then just once or maybe twice. If you have recent resource-agents and crmsh, you can enable operation tracing (with crm resource trace rsc monitor interval). Thanks, Dejan Thanks Tom On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Hi. I think there is an issue with the Updated Xen RA. I think there is an issue with the if statement here but I am not sure. I may be confused about how bash || works but I don't see my servers ever entering the loop on a vm disappearing. if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi Does this not mean that if we run a monitor operation that is not a probe we will have: (ocf_is_probe) return false (stop != monitor) return true (false || true) return true which will cause the if statement to return $rc and never enter the loop? Xen_Status_with_Retry() { local rc cnt=5 Xen_Status $1 rc=$? if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do case $__OCF_ACTION in stop) ocf_log debug domain $1 reported as not running, waiting $cnt seconds ... ;; monitor) ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... ;; *) : not reachable ;; esac sleep 1 Xen_Status $1 rc=$? let cnt=$((cnt-1)) done return $rc } On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: Hi Tom, On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: Hi Dejan Just a quick question. I cannot see your new log messages being logged to syslog ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... Do you know where I can set my logging to see warn level messages? I expected to see them in my testing by default but that does not seem to be true. You should see them by default. But note that these warnings may not happen, depending on the circumstances on your host. In my experiments they were logged only while the guest was rebooting and then just once or maybe twice. If you have recent resource-agents and crmsh, you can enable operation tracing (with crm resource trace rsc monitor interval). Thanks, Dejan Thanks Tom On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start prm_xen_v02_start_0 on h05 (local) lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) [...] lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02' already exists with ID '3' lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file /etc/xen/vm/v02. [...] lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid 19686 exited with return code 1 [...] crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05 failed (target: 0 vs. rc: 1): Error [...] As you can clearly see start failed, because the guest was found up already! IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). Yes, I've seen that. It's basically the same issue, i.e. the domain being gone for a while and then reappearing. I guess the following test is problematic: --- xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME rc=$? if [ $rc -ne 0 ]; then return $OCF_ERR_GENERIC --- Here xm create probably fails if the guest is already created... It should fail too. Note that this is a race, but the race is anyway caused by the strange behaviour of xen. With the recent fix (or workaround) in the RA, this shouldn't be happening. Thanks, Dejan Regards, Ulrich Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in Nachricht 20131001102430.GA4687@walrus.homenet: Hi, On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote: Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Some more reading of the source code makes me think the || [ $__OCF_ACTION != stop ]; is not needed. Xen_Status_with_Retry() is only called from Stop and Monitor so we only need to check if it's a probe. Everything else should be handled in the case statement in the loop. Tom On 10/16/2013 05:16 PM, Tom Parker wrote: Hi. I think there is an issue with the Updated Xen RA. I think there is an issue with the if statement here but I am not sure. I may be confused about how bash || works but I don't see my servers ever entering the loop on a vm disappearing. if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi Does this not mean that if we run a monitor operation that is not a probe we will have: (ocf_is_probe) return false (stop != monitor) return true (false || true) return true which will cause the if statement to return $rc and never enter the loop? Xen_Status_with_Retry() { local rc cnt=5 Xen_Status $1 rc=$? if ocf_is_probe || [ $__OCF_ACTION != stop ]; then return $rc fi while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do case $__OCF_ACTION in stop) ocf_log debug domain $1 reported as not running, waiting $cnt seconds ... ;; monitor) ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... ;; *) : not reachable ;; esac sleep 1 Xen_Status $1 rc=$? let cnt=$((cnt-1)) done return $rc } On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote: Hi Tom, On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote: Hi Dejan Just a quick question. I cannot see your new log messages being logged to syslog ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... Do you know where I can set my logging to see warn level messages? I expected to see them in my testing by default but that does not seem to be true. You should see them by default. But note that these warnings may not happen, depending on the circumstances on your host. In my experiments they were logged only while the guest was rebooting and then just once or maybe twice. If you have recent resource-agents and crmsh, you can enable operation tracing (with crm resource trace rsc monitor interval). Thanks, Dejan Thanks Tom On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start prm_xen_v02_start_0 on h05 (local) lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) [...] lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02' already exists with ID '3' lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file /etc/xen/vm/v02. [...] lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid 19686 exited with return code 1 [...] crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05 failed (target: 0 vs. rc: 1): Error [...] As you can clearly see start failed, because the guest was found up already! IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). Yes, I've seen that. It's basically the same issue, i.e. the domain being gone for a while and then reappearing. I guess the following test is problematic: --- xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME rc=$? if [ $rc -ne 0 ]; then return $OCF_ERR_GENERIC --- Here xm create probably fails if the guest is already created... It should fail too. Note that this is a race, but the race is anyway caused by the strange behaviour of xen. With the recent fix (or workaround) in the RA, this shouldn't be happening. Thanks, Dejan Regards, Ulrich Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Hi Dejan Just a quick question. I cannot see your new log messages being logged to syslog ocf_log warn domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ... Do you know where I can set my logging to see warn level messages? I expected to see them in my testing by default but that does not seem to be true. Thanks Tom On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start prm_xen_v02_start_0 on h05 (local) lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) [...] lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02' already exists with ID '3' lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file /etc/xen/vm/v02. [...] lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid 19686 exited with return code 1 [...] crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05 failed (target: 0 vs. rc: 1): Error [...] As you can clearly see start failed, because the guest was found up already! IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). Yes, I've seen that. It's basically the same issue, i.e. the domain being gone for a while and then reappearing. I guess the following test is problematic: --- xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME rc=$? if [ $rc -ne 0 ]; then return $OCF_ERR_GENERIC --- Here xm create probably fails if the guest is already created... It should fail too. Note that this is a race, but the race is anyway caused by the strange behaviour of xen. With the recent fix (or workaround) in the RA, this shouldn't be happening. Thanks, Dejan Regards, Ulrich Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in Nachricht 20131001102430.GA4687@walrus.homenet: Hi, On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote: Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so that the cluster is the only thing managing them but this is not super clean as I get failures in my logs that are not really failures. It is very much a severe bug. The Xen RA has gained a workaround for this now, but we're also pushing Take a look here: https://github.com/ClusterLabs/resource-agents/pull/314 Thanks, Dejan the Xen team (where the real problem is) to investigate and fix. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
This scares me too. If the start operation finds a running vm and fails, my cluster config will automatically try to start the same VM on the next node it has available. This scenario almost guarantees duplicate VMs even if I have the on_reboot=destroy. Dejan, I am not sure but I don't think your patch will take care of this. In my opinion a start that finds a running version should return success (vm should be started and it is.) Tom On 10/08/2013 07:52 AM, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start prm_xen_v02_start_0 on h05 (local) lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) [...] lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02' already exists with ID '3' lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file /etc/xen/vm/v02. [...] lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid 19686 exited with return code 1 [...] crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05 failed (target: 0 vs. rc: 1): Error [...] As you can clearly see start failed, because the guest was found up already! IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). I guess the following test is problematic: --- xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME rc=$? if [ $rc -ne 0 ]; then return $OCF_ERR_GENERIC --- Here xm create probably fails if the guest is already created... Regards, Ulrich Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in Nachricht 20131001102430.GA4687@walrus.homenet: Hi, On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote: Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so that the cluster is the only thing managing them but this is not super clean as I get failures in my logs that are not really failures. It is very much a severe bug. The Xen RA has gained a workaround for this now, but we're also pushing Take a look here: https://github.com/ClusterLabs/resource-agents/pull/314 Thanks, Dejan the Xen team (where the real problem is) to investigate and fix. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
I want to test the updated RA. Does anyone know how I can increase the loglevel to warn or debug without restarting my cluster? I am not seeing any of the new messages in my logs. Tom On 10/08/2013 07:52 AM, Ulrich Windl wrote: Hi! I thought, I'll never be bitten by this bug, but I actually was! Now I'm wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is still counting down for actual boot... But the reason why I'm writing is that I think I've discovered another bug in the RA: CRM decided to recover the guest VM v02: [...] lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906: pid 19516 exited with return code 7 [...] pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05) [...] crmd: [14906]: info: te_rsc_command: Initiating action 5: stop prm_xen_v02_stop_0 on h05 (local) [...] Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped. [...] lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid 19552 exited with return code 0 [...] crmd: [14906]: info: te_rsc_command: Initiating action 78: start prm_xen_v02_start_0 on h05 (local) lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686) [...] lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 'v02' already exists with ID '3' lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file /etc/xen/vm/v02. [...] lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: pid 19686 exited with return code 1 [...] crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0 (call=32, rc=1, cib-update=5271, confirmed=true) unknown error crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05 failed (target: 0 vs. rc: 1): Error [...] As you can clearly see start failed, because the guest was found up already! IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84). I guess the following test is problematic: --- xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME rc=$? if [ $rc -ne 0 ]; then return $OCF_ERR_GENERIC --- Here xm create probably fails if the guest is already created... Regards, Ulrich Dejan Muhamedagic deja...@fastmail.fm schrieb am 01.10.2013 um 12:24 in Nachricht 20131001102430.GA4687@walrus.homenet: Hi, On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote: Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so that the cluster is the only thing managing them but this is not super clean as I get failures in my logs that are not really failures. It is very much a severe bug. The Xen RA has gained a workaround for this now, but we're also pushing Take a look here: https://github.com/ClusterLabs/resource-agents/pull/314 Thanks, Dejan the Xen team (where the real problem is) to investigate and fix. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Thanks to everyone who helped on this one. I really appreciate the speed that this has been looked at and resolved. I am kind of surprised that no one has reported it before. Lars. Do you know the bug report number with the Xen guys? I would like to watch that as it progresses as well. Thanks again! Tom On 10/01/2013 06:24 AM, Dejan Muhamedagic wrote: Hi, On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote: On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote: Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so that the cluster is the only thing managing them but this is not super clean as I get failures in my logs that are not really failures. It is very much a severe bug. The Xen RA has gained a workaround for this now, but we're also pushing Take a look here: https://github.com/ClusterLabs/resource-agents/pull/314 Thanks, Dejan the Xen team (where the real problem is) to investigate and fix. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Xen RA and rebooting
Hi Ulrich. You have summed it up exactly and the chances seem small but in the real world (Murphy's Law I guess) I have hit this many times. Twice to the point where I have mangled a Production VM to the point of garbage. The larger the available free memory on the cluster as a whole seems to make a big difference because there seems to be a much greater chance of the cluster deciding to move a dead vm while it is rebooting. Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so that the cluster is the only thing managing them but this is not super clean as I get failures in my logs that are not really failures. Tom On 09/30/2013 07:56 AM, Ulrich Windl wrote: Hi! With Xen paravirtualization, when a VM (guest) is rebootet (e.g. via guest's reboot), the actual VM (which doesn't really exist as a concept in paravirtualization) is destroyed for a moment and then is recreated (AFAIK). That's why xm console does not survive a guest reboot, and that's why a RA may see the guest is gone for a moment before it's recreated. A clean fix would be in Xen to keep the guest in xm list during reboot. The chances to be hit by the problem are small, but when hit, the consequences are bad. Regards, Ulrich Ferenc Wagner wf...@niif.hu schrieb am 17.09.2013 um 11:38 in Nachricht 87six37pph@lant.ki.iif.hu: Lars Marowsky-Bree l...@suse.com writes: The RA thinks the guest is gone, the cluster reacts and schedules it to be started (perhaps elsewhere); and then the hypervisor starts it locally again *too*. I think changing those libvirt settings to destroy could work - the cluster will then restart the guest appropriately, not the hypervisor. Maybe the RA is just too picky about the reported VM state. This is one of the reasons* I'm using my own RA for managing libvirt virtual domains: mine does not care about the fine points, if the domain is active in any state, it's running, as far as the RA is concerned, so a domain reset is not a cluster event in any case. On the other hand, doesn't the recover action after a monitor failure consist of a stop action on the original host before the new start, just to make sure? Or maybe I'm confusing things... Regards, Feri. * Another is that mine gets the VM definition as a parameter, not via some shared filesystem. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 09/17/2013 01:13 AM, Vladislav Bogdanov wrote: 14.09.2013 07:28, Tom Parker wrote: Hello All Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. If anyone has any suggestions or parameters that I should be tweaking that would be appreciated. I use following in libvirt VM definitions to prevent this: on_poweroffdestroy/on_poweroff on_rebootdestroy/on_reboot on_crashdestroy/on_crash Vladislav Does this not show as a lot of failed operations? I guess they will clean themselves up after the failure expires. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 09/17/2013 04:18 AM, Lars Marowsky-Bree wrote: On 2013-09-16T16:36:38, Tom Parker tpar...@cbnco.com wrote: Can you kindly file a bug report here so it doesn't get lost https://github.com/ClusterLabs/resource-agents/issues ? Submitted (Issue *#308)* Thanks. It definitely leads to data corruption and I think has to do with the way that the locking is not working properly on my lvm partitions. Well, not really an LVM issue. The RA thinks the guest is gone, the cluster reacts and schedules it to be started (perhaps elsewhere); and then the hypervisor starts it locally again *too*. I mean the locking of the LVs. I should not be able to mount the same LV in two places. I know I can lock each LV exclusive to a node but I am not sure how to tell the RA to do that for me. At the moment I am activating a VG with the LVM RA and that is shared across all my physical machines. If I do exclusive activation I think that locks the vg to a particular node instead of the LVs. I think changing those libvirt settings to destroy could work - the cluster will then restart the guest appropriately, not the hypervisor. Regards, Lars ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Clone colocation missing?
Now that I have started using resource templates for my VMs (Thanks for this suggestion Lars!) I have added one simple colocation rule to my cluster My VMs all use the @CBNXen resource template and I have: order virtual-machines-after-storage inf: storage-clone CBNXen colocation virtual-machines-with-storage inf: CBNXen storage-clone This should take care of all the ordering and colocation needs for my VMs Tom On 09/14/2013 07:14 AM, Lars Marowsky-Bree wrote: On 2013-09-13T17:48:40, Tom Parker tpar...@cbnco.com wrote: Hi Feri I agree that it should be necessary but for some reason it works well the way it is and everything starts in the correct order. Maybe someone on the dev list can explain a little bit better why this is working. It may have something to do with the fact that it's a clone instead of a primitive. And luck. Your behaviour is undefined, and will work for most of the common cases. : versus inf: on the order means that, during a healthy start-up, A will be scheduled to start before B. It does not mean that B will need to be stopped before A. Or that B shouldn't start if A can't. Typically, both are required. Since you've got ordering, *normally*, B will start on a node where A is running. However, if A can't run on a node for any given reason, B will still try to start there without collocation. Typically, you'd want to avoid that. The issues with the start sequence tend to be mostly harmless - you'll just get additional errors for failure cases that might distract you from the real cause. The stop issue can be more difficult, because it might imply that A fails to stop because B is still active, and you'll get stop escalation (fencing). However, it might also mean that A enters an escalated stop procedure itself (like Filesystem, which will kill -9 processes that are still active), and thus implicitly stop B by force. That'll probably work, you'll see everything stopping, but it might require additional time from B on next start-up to recover from the aborted state. e.g., you can be lucky, but you also might turn out not to be. In my experience, this means it'll all work just fine during controlled testing, and then fail spectacularly under a production load. Hence the recommendation to fix the constraints ;-) (And, yes, this *does* suggest that we made a mistake in making this so easy to misconfigure. But, hey, ordering and colocation are independent concepts! The design and abstraction is pure! And I admit that I guess I'm to blame for that to a large degree ... If the most common case is A, then B + B where A, why isn't there a recommended constraint that just does both with no way of misconfiguring that? It's pretty high on my list of things to have fixed.) Regards, Lars ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 09/14/2013 07:18 AM, Lars Marowsky-Bree wrote: On 2013-09-14T00:28:30, Tom Parker tpar...@cbnco.com wrote: Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Hrm. Good question. Because to the monitor, it really looks as if the VM is temporarily gone, and it doesn't know ... Perhaps we need to keep looking for it for a few seconds. Can you kindly file a bug report here so it doesn't get lost https://github.com/ClusterLabs/resource-agents/issues ? Submitted (Issue *#308)* Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. *This* however is really, really worrisome and sounds like data corruption. How is this happening? It definitely leads to data corruption and I think has to do with the way that the locking is not working properly on my lvm partitions. It seems to mostly happen on clusters where I am using lvm slices on an MSA as shared storage (they don't seem to lock at the lv level) and the placement-strategy is utilization. If Xen reboots and the cluster declares the vm as dead it seems to try to start it on another node that has more resources instead of the node where it was running. It doesn't happen consistently enough for me to detect a pattern and seems to never happen on my QA system where I can actually cause corruption without anyone getting mad. If I can isolate how it happens I will file a bug. The work-around right now is to put the VM resource into maintenance mode for the reboot, or to reboot it via stop/start of the cluster manager. Regards, Lars ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Clone colocation missing? (was: Pacemaker 1.19 cannot manage more than 127 resources)
Hi Feri I agree that it should be necessary but for some reason it works well the way it is and everything starts in the correct order. Maybe someone on the dev list can explain a little bit better why this is working. It may have something to do with the fact that it's a clone instead of a primitive. Tom On Thu 05 Sep 2013 04:48:40 AM EDT, Ferenc Wagner wrote: Tom Parker tpar...@cbnco.com writes: I have attached my original crm config with 201 primitives to this e-mail. Hi, Sorry to sidetrack this thread, but I really wonder why you only have order constraints for your Xen resources, without any colocation constraints. After all, they can only start after the *local* storage clone has started... For example, you have this: primitive abrazotedb ocf:heartbeat:Xen [...] order abrazotedb-after-storage-clone : storage-clone abrazotedb and I miss (besides the already mentioned inf above) something like: colocation abrazotedb-with-storage-clone inf: abratozedb storage-clone Or is this really unnecessary for some reason? Please enlighten me. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Xen RA and rebooting
Hello All Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. If anyone has any suggestions or parameters that I should be tweaking that would be appreciated. Tom ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] error: te_connect_stonith: Sign-in failed: triggered a retry
Hello Since my upgrade last night I am also seeing this message in the logs on my servers. error: te_connect_stonith: Sign-in failed: triggered a retry Old mailing lists seem to imply that this is an issue with heartbeat which I don't think I am running. My software stack is this at the moment: cluster-glue-1.0.11-0.15.28 libcorosync4-1.4.5-0.18.15 corosync-1.4.5-0.18.15 pacemaker-mgmt-2.1.2-0.7.40 pacemaker-mgmt-client-2.1.2-0.7.40 pacemaker-1.1.9-0.19.102 Does anyone know where this may be coming from? Thanks Tom Parker. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] error: te_connect_stonith: Sign-in failed: triggered a retry
This is happening when I am using the really large CIB and no. There doesn't seem to be anything else. 3 of my 6 nodes were showing this error. Now that I have deleted and recreated my CIB this log message seems to have gone away. On 08/29/2013 10:16 PM, Andrew Beekhof wrote: On 30/08/2013, at 5:51 AM, Tom Parker tpar...@cbnco.com wrote: Hello Since my upgrade last night I am also seeing this message in the logs on my servers. error: te_connect_stonith: Sign-in failed: triggered a retry Old mailing lists seem to imply that this is an issue with heartbeat which I don't think I am running. My software stack is this at the moment: cluster-glue-1.0.11-0.15.28 libcorosync4-1.4.5-0.18.15 corosync-1.4.5-0.18.15 pacemaker-mgmt-2.1.2-0.7.40 pacemaker-mgmt-client-2.1.2-0.7.40 pacemaker-1.1.9-0.19.102 Does anyone know where this may be coming from? were there any other errors? ie. did stonithd crash? Thanks Tom Parker. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources
My pacemaker config contains the following settings: LRMD_MAX_CHILDREN=8 export PCMK_ipc_buffer=3172882 This is what I had today to get to 127 Resources defined. I am not sure what I should choose for the PCMK_ipc_type. Do you have any suggestions for large clusters? Thanks Tom On 08/29/2013 11:19 PM, Andrew Beekhof wrote: On 30/08/2013, at 5:49 AM, Tom Parker tpar...@cbnco.com wrote: Hello. Las night I updated my SLES 11 servers to HAE-SP3 which contains the following versions of software: cluster-glue-1.0.11-0.15.28 libcorosync4-1.4.5-0.18.15 corosync-1.4.5-0.18.15 pacemaker-mgmt-2.1.2-0.7.40 pacemaker-mgmt-client-2.1.2-0.7.40 pacemaker-1.1.9-0.19.102 With the previous versions of openais/corosync I could run over 200 resources with no problems and with very little lag with the management commands (crm_mon, crm configure, etc) Today I am unable to configure more than 127 resources. When I commit my 128th resource all the crm commands start to fail (crm_mon just hangs) or timeout (ERROR: running cibadmin -Ql: Call cib_query failed (-62): Timer expired) I have attached my original crm config with 201 primitives to this e-mail. If anyone has any ideas as to what may have changed between pacemaker versions that would cause this please let me know. If I can't get this solved this week I will have to downgrade to SP2 again. Thanks for any information. I suspect you've hit an IPC buffer limit. Depending on exactly what went into the SUSE builds, you should have the following environment variables (documentation from /etc/syconfig/pacemaker on RHEL) to play with: # Force use of a particular class of IPC connection # PCMK_ipc_type=shared-mem|socket|posix|sysv # Specify an IPC buffer size in bytes # Useful when connecting to really big clusters that exceed the default 20k buffer # PCMK_ipc_buffer=20480 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources
Do you know if this has changed significantly from the older versions? This cluster was working fine before the upgrade. On Fri 30 Aug 2013 12:16:35 AM EDT, Andrew Beekhof wrote: On 30/08/2013, at 1:42 PM, Tom Parker tpar...@cbnco.com wrote: My pacemaker config contains the following settings: LRMD_MAX_CHILDREN=8 export PCMK_ipc_buffer=3172882 perhaps go higher This is what I had today to get to 127 Resources defined. I am not sure what I should choose for the PCMK_ipc_type. Do you have any suggestions for large clusters? shm is the new upstream default, but it may not have propagated to suse yet. Thanks Tom On 08/29/2013 11:19 PM, Andrew Beekhof wrote: On 30/08/2013, at 5:49 AM, Tom Parker tpar...@cbnco.com wrote: Hello. Las night I updated my SLES 11 servers to HAE-SP3 which contains the following versions of software: cluster-glue-1.0.11-0.15.28 libcorosync4-1.4.5-0.18.15 corosync-1.4.5-0.18.15 pacemaker-mgmt-2.1.2-0.7.40 pacemaker-mgmt-client-2.1.2-0.7.40 pacemaker-1.1.9-0.19.102 With the previous versions of openais/corosync I could run over 200 resources with no problems and with very little lag with the management commands (crm_mon, crm configure, etc) Today I am unable to configure more than 127 resources. When I commit my 128th resource all the crm commands start to fail (crm_mon just hangs) or timeout (ERROR: running cibadmin -Ql: Call cib_query failed (-62): Timer expired) I have attached my original crm config with 201 primitives to this e-mail. If anyone has any ideas as to what may have changed between pacemaker versions that would cause this please let me know. If I can't get this solved this week I will have to downgrade to SP2 again. Thanks for any information. I suspect you've hit an IPC buffer limit. Depending on exactly what went into the SUSE builds, you should have the following environment variables (documentation from /etc/syconfig/pacemaker on RHEL) to play with: # Force use of a particular class of IPC connection # PCMK_ipc_type=shared-mem|socket|posix|sysv # Specify an IPC buffer size in bytes # Useful when connecting to really big clusters that exceed the default 20k buffer # PCMK_ipc_buffer=20480 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources
Thanks for your help. I think I have it solved. The trick is that the crm tools also need to know what the Pacemaker IPC buffer size is. I have set: /etc/sysconfig/pacemaker #export LRMD_MAX_CHILDREN=8 # Force use of a particular class of IPC connection # PCMK_ipc_type=shared-mem|socket|posix|sysv export PCMK_ipc_type=shared-mem # Specify an IPC buffer size in bytes # Useful when connecting to really big clusters that exceed the default 20k buffer # PCMK_ipc_buffer=20480 export PCMK_ipc_buffer=2048 and ~/.bashrc export PCMK_ipc_type=shared-mem export PCMK_ipc_buffer=2048 And now everything seems to play nicely together. A 20MB buffer seems huge but I have a TON of virtual machines on this cluster. On Fri 30 Aug 2013 01:00:36 AM EDT, Andrew Beekhof wrote: You'd have to ask suse. They'd know what the old and new are and therefor the differences between the two. On 30/08/2013, at 2:21 PM, Tom Parker tpar...@cbnco.com wrote: Do you know if this has changed significantly from the older versions? This cluster was working fine before the upgrade. On Fri 30 Aug 2013 12:16:35 AM EDT, Andrew Beekhof wrote: On 30/08/2013, at 1:42 PM, Tom Parker tpar...@cbnco.com wrote: My pacemaker config contains the following settings: LRMD_MAX_CHILDREN=8 export PCMK_ipc_buffer=3172882 perhaps go higher This is what I had today to get to 127 Resources defined. I am not sure what I should choose for the PCMK_ipc_type. Do you have any suggestions for large clusters? shm is the new upstream default, but it may not have propagated to suse yet. Thanks Tom On 08/29/2013 11:19 PM, Andrew Beekhof wrote: On 30/08/2013, at 5:49 AM, Tom Parker tpar...@cbnco.com wrote: Hello. Las night I updated my SLES 11 servers to HAE-SP3 which contains the following versions of software: cluster-glue-1.0.11-0.15.28 libcorosync4-1.4.5-0.18.15 corosync-1.4.5-0.18.15 pacemaker-mgmt-2.1.2-0.7.40 pacemaker-mgmt-client-2.1.2-0.7.40 pacemaker-1.1.9-0.19.102 With the previous versions of openais/corosync I could run over 200 resources with no problems and with very little lag with the management commands (crm_mon, crm configure, etc) Today I am unable to configure more than 127 resources. When I commit my 128th resource all the crm commands start to fail (crm_mon just hangs) or timeout (ERROR: running cibadmin -Ql: Call cib_query failed (-62): Timer expired) I have attached my original crm config with 201 primitives to this e-mail. If anyone has any ideas as to what may have changed between pacemaker versions that would cause this please let me know. If I can't get this solved this week I will have to downgrade to SP2 again. Thanks for any information. I suspect you've hit an IPC buffer limit. Depending on exactly what went into the SUSE builds, you should have the following environment variables (documentation from /etc/syconfig/pacemaker on RHEL) to play with: # Force use of a particular class of IPC connection # PCMK_ipc_type=shared-mem|socket|posix|sysv # Specify an IPC buffer size in bytes # Useful when connecting to really big clusters that exceed the default 20k buffer # PCMK_ipc_buffer=20480 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems