Re: [Linux-HA] Xen RA and rebooting
Tom Parker tpar...@cbnco.com writes: On 09/17/2013 04:18 AM, Lars Marowsky-Bree wrote: On 2013-09-16T16:36:38, Tom Parker tpar...@cbnco.com wrote: It definitely leads to data corruption and I think has to do with the way that the locking is not working properly on my lvm partitions. Well, not really an LVM issue. The RA thinks the guest is gone, the cluster reacts and schedules it to be started (perhaps elsewhere); and then the hypervisor starts it locally again *too*. I mean the locking of the LVs. I should not be able to mount the same LV in two places. I know I can lock each LV exclusive to a node but I am not sure how to tell the RA to do that for me. CLVM can provide exclusive activation, but that would make live migration impossible. -- Feri. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 2013-09-16T16:36:38, Tom Parker tpar...@cbnco.com wrote: Can you kindly file a bug report here so it doesn't get lost https://github.com/ClusterLabs/resource-agents/issues ? Submitted (Issue *#308)* Thanks. It definitely leads to data corruption and I think has to do with the way that the locking is not working properly on my lvm partitions. Well, not really an LVM issue. The RA thinks the guest is gone, the cluster reacts and schedules it to be started (perhaps elsewhere); and then the hypervisor starts it locally again *too*. I think changing those libvirt settings to destroy could work - the cluster will then restart the guest appropriately, not the hypervisor. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
Lars Marowsky-Bree l...@suse.com writes: The RA thinks the guest is gone, the cluster reacts and schedules it to be started (perhaps elsewhere); and then the hypervisor starts it locally again *too*. I think changing those libvirt settings to destroy could work - the cluster will then restart the guest appropriately, not the hypervisor. Maybe the RA is just too picky about the reported VM state. This is one of the reasons* I'm using my own RA for managing libvirt virtual domains: mine does not care about the fine points, if the domain is active in any state, it's running, as far as the RA is concerned, so a domain reset is not a cluster event in any case. On the other hand, doesn't the recover action after a monitor failure consist of a stop action on the original host before the new start, just to make sure? Or maybe I'm confusing things... Regards, Feri. * Another is that mine gets the VM definition as a parameter, not via some shared filesystem. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 2013-09-17T11:38:34, Ferenc Wagner wf...@niif.hu wrote: On the other hand, doesn't the recover action after a monitor failure consist of a stop action on the original host before the new start, just to make sure? Or maybe I'm confusing things... Yes, it would - but it seems there's a brief period during reboot where the guest is shown as gone/cleanly stopped, and then the stop action will just see the very same. Actually that strikes me as a problem with Xen/libvirt's reporting. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
Lars Marowsky-Bree l...@suse.com writes: On 2013-09-17T11:38:34, Ferenc Wagner wf...@niif.hu wrote: On the other hand, doesn't the recover action after a monitor failure consist of a stop action on the original host before the new start, just to make sure? Or maybe I'm confusing things... Yes, it would - but it seems there's a brief period during reboot where the guest is shown as gone/cleanly stopped, and then the stop action will just see the very same. Actually that strikes me as a problem with Xen/libvirt's reporting. Absolutely. KVM under libvirt does not exhibit such behaviour on our systems, and I find this most natural and correct. -- Regards, Feri. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 09/17/2013 01:13 AM, Vladislav Bogdanov wrote: 14.09.2013 07:28, Tom Parker wrote: Hello All Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. If anyone has any suggestions or parameters that I should be tweaking that would be appreciated. I use following in libvirt VM definitions to prevent this: on_poweroffdestroy/on_poweroff on_rebootdestroy/on_reboot on_crashdestroy/on_crash Vladislav Does this not show as a lot of failed operations? I guess they will clean themselves up after the failure expires. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
17.09.2013 20:51, Tom Parker wrote: On 09/17/2013 01:13 AM, Vladislav Bogdanov wrote: 14.09.2013 07:28, Tom Parker wrote: Hello All Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. If anyone has any suggestions or parameters that I should be tweaking that would be appreciated. I use following in libvirt VM definitions to prevent this: on_poweroffdestroy/on_poweroff on_rebootdestroy/on_reboot on_crashdestroy/on_crash Vladislav Does this not show as a lot of failed operations? I guess they will clean themselves up after the failure expires. Exactly. And this is much better than data corruption. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 09/17/2013 04:18 AM, Lars Marowsky-Bree wrote: On 2013-09-16T16:36:38, Tom Parker tpar...@cbnco.com wrote: Can you kindly file a bug report here so it doesn't get lost https://github.com/ClusterLabs/resource-agents/issues ? Submitted (Issue *#308)* Thanks. It definitely leads to data corruption and I think has to do with the way that the locking is not working properly on my lvm partitions. Well, not really an LVM issue. The RA thinks the guest is gone, the cluster reacts and schedules it to be started (perhaps elsewhere); and then the hypervisor starts it locally again *too*. I mean the locking of the LVs. I should not be able to mount the same LV in two places. I know I can lock each LV exclusive to a node but I am not sure how to tell the RA to do that for me. At the moment I am activating a VG with the LVM RA and that is shared across all my physical machines. If I do exclusive activation I think that locks the vg to a particular node instead of the LVs. I think changing those libvirt settings to destroy could work - the cluster will then restart the guest appropriately, not the hypervisor. Regards, Lars ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 09/14/2013 07:18 AM, Lars Marowsky-Bree wrote: On 2013-09-14T00:28:30, Tom Parker tpar...@cbnco.com wrote: Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Hrm. Good question. Because to the monitor, it really looks as if the VM is temporarily gone, and it doesn't know ... Perhaps we need to keep looking for it for a few seconds. Can you kindly file a bug report here so it doesn't get lost https://github.com/ClusterLabs/resource-agents/issues ? Submitted (Issue *#308)* Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. *This* however is really, really worrisome and sounds like data corruption. How is this happening? It definitely leads to data corruption and I think has to do with the way that the locking is not working properly on my lvm partitions. It seems to mostly happen on clusters where I am using lvm slices on an MSA as shared storage (they don't seem to lock at the lv level) and the placement-strategy is utilization. If Xen reboots and the cluster declares the vm as dead it seems to try to start it on another node that has more resources instead of the node where it was running. It doesn't happen consistently enough for me to detect a pattern and seems to never happen on my QA system where I can actually cause corruption without anyone getting mad. If I can isolate how it happens I will file a bug. The work-around right now is to put the VM resource into maintenance mode for the reboot, or to reboot it via stop/start of the cluster manager. Regards, Lars ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
14.09.2013 07:28, Tom Parker wrote: Hello All Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. If anyone has any suggestions or parameters that I should be tweaking that would be appreciated. I use following in libvirt VM definitions to prevent this: on_poweroffdestroy/on_poweroff on_rebootdestroy/on_reboot on_crashdestroy/on_crash Vladislav ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen RA and rebooting
On 2013-09-14T00:28:30, Tom Parker tpar...@cbnco.com wrote: Does anyone know of a good way to prevent pacemaker from declaring a vm dead if it's rebooted from inside the vm. It seems to be detecting the vm as stopped for the brief moment between shutting down and starting up. Hrm. Good question. Because to the monitor, it really looks as if the VM is temporarily gone, and it doesn't know ... Perhaps we need to keep looking for it for a few seconds. Can you kindly file a bug report here so it doesn't get lost https://github.com/ClusterLabs/resource-agents/issues ? Often this causes the cluster to have two copies of the same vm if the locks are not set properly (which I have found to be unreliable) one that is managed and one that is abandonded. *This* however is really, really worrisome and sounds like data corruption. How is this happening? The work-around right now is to put the VM resource into maintenance mode for the reboot, or to reboot it via stop/start of the cluster manager. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems