On 19/09/16 03:13 PM, Digimer wrote: > On 19/09/16 03:07 PM, Digimer wrote: >> On 19/09/16 02:39 PM, Digimer wrote: >>> On 19/09/16 02:30 PM, Jan Pokorný wrote: >>>> On 18/09/16 15:37 -0400, Digimer wrote: >>>>> If, for example, a server's definition file is corrupted while the >>>>> server is running, rgmanager will put the server into a 'failed' state. >>>>> That's fine and fair. >>>> >>>> Please, be more precise. Is it "vm" resource agent that you are talking >>>> about, hence server is the particular virtual machine to be managed? >>>> Is the agent in the role of a service (defined at a top-level) or >>>> a standard resource (without special treatment, possibly with >>>> dependent services further in the group)? >>> >>> In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and >>> gets a bad return (because the foo.xml file was corrupted by creating a >>> typo that breaks the XML, as an example). >>> >>> I'm not sure if that answers your question, sorry. >>> >>>>> The problem is that, once the file is fixed, there appears to be no >>>>> way to go failed -> started without disabling (and thus powering off) >>>>> the VM. This is troublesom because it forces an interruption when the >>>>> service could have been placed under resource management without a reboot. >>>>> >>>>> For example, doing 'clusvcadm -e <server>' when the service was >>>>> 'disabled' (say because of a manual boot of the server), rgmanager >>>>> detects that the server is running fine and simply marks the server as >>>>> 'started'. Is there no way to do something similar to go 'failed' -> >>>>> 'started' without the 'disable' step? >>>> >>>> In case it's a VM as a service, this could possibly be "exploited" >>>> (never tested that, though): >>>> >>>> # MANWIDTH=72 man rgmanager | col -b \ >>>> | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}' >>>>> VIRTUAL MACHINE FEATURES >>>>> Apart from what is noted in the VM resource agent, rgman- >>>>> ager provides a few convenience features when dealing >>>>> with virtual machines. >>>>> * it will use live migration when transferring a virtual >>>>> machine to a more-preferred host in the cluster as a >>>>> consequence of failover domain operation >>>>> * it will search the other instances of rgmanager in the >>>>> cluster in the case that a user accidentally moves a >>>>> virtual machine using other management tools >>>>> * unlike services, adding a virtual machine to rgman- >>>>> ager’s configuration will not cause the virtual machine >>>>> to be restarted >>>>> * removing a virtual machine from rgmanager’s >>>>> configuration will leave the virtual machine running. >>>> >>>> (see the last two items). >>> >>> So a possible "recover" would be to remove the VM from rgmanager, then >>> add it back? I can see that working, but it seems heavy handed. :) >>> >>>>> I tried freezing the service, no luck. I also tried coalescing via >>>>> '-c', but that didn't help either. >>>> >>>> Any path from "failed" in the resource (group) life-cycle goes either >>>> through "disabled" or "stopped" if I am not mistaken, so would rather >>>> experiment with adding a new service and dropping the old one per >>>> the above description as a possible workaround (perhaps in the reverse >>>> order so as to retain the same name for the service, indeed unless >>>> rgmanager would actively prevent that anyway -- no idea). >>> >>> This is my understanding as well, yes (that failed must go through >>> 'disabled' or 'stopped'). >>> >>> I'll try the remove/re-add option and report back. >> >> OK, didn't work. >> >> I corrupted the XML definition to cause rgmanager to report it as >> 'failed', removed it from rgmanager (clustat no longer reported it at >> all), re-added it and when it came back, it was still listed as 'failed'. > > Ha! > > So, it was still flagged as 'failed', so I called '-d' to disable it > (after adding it back to rgmanager) and it went 'disabled' WITHOUT > stopping the server. When I called '-e' on node 2 (the server was on > node 1), it started on node 1 properly and returned to a 'started' state > without restarting. > > I wonder if I could call disable directly from the other node...
So yes, I can. If I call -d on a node that ISN'T the host, it flags the server as stopped without actually shutting it down. Then I can call '-e' and bring it back up fine. This feels like I am exploiting a bug though... I wonder if there is a more "proper" way to recover the server? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org