Re: [ClusterLabs] multiple action= lines sent to STDIN of fencing agents - why?
Hi, On 15 October 2015 at 16:45:58, Jan Pokorný (jpoko...@redhat.com) wrote: > On 15/10/15 12:25 +0100, Adam Spiers wrote: > > I inserted some debugging into fencing.py and found that stonithd > > sends stuff like this to STDIN of the fencing agents it forks: > > > > action=list > > param1=value1 > > param2=value2 > > param3=value3 > > action=list > > > > where paramX and valueX come from the configuration of the primitive > > for the fencing agent. > > > > As a corollary, if the primitive for the fencing agent has 'action' > > defined as one of its parameters, this means that there will be three > > 'action=' lines, and the middle one could have a different value to > > the two sandwiching it. > > > > When I first saw this, I had an extended #wtf moment and thought it > > was a bug. But on closer inspection, it seems very deliberate, e.g. > > > > https://github.com/ClusterLabs/pacemaker/commit/bfd620645f151b71fafafa279969e9d8bd0fd74f > > > > > > The "regardless of what the admin configured" comment suggests to me > > that there is an underlying assumption that any fencing agent will > > ensure that if the same parameter is duplicated on STDIN, the final > > value will override any previous ones. And indeed fencing.py ensures > > this, but presumably it is possible to write agents which don't use > > fencing.py. > > > > Is my understanding correct? If so: > > > > 1) Is the first 'action=' line intended in order to set some kind of > > default action, in the case that the admin didn't configure the > > primitive with an 'action=' parameter *and* _action wasn't one of > > list/status/monitor/metadata? In what circumstances would this > > happen? Fence agents do not use this mechanism for adding a values for attributes. So if multiple values of actions occur, it is done by user manually. By default, if action is not defined it is reboot (or off). In case of pacemaker (pcmk_*_action), the action is added according to pcmk_*_action settings. > > 2) Is this assumption of the agents always being order-sensitive > > (i.e. last value always wins) documented anywhere? The best > > documentation on the API I could find was here: > > > > https://fedorahosted.org/cluster/wiki/FenceAgentAPI > > > > but it doesn't mention this. > > Agreed that this implicit and undocumented arrangement smells fishy. Info was added to the wiki page was added > In fact I raised this point within RH team 3 month back but there > wasn't any feedback at that time. Haven't filed a bug because it's > unclear to me what's intentional and what's not between the two > interacting sides. Back then, I was proposing that fencing.py should, > at least, emit a message about subsequent parameter overrides. Emit a message when such situation occurs is a bit tricky but doable. The problem is in fact that we use ‘default’ values for some attributes and later in program we can’t distinguish who entered the value. So in some cases, the warning will not be emit. m, ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Fencing questions.
Hi I am running a 2 node cluster with this config on centos 6.5/6.6 where i have a multi-state resource foo being run in master/slave mode and a bunch of floating IP addresses configured. Additionally i have a collocation constraint for the IP addr to be collocated with the master. Please find the following files attached cluster.conf CIB Issues that i have :- 1. Daemons required for fencing Earlier we were invoking cman start quorum from pacemaker script which ensured that fenced / gfs and other daemons are not started. This was ok since fencing wasn't being handled earlier. For fencing purpose do we only need the fenced to be started ? We don't have any gfs partitions that we want to monitor via pacemaker. My concern here is that if i use the unmodified script then pacemaker start time increases significantly. I see a difference of 60 sec from the earlier startup before service pacemaker status shows up as started. 2. Fencing test cases. Based on the internet queries i could find , apart from plugging out the dedicated cable. The only other case suggested is killing corosync process on one of the nodes. Are there any other basic cases that i should look at ? What about bring up interface down manually ? I understand that this is an unlikely scenario but i am just looking for more ways to test this out. 3. Testing whether fencing is working or not. Previously i have been using fence_ilo4 from the shell to test whether the command is working. I was assuming that similar invocation would be done by stonith when actual fencing needs to be done. However based on other threads i could find people also use fence_tool to try this out. According to me this tests out whether fencing when invoked by fenced for a particular node succeeds or not. Is that valid ? Since we are configuring fence_pcmk as the fence device the flow of things is fenced -> fence_pcmk -> stonith -> fence agent. 4. Fencing agent to be used (fence_ipmilan vs fence_ilo4) Also for ILO fencing i see fence_ilo4 and fence_ipmilan both available. I had been using fence_ilo4 till now. I think this mail has multiple questions and i will probably send out another mail for a few issues i see after fencing takes place. Thanks in advance Arjun cib Description: Binary data cluster.conf Description: Binary data ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Is STONITH resource location special?
Hi, Pacemaker Explained discusses in 13.3 the special treatment of STONITH resources. Now, I configured 6 fence_ipmilan instances in a cluster which runs on 4 nodes currently. No STONITH resource started on the node it can kill, even though I haven't configured location constraints yet. Is this clever placement accidental, or is it another aspect of the special treatment of STONITH resources? Should I configure the usual location constraints, or are they not needed under 1.1.13? Also, I haven't configured pcmk_host_check, so is should default to dynamic-list, but the pcmk_host_list setting is still effective, according to stonith_admin. I don't mind this, but it seems to contradict the documentation. Who is right? -- Thanks, Feri. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Fencing questions.
On 19/10/15 06:53 AM, Arjun Pandey wrote: > Hi > > I am running a 2 node cluster with this config on centos 6.5/6.6 where It's important to keep both nodes on the same minor version, particularly in this case. Please either upgrade centos 6.5 to 6.6 or both to 6.7. > i have a multi-state resource foo being run in master/slave mode and a > bunch of floating IP addresses configured. Additionally i have > a collocation constraint for the IP addr to be collocated with the master. > > Please find the following files attached > cluster.conf > CIB It's preferable on a mailing list to copy the text into the body of the message. Easier to read. > Issues that i have :- > 1. Daemons required for fencing > Earlier we were invoking cman start quorum from pacemaker script which > ensured that fenced / gfs and other daemons are not started. This was ok > since fencing wasn't being handled earlier. The cman fencing is simply a pass-through to pacemaker. When pacemaker tells cman that fencing succeeded, it inform DLM and begins cleanup. > For fencing purpose do we only need the fenced to be started ? We don't > have any gfs partitions that we want to monitor via pacemaker. My > concern here is that if i use the unmodified script then pacemaker start > time increases significantly. I see a difference of 60 sec from the > earlier startup before service pacemaker status shows up as started. Don't start fenced manually, just start pacemaker and let it handle everything. Ideally, use the pcs command (and pcsd daemon on the nodes) to start/stop the cluster, but you'll need to upgrade to 6.7. > 2. Fencing test cases. > Based on the internet queries i could find , apart from plugging out > the dedicated cable. The only other case suggested is killing corosync > process on one of the nodes. > Are there any other basic cases that i should look at ? > What about bring up interface down manually ? I understand that this is > an unlikely scenario but i am just looking for more ways to test this out. echo c > /proc/sysrq-trigger == kernel panic. It's my preferred test. Also, killing the power to the node will cause IPMI to fail and will test your backup fence method, if you have it, or ensure the cluster locks up if you don't (better to hang than to risk corruption). > 3. Testing whether fencing is working or not. > Previously i have been using fence_ilo4 from the shell to test whether > the command is working. I was assuming that similar invocation would be > done by stonith when actual fencing needs to be done. > > However based on other threads i could find people also use fence_tool > to try this out. According to me this tests out whether > fencing when invoked by fenced for a particular node succeeds or not. Is > that valid ? Fence tool is just a command to control the cluster's fencing. The fence_X agents do the actual work. > Since we are configuring fence_pcmk as the fence device the flow of > things is > fenced -> fence_pcmk -> stonith -> fence agent. Basically correct. > 4. Fencing agent to be used (fence_ipmilan vs fence_ilo4) > Also for ILO fencing i see fence_ilo4 and fence_ipmilan both available. > I had been using fence_ilo4 till now. Which ever works is fine. I believe a lot of the fence_X out-of-band agents are actually just links to fence_ipmilan, but I might be wrong. > I think this mail has multiple questions and i will probably send out > another mail for a few issues i see after fencing takes place. > > Thanks in advance > Arjun > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Fencing questions.
Hi Digimer Please find my response inilne. On Mon, Oct 19, 2015 at 8:21 PM, Digimer wrote: > On 19/10/15 06:53 AM, Arjun Pandey wrote: > > Hi > > > > I am running a 2 node cluster with this config on centos 6.5/6.6 where > > It's important to keep both nodes on the same minor version, > particularly in this case. Please either upgrade centos 6.5 to 6.6 or > both to 6.7. > [Arjun] > My bad. Both the nodes are on centos 6.6 now. We used to support this on > 6.5 earlier. > > > i have a multi-state resource foo being run in master/slave mode and a > > bunch of floating IP addresses configured. Additionally i have > > a collocation constraint for the IP addr to be collocated with the > master. > > > > Please find the following files attached > > cluster.conf > > CIB > > It's preferable on a mailing list to copy the text into the body of the > message. Easier to read. > [Arjun] Adding now > cluster.conf CIB > Issues that i have :- > > 1. Daemons required for fencing > > Earlier we were invoking cman start quorum from pacemaker script which > > ensured that fenced / gfs and other daemons are not started. This was ok > > since fencing wasn't being handled earlier. > > The cman fencing is simply a pass-through to pacemaker. When pacemaker > tells cman that fencing succeeded, it inform DLM and begins cleanup. > > > For fencing purpose do we only need the fenced to be started ? We don't > > have any gfs partitions that we want to monitor via pacemaker. My > > concern here is that if i use the unmodified script then pacemaker start > > time increases significantly. I see a difference of 60 sec from the > > earlier startup before service pacemaker status shows up as started. > > Don't start fenced manually, just start pacemaker and let it handle > everything. Ideally, use the pcs command (and pcsd daemon on the nodes) > to start/stop the cluster, but you'll need to upgrade to 6.7. [Arjun] I am not starting fencing manually,it gets started from pacemaker setup itself. However if i look at cman init script start routine there's a case where if one calls "cman start quorum" in which fenced/gfs and other daemons are not started. Source code link https://git.fedorahosted.org/cgit/cluster.git/tree/cman/init.d/cman.in?h=RHEL6#n795 This is what gets called from our pacemaker init script I was wondering whether i should add a new case in this script because we don't really use gfs/dlm. This is leading to substantial increase in pacemaker startup time > 2. Fencing test cases. > > Based on the internet queries i could find , apart from plugging out > > the dedicated cable. The only other case suggested is killing corosync > > process on one of the nodes. > >
Re: [ClusterLabs] Corosync & pacemaker quit between boot and login
On Fri, Oct 16, 2015 at 06:25:29PM +0200, users-requ...@clusterlabs.org wrote: > From: Russell Martin > Subject: [ClusterLabs] Corosync & pacemaker quit between boot and login > > ... > Both corosync and pacemaker seem to start fine during boot (they both say > "[OK]") > However, once logged in, neither daemon is running. Hi, since you are using a Desktop install, could it be that NetworkManager interferes with your network configuration? Otherwise you would have to look through /var/log/corosync/corosync.log for a hint. Regards Matthias ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on
Pacemaker supports fencing "topologies", allowing multiple fencing devices to be used (in conjunction or as fallbacks) when a node needs to be fenced. However, there is a catch when using something like redundant power supplies. If you put two power switches in the same topology level, and Pacemaker needs to reboot the node, it will reboot the first power switch and then the second -- which has no effect since the supplies are redundant. Pacemaker's upstream master branch has new handling that will be part of the eventual 1.1.14 release. In such a case, it will turn all the devices off, then turn them all back on again. With previous versions, there was a complicated configuration workaround involving creating separate devices for the off and on actions. With the new version, it happens automatically, and no special configuration is needed. Here's an example where node1 is the affected node, and apc1 and apc2 are the fence devices: pcs stonith level add 1 node1 apc1,apc2 Of course you can configure it using crm or XML as well. The fencing operation will be treated as successful as long as the "off" commands succeed, because then it is safe for the cluster to recover any resources that were on the node. Timeouts and errors in the "on" phase will be logged but ignored. Any action-specific timeout for the remapped action will be used (for example, pcmk_off_timeout will be used when executing the "off" command, not pcmk_reboot_timeout). The new code knows to skip the "on" step if the fence agent has automatic unfencing (because it will happen when the node rejoins the cluster). This allows fence_scsi to work with this feature. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on
On 19/10/15 12:34 PM, Ken Gaillot wrote: > Pacemaker supports fencing "topologies", allowing multiple fencing > devices to be used (in conjunction or as fallbacks) when a node needs to > be fenced. > > However, there is a catch when using something like redundant power > supplies. If you put two power switches in the same topology level, and > Pacemaker needs to reboot the node, it will reboot the first power > switch and then the second -- which has no effect since the supplies are > redundant. > > Pacemaker's upstream master branch has new handling that will be part of > the eventual 1.1.14 release. In such a case, it will turn all the > devices off, then turn them all back on again. How long will it leave stay in the 'off' state? Is it configurable? I ask because if it's too short, some PSUs may not actually lose power. One or two seconds should be way more than enough though. > With previous versions, there was a complicated configuration workaround > involving creating separate devices for the off and on actions. With the > new version, it happens automatically, and no special configuration is > needed. > > Here's an example where node1 is the affected node, and apc1 and apc2 > are the fence devices: > >pcs stonith level add 1 node1 apc1,apc2 Where would the outlet definition go? 'apc1:4,apc2:4'? > Of course you can configure it using crm or XML as well. > > The fencing operation will be treated as successful as long as the "off" > commands succeed, because then it is safe for the cluster to recover any > resources that were on the node. Timeouts and errors in the "on" phase > will be logged but ignored. > > Any action-specific timeout for the remapped action will be used (for > example, pcmk_off_timeout will be used when executing the "off" command, > not pcmk_reboot_timeout). I think this answers my question about how long it stays off for. What would be an example config to control the off time then? > The new code knows to skip the "on" step if the fence agent has > automatic unfencing (because it will happen when the node rejoins the > cluster). This allows fence_scsi to work with this feature. http://i.imgur.com/i7BzivK.png -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Automatic IPC buffer adjustment in Pacemaker 1.1.11?
Hi, The http://www.ultrabug.fr/tuning-pacemaker-for-large-clusters/ blog post states that "Pacemaker v1.1.11 should come with a feature which will allow the IPC layer to adjust the PCMK_ipc_buffer automagically". However, I failed to identify this in the ChangeLog. Did it really happen, can I drop my PCMK_ipc_buffer settings, and if yes, do I need to switch this feature on somehow? -- Thanks, Feri. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on
On 10/19/2015 11:42 AM, Digimer wrote: > On 19/10/15 12:34 PM, Ken Gaillot wrote: >> Pacemaker supports fencing "topologies", allowing multiple fencing >> devices to be used (in conjunction or as fallbacks) when a node needs to >> be fenced. >> >> However, there is a catch when using something like redundant power >> supplies. If you put two power switches in the same topology level, and >> Pacemaker needs to reboot the node, it will reboot the first power >> switch and then the second -- which has no effect since the supplies are >> redundant. >> >> Pacemaker's upstream master branch has new handling that will be part of >> the eventual 1.1.14 release. In such a case, it will turn all the >> devices off, then turn them all back on again. > > How long will it leave stay in the 'off' state? Is it configurable? I > ask because if it's too short, some PSUs may not actually lose power. > One or two seconds should be way more than enough though. It simply waits for the fence agent to return success from the "off" command before proceeding. I wouldn't assume any particular time between that and initiating "on", and there's no way to set a delay there -- it's up to the agent to not return success until the action is actually complete. The standard says that agents should actually confirm that the device is in the desired state after sending a command, so hopefully this is already baked in. >> With previous versions, there was a complicated configuration workaround >> involving creating separate devices for the off and on actions. With the >> new version, it happens automatically, and no special configuration is >> needed. >> >> Here's an example where node1 is the affected node, and apc1 and apc2 >> are the fence devices: >> >>pcs stonith level add 1 node1 apc1,apc2 > > Where would the outlet definition go? 'apc1:4,apc2:4'? "apc1" here is name of a Pacemaker fence resource. Hostname, port, etc. would be configured in the definition of the "apc1" resource (which I omitted above to focus on the topology config). >> Of course you can configure it using crm or XML as well. >> >> The fencing operation will be treated as successful as long as the "off" >> commands succeed, because then it is safe for the cluster to recover any >> resources that were on the node. Timeouts and errors in the "on" phase >> will be logged but ignored. >> >> Any action-specific timeout for the remapped action will be used (for >> example, pcmk_off_timeout will be used when executing the "off" command, >> not pcmk_reboot_timeout). > > I think this answers my question about how long it stays off for. What > would be an example config to control the off time then? This isn't a delay, but a timeout before declaring the action failed. If an "off" command does not return in this amount of time, the command (and the entire topology level) will be considered failed, and the next level will be tried. The timeouts are configured in the fence resource definition. So combining the above questions, apc1 might be defined like this: pcs stonith create apc1 fence_apc_snmp \ ipaddr=apc1.example.com \ login=user passwd='supersecret' \ pcmk_off_timeout=30s \ pcmk_host_map="node1.example.com:1,node2.example.com:2" >> The new code knows to skip the "on" step if the fence agent has >> automatic unfencing (because it will happen when the node rejoins the >> cluster). This allows fence_scsi to work with this feature. > > http://i.imgur.com/i7BzivK.png :-D ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org