On 2020-10-07 2:20 a.m., Digimer wrote: > On 2020-10-07 1:49 a.m., Andrei Borzenkov wrote: >> 07.10.2020 06:42, Digimer пишет: >>> Hi all, >>> >>> While developing our program (and not being a production cluster), I >>> find that when I push broken code to a node, causing the RA to fail to >>> perform an operation, the node gets fenced. (example below). >>> >>> This brings up a question; >>> >>> If a single resource fails for any reason and can't be recovered, but >>> other resources on the node are still operational, how can I suppress a >>> self-fence? I'd rather one failed resource than having all resources get >>> killed (they're VMs, so restarting on the peer is ... disruptive). >>> >>> If this is a bad approach (sufficiently bad to justify hard-rebooting >>> other VMs that had been running on the same node), why is that? Are >>> there any less-bad options for this scenario? >>> >>> Obviously, I would never push untested code to a production system, >>> but knowing now that this is possible (losing a node with it's other VMs >>> on an RA / code fault), I'm worried about some unintended "oops" causing >>> the loss of a node. >>> >>> For example, would it be possible to have the node try to live migrate >>> services to the other peer, before self-fencing in a scenario like this? >>> Are there other options / considerations I might be missing here? >>> >>> example VM config: >>> >>> ==== >>> <primitive class="ocf" id="srv07-el6" provider="alteeve" >>> type="server"> >>> <instance_attributes id="srv07-el6-instance_attributes"> >>> <nvpair id="srv07-el6-instance_attributes-name" name="name" >>> value="srv07-el6"/> >>> </instance_attributes> >>> <meta_attributes id="srv07-el6-meta_attributes"> >>> <nvpair id="srv07-el6-meta_attributes-allow-migrate" >>> name="allow-migrate" value="true"/> >>> <nvpair id="srv07-el6-meta_attributes-migrate_to" >>> name="migrate_to" value="INFINITY"/> >>> <nvpair id="srv07-el6-meta_attributes-stop" name="stop" >>> value="INFINITY"/> >>> <nvpair id="srv07-el6-meta_attributes-target-role" >>> name="target-role" value="Stopped"/> >>> </meta_attributes> >>> <operations> >>> <op id="srv07-el6-migrate_from-interval-0s" interval="0s" >>> name="migrate_from" timeout="600"/> >>> <op id="srv07-el6-migrate_to-interval-0s" interval="0s" >>> name="migrate_to" timeout="INFINITY"/> >>> <op id="srv07-el6-monitor-interval-60" interval="60" >>> name="monitor" on-fail="block"/> >>> <op id="srv07-el6-notify-interval-0s" interval="0s" >>> name="notify" timeout="20"/> >>> <op id="srv07-el6-start-interval-0s" interval="0s" >>> name="start" timeout="30"/> >>> <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop" >>> timeout="INFINITY"/> >>> </operations> >>> </primitive> >>> ==== >>> >>> Logs from a code oops in the RA triggering a node self-fence; >>> >>> ==== >>> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: >>> srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR: syntax >>> error at or near "3" ] >> >> Only stop operation failure results in stonith by default, you can >> change it with on-fail operation attribute. The only other sensible >> value would be "block". > > Ah, it looks like I misunderstood how on-fail="block" works. I see in > the CIB it was only applied to the monitor action (which I probably > don't want, as I want it to recover if a monitor fails). > > I've changed the CIB to below, I'll see how this handles future code > oopses. > > Thanks! > > digimer > > ==== > <primitive class="ocf" id="srv07-el6" provider="alteeve" > type="server"> > <instance_attributes id="srv07-el6-instance_attributes"> > <nvpair id="srv07-el6-instance_attributes-name" name="name" > value="srv07-el6"/> > </instance_attributes> > <meta_attributes id="srv07-el6-meta_attributes"> > <nvpair id="srv07-el6-meta_attributes-allow-migrate" > name="allow-migrate" value="true"/> > <nvpair id="srv07-el6-meta_attributes-migrate_to" > name="migrate_to" value="INFINITY"/> > <nvpair id="srv07-el6-meta_attributes-stop" name="stop" > value="INFINITY"/> > <nvpair id="srv07-el6-meta_attributes-target-role" > name="target-role" value="stopped"/> > </meta_attributes> > <operations> > <op id="srv07-el6-migrate_from-interval-0s" interval="0s" > name="migrate_from" timeout="600"/> > <op id="srv07-el6-migrate_to-interval-0s" interval="0s" > name="migrate_to" timeout="INFINITY"/> > <op id="srv07-el6-monitor-interval-60" interval="60" > name="monitor"/> > <op id="srv07-el6-notify-interval-0s" interval="0s" > name="notify" timeout="20"/> > <op id="srv07-el6-start-interval-0s" interval="0s" > name="start" on-fail="block" timeout="INFINITY"/> > <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop" > on-fail="block" timeout="INFINITY"/> > </operations> > </primitive> > ====
Update, this worked! I faulted the RA and the server entered a FAILED state, no fencing. Thanks again! -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/