>>> Digimer <[email protected]> schrieb am 07.10.2020 um 05:42 in Nachricht <[email protected]>: > Hi all, > > While developing our program (and not being a production cluster), I > find that when I push broken code to a node, causing the RA to fail to > perform an operation, the node gets fenced. (example below).
(I see others have replied, too, but anyway) Specifically it's the "stop" operation that may not fail. > > This brings up a question; > > If a single resource fails for any reason and can't be recovered, but > other resources on the node are still operational, how can I suppress a > self-fence? I'd rather one failed resource than having all resources get > killed (they're VMs, so restarting on the peer is ... disruptive). I think you can (on-fail=block (AFAIR). Note: This is not a political statement for any near elections ;-) > > If this is a bad approach (sufficiently bad to justify hard-rebooting > other VMs that had been running on the same node), why is that? Are > there any less-bad options for this scenario? > > Obviously, I would never push untested code to a production system, > but knowing now that this is possible (losing a node with it's other VMs > on an RA / code fault), I'm worried about some unintended "oops" causing > the loss of a node. > > For example, would it be possible to have the node try to live migrate > services to the other peer, before self-fencing in a scenario like this? As there is guarantee that migration will succeed without fencing the node it could only be done with a timeout; otherwise the node will be hanging while waiting for migration to succeed. > Are there other options / considerations I might be missing here? > > example VM config: > > ==== > <primitive class="ocf" id="srv07-el6" provider="alteeve" > type="server"> > <instance_attributes id="srv07-el6-instance_attributes"> > <nvpair id="srv07-el6-instance_attributes-name" name="name" > value="srv07-el6"/> > </instance_attributes> > <meta_attributes id="srv07-el6-meta_attributes"> > <nvpair id="srv07-el6-meta_attributes-allow-migrate" > name="allow-migrate" value="true"/> > <nvpair id="srv07-el6-meta_attributes-migrate_to" > name="migrate_to" value="INFINITY"/> > <nvpair id="srv07-el6-meta_attributes-stop" name="stop" > value="INFINITY"/> > <nvpair id="srv07-el6-meta_attributes-target-role" > name="target-role" value="Stopped"/> > </meta_attributes> > <operations> > <op id="srv07-el6-migrate_from-interval-0s" interval="0s" > name="migrate_from" timeout="600"/> > <op id="srv07-el6-migrate_to-interval-0s" interval="0s" > name="migrate_to" timeout="INFINITY"/> > <op id="srv07-el6-monitor-interval-60" interval="60" > name="monitor" on-fail="block"/> > <op id="srv07-el6-notify-interval-0s" interval="0s" > name="notify" timeout="20"/> > <op id="srv07-el6-start-interval-0s" interval="0s" > name="start" timeout="30"/> > <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop" > timeout="INFINITY"/> > </operations> > </primitive> > ==== > > Logs from a code oops in the RA triggering a node self-fence; > > ==== > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: > srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR: syntax > error at or near "3" ] > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: > srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0 > WHERE server_uuid = '3d73db4c-d... ] > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: > srv07-el6_stop_0:36779:stderr [ > ^ at /usr/share/perl5/Anvil/Tools/Database.pm line > 13791. ] As I'm writing a lot of Perl code, too: Do you know "perl -c" to check the syntax, BTW? And don't forget ocf-tester. ;-) > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: > srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR: syntax > error at or near "3" ] > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: > srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0 > WHERE server_uuid = '3d73db4c-d... ] > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: > srv07-el6_stop_0:36779:stderr [ > ^ at /usr/share/perl5/Anvil/Tools/Database.pm line > 13791. ] > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-controld[33819]: notice: > Result of stop operation for srv07-el6 on mk-a02n01: 1 (error) > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-controld[33819]: notice: > mk-a02n01-srv07-el6_stop_0:51 [ DBD::Pg::db do failed: ERROR: syntax > error at or near "3"\nLINE 1: ...ut off, server_boot_time = 0 WHERE > server_uuid = '3d73db4c-d...\n > ^ at /usr/share/perl5/Anvil/Tools/Database.pm line > 13791.\nDBD::Pg::db do failed: ERROR: syntax error at or near "3"\nLINE > 1: ...ut off, server_boot_time = 0 WHERE server_uuid = '3d73db4c-d...\n > ^ at > /usr/share/p > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-attrd[33817]: notice: > Setting fail-count-srv07-el6#stop_0[mk-a02n01]: (unset) -> INFINITY > Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-attrd[33817]: notice: > Setting last-failure-srv07-el6#stop_0[mk-a02n01]: (unset) -> 1602041634 > Connection to mk-a02n01.ifn closed by remote host. > Connection to mk-a02n01.ifn closed. > ==== > > -- > Digimer > Papers and Projects: https://alteeve.com/w/ > "I am, somehow, less interested in the weight and convolutions of > Einstein’s brain than in the near certainty that people of equal talent > have lived and died in cotton fields and sweatshops." - Stephen Jay Gould > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
