[ClusterLabs] Antw: [EXT] Avoiding self-fence on RA failure

Ulrich Windl Tue, 06 Oct 2020 23:36:25 -0700

>>> Digimer <[email protected]> schrieb am 07.10.2020 um 05:42 in Nachricht
<[email protected]>:
> Hi all,
> 
>   While developing our program (and not being a production cluster), I
> find that when I push broken code to a node, causing the RA to fail to
> perform an operation, the node gets fenced. (example below).


(I see others have replied, too, but anyway)
Specifically it's the "stop" operation that may not fail.

> 
>   This brings up a question;
> 
>   If a single resource fails for any reason and can't be recovered, but
> other resources on the node are still operational, how can I suppress a
> self-fence? I'd rather one failed resource than having all resources get
> killed (they're VMs, so restarting on the peer is ... disruptive).

I think you can (on-fail=block (AFAIR).
Note: This is not a political statement for any near elections ;-)

> 
>   If this is a bad approach (sufficiently bad to justify hard-rebooting
> other VMs that had been running on the same node), why is that? Are
> there any less-bad options for this scenario?
> 
>   Obviously, I would never push untested code to a production system,
> but knowing now that this is possible (losing a node with it's other VMs
> on an RA / code fault), I'm worried about some unintended "oops" causing
> the loss of a node.
> 
>   For example, would it be possible to have the node try to live migrate
> services to the other peer, before self-fencing in a scenario like this?

As there is guarantee that migration will succeed without fencing the node it
could only be done with a timeout; otherwise the node will be hanging while
waiting for migration to succeed.


> Are there other options / considerations I might be missing here?
> 
> example VM config:
> 
> ====
>       <primitive class="ocf" id="srv07-el6" provider="alteeve"
> type="server">
>         <instance_attributes id="srv07-el6-instance_attributes">
>           <nvpair id="srv07-el6-instance_attributes-name" name="name"
> value="srv07-el6"/>
>         </instance_attributes>
>         <meta_attributes id="srv07-el6-meta_attributes">
>           <nvpair id="srv07-el6-meta_attributes-allow-migrate"
> name="allow-migrate" value="true"/>
>           <nvpair id="srv07-el6-meta_attributes-migrate_to"
> name="migrate_to" value="INFINITY"/>
>           <nvpair id="srv07-el6-meta_attributes-stop" name="stop"
> value="INFINITY"/>
>           <nvpair id="srv07-el6-meta_attributes-target-role"
> name="target-role" value="Stopped"/>
>         </meta_attributes>
>         <operations>
>           <op id="srv07-el6-migrate_from-interval-0s" interval="0s"
> name="migrate_from" timeout="600"/>
>           <op id="srv07-el6-migrate_to-interval-0s" interval="0s"
> name="migrate_to" timeout="INFINITY"/>
>           <op id="srv07-el6-monitor-interval-60" interval="60"
> name="monitor" on-fail="block"/>
>           <op id="srv07-el6-notify-interval-0s" interval="0s"
> name="notify" timeout="20"/>
>           <op id="srv07-el6-start-interval-0s" interval="0s"
> name="start" timeout="30"/>
>           <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop"
> timeout="INFINITY"/>
>         </operations>
>       </primitive>
> ====
> 
> Logs from a code oops in the RA triggering a node self-fence;
> 
> ====
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
> srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR:  syntax
> error at or near "3" ]
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
> srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0
> WHERE server_uuid = '3d73db4c-d... ]
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
> srv07-el6_stop_0:36779:stderr [
>                      ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
> 13791. ]

As I'm writing a lot of Perl code, too: Do you know "perl -c" to check the
syntax, BTW?

And don't forget ocf-tester. ;-)

> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
> srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR:  syntax
> error at or near "3" ]
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
> srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0
> WHERE server_uuid = '3d73db4c-d... ]
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]:  notice:
> srv07-el6_stop_0:36779:stderr [
>                      ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
> 13791. ]
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-controld[33819]:  notice:
> Result of stop operation for srv07-el6 on mk-a02n01: 1 (error)
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-controld[33819]:  notice:
> mk-a02n01-srv07-el6_stop_0:51 [ DBD::Pg::db do failed: ERROR:  syntax
> error at or near "3"\nLINE 1: ...ut off, server_boot_time = 0 WHERE
> server_uuid = '3d73db4c-d...\n
>                    ^ at /usr/share/perl5/Anvil/Tools/Database.pm line
> 13791.\nDBD::Pg::db do failed: ERROR:  syntax error at or near "3"\nLINE
> 1: ...ut off, server_boot_time = 0 WHERE server_uuid = '3d73db4c-d...\n
>                                                             ^ at
> /usr/share/p
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-attrd[33817]:  notice:
> Setting fail-count-srv07-el6#stop_0[mk-a02n01]: (unset) -> INFINITY
> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-attrd[33817]:  notice:
> Setting last-failure-srv07-el6#stop_0[mk-a02n01]: (unset) -> 1602041634
> Connection to mk-a02n01.ifn closed by remote host.
> Connection to mk-a02n01.ifn closed.
> ====
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/ 
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: [EXT] Avoiding self-fence on RA failure

Reply via email to