Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
12.11.2013 09:56, Vladislav Bogdanov wrote: ... Ah, then in_ccm will be set to false only when corosync (2) is stopped on a node, not when pacemaker is stopped? Thus, current drbd agent/fencing logic does not (well) support just stop of pacemaker in my use-case, messaging layer should be stopped as well. May be it should also look at the shutdown attribute... Just for the thread completeness. With the patch below and latest pacemaker tip from beekhof repository drbd fence handler returns almost immediately and drbd resource is promoted without delays on another node after shutdown of pacemaker instance which has it promoted. --- a/scripts/crm-fence-peer.sh 2013-09-27 10:47:52.0 + +++ b/scripts/crm-fence-peer.sh 2013-11-12 13:45:52.274674803 + @@ -500,6 +500,21 @@ guess_if_pacemaker_will_fence() [[ $crmd = banned ]] will_fence=true if [[ ${expected-down} = down $in_ccm = false $crmd != online ]]; then : pacemaker considers this as clean down + elif [[ $crmd/$join/$expected = offline/down/down ]] ; then + # Check if pacemaker is simply shutdown, but membership/quorum is possibly still established (corosync2/cman) + # 1.1.11 will set expected=down on a clean shutdown too + # Look for shutdown transient node attribute + local node_attributes=$(set +x; echo $cib_xml | awk /node_state [^\n]*uname=\$DRBD_PEER\/,/\/instance_attributes/| grep -F -e nvpair ) + if [ -n ${node_attributes} ] ; then + local shut_down=$(set +x; echo $node_attributes | awk '/ name=shutdown/ {if (match($0, /value=\([[:digit:]]+)\/, values)) {print values[1]} }') + if [ -n ${shut_down} ] ; then + : pacemaker considers this as clean down + else + will_fence=true + fi + else + will_fence=true + fi elif [[ $in_ccm = false ]] || [[ $crmd != online ]]; then will_fence=true fi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
11.11.2013 09:00, Vladislav Bogdanov wrote: ... Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? This one works in all tests I was able to imagine, but I'm not sure it is completely safe to set expected=down for old DC (in test when drbd is promoted on DC and it reboots). From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001 From: Vladislav Bogdanov bub...@hoster-ok.com Date: Mon, 11 Nov 2013 14:32:48 + Subject: [PATCH] Update node values in cib on clean shutdown --- crmd/callbacks.c |6 +- crmd/membership.c |2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/crmd/callbacks.c b/crmd/callbacks.c index 3dae17b..9cfb973 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d } else if (safe_str_eq(node-uname, fsa_our_dc) crm_is_peer_active(node) == FALSE) { /* Did the DC leave us? */ crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc); +/* FIXME: is it safe? */ +crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL); } break; @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d if (AM_I_DC) { xmlNode *update = NULL; +int flags = node_update_peer; gboolean alive = crm_is_peer_active(node); crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared); @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_update_peer_join(__FUNCTION__, node, crm_join_none); crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); +flags |= node_update_cluster | node_update_join | node_update_expected; check_join_state(fsa_state, __FUNCTION__); update_graph(transition_graph, down); @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_trace(Other %p, down); } -update = do_update_node_cib(node, node_update_peer, NULL, __FUNCTION__); +update = do_update_node_cib(node, flags, NULL, __FUNCTION__); fsa_cib_anon_update(XML_CIB_TAG_STATUS, update, cib_scope_local | cib_quorum_override | cib_can_create); free_xml(update); diff --git a/crmd/membership.c b/crmd/membership.c index be1863a..d68b3aa 100644 --- a/crmd/membership.c +++ b/crmd/membership.c @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode * parent, const char *s crm_xml_add(node_state, XML_ATTR_UNAME, node-uname); if (flags node_update_cluster) { -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) { +if (crm_is_peer_active(node)) { value = XML_BOOLEAN_YES; } else if (node-state) { value = XML_BOOLEAN_NO; -- 1.7.1 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
On 12 Nov 2013, at 10:29 am, Andrew Beekhof and...@beekhof.net wrote: On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov bub...@hoster-ok.com wrote: 11.11.2013 09:00, Vladislav Bogdanov wrote: ... Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? This one works in all tests I was able to imagine, but I'm not sure it is completely safe to set expected=down for old DC (in test when drbd is promoted on DC and it reboots). From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001 From: Vladislav Bogdanov bub...@hoster-ok.com Date: Mon, 11 Nov 2013 14:32:48 + Subject: [PATCH] Update node values in cib on clean shutdown --- crmd/callbacks.c |6 +- crmd/membership.c |2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/crmd/callbacks.c b/crmd/callbacks.c index 3dae17b..9cfb973 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d } else if (safe_str_eq(node-uname, fsa_our_dc) crm_is_peer_active(node) == FALSE) { /* Did the DC leave us? */ crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc); +/* FIXME: is it safe? */ Not at all safe. It will prevent fencing. +crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL); } break; @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d if (AM_I_DC) { xmlNode *update = NULL; +int flags = node_update_peer; gboolean alive = crm_is_peer_active(node); crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared); @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_update_peer_join(__FUNCTION__, node, crm_join_none); crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); +flags |= node_update_cluster | node_update_join | node_update_expected; This does look ok though With the exception of 'node_update_cluster'. That didn't change here and shouldn't be touched until it really does leave the membership. check_join_state(fsa_state, __FUNCTION__); update_graph(transition_graph, down); @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_trace(Other %p, down); } -update = do_update_node_cib(node, node_update_peer, NULL, __FUNCTION__); +update = do_update_node_cib(node, flags, NULL, __FUNCTION__); fsa_cib_anon_update(XML_CIB_TAG_STATUS, update, cib_scope_local | cib_quorum_override | cib_can_create); free_xml(update); diff --git a/crmd/membership.c b/crmd/membership.c index be1863a..d68b3aa 100644 --- a/crmd/membership.c +++ b/crmd/membership.c @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode * parent, const char *s crm_xml_add(node_state, XML_ATTR_UNAME, node-uname); if (flags node_update_cluster) { -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) { +if (crm_is_peer_active(node)) { This is also wrong. XML_NODE_IN_CLUSTER is purely a record of whether the node is in the current corosync/cman/heartbeat membership. value = XML_BOOLEAN_YES; } else if (node-state) { value = XML_BOOLEAN_NO; -- 1.7.1 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
Can you try with these two patches please? + Andrew Beekhof (4 seconds ago) fec946a: Fix: crmd: When the DC gracefully shuts down, record the new expected state into the cib (HEAD, master) + Andrew Beekhof (10 seconds ago) 740122a: Fix: crmd: When a peer expectedly shuts down, record the new join and expected states into the cib On 12 Nov 2013, at 11:05 am, Andrew Beekhof and...@beekhof.net wrote: On 12 Nov 2013, at 10:29 am, Andrew Beekhof and...@beekhof.net wrote: On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov bub...@hoster-ok.com wrote: 11.11.2013 09:00, Vladislav Bogdanov wrote: ... Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? This one works in all tests I was able to imagine, but I'm not sure it is completely safe to set expected=down for old DC (in test when drbd is promoted on DC and it reboots). From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001 From: Vladislav Bogdanov bub...@hoster-ok.com Date: Mon, 11 Nov 2013 14:32:48 + Subject: [PATCH] Update node values in cib on clean shutdown --- crmd/callbacks.c |6 +- crmd/membership.c |2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/crmd/callbacks.c b/crmd/callbacks.c index 3dae17b..9cfb973 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d } else if (safe_str_eq(node-uname, fsa_our_dc) crm_is_peer_active(node) == FALSE) { /* Did the DC leave us? */ crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc); +/* FIXME: is it safe? */ Not at all safe. It will prevent fencing. +crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL); } break; @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d if (AM_I_DC) { xmlNode *update = NULL; +int flags = node_update_peer; gboolean alive = crm_is_peer_active(node); crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared); @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_update_peer_join(__FUNCTION__, node, crm_join_none); crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); +flags |= node_update_cluster | node_update_join | node_update_expected; This does look ok though With the exception of 'node_update_cluster'. That didn't change here and shouldn't be touched until it really does leave the membership. check_join_state(fsa_state, __FUNCTION__); update_graph(transition_graph, down); @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_trace(Other %p, down); } -update = do_update_node_cib(node, node_update_peer, NULL, __FUNCTION__); +update = do_update_node_cib(node, flags, NULL, __FUNCTION__); fsa_cib_anon_update(XML_CIB_TAG_STATUS, update, cib_scope_local | cib_quorum_override | cib_can_create); free_xml(update); diff --git a/crmd/membership.c b/crmd/membership.c index be1863a..d68b3aa 100644 --- a/crmd/membership.c +++ b/crmd/membership.c @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode * parent, const char *s crm_xml_add(node_state, XML_ATTR_UNAME, node-uname); if (flags node_update_cluster) { -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) { +if (crm_is_peer_active(node)) { This is also wrong. XML_NODE_IN_CLUSTER is purely a record of whether the node is in the current corosync/cman/heartbeat membership. value = XML_BOOLEAN_YES; } else if (node-state) { value = XML_BOOLEAN_NO; -- 1.7.1 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list:
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
12.11.2013 03:05, Andrew Beekhof wrote: On 12 Nov 2013, at 10:29 am, Andrew Beekhof and...@beekhof.net wrote: On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov bub...@hoster-ok.com wrote: 11.11.2013 09:00, Vladislav Bogdanov wrote: ... Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? This one works in all tests I was able to imagine, but I'm not sure it is completely safe to set expected=down for old DC (in test when drbd is promoted on DC and it reboots). From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001 From: Vladislav Bogdanov bub...@hoster-ok.com Date: Mon, 11 Nov 2013 14:32:48 + Subject: [PATCH] Update node values in cib on clean shutdown --- crmd/callbacks.c |6 +- crmd/membership.c |2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/crmd/callbacks.c b/crmd/callbacks.c index 3dae17b..9cfb973 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d } else if (safe_str_eq(node-uname, fsa_our_dc) crm_is_peer_active(node) == FALSE) { /* Did the DC leave us? */ crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc); +/* FIXME: is it safe? */ Not at all safe. It will prevent fencing. I actually tried to kill corosync, and node has been fenced. Killing crmd on a DC resulted in its restart and resources reprobe, and no fencing. I thought it is probably normal. +crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL); } break; @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d if (AM_I_DC) { xmlNode *update = NULL; +int flags = node_update_peer; gboolean alive = crm_is_peer_active(node); crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared); @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_update_peer_join(__FUNCTION__, node, crm_join_none); crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN); +flags |= node_update_cluster | node_update_join | node_update_expected; This does look ok though With the exception of 'node_update_cluster'. That didn't change here and shouldn't be touched until it really does leave the membership. Ah, then in_ccm will be set to false only when corosync (2) is stopped on a node, not when pacemaker is stopped? Thus, current drbd agent/fencing logic does not (well) support just stop of pacemaker in my use-case, messaging layer should be stopped as well. May be it should also look at the shutdown attribute... Would it be sane? I also think about workaround: 'service pacemaker stop' first puts node to a standby state, saves flag somewhere that it should be put online automatically right after the next start, and waits until pengine goes to an idle state. After that it does actual stop. check_join_state(fsa_state, __FUNCTION__); update_graph(transition_graph, down); @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_trace(Other %p, down); } -update = do_update_node_cib(node, node_update_peer, NULL, __FUNCTION__); +update = do_update_node_cib(node, flags, NULL, __FUNCTION__); fsa_cib_anon_update(XML_CIB_TAG_STATUS, update, cib_scope_local | cib_quorum_override | cib_can_create); free_xml(update); diff --git a/crmd/membership.c b/crmd/membership.c index be1863a..d68b3aa 100644 --- a/crmd/membership.c +++ b/crmd/membership.c @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode * parent, const char *s crm_xml_add(node_state, XML_ATTR_UNAME, node-uname); if (flags node_update_cluster) { -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) { +if (crm_is_peer_active(node)) { This is also wrong. XML_NODE_IN_CLUSTER is purely a record of whether the node is in the current corosync/cman/heartbeat membership. value = XML_BOOLEAN_YES; } else if (node-state) { value = XML_BOOLEAN_NO; -- 1.7.1
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
12.11.2013 03:15, Andrew Beekhof wrote: Can you try with these two patches please? + Andrew Beekhof (4 seconds ago) fec946a: Fix: crmd: When the DC gracefully shuts down, record the new expected state into the cib (HEAD, master) + Andrew Beekhof (10 seconds ago) 740122a: Fix: crmd: When a peer expectedly shuts down, record the new join and expected states into the cib Confirmed, they do the trick. Everything is set as expected. As I wrote earlier, drbd is still stuck until corosync is stopped too. That should be probably drbd issue, not pacemaker. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, David, all, Just found interesting fact, don't know is it a bug or not. When doing service pacemaker stop on a node which has drbd resource promoted, that resource does not promote on another node, and promote operation timeouts. This is related to drbd fence integration with pacemaker and to insufficient default (recommended) promote timeout for drbd resource. crm-fence-peer.sh places constraint to cib one second after promote operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh uses that value as a timeout, and fully utilizes it if it cannot say for sure that peer node is in a sane state - online or cleanly offline). It seems like increasing promote op timeout helps, but, I'd expect that to complete almost immediately, instead of waiting extra 90 seconds for nothing. Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? Or may be it is possible to provide some different hint for crm_fence_peer.sh? Another option (actually hack) would be to delay shutdown between resources stop and processes stop (so drbd handler on the other node determines peer is still online, and places constraint immediately), but that is very fragile. pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6. Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
11.11.2013 02:30, Andrew Beekhof wrote: On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, David, all, Just found interesting fact, don't know is it a bug or not. When doing service pacemaker stop on a node which has drbd resource promoted, that resource does not promote on another node, and promote operation timeouts. This is related to drbd fence integration with pacemaker and to insufficient default (recommended) promote timeout for drbd resource. crm-fence-peer.sh places constraint to cib one second after promote operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh uses that value as a timeout, and fully utilizes it if it cannot say for sure that peer node is in a sane state - online or cleanly offline). It seems like increasing promote op timeout helps, but, I'd expect that to complete almost immediately, instead of waiting extra 90 seconds for nothing. Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? corosync2 Or may be it is possible to provide some different hint for crm_fence_peer.sh? Another option (actually hack) would be to delay shutdown between resources stop and processes stop (so drbd handler on the other node determines peer is still online, and places constraint immediately), but that is very fragile. pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6. Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node
11.11.2013 06:32, Vladislav Bogdanov wrote: 11.11.2013 02:30, Andrew Beekhof wrote: On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, David, all, Just found interesting fact, don't know is it a bug or not. When doing service pacemaker stop on a node which has drbd resource promoted, that resource does not promote on another node, and promote operation timeouts. This is related to drbd fence integration with pacemaker and to insufficient default (recommended) promote timeout for drbd resource. crm-fence-peer.sh places constraint to cib one second after promote operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh uses that value as a timeout, and fully utilizes it if it cannot say for sure that peer node is in a sane state - online or cleanly offline). It seems like increasing promote op timeout helps, but, I'd expect that to complete almost immediately, instead of waiting extra 90 seconds for nothing. Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? That would make sense. Are you using the plugin, cman or corosync 2? Is this ok or I miss something? From a8398bb73a2b66103793c360d0081589f526acf2 Mon Sep 17 00:00:00 2001 From: Vladislav Bogdanov bub...@hoster-ok.com Date: Mon, 11 Nov 2013 05:59:17 + Subject: [PATCH] Update node values in cib on clean shutdown --- crmd/callbacks.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/crmd/callbacks.c b/crmd/callbacks.c index 3dae17b..8cabffb 100644 --- a/crmd/callbacks.c +++ b/crmd/callbacks.c @@ -221,7 +221,9 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d crm_trace(Other %p, down); } -update = do_update_node_cib(node, node_update_peer, NULL, __FUNCTION__); +update = do_update_node_cib(node, +node_update_peer | node_update_cluster | node_update_join | node_update_expected, +NULL, __FUNCTION__); fsa_cib_anon_update(XML_CIB_TAG_STATUS, update, cib_scope_local | cib_quorum_override | cib_can_create); free_xml(update); -- 1.7.1 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] DRBD promotion timeout after pacemaker stop on other node
Hi Andrew, David, all, Just found interesting fact, don't know is it a bug or not. When doing service pacemaker stop on a node which has drbd resource promoted, that resource does not promote on another node, and promote operation timeouts. This is related to drbd fence integration with pacemaker and to insufficient default (recommended) promote timeout for drbd resource. crm-fence-peer.sh places constraint to cib one second after promote operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh uses that value as a timeout, and fully utilizes it if it cannot say for sure that peer node is in a sane state - online or cleanly offline). It seems like increasing promote op timeout helps, but, I'd expect that to complete almost immediately, instead of waiting extra 90 seconds for nothing. Looking at crm-fence-peer.sh script, it would determine peer state as offline immediately if node state (all of) * doesn't contain expected tag or has it set to down * has in_ccm tag set to false * has crmd tag set to anything except online On the other hand, crmd sets expected = down only after fencing is complete (probably the same for in_ccm?). Shouldn't is do the same (or may be just remove that tag) if clean shutdown about to be complete? Or may be it is possible to provide some different hint for crm_fence_peer.sh? Another option (actually hack) would be to delay shutdown between resources stop and processes stop (so drbd handler on the other node determines peer is still online, and places constraint immediately), but that is very fragile. pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6. Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org