subject:"\[Pacemaker\] DRBD promotion timeout after pacemaker stop on other node"

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-12 Thread Vladislav Bogdanov

12.11.2013 09:56, Vladislav Bogdanov wrote:
...
 Ah, then in_ccm will be set to false only when corosync (2) is stopped
 on a node, not when pacemaker is stopped?
 
 Thus, current drbd agent/fencing logic does not (well) support just stop
 of pacemaker in my use-case, messaging layer should be stopped as well.
 
 May be it should also look at the shutdown attribute...

Just for the thread completeness.
With the patch below and latest pacemaker tip from beekhof repository
drbd fence handler returns almost immediately and drbd resource is promoted
without delays on another node after shutdown of pacemaker instance which has
it promoted.

--- a/scripts/crm-fence-peer.sh 2013-09-27 10:47:52.0 +
+++ b/scripts/crm-fence-peer.sh 2013-11-12 13:45:52.274674803 +
@@ -500,6 +500,21 @@ guess_if_pacemaker_will_fence()
[[ $crmd = banned ]]  will_fence=true
if [[ ${expected-down} = down  $in_ccm = false   $crmd != 
online ]]; then
: pacemaker considers this as clean down
+   elif [[ $crmd/$join/$expected = offline/down/down ]] ; then
+   # Check if pacemaker is simply shutdown, but membership/quorum 
is possibly still established (corosync2/cman)
+   # 1.1.11 will set expected=down on a clean shutdown too
+   # Look for shutdown transient node attribute
+   local node_attributes=$(set +x; echo $cib_xml | awk 
/node_state [^\n]*uname=\$DRBD_PEER\/,/\/instance_attributes/| grep -F 
-e nvpair )
+   if [ -n ${node_attributes} ] ; then
+   local shut_down=$(set +x; echo $node_attributes | awk 
'/ name=shutdown/ {if (match($0, /value=\([[:digit:]]+)\/, values)) {print 
values[1]} }')
+   if [ -n ${shut_down} ] ; then
+   : pacemaker considers this as clean down
+   else
+   will_fence=true
+   fi
+   else
+   will_fence=true
+   fi
elif [[ $in_ccm = false ]] || [[ $crmd != online ]]; then
will_fence=true
fi


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-11 Thread Vladislav Bogdanov

11.11.2013 09:00, Vladislav Bogdanov wrote:
...
 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online

 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?

 That would make sense.  Are you using the plugin, cman or corosync 2?
 

This one works in all tests I was able to imagine, but I'm not sure it is
completely safe to set expected=down for old DC (in test when drbd is 
promoted on DC and it reboots).

From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001
From: Vladislav Bogdanov bub...@hoster-ok.com
Date: Mon, 11 Nov 2013 14:32:48 +
Subject: [PATCH] Update node values in cib on clean shutdown

---
 crmd/callbacks.c  |6 +-
 crmd/membership.c |2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index 3dae17b..9cfb973 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, crm_node_t 
* node, const void *d
 } else if (safe_str_eq(node-uname, fsa_our_dc)  
crm_is_peer_active(node) == FALSE) {
 /* Did the DC leave us? */
 crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc);
+/* FIXME: is it safe? */
+crm_update_peer_expected(__FUNCTION__, node, 
CRMD_JOINSTATE_DOWN);
 register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL);
 }
 break;
@@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t 
* node, const void *d
 
 if (AM_I_DC) {
 xmlNode *update = NULL;
+int flags = node_update_peer;
 gboolean alive = crm_is_peer_active(node);
 crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared);
 
@@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t 
* node, const void *d
 
 crm_update_peer_join(__FUNCTION__, node, crm_join_none);
 crm_update_peer_expected(__FUNCTION__, node, 
CRMD_JOINSTATE_DOWN);
+flags |= node_update_cluster | node_update_join | 
node_update_expected;
 check_join_state(fsa_state, __FUNCTION__);
 
 update_graph(transition_graph, down);
@@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t 
* node, const void *d
 crm_trace(Other %p, down);
 }
 
-update = do_update_node_cib(node, node_update_peer, NULL, 
__FUNCTION__);
+update = do_update_node_cib(node, flags, NULL, __FUNCTION__);
 fsa_cib_anon_update(XML_CIB_TAG_STATUS, update,
 cib_scope_local | cib_quorum_override | 
cib_can_create);
 free_xml(update);
diff --git a/crmd/membership.c b/crmd/membership.c
index be1863a..d68b3aa 100644
--- a/crmd/membership.c
+++ b/crmd/membership.c
@@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode * 
parent, const char *s
 crm_xml_add(node_state, XML_ATTR_UNAME, node-uname);
 
 if (flags  node_update_cluster) {
-if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) {
+if (crm_is_peer_active(node)) {
 value = XML_BOOLEAN_YES;
 } else if (node-state) {
 value = XML_BOOLEAN_NO;
-- 
1.7.1



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-11 Thread Andrew Beekhof


On 12 Nov 2013, at 10:29 am, Andrew Beekhof and...@beekhof.net wrote:

 
 On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 11.11.2013 09:00, Vladislav Bogdanov wrote:
 ...
 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online
 
 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?
 
 That would make sense.  Are you using the plugin, cman or corosync 2?
 
 
 This one works in all tests I was able to imagine, but I'm not sure it is
 completely safe to set expected=down for old DC (in test when drbd is 
 promoted on DC and it reboots).
 
 From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001
 From: Vladislav Bogdanov bub...@hoster-ok.com
 Date: Mon, 11 Nov 2013 14:32:48 +
 Subject: [PATCH] Update node values in cib on clean shutdown
 
 ---
 crmd/callbacks.c  |6 +-
 crmd/membership.c |2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/crmd/callbacks.c b/crmd/callbacks.c
 index 3dae17b..9cfb973 100644
 --- a/crmd/callbacks.c
 +++ b/crmd/callbacks.c
 @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
} else if (safe_str_eq(node-uname, fsa_our_dc)  
 crm_is_peer_active(node) == FALSE) {
/* Did the DC leave us? */
crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc);
 +/* FIXME: is it safe? */
 
 Not at all safe.  It will prevent fencing.
 
 +crm_update_peer_expected(__FUNCTION__, node, 
 CRMD_JOINSTATE_DOWN);
register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL);
}
break;
 @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
 
if (AM_I_DC) {
xmlNode *update = NULL;
 +int flags = node_update_peer;
gboolean alive = crm_is_peer_active(node);
crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared);
 
 @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
 
crm_update_peer_join(__FUNCTION__, node, crm_join_none);
crm_update_peer_expected(__FUNCTION__, node, 
 CRMD_JOINSTATE_DOWN);
 +flags |= node_update_cluster | node_update_join | 
 node_update_expected;
 
 This does look ok though

With the exception of 'node_update_cluster'.  
That didn't change here and shouldn't be touched until it really does leave the 
membership.

 
check_join_state(fsa_state, __FUNCTION__);
 
update_graph(transition_graph, down);
 @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
crm_trace(Other %p, down);
}
 
 -update = do_update_node_cib(node, node_update_peer, NULL, 
 __FUNCTION__);
 +update = do_update_node_cib(node, flags, NULL, __FUNCTION__);
fsa_cib_anon_update(XML_CIB_TAG_STATUS, update,
cib_scope_local | cib_quorum_override | 
 cib_can_create);
free_xml(update);
 diff --git a/crmd/membership.c b/crmd/membership.c
 index be1863a..d68b3aa 100644
 --- a/crmd/membership.c
 +++ b/crmd/membership.c
 @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, xmlNode 
 * parent, const char *s
crm_xml_add(node_state, XML_ATTR_UNAME, node-uname);
 
if (flags  node_update_cluster) {
 -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) {
 +if (crm_is_peer_active(node)) {
 
 This is also wrong.  XML_NODE_IN_CLUSTER is purely a record of whether the 
 node is in the current corosync/cman/heartbeat membership.
 
value = XML_BOOLEAN_YES;
} else if (node-state) {
value = XML_BOOLEAN_NO;
 -- 
 1.7.1
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-11 Thread Andrew Beekhof

Can you try with these two patches please?

+ Andrew Beekhof (4 seconds ago) fec946a: Fix: crmd: When the DC gracefully 
shuts down, record the new expected state into the cib  (HEAD, master)
+ Andrew Beekhof (10 seconds ago) 740122a: Fix: crmd: When a peer expectedly 
shuts down, record the new join and expected states into the cib 


On 12 Nov 2013, at 11:05 am, Andrew Beekhof and...@beekhof.net wrote:

 
 On 12 Nov 2013, at 10:29 am, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 11.11.2013 09:00, Vladislav Bogdanov wrote:
 ...
 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online
 
 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?
 
 That would make sense.  Are you using the plugin, cman or corosync 2?
 
 
 This one works in all tests I was able to imagine, but I'm not sure it is
 completely safe to set expected=down for old DC (in test when drbd is 
 promoted on DC and it reboots).
 
 From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001
 From: Vladislav Bogdanov bub...@hoster-ok.com
 Date: Mon, 11 Nov 2013 14:32:48 +
 Subject: [PATCH] Update node values in cib on clean shutdown
 
 ---
 crmd/callbacks.c  |6 +-
 crmd/membership.c |2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/crmd/callbacks.c b/crmd/callbacks.c
 index 3dae17b..9cfb973 100644
 --- a/crmd/callbacks.c
 +++ b/crmd/callbacks.c
 @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
   } else if (safe_str_eq(node-uname, fsa_our_dc)  
 crm_is_peer_active(node) == FALSE) {
   /* Did the DC leave us? */
   crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc);
 +/* FIXME: is it safe? */
 
 Not at all safe.  It will prevent fencing.
 
 +crm_update_peer_expected(__FUNCTION__, node, 
 CRMD_JOINSTATE_DOWN);
   register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL);
   }
   break;
 @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
 
   if (AM_I_DC) {
   xmlNode *update = NULL;
 +int flags = node_update_peer;
   gboolean alive = crm_is_peer_active(node);
   crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared);
 
 @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
 
   crm_update_peer_join(__FUNCTION__, node, crm_join_none);
   crm_update_peer_expected(__FUNCTION__, node, 
 CRMD_JOINSTATE_DOWN);
 +flags |= node_update_cluster | node_update_join | 
 node_update_expected;
 
 This does look ok though
 
 With the exception of 'node_update_cluster'.  
 That didn't change here and shouldn't be touched until it really does leave 
 the membership.
 
 
   check_join_state(fsa_state, __FUNCTION__);
 
   update_graph(transition_graph, down);
 @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
   crm_trace(Other %p, down);
   }
 
 -update = do_update_node_cib(node, node_update_peer, NULL, 
 __FUNCTION__);
 +update = do_update_node_cib(node, flags, NULL, __FUNCTION__);
   fsa_cib_anon_update(XML_CIB_TAG_STATUS, update,
   cib_scope_local | cib_quorum_override | 
 cib_can_create);
   free_xml(update);
 diff --git a/crmd/membership.c b/crmd/membership.c
 index be1863a..d68b3aa 100644
 --- a/crmd/membership.c
 +++ b/crmd/membership.c
 @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, 
 xmlNode * parent, const char *s
   crm_xml_add(node_state, XML_ATTR_UNAME, node-uname);
 
   if (flags  node_update_cluster) {
 -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) {
 +if (crm_is_peer_active(node)) {
 
 This is also wrong.  XML_NODE_IN_CLUSTER is purely a record of whether the 
 node is in the current corosync/cman/heartbeat membership.
 
   value = XML_BOOLEAN_YES;
   } else if (node-state) {
   value = XML_BOOLEAN_NO;
 -- 
 1.7.1
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list:

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-11 Thread Vladislav Bogdanov

12.11.2013 03:05, Andrew Beekhof wrote:
 
 On 12 Nov 2013, at 10:29 am, Andrew Beekhof and...@beekhof.net wrote:
 

 On 12 Nov 2013, at 2:46 am, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 11.11.2013 09:00, Vladislav Bogdanov wrote:
 ...
 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online

 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?

 That would make sense.  Are you using the plugin, cman or corosync 2?


 This one works in all tests I was able to imagine, but I'm not sure it is
 completely safe to set expected=down for old DC (in test when drbd is 
 promoted on DC and it reboots).

 From ddfccc8a40cfece5c29d61f44a4467954d5c5da8 Mon Sep 17 00:00:00 2001
 From: Vladislav Bogdanov bub...@hoster-ok.com
 Date: Mon, 11 Nov 2013 14:32:48 +
 Subject: [PATCH] Update node values in cib on clean shutdown

 ---
 crmd/callbacks.c  |6 +-
 crmd/membership.c |2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

 diff --git a/crmd/callbacks.c b/crmd/callbacks.c
 index 3dae17b..9cfb973 100644
 --- a/crmd/callbacks.c
 +++ b/crmd/callbacks.c
 @@ -162,6 +162,8 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
} else if (safe_str_eq(node-uname, fsa_our_dc)  
 crm_is_peer_active(node) == FALSE) {
/* Did the DC leave us? */
crm_notice(Our peer on the DC (%s) is dead, fsa_our_dc);
 +/* FIXME: is it safe? */

 Not at all safe.  It will prevent fencing.

I actually tried to kill corosync, and node has been fenced. Killing
crmd on a DC resulted in its restart and resources reprobe, and no
fencing. I thought it is probably normal.


 +crm_update_peer_expected(__FUNCTION__, node, 
 CRMD_JOINSTATE_DOWN);
register_fsa_input(C_CRMD_STATUS_CALLBACK, I_ELECTION, NULL);
}
break;
 @@ -169,6 +171,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d

if (AM_I_DC) {
xmlNode *update = NULL;
 +int flags = node_update_peer;
gboolean alive = crm_is_peer_active(node);
crm_action_t *down = match_down_event(0, node-uuid, NULL, appeared);

 @@ -199,6 +202,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d

crm_update_peer_join(__FUNCTION__, node, crm_join_none);
crm_update_peer_expected(__FUNCTION__, node, 
 CRMD_JOINSTATE_DOWN);
 +flags |= node_update_cluster | node_update_join | 
 node_update_expected;

 This does look ok though
 
 With the exception of 'node_update_cluster'.  
 That didn't change here and shouldn't be touched until it really does leave 
 the membership.

Ah, then in_ccm will be set to false only when corosync (2) is stopped
on a node, not when pacemaker is stopped?

Thus, current drbd agent/fencing logic does not (well) support just stop
of pacemaker in my use-case, messaging layer should be stopped as well.

May be it should also look at the shutdown attribute...
Would it be sane?

I also think about workaround:
'service pacemaker stop' first puts node to a standby state, saves flag
somewhere that it should be put online automatically right after the
next start, and waits until pengine goes to an idle state. After that it
does actual stop.


 

check_join_state(fsa_state, __FUNCTION__);

update_graph(transition_graph, down);
 @@ -221,7 +225,7 @@ peer_update_callback(enum crm_status_type type, 
 crm_node_t * node, const void *d
crm_trace(Other %p, down);
}

 -update = do_update_node_cib(node, node_update_peer, NULL, 
 __FUNCTION__);
 +update = do_update_node_cib(node, flags, NULL, __FUNCTION__);
fsa_cib_anon_update(XML_CIB_TAG_STATUS, update,
cib_scope_local | cib_quorum_override | 
 cib_can_create);
free_xml(update);
 diff --git a/crmd/membership.c b/crmd/membership.c
 index be1863a..d68b3aa 100644
 --- a/crmd/membership.c
 +++ b/crmd/membership.c
 @@ -152,7 +152,7 @@ do_update_node_cib(crm_node_t * node, int flags, 
 xmlNode * parent, const char *s
crm_xml_add(node_state, XML_ATTR_UNAME, node-uname);

if (flags  node_update_cluster) {
 -if (safe_str_eq(node-state, CRM_NODE_ACTIVE)) {
 +if (crm_is_peer_active(node)) {

 This is also wrong.  XML_NODE_IN_CLUSTER is purely a record of whether the 
 node is in the current corosync/cman/heartbeat membership.

value = XML_BOOLEAN_YES;
} else if (node-state) {
value = XML_BOOLEAN_NO;
 -- 
 1.7.1

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-11 Thread Vladislav Bogdanov

12.11.2013 03:15, Andrew Beekhof wrote:
 Can you try with these two patches please?
 
 + Andrew Beekhof (4 seconds ago) fec946a: Fix: crmd: When the DC gracefully 
 shuts down, record the new expected state into the cib  (HEAD, master)
 + Andrew Beekhof (10 seconds ago) 740122a: Fix: crmd: When a peer expectedly 
 shuts down, record the new join and expected states into the cib 

Confirmed, they do the trick. Everything is set as expected.

As I wrote earlier, drbd is still stuck until corosync is stopped too.
That should be probably drbd issue, not pacemaker.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-10 Thread Andrew Beekhof


On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 Hi Andrew, David, all,
 
 Just found interesting fact, don't know is it a bug or not.
 
 When doing service pacemaker stop on a node which has drbd resource
 promoted, that resource does not promote on another node, and promote
 operation timeouts.
 
 This is related to drbd fence integration with pacemaker and to
 insufficient default (recommended) promote timeout for drbd resource.
 
 crm-fence-peer.sh places constraint to cib one second after promote
 operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
 uses that value as a timeout, and fully utilizes it if it cannot say for
 sure that peer node is in a sane state - online or cleanly offline).
 
 It seems like increasing promote op timeout helps, but, I'd expect that
 to complete almost immediately, instead of waiting extra 90 seconds for
 nothing.
 
 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online
 
 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?

That would make sense.  Are you using the plugin, cman or corosync 2?

 Or may be it is possible to provide some different hint for
 crm_fence_peer.sh?
 
 Another option (actually hack) would be to delay shutdown between
 resources stop and processes stop (so drbd handler on the other node
 determines peer is still online, and places constraint immediately), but
 that is very fragile.
 
 pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd
 is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6.
 
 Vladislav
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-10 Thread Vladislav Bogdanov

11.11.2013 02:30, Andrew Beekhof wrote:
 
 On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 Hi Andrew, David, all,

 Just found interesting fact, don't know is it a bug or not.

 When doing service pacemaker stop on a node which has drbd resource
 promoted, that resource does not promote on another node, and promote
 operation timeouts.

 This is related to drbd fence integration with pacemaker and to
 insufficient default (recommended) promote timeout for drbd resource.

 crm-fence-peer.sh places constraint to cib one second after promote
 operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
 uses that value as a timeout, and fully utilizes it if it cannot say for
 sure that peer node is in a sane state - online or cleanly offline).

 It seems like increasing promote op timeout helps, but, I'd expect that
 to complete almost immediately, instead of waiting extra 90 seconds for
 nothing.

 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online

 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?
 
 That would make sense.  Are you using the plugin, cman or corosync 2?

corosync2


 
 Or may be it is possible to provide some different hint for
 crm_fence_peer.sh?

 Another option (actually hack) would be to delay shutdown between
 resources stop and processes stop (so drbd handler on the other node
 determines peer is still online, and places constraint immediately), but
 that is very fragile.

 pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd
 is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6.

 Vladislav

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-10 Thread Vladislav Bogdanov

11.11.2013 06:32, Vladislav Bogdanov wrote:
 11.11.2013 02:30, Andrew Beekhof wrote:

 On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 Hi Andrew, David, all,

 Just found interesting fact, don't know is it a bug or not.

 When doing service pacemaker stop on a node which has drbd resource
 promoted, that resource does not promote on another node, and promote
 operation timeouts.

 This is related to drbd fence integration with pacemaker and to
 insufficient default (recommended) promote timeout for drbd resource.

 crm-fence-peer.sh places constraint to cib one second after promote
 operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
 uses that value as a timeout, and fully utilizes it if it cannot say for
 sure that peer node is in a sane state - online or cleanly offline).

 It seems like increasing promote op timeout helps, but, I'd expect that
 to complete almost immediately, instead of waiting extra 90 seconds for
 nothing.

 Looking at crm-fence-peer.sh script, it would determine peer state as
 offline immediately if node state (all of)
 * doesn't contain expected tag or has it set to down
 * has in_ccm tag set to false
 * has crmd tag set to anything except online

 On the other hand, crmd sets expected = down only after fencing is
 complete (probably the same for in_ccm?). Shouldn't is do the same (or
 may be just remove that tag) if clean shutdown about to be complete?

 That would make sense.  Are you using the plugin, cman or corosync 2?

Is this ok or I miss something?

From a8398bb73a2b66103793c360d0081589f526acf2 Mon Sep 17 00:00:00 2001
From: Vladislav Bogdanov bub...@hoster-ok.com
Date: Mon, 11 Nov 2013 05:59:17 +
Subject: [PATCH] Update node values in cib on clean shutdown

---
 crmd/callbacks.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/crmd/callbacks.c b/crmd/callbacks.c
index 3dae17b..8cabffb 100644
--- a/crmd/callbacks.c
+++ b/crmd/callbacks.c
@@ -221,7 +221,9 @@ peer_update_callback(enum crm_status_type type, crm_node_t 
* node, const void *d
 crm_trace(Other %p, down);
 }
 
-update = do_update_node_cib(node, node_update_peer, NULL, 
__FUNCTION__);
+update = do_update_node_cib(node,
+node_update_peer | node_update_cluster | node_update_join | 
node_update_expected,
+NULL, __FUNCTION__);
 fsa_cib_anon_update(XML_CIB_TAG_STATUS, update,
 cib_scope_local | cib_quorum_override | 
cib_can_create);
 free_xml(update);
-- 
1.7.1


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

2013-11-04 Thread Vladislav Bogdanov

Hi Andrew, David, all,

Just found interesting fact, don't know is it a bug or not.

When doing service pacemaker stop on a node which has drbd resource
promoted, that resource does not promote on another node, and promote
operation timeouts.

This is related to drbd fence integration with pacemaker and to
insufficient default (recommended) promote timeout for drbd resource.

crm-fence-peer.sh places constraint to cib one second after promote
operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
uses that value as a timeout, and fully utilizes it if it cannot say for
sure that peer node is in a sane state - online or cleanly offline).

It seems like increasing promote op timeout helps, but, I'd expect that
to complete almost immediately, instead of waiting extra 90 seconds for
nothing.

Looking at crm-fence-peer.sh script, it would determine peer state as
offline immediately if node state (all of)
* doesn't contain expected tag or has it set to down
* has in_ccm tag set to false
* has crmd tag set to anything except online

On the other hand, crmd sets expected = down only after fencing is
complete (probably the same for in_ccm?). Shouldn't is do the same (or
may be just remove that tag) if clean shutdown about to be complete?
Or may be it is possible to provide some different hint for
crm_fence_peer.sh?

Another option (actually hack) would be to delay shutdown between
resources stop and processes stop (so drbd handler on the other node
determines peer is still online, and places constraint immediately), but
that is very fragile.

pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd
is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6.

Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Re: [Pacemaker] DRBD promotion timeout after pacemaker stop on other node

[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

10 matches

Site Navigation

Mail list logo

Footer information