Hi Ken, Thank you for comments.
> The above is the reason for the behavior you're seeing. > > A fenced node can come back up and rejoin the cluster before the fence > command reports completion. When Pacemaker sees the rejoin, it assumes > the fence command completed. > > However in this case, the lost node rejoined on its own while fencing > was still in progress, so that was an incorrect assumption. > > A proper fix will take some investigation. As a workaround in the > meantime, you could try increasing the corosync token timeout, so the > node is not declared lost for brief outages. We think so, too. We understand that we can evade a problem by lengthening token of corosync. If log when a problem happened is necessary for a survey by you, please contact me. Many Thanks! Hideo Yamauchi. ----- Original Message ----- > From: Ken Gaillot <kgail...@redhat.com> > To: users@clusterlabs.org > Cc: > Date: 2015/10/29, Thu 23:09 > Subject: Re: [ClusterLabs] [Enhancement] When STONITH is not completed, a > resource moves. > > On 10/28/2015 08:39 PM, renayama19661...@ybb.ne.jp wrote: >> Hi All, >> >> The following problem produced us in Pacemaker1.1.12. >> While STONITH was not completed, a resource moved it. >> >> The next movement seemed to happen in a cluster. >> >> Step1) Start a cluster. >> >> Step2) Node 1 breaks down. >> >> Step3) Node 1 is reconnected before practice is completed from node 2 > STONITH. >> >> Step4) Repeated between Step2 and Step3. >> >> Step5) STONITH from node 2 is not completed, but a resource moves to node > 2. >> >> >> >> There was not resource information of node 1 when I saw pe file when a > resource moved in node 2. >> (snip) >> <status> >> <node_state id="3232242311" uname="node1" > in_ccm="false" crmd="offline" > crm-debug-origin="do_state_transition" join="down" > expected="down"> >> <transient_attributes id="3232242311"> >> <instance_attributes id="status-3232242311"> >> <nvpair id="status-3232242311-last-failure-prm_XXX1" > name="last-failure-prm_XXX1" value="1441957021"/> >> <nvpair id="status-3232242311-default_ping_set" > name="default_ping_set" value="300"/> >> <nvpair id="status-3232242311-last-failure-prm_XXX2" > name="last-failure-prm_XXX2" value="1441956891"/> >> <nvpair id="status-3232242311-shutdown" > name="shutdown" value="0"/> >> <nvpair id="status-3232242311-probe_complete" > name="probe_complete" value="true"/> >> </instance_attributes> >> </transient_attributes> >> </node_state> >> <node_state id="3232242312" in_ccm="true" > crmd="online" crm-debug-origin="do_state_transition" > uname="node2" join="member" expected="member"> >> <transient_attributes id="3232242312"> >> <instance_attributes id="status-3232242312"> >> <nvpair id="status-3232242312-shutdown" > name="shutdown" value="0"/> >> <nvpair id="status-3232242312-probe_complete" > name="probe_complete" value="true"/> >> <nvpair id="status-3232242312-default_ping_set" > name="default_ping_set" value="300"/> >> </instance_attributes> >> </transient_attributes> >> <lrm id="3232242312"> >> <lrm_resources> >> (snip) >> >> While STONITH is not completed, the information of the node of cib is > deleted and seems to be caused by the fact that cib does not have the > resource > information of the node. >> >> The cause of the problem was that the communication of the cluster became > unstable. >> However, an action of this cluster is a problem. >> >> This problem is not taking place in Pacemaker1.1.13 for the moment. >> However, I think that it is the same processing as far as I see a source > code. >> >> Does the deletion of the node information not have to perform it after all > new node information gathered? >> >> * crmd/callback.c >> (snip) >> void >> peer_update_callback(enum crm_status_type type, crm_node_t * node, const > void *data) >> { >> (snip) >> if (down) { >> const char *task = crm_element_value(down->xml, > XML_LRM_ATTR_TASK); >> >> if (alive && safe_str_eq(task, CRM_OP_FENCE)) { >> crm_info("Node return implies stonith of %s (action > %d) completed", node->uname, >> down->id); > > The above is the reason for the behavior you're seeing. > > A fenced node can come back up and rejoin the cluster before the fence > command reports completion. When Pacemaker sees the rejoin, it assumes > the fence command completed. > > However in this case, the lost node rejoined on its own while fencing > was still in progress, so that was an incorrect assumption. > > A proper fix will take some investigation. As a workaround in the > meantime, you could try increasing the corosync token timeout, so the > node is not declared lost for brief outages. > >> st_fail_count_reset(node->uname); >> >> erase_status_tag(node->uname, XML_CIB_TAG_LRM, > cib_scope_local); >> erase_status_tag(node->uname, > XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local); >> /* down->confirmed = TRUE; Only stonith-ng returning > should imply completion */ >> down->sent_update = TRUE; /* Prevent > tengine_stonith_callback() from calling send_stonith_update() */ >> >> (snip) >> >> >> * There is the log, but cannot attach it because the information of the > user is included. >> * Please contact me by an email if you need it. >> >> >> These contents are registered with Bugzilla. >> * http://bugs.clusterlabs.org/show_bug.cgi?id=5254 >> >> >> Best Regards, >> Hideo Yamauchi. >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org