Re: [Pacemaker] About behavior in Action Lost.
2010/10/7 Andrew Beekhof and...@beekhof.net: On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: Andrew, 2010/9/23 Andrew Beekhof and...@beekhof.net: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. I would like to backport this to 1.0. Would you agree with this? I would prefer not to, but if it is important to you then I will agree. Thank you for your ACK. It's now in 1.0. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/146e405c1afa -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in Action Lost.
Andrew, 2010/9/23 Andrew Beekhof and...@beekhof.net: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. I would like to backport this to 1.0. Would you agree with this? Without this the failed node was not fenced when it ought to be and failed to continue the service. I would also think that it would be good to have the same behavior between 1.0 and 1.1 in such a critical condition to support both versions better. Thanks, Keisuke MORI On Wed, Sep 22, 2010 at 11:18 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in Action Lost.
On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: Andrew, 2010/9/23 Andrew Beekhof and...@beekhof.net: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. I would like to backport this to 1.0. Would you agree with this? I would prefer not to, but if it is important to you then I will agree. Without this the failed node was not fenced when it ought to be and failed to continue the service. I would also think that it would be good to have the same behavior between 1.0 and 1.1 in such a critical condition to support both versions better. Thanks, Keisuke MORI On Wed, Sep 22, 2010 at 11:18 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
Re: [Pacemaker] About behavior in Action Lost.
Sorry, it probably got rebased before I pushed it. http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the right link On Wed, Sep 29, 2010 at 2:51 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. The change of this link is not found. Where did you update it? Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. On Wed, Sep 22, 2010 at 11:18 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, #65533;renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] About behavior in Action Lost.
Hi Andrew, Sorry, it probably got rebased before I pushed it. http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the right link Thanks!! Hideo Yamuachi. --- Andrew Beekhof and...@beekhof.net wrote: Sorry, it probably got rebased before I pushed it. http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the right link On Wed, Sep 29, 2010 at 2:51 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Pushed as: #65533; #65533;http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. The change of this link is not found. Where did you update it? Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: Pushed as: #65533; #65533;http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. On Wed, Sep 22, 2010 at 11:18 AM, #65533;renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, #65533;renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting
Re: [Pacemaker] About behavior in Action Lost.
Hi Andrew, Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. The change of this link is not found. Where did you update it? Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. On Wed, Sep 22, 2010 at 11:18 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, #65533;renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in Action Lost.
Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. On Wed, Sep 22, 2010 at 11:18 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Sorry... I did not know the fact that there was such an argument in old days. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? That means, you change it in the case of Action Lost of the stop this time to carry out stonith? If my recognition is right, I agree too. if(timer-action-type != action_type_rsc) { send_update = FALSE; } else if(safe_str_eq(task, cancel)) { /* we dont need to update the CIB with these */ send_update = FALSE; } --- delete else if(safe_str_eq(task, stop)){..} ? if(send_update) { /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */ cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); } Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Tue, Sep 21, 2010 at 8:59 AM, renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in Action Lost.
On Tue, Sep 21, 2010 at 8:59 AM, renayama19661...@ybb.ne.jp wrote: Hi, Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker. Action Lost occurred in stop movement after the error of the monitor occurred. Sep 8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost: [Action 9]: In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) Sep 8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost For the load of the node, We think that the stop movement did not go well. But cannot nodes execute stonith. A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker