Re: [Pacemaker] About behavior in Action Lost.

2010-10-12 Thread Keisuke MORI
2010/10/7 Andrew Beekhof and...@beekhof.net:
 On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com 
 wrote:
 Andrew,

 2010/9/23 Andrew Beekhof and...@beekhof.net:
 Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

 I would like to backport this to 1.0.
 Would you agree with this?

 I would prefer not to, but if it is important to you then I will agree.


Thank you for your ACK. It's now in 1.0.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/146e405c1afa

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in Action Lost.

2010-10-07 Thread Keisuke MORI
Andrew,

2010/9/23 Andrew Beekhof and...@beekhof.net:
 Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

I would like to backport this to 1.0.
Would you agree with this?

Without this the failed node was not fenced when it ought to be and
failed to continue the service.
I would also think that it would be good to have the same behavior
between 1.0 and 1.1 in such a critical condition to support both
versions better.

Thanks,
Keisuke MORI


 On Wed, Sep 22, 2010 at 11:18 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Sorry...
 I did not know the fact that there was such an argument in old days.


 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 That means, you change it in the case of Action Lost of the stop this time 
 to carry out stonith?
 If my recognition is right, I agree too.

 if(timer-action-type != action_type_rsc) {
 send_update = FALSE;
 } else if(safe_str_eq(task, cancel)) {
 /* we dont need to update the CIB with these */
 send_update = FALSE;
 }
 --- delete else if(safe_str_eq(task, stop)){..} ?

 if(send_update) {
 /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); 
 */
 cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
 }

 Best Regards,
 Hideo Yamauchi.

 --- Andrew Beekhof and...@beekhof.net wrote:

 On Tue, Sep 21, 2010 at 8:59 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  Node was in state that the load was very high, and we confirmed monitor 
  movement of Pacemeker.
  Action Lost occurred in stop movement after the error of the monitor 
  occurred.
 
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
  transition, action lost:
 [Action 9]:
  In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: 
  action_timer_callback:486
 -
  Triggered transition abort (complete=0) : Action lost
 
 
  For the load of the node, We think that the stop movement did not go well.
  But cannot nodes execute stonith.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in Action Lost.

2010-10-07 Thread Andrew Beekhof
On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com wrote:
 Andrew,

 2010/9/23 Andrew Beekhof and...@beekhof.net:
 Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

 I would like to backport this to 1.0.
 Would you agree with this?

I would prefer not to, but if it is important to you then I will agree.


 Without this the failed node was not fenced when it ought to be and
 failed to continue the service.
 I would also think that it would be good to have the same behavior
 between 1.0 and 1.1 in such a critical condition to support both
 versions better.

 Thanks,
 Keisuke MORI


 On Wed, Sep 22, 2010 at 11:18 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Sorry...
 I did not know the fact that there was such an argument in old days.


 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 That means, you change it in the case of Action Lost of the stop this 
 time to carry out stonith?
 If my recognition is right, I agree too.

 if(timer-action-type != action_type_rsc) {
 send_update = FALSE;
 } else if(safe_str_eq(task, cancel)) {
 /* we dont need to update the CIB with these */
 send_update = FALSE;
 }
 --- delete else if(safe_str_eq(task, stop)){..} ?

 if(send_update) {
 /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); 
 */
 cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
 }

 Best Regards,
 Hideo Yamauchi.

 --- Andrew Beekhof and...@beekhof.net wrote:

 On Tue, Sep 21, 2010 at 8:59 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  Node was in state that the load was very high, and we confirmed monitor 
  movement of Pacemeker.
  Action Lost occurred in stop movement after the error of the monitor 
  occurred.
 
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
  transition, action lost:
 [Action 9]:
  In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: 
  action_timer_callback:486
 -
  Triggered transition abort (complete=0) : Action lost
 
 
  For the load of the node, We think that the stop movement did not go 
  well.
  But cannot nodes execute stonith.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




 --
 Keisuke MORI

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 

Re: [Pacemaker] About behavior in Action Lost.

2010-09-29 Thread Andrew Beekhof
Sorry, it probably got rebased before I pushed it.

http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the
right link

On Wed, Sep 29, 2010 at 2:51 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Pushed as:
    http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

 The change of this link is not found.
 Where did you update it?

 Best Regards,
 Hideo Yamauchi.

 --- Andrew Beekhof and...@beekhof.net wrote:

 Pushed as:
    http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

 On Wed, Sep 22, 2010 at 11:18 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi Andrew,
 
  Thank you for comment.
 
  A long time ago in a galaxy far away, some messaging layers used to
  loose quite a few actions, including stops.
  About the same time, we decided that fencing because a stop action was
  lost wasn't a good idea.
 
  The rationale was that if the operation eventually completed, it would
  end up in the CIB anyway.
  And even if it didn't, the PE would continue to try the operation
  again until the whole node fell over at which point it would get shot
  anyway.
 
  Sorry...
  I did not know the fact that there was such an argument in old days.
 
 
  Now, having said that, things have improved since then and perhaps,
  the interest of speeding up recovery in these situations, it is time
  to stop treating stop operations differently.
  Would you agree?
 
  That means, you change it in the case of Action Lost of the stop this 
  time to carry out
 stonith?
  If my recognition is right, I agree too.
 
  if(timer-action-type != action_type_rsc) {
  send_update = FALSE;
  } else if(safe_str_eq(task, cancel)) {
  /* we dont need to update the CIB with these */
  send_update = FALSE;
  }
  --- delete else if(safe_str_eq(task, stop)){..} ?
 
  if(send_update) {
  /* cib_action_update(timer-action, LRM_OP_PENDING, 
  EXECRA_STATUS_UNKNOWN); */
  cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
  }
 
  Best Regards,
  Hideo Yamauchi.
 
  --- Andrew Beekhof and...@beekhof.net wrote:
 
  On Tue, Sep 21, 2010 at 8:59 AM, #65533;renayama19661...@ybb.ne.jp 
  wrote:
   Hi,
  
   Node was in state that the load was very high, and we confirmed monitor 
   movement of
 Pacemeker.
   Action Lost occurred in stop movement after the error of the monitor 
   occurred.
  
   Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
   transition, action
 lost:
  [Action 9]:
   In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
   Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph:
 action_timer_callback:486
  -
   Triggered transition abort (complete=0) : Action lost
  
  
   For the load of the node, We think that the stop movement did not go 
   well.
   But cannot nodes execute stonith.
 
  A long time ago in a galaxy far away, some messaging layers used to
  loose quite a few actions, including stops.
  About the same time, we decided that fencing because a stop action was
  lost wasn't a good idea.
 
  The rationale was that if the operation eventually completed, it would
  end up in the CIB anyway.
  And even if it didn't, the PE would continue to try the operation
  again until the whole node fell over at which point it would get shot
  anyway.
 
  Now, having said that, things have improved since then and perhaps,
  the interest of speeding up recovery in these situations, it is time
  to stop treating stop operations differently.
  Would you agree?
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: 
  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: 
  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: 

Re: [Pacemaker] About behavior in Action Lost.

2010-09-29 Thread renayama19661014
Hi Andrew,

 Sorry, it probably got rebased before I pushed it.
 
 http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the
 right link

Thanks!!

Hideo Yamuachi.

--- Andrew Beekhof and...@beekhof.net wrote:

 Sorry, it probably got rebased before I pushed it.
 
 http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the
 right link
 
 On Wed, Sep 29, 2010 at 2:51 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi Andrew,
 
  Pushed as:
  #65533; #65533;http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
 
  Not sure about applying to 1.0 though, its a dramatic change in behavior.
 
  The change of this link is not found.
  Where did you update it?
 
  Best Regards,
  Hideo Yamauchi.
 
  --- Andrew Beekhof and...@beekhof.net wrote:
 
  Pushed as:
  #65533; #65533;http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
 
  Not sure about applying to 1.0 though, its a dramatic change in behavior.
 
  On Wed, Sep 22, 2010 at 11:18 AM, #65533;renayama19661...@ybb.ne.jp 
  wrote:
   Hi Andrew,
  
   Thank you for comment.
  
   A long time ago in a galaxy far away, some messaging layers used to
   loose quite a few actions, including stops.
   About the same time, we decided that fencing because a stop action was
   lost wasn't a good idea.
  
   The rationale was that if the operation eventually completed, it would
   end up in the CIB anyway.
   And even if it didn't, the PE would continue to try the operation
   again until the whole node fell over at which point it would get shot
   anyway.
  
   Sorry...
   I did not know the fact that there was such an argument in old days.
  
  
   Now, having said that, things have improved since then and perhaps,
   the interest of speeding up recovery in these situations, it is time
   to stop treating stop operations differently.
   Would you agree?
  
   That means, you change it in the case of Action Lost of the stop this 
   time to carry out
  stonith?
   If my recognition is right, I agree too.
  
   if(timer-action-type != action_type_rsc) {
   send_update = FALSE;
   } else if(safe_str_eq(task, cancel)) {
   /* we dont need to update the CIB with these */
   send_update = FALSE;
   }
   --- delete else if(safe_str_eq(task, stop)){..} ?
  
   if(send_update) {
   /* cib_action_update(timer-action, LRM_OP_PENDING, 
   EXECRA_STATUS_UNKNOWN); */
   cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
   }
  
   Best Regards,
   Hideo Yamauchi.
  
   --- Andrew Beekhof and...@beekhof.net wrote:
  
   On Tue, Sep 21, 2010 at 8:59 AM, #65533;renayama19661...@ybb.ne.jp 
   wrote:
Hi,
   
Node was in state that the load was very high, and we confirmed 
monitor movement of
  Pacemeker.
Action Lost occurred in stop movement after the error of the monitor 
occurred.
   
Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: 
Aborting transition,
 action
  lost:
   [Action 9]:
In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: 
abort_transition_graph:
  action_timer_callback:486
   -
Triggered transition abort (complete=0) : Action lost
   
   
For the load of the node, We think that the stop movement did not go 
well.
But cannot nodes execute stonith.
  
   A long time ago in a galaxy far away, some messaging layers used to
   loose quite a few actions, including stops.
   About the same time, we decided that fencing because a stop action was
   lost wasn't a good idea.
  
   The rationale was that if the operation eventually completed, it would
   end up in the CIB anyway.
   And even if it didn't, the PE would continue to try the operation
   again until the whole node fell over at which point it would get shot
   anyway.
  
   Now, having said that, things have improved since then and perhaps,
   the interest of speeding up recovery in these situations, it is time
   to stop treating stop operations differently.
   Would you agree?
  
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: 
   http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  
  
  
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: 
   http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting 

Re: [Pacemaker] About behavior in Action Lost.

2010-09-28 Thread renayama19661014
Hi Andrew,

 Pushed as:
http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
 
 Not sure about applying to 1.0 though, its a dramatic change in behavior.

The change of this link is not found. 
Where did you update it?

Best Regards,
Hideo Yamauchi.

--- Andrew Beekhof and...@beekhof.net wrote:

 Pushed as:
http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
 
 Not sure about applying to 1.0 though, its a dramatic change in behavior.
 
 On Wed, Sep 22, 2010 at 11:18 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi Andrew,
 
  Thank you for comment.
 
  A long time ago in a galaxy far away, some messaging layers used to
  loose quite a few actions, including stops.
  About the same time, we decided that fencing because a stop action was
  lost wasn't a good idea.
 
  The rationale was that if the operation eventually completed, it would
  end up in the CIB anyway.
  And even if it didn't, the PE would continue to try the operation
  again until the whole node fell over at which point it would get shot
  anyway.
 
  Sorry...
  I did not know the fact that there was such an argument in old days.
 
 
  Now, having said that, things have improved since then and perhaps,
  the interest of speeding up recovery in these situations, it is time
  to stop treating stop operations differently.
  Would you agree?
 
  That means, you change it in the case of Action Lost of the stop this 
  time to carry out
 stonith?
  If my recognition is right, I agree too.
 
  if(timer-action-type != action_type_rsc) {
  send_update = FALSE;
  } else if(safe_str_eq(task, cancel)) {
  /* we dont need to update the CIB with these */
  send_update = FALSE;
  }
  --- delete else if(safe_str_eq(task, stop)){..} ?
 
  if(send_update) {
  /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); 
  */
  cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
  }
 
  Best Regards,
  Hideo Yamauchi.
 
  --- Andrew Beekhof and...@beekhof.net wrote:
 
  On Tue, Sep 21, 2010 at 8:59 AM, #65533;renayama19661...@ybb.ne.jp 
  wrote:
   Hi,
  
   Node was in state that the load was very high, and we confirmed monitor 
   movement of
 Pacemeker.
   Action Lost occurred in stop movement after the error of the monitor 
   occurred.
  
   Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
   transition, action
 lost:
  [Action 9]:
   In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
   Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph:
 action_timer_callback:486
  -
   Triggered transition abort (complete=0) : Action lost
  
  
   For the load of the node, We think that the stop movement did not go 
   well.
   But cannot nodes execute stonith.
 
  A long time ago in a galaxy far away, some messaging layers used to
  loose quite a few actions, including stops.
  About the same time, we decided that fencing because a stop action was
  lost wasn't a good idea.
 
  The rationale was that if the operation eventually completed, it would
  end up in the CIB anyway.
  And even if it didn't, the PE would continue to try the operation
  again until the whole node fell over at which point it would get shot
  anyway.
 
  Now, having said that, things have improved since then and perhaps,
  the interest of speeding up recovery in these situations, it is time
  to stop treating stop operations differently.
  Would you agree?
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: 
  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: 
  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in Action Lost.

2010-09-24 Thread Andrew Beekhof
Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

Not sure about applying to 1.0 though, its a dramatic change in behavior.

On Wed, Sep 22, 2010 at 11:18 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Sorry...
 I did not know the fact that there was such an argument in old days.


 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 That means, you change it in the case of Action Lost of the stop this time 
 to carry out stonith?
 If my recognition is right, I agree too.

 if(timer-action-type != action_type_rsc) {
 send_update = FALSE;
 } else if(safe_str_eq(task, cancel)) {
 /* we dont need to update the CIB with these */
 send_update = FALSE;
 }
 --- delete else if(safe_str_eq(task, stop)){..} ?

 if(send_update) {
 /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */
 cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
 }

 Best Regards,
 Hideo Yamauchi.

 --- Andrew Beekhof and...@beekhof.net wrote:

 On Tue, Sep 21, 2010 at 8:59 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  Node was in state that the load was very high, and we confirmed monitor 
  movement of Pacemeker.
  Action Lost occurred in stop movement after the error of the monitor 
  occurred.
 
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
  transition, action lost:
 [Action 9]:
  In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: 
  action_timer_callback:486
 -
  Triggered transition abort (complete=0) : Action lost
 
 
  For the load of the node, We think that the stop movement did not go well.
  But cannot nodes execute stonith.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in Action Lost.

2010-09-22 Thread Andrew Beekhof
On Tue, Sep 21, 2010 at 8:59 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi,

 Node was in state that the load was very high, and we confirmed monitor 
 movement of Pacemeker.
 Action Lost occurred in stop movement after the error of the monitor occurred.

 Sep  8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, 
 action lost: [Action 9]:
 In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
 Sep  8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: 
 action_timer_callback:486 -
 Triggered transition abort (complete=0) : Action lost


 For the load of the node, We think that the stop movement did not go well.
 But cannot nodes execute stonith.

A long time ago in a galaxy far away, some messaging layers used to
loose quite a few actions, including stops.
About the same time, we decided that fencing because a stop action was
lost wasn't a good idea.

The rationale was that if the operation eventually completed, it would
end up in the CIB anyway.
And even if it didn't, the PE would continue to try the operation
again until the whole node fell over at which point it would get shot
anyway.

Now, having said that, things have improved since then and perhaps,
the interest of speeding up recovery in these situations, it is time
to stop treating stop operations differently.
Would you agree?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker