Re: [Openais] cpg behavior on transitional membership change

2011-09-03 Thread Jiaju Zhang
On Fri, Sep 02, 2011 at 10:12:11PM +0300, Vladislav Bogdanov wrote:
 02.09.2011 20:55, David Teigland wrote:
 [snip]
  
  I really can't make any sense of the report, sorry.  Maybe reproduce it
  without pacemaker, and then describe the specific steps to create the
  issue and resulting symptoms.  After that we can determine what logs, if
  any, would be useful.
  
 
 I just tried to ask a question about cluster components logic based on
 information I discovered from both logs and code analysis. I'm sorry if
 I was unclear in that, probably some language barrier still exists.
 
 Please see my previous mail, I tried to add some explanations why I
 think current logic is not complete.

Hi Vladislav, I guess I have known the problem what you described;)
I'd like to give a example to make the things more clear.

3-node cluster, for whatever reason, especially on heavy workload,
corosync may detect one node disappear and reappear again. So the
membership information changes are as follows:
membership 1: nodeA, nodeB, nodeC
membership 2: nodeB, nodeC
membership 3: nodeA, nodeB, nodeC

From the membership change 1 - 2, dlm_controld konws nodeA is down,
and have many things to do, like check_fs_done, check_fencing_done ...
The key point here is dlm need to wait the fencing is really done
before it proceed. If we employ a cluster filesystem here, like ocfs2,
it also needs the fencing is really done. I believe in the normal
cases, pacemaker will fence nodeA and then everything should be OK.

However, there is a possibility here that pacemaker won't fence nodeA.
Say nodeA is the original DC of the cluster, when nodeA is down, the
cluster should elect a new DC. But if the time window where membership
change 2 - 3 is too small, node A is up again and attend the election
too, then node A is elected to be the DC again and it won't fence
itself.
Andrew, correct me if my understanding on pacemaker is wrong;)

So I think the membership change should be like a transaction in
database or filesystem field, that is, for the membership change
1 - 2, every thing should be done (e.g. fencing nodeA), no matter the
following change 2 - 3 will happen or not. For the situation where a
node magically disappear and reappear, and the situation where a node
normally down and then up, ocfs2 and dlm should not be able to see any
difference between them, what they can do is just waiting the fencing
to be done.

Any comments? thoughts?

Thanks,
Jiaju
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] cpg behavior on transitional membership change

2011-09-03 Thread Vladislav Bogdanov
Hi Jiaju,

03.09.2011 19:52, Jiaju Zhang wrote:
 On Fri, Sep 02, 2011 at 10:12:11PM +0300, Vladislav Bogdanov wrote:
 02.09.2011 20:55, David Teigland wrote:
 [snip]

 I really can't make any sense of the report, sorry.  Maybe reproduce it
 without pacemaker, and then describe the specific steps to create the
 issue and resulting symptoms.  After that we can determine what logs, if
 any, would be useful.


 I just tried to ask a question about cluster components logic based on
 information I discovered from both logs and code analysis. I'm sorry if
 I was unclear in that, probably some language barrier still exists.

 Please see my previous mail, I tried to add some explanations why I
 think current logic is not complete.
 
 Hi Vladislav, I guess I have known the problem what you described;)
 I'd like to give a example to make the things more clear.
 
 3-node cluster, for whatever reason, especially on heavy workload,

(my case)

 corosync may detect one node disappear and reappear again. So the

BTW, could this be prevented (at least for majority of cases) by some
corosync timeout params?
Steve?

 membership information changes are as follows:
 membership 1: nodeA, nodeB, nodeC
 membership 2: nodeB, nodeC
 membership 3: nodeA, nodeB, nodeC

Exactly.

 
 From the membership change 1 - 2, dlm_controld konws nodeA is down,
 and have many things to do, like check_fs_done, check_fencing_done ...
 The key point here is dlm need to wait the fencing is really done
 before it proceed. If we employ a cluster filesystem here, like ocfs2,
 it also needs the fencing is really done. I believe in the normal
 cases, pacemaker will fence nodeA and then everything should be OK.
 
 However, there is a possibility here that pacemaker won't fence nodeA.
 Say nodeA is the original DC of the cluster, when nodeA is down, the
 cluster should elect a new DC. But if the time window where membership
 change 2 - 3 is too small, node A is up again and attend the election
 too, then node A is elected to be the DC again and it won't fence
 itself.

Ahm, I think you are perfectly right in what exactly happens.
This addresses my case I think, just because that node which left
cluster for a short moment is usually
1. Started earlier on a cold boot (it is a VM, others are bare-metal
which boot diskless via PXE, and there is 1 minute timeout before that
bare-metal systems get their boot image, just because of cisco
implementation of ether-channel)
2. Upgraded (and rebooted if needed) first, before bare-metal nodes are
booted with new image.

So that node is usually a DC.

 Andrew, correct me if my understanding on pacemaker is wrong;)
 
 So I think the membership change should be like a transaction in
 database or filesystem field, that is, for the membership change
 1 - 2, every thing should be done (e.g. fencing nodeA), no matter the
 following change 2 - 3 will happen or not. For the situation where a
 node magically disappear and reappear, and the situation where a node
 normally down and then up, ocfs2 and dlm should not be able to see any
 difference between them, what they can do is just waiting the fencing
 to be done.
 
 Any comments? thoughts?

I'd have protection at as many layers as possible. So, if you are right
about pacemaker, then it is great if it could be fixed in pacemaker.
But, I'd prefer to be safe and have DLM et all schedule fencing as well
if they notice unrecoverable problem which leads to cluster subsystem
freeze if fencing is not done for whatever reason. I suppose that
current behavior is just a artifact from cluster2, where groupd was on
duty for that event (I may be wrong, because I didn't look at groupd
code closely, but comments in other daemons code make me think so).

Thank you for your comments,
Vladislav
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


[Openais] cpg behavior on transitional membership change

2011-09-02 Thread Vladislav Bogdanov
Hi all,

I'm trying to further investigate problem I described at
https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html

The main problem for me there is that pacemaker first sees transitional
membership with left nodes, then it sees stable membership with that
nodes returned back, and does nothing about that. On the other hand,
dlm_controld sees CPG_REASON_NODEDOWN events on CPGs related to all its
lockspaces (at the same time with transitional membership change) and
stops kernel part of each lockspace until whole cluster is rebooted (or
until some other recovery procedure which unfortunately does not happen
:( ). It neither requests to fence left node nor recovers when node is
returned on next stable membership.

Could anyone please help me to understand, what is a correct CPG
behavior on membership change?
From what I see, CPG emits CPG_REASON_NODEDOWN event on both
transitional and stable membership if there is node which left the
cluster. Am I correct here? And is that a right thing if I am?

If yes, is there a way do detect membership change type (transitional pr
stable) through CPG API?

Hoping for answer,

Best regards,
Vladislav
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] cpg behavior on transitional membership change

2011-09-02 Thread Steven Dake
On 09/02/2011 12:59 AM, Vladislav Bogdanov wrote:
 Hi all,
 
 I'm trying to further investigate problem I described at
 https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html
 
 The main problem for me there is that pacemaker first sees transitional
 membership with left nodes, then it sees stable membership with that
 nodes returned back, and does nothing about that. On the other hand,
 dlm_controld sees CPG_REASON_NODEDOWN events on CPGs related to all its
 lockspaces (at the same time with transitional membership change) and
 stops kernel part of each lockspace until whole cluster is rebooted (or
 until some other recovery procedure which unfortunately does not happen

I believe fenced should reboot the node, but only if there is quorum.
It is possible your cluster has lost quorum during this series of
events.  I have copied Dave for his feedback on this point.

 :( ). It neither requests to fence left node nor recovers when node is
 returned on next stable membership.
 
 Could anyone please help me to understand, what is a correct CPG
 behavior on membership change?
 From what I see, CPG emits CPG_REASON_NODEDOWN event on both
 transitional and stable membership if there is node which left the
 cluster. Am I correct here? And is that a right thing if I am?
 

Line #'s where this happens?

 If yes, is there a way do detect membership change type (transitional pr
 stable) through CPG API?
 

A transitional membership will always contain a subset of the previous
regular membership.  This means it will always contains 0 or more left
members.  A transitional membership means The membership of nodes
transitioning from previous regular membership to new regular mebmership.

A regular configuration is where members are added to the configuration
when detected.  A transitional membership never has nodes added to it.

 Hoping for answer,
 

It would be nice if cpg and totem had a direct relationship in how their
transitional and regular configurations were generated, but this doesn't
happen currently.  I am not sure if there is a good reason for this.

Regards
-steve

 Best regards,
 Vladislav
 ___
 Openais mailing list
 Openais@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] cpg behavior on transitional membership change

2011-09-02 Thread David Teigland
On Fri, Sep 02, 2011 at 10:30:53AM -0700, Steven Dake wrote:
 On 09/02/2011 12:59 AM, Vladislav Bogdanov wrote:
  Hi all,
  
  I'm trying to further investigate problem I described at
  https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html
  
  The main problem for me there is that pacemaker first sees transitional
  membership with left nodes, then it sees stable membership with that
  nodes returned back, and does nothing about that. On the other hand,
  dlm_controld sees CPG_REASON_NODEDOWN events on CPGs related to all its
  lockspaces (at the same time with transitional membership change) and
  stops kernel part of each lockspace until whole cluster is rebooted (or
  until some other recovery procedure which unfortunately does not happen
 
 I believe fenced should reboot the node, but only if there is quorum.
 It is possible your cluster has lost quorum during this series of
 events.  I have copied Dave for his feedback on this point.

I really can't make any sense of the report, sorry.  Maybe reproduce it
without pacemaker, and then describe the specific steps to create the
issue and resulting symptoms.  After that we can determine what logs, if
any, would be useful.

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] cpg behavior on transitional membership change

2011-09-02 Thread Vladislav Bogdanov
Hi Steve,

02.09.2011 20:30, Steven Dake wrote:
 On 09/02/2011 12:59 AM, Vladislav Bogdanov wrote:
...
 I'm trying to further investigate problem I described at
 https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html

 The main problem for me there is that pacemaker first sees transitional
 membership with left nodes, then it sees stable membership with that
 nodes returned back, and does nothing about that. On the other hand,
 dlm_controld sees CPG_REASON_NODEDOWN events on CPGs related to all its
 lockspaces (at the same time with transitional membership change) and
 stops kernel part of each lockspace until whole cluster is rebooted (or
 until some other recovery procedure which unfortunately does not happen
 
 I believe fenced should reboot the node, but only if there is quorum.
 It is possible your cluster has lost quorum during this series of
 events.  I have copied Dave for his feedback on this point.

Aha. I think so too. But fenced doesn't do that as well as all other
daemons from cluster3, this part of code is identical among them, that's
why I think this does not depend on whether cman or pacemaker stack is used:
fence/fenced/cpg.c around line 1440 (as for 3.1.1)
if (left_list[i].reason == CPG_REASON_NODEDOWN ||
left_list[i].reason == CPG_REASON_PROCDOWN) {
memb-failed = 1;
cg-failed_count++;
}
...
if (left_list[i].reason == CPG_REASON_PROCDOWN)
kick_node_from_cluster(memb-nodeid);

probably last lines should be:
if (left_list[i].reason == CPG_REASON_NODEDOWN ||
left_list[i].reason == CPG_REASON_PROCDOWN)
kick_node_from_cluster(memb-nodeid);

at least in one of daemons (fenced is a good candidate, but I prefer
dlm_controld)?

About quorum: 3 node cluster was split to two partitions, 2 bare-metal
and 1 VM nodes.
When I found that, two metal ones were in 'kern_stop' state,
transitioning via 'kern_stop,fencing' state I suppose. VM did not have
quorum, so it was left in 'kern_stop,fencing' state.

dlm dump says:
1313579105 clvmd add_change cg 4 remove nodeid 1543767306 reason 3
That means CPG_REASON_NODEDOWN event.

Then:
1313579105 Node 1543767306/mgmt01 has not been shot yet
1313579105 clvmd check_fencing 1543767306 wait add 1313562825 fail
1313579105 last 0
1313579107 Node 1543767306/mgmt01 was last shot 'now'

This is not true, there is no line about actual fencing scheduling (and
it is clear from code why). This could be a deficiency of .pcmk
dlm_controld variant, but that is not important here I think.

1313579107 clvmd check_fencing 1543767306 done add 1313562825 fail
1313579105 last 1313579107
1313579107 clvmd check_fencing done


 
 :( ). It neither requests to fence left node nor recovers when node is
 returned on next stable membership.

 Could anyone please help me to understand, what is a correct CPG
 behavior on membership change?
 From what I see, CPG emits CPG_REASON_NODEDOWN event on both
 transitional and stable membership if there is node which left the
 cluster. Am I correct here? And is that a right thing if I am?

Ah, I should be mixed something, it was quite long ago. Actually, yes,
that was transitional one. There was only one such event.


 
 Line #'s where this happens?

I just saw that in pacemaker plugin logs and in dlm_tool dump logs.
Their timestamps are identical.

 
 If yes, is there a way do detect membership change type (transitional pr
 stable) through CPG API?

 
 A transitional membership will always contain a subset of the previous
 regular membership.  This means it will always contains 0 or more left
 members.  A transitional membership means The membership of nodes
 transitioning from previous regular membership to new regular mebmership.

 
 A regular configuration is where members are added to the configuration
 when detected.  A transitional membership never has nodes added to it.

Thank you for clarification very much.
Shouldn't pacemaker then schedule fencing itself (from the partition
with quorum) if there are left nodes? BTW, actually there was only
second or two between transitional and regular membership. I probably
need to ask Andrew for pacemaker logic details.

Unfortunately I lost that logs and hardly can reproduce that :(
That was a VM which left the cluster, and it probably just suffered from
insufficient host CPU time.

And... Just wandering, what could be a reason to recalculate membership
if there are 0 left or added members?


 
 Hoping for answer,

 
 It would be nice if cpg and totem had a direct relationship in how their
 transitional and regular configurations were generated, but this doesn't
 happen currently.  I am not sure if there is a good reason for this.

Pacemaker uses totem? At least it doesn't use cpg. May be that is the
reason of not-fencing from within it?

Thank you very much,
Vladislav

Re: [Openais] cpg behavior on transitional membership change

2011-09-02 Thread Vladislav Bogdanov
02.09.2011 20:55, David Teigland wrote:
[snip]
 
 I really can't make any sense of the report, sorry.  Maybe reproduce it
 without pacemaker, and then describe the specific steps to create the
 issue and resulting symptoms.  After that we can determine what logs, if
 any, would be useful.
 

I just tried to ask a question about cluster components logic based on
information I discovered from both logs and code analysis. I'm sorry if
I was unclear in that, probably some language barrier still exists.

Please see my previous mail, I tried to add some explanations why I
think current logic is not complete.

Thank you,
Vladislav
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais