First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:
> Thomas,
> 
> 
>> Hi,
>>
>> first some background info for my questions I'm going to ask:
>> We use corosync as a basis for our distributed realtime configuration
>> file system (pmxcfs)[1].
> 
> nice
> 
>>
>> We got some reports of a completely hanging FS with the only
>> correlations being high load, often IO, and most times a message that
>> corosync did not got scheduled for longer than the token timeout.
>>
>> See this example from a three node cluster, first:
>>
>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync 
>>> main process was not scheduled for 3767.3159 ms (threshold is 1320.0000 
>>> ms). Consider token timeout increase.
>>
>> then we get a few logs that JOIN or LEAVE messages were thrown away
>> (understandable for this event):
>>
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or 
>> LEAVE message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main 
>> process was not scheduled for 3767.3159 ms (threshold is 1320.0000 ms). 
>> Consider token timeout increase.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new 
>> membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to 
>> receive the leave message. failed: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership 
>> (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive 
>> the leave message. failed: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] 
>> Members[3]: 1 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed 
>> service synchronization, ready to provide service.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>>
>> Until recently we stepped really in the dark and had everything from
>> Kernel bugs to our filesystem logic as possible cause in mind...  But
>> then we had the luck to trigger this in our test systems and went to
>> town with gdb on the core dump, finding that we can trigger this by
>> pausing the leader (from our FS POV) for a short moment (may be shorter
>> than the token timeout), so that a new leader get elected, and then
>> resuming our leader node VM again.
>>
>> The problem I saw was that while the leader had a log entry which
>> proved that he noticed his blackout:
>>> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 
>>> 2 3 left: 2 3
> 
> I know it looks weird but it's perfectly fine and expected.
> 

It seemed OK, from this nodes POV, just the missing config change CB
was a bit odd to us.

>>
>> our FS cpg_confchg_fn callback[2] was never called, thus it thought it
> 
> That shouldn't happen
> 

So we really should get a config change CB on the paused node after
unpausing, with all other (online) nodes in both leave and join member
list?
Just asking again to confirm my thinking and that I did not misunderstood
you. :)

>> was still in sync and nothing ever happened, until another member
>> triggered this callback, by either leaving or (re-)joining.
>>
>> Looking in the cpg.c code I saw that there's another callback, namely
>> cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that
> 
> This callback is not necessary to have as long as information about cpg group 
> is enough. cpg_totem_confchg_fn contains information about all processors 
> (nodes).
> 

OK, make sense.

>> one... (I ain't the original author of the FS and it predates at least
>> to 2010, so maybe cpg_initialize was not yet deprecated there, and
>> thus model_initialize was not used then)
> 
>>
>> I switched over to using cpg_model_initialize and set the totem_confchg
>> callback, but for the "blacked out node" it gets called twice after the
>> event, but always shows all members...
>>
>> So to finally get to my questions:
>>
>> * why doesn't get the cpg_confchg_fn CB called when a node has a short
>>   blackout (i.e., corosync not being scheduled for a bit of time)?
>>   having all other nodes in it's leave and join list, as the log
>>   would suggests ("Members joined: 2 3 left: 2 3")
> 
> I believe it was called but not when corosync was paused.
> 

But I never see my registered callback called after unpausing and
also when printing out the dispatch_data->id inside cpg_dispatch
I do not get one with "MESSAGE_RES_CPG_CONFCHG_CALLBACK" after 
unpausing only a "MESSAGE_RES_CPG_TOTEM_CONFCHG_CALLBACK"

>>
>> * If that doesn't seems like a good idea, what can we use to really
>>   detect such a node blackout?
> 
> It's not possible to detect from the affected node, but it must be detected 
> from other nodes.
> 

You mean not directly, as in: we cannot differ between a legitimate fall
out and such a pause, caused artificially or by real high IO.
That makes sense.

>>
>> As a work around I added logic for when through a config change a node
>> with a lower ID joined. The node which was leader until then triggers
>> a CPG leave enforcing a cluster wide config change event to happen,
>> which this time also the blacked out node gets and syncs then again
>> This works, but does not feels really nice, IMO...
> 
> Ok let me explain what exactly happened in your test and simplify it to two 
> nodes:
> 
> - Node A and B are running corosync and same cpg application
> - Node A is paused for time > token_timeout
> - Node B detects that node A is not responding
>   - Creates new membership (with only node B)
>   - Sends notification to cpg app about Node A leave
> - Node A is still paused so it cannot send notification into cpg app
> - Node A is unpaused
>   - Pause is detected and start forming new membership
>   - From node A point of view, node B left - simply because node never 
> considers itself as left
>   - Node A is able to contact node B so final membership is A, B.
>   - Node A can finally deliver two cpg conf changes into cpg application. One 
> about left of node B and second one about join of node B.
> 

Exactly the last point is missing here, and I'd also expect it
to happen... I only get the config change on Node B for the Node
A join, but Node A logs (as stated)... :/

> Now it's really cpg application problem to synchronize its data. Many 
> applications (usually FS) are using quorum together with fencing to find out, 
> which cluster partition is quorate and clean inquorate one.
> 
> Hopefully my explanation help you and feel free to ask more questions!
> 

They help, but I'm still a bit unsure about why the CB could not happen here,
may need to dive a bit deeper into corosync :)

> Regards,
>   Honza
> 
>>
>> help would be appreciated, much thanks!
>>
>> cheers,
>> Thomas
>>
>> [1]: 
>> https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
>> [2]: 
>> https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096
>>



_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to