Re: [ClusterLabs] both nodes OFFLINE

2017-05-23 Thread 石井 俊直
Hi.

Thanks for reply. And sorry for my report that problem has solved.
As mentioned, corosync versions were not same. “Syncing” versions solved the 
problem.
This was just an installation problem. Although we used Ansible to update the 
rpm file,
there was a failure and we missed it happend. 

> 2017/05/23 7:12、Ken Gaillot のメール:
> 
> On 05/13/2017 01:36 AM, 石井 俊直 wrote:
>> Hi.
>> 
>> We have, sometimes, a problem in our two nodes cluster on CentOS7. Let 
>> node-2 and node-3
>> be the names of the nodes. When the problem happens, both nodes are 
>> recognized OFFLINE
>> on node-3 and on node-2, only node-3 is recognized OFFLINE.
>> 
>> When that happens, the following log message is added repeatedly on node-2 
>> and log file
>> (/var/log/cluster/corosync.log) becomes hundreds of megabytes in short time. 
>> Log message
>> content on node-3 is different.
>> 
>> The erroneous state is temporally solved if OS of node-2 is restarted. On 
>> the other hand,
>> restarting OS of node-3 results in the same state.
>> 
>> I’ve searched content of ML and found a post (Mon Oct 1 01:27:39 CEST 2012) 
>> about
>> "Discarding update with feature set” problem. According to the message, our 
>> problem
>> may be solved by removing /var/lib/pacemaker/crm/cib.* on node-2.
>> 
>> What I want to know is whether removing the above files on just one of the 
>> node is safe ?
>> If there’s other method to solve the problem, I’d like to hear that.
>> 
>> Thanks.
>> 
>> —— from corosync.log  
>> cib:error: cib_perform_op:   Discarding update with feature set 
>> '3.0.11' greater than our own '3.0.10'
> 
> This implies that the pacemaker versions are different on the two nodes.
> Usually, when the pacemaker version changes, the feature set version
> also changes, which means that it introduces new features that won't
> work with older pacemaker versions.
> 
> Running a cluster with mixed pacemaker versions in such a case is
> allowed, but only during a rolling upgrade. Once an older node leaves
> the cluster for any reason, it will not be allowed to rejoin until it is
> upgraded.
> 
> Removing the cib files won't help, since node-2 apparently does not
> support node-3's pacemaker version.
> 
> If that's not the situation you are in, please give more details, as
> this should not be possible otherwise.
> 
>> cib:error: cib_process_request:  Completed cib_replace operation for 
>> section 'all': Protocol not supported (rc=-93, origin=node-3/crmd/12708, 
>> version=0.83.30)
>> crmd:   error: finalize_sync_callback:   Sync from node-3 failed: 
>> Protocol not supported
>> crmd:info: register_fsa_error_adv:   Resetting the current action 
>> list
>> crmd: warning: do_log:   Input I_ELECTION_DC received in state 
>> S_FINALIZE_JOIN from finalize_sync_callback
>> crmd:info: do_state_transition:  State transition S_FINALIZE_JOIN -> 
>> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL 
>> origin=finalize_sync_callback
>> crmd:info: crm_update_peer_join: initialize_join: Node node-2[1] - 
>> join-6329 phase 2 -> 0
>> crmd:info: crm_update_peer_join: initialize_join: Node node-3[2] - 
>> join-6329 phase 2 -> 0
>> crmd:info: update_dc:Unset DC. Was node-2
>> crmd:info: join_make_offer:  join-6329: Sending offer to node-2
>> crmd:info: crm_update_peer_join: join_make_offer: Node node-2[1] - 
>> join-6329 phase 0 -> 1
>> crmd:info: join_make_offer:  join-6329: Sending offer to node-3
>> crmd:info: crm_update_peer_join: join_make_offer: Node node-3[2] - 
>> join-6329 phase 0 -> 1
>> crmd:info: do_dc_join_offer_all: join-6329: Waiting on 2 outstanding 
>> join acks
>> crmd:info: update_dc:Set DC to node-2 (3.0.10)
>> crmd:info: crm_update_peer_join: do_dc_join_filter_offer: Node node-2[1] 
>> - join-6329 phase 1 -> 2
>> crmd:info: crm_update_peer_join: do_dc_join_filter_offer: Node node-3[2] 
>> - join-6329 phase 1 -> 2
>> crmd:info: do_state_transition:  State transition S_INTEGRATION -> 
>> S_FINALIZE_JOIN | input=I_INTEGRATED cause=C_FSA_INTERNAL 
>> origin=check_join_state
>> crmd:info: crmd_join_phase_log:  join-6329: node-2=integrated
>> crmd:info: crmd_join_phase_log:  join-6329: node-3=integrated
>> crmd:  notice: do_dc_join_finalize:  Syncing the Cluster Information Base 
>> from node-3 to rest of cluster | join-6329
>> crmd:  notice: do_dc_join_finalize:  Requested version   > crm_feature_set="3.0.11" validate-with="pacemaker-2.5" epoch="84" 
>> num_updates="1" admin_epoch="0" cib-last-written="Thu May 11 08:05:45 2017" 
>> update-origin="node-2" update-client="crm_resource" update-user="root" 
>> have-quorum="1"/>
>> cib: info: cib_process_request:  Forwarding cib_sync operation for 
>> section 'all' to node-3 (origin=local/crmd/12710)
>> cib: info: cib_process_replace:  Digest matched on replace from node-3: 
>> 85a19c7927c54ccb15794f2720e07ce1
>> cib: info: 

Re: [ClusterLabs] both nodes OFFLINE

2017-05-22 Thread Ken Gaillot
On 05/13/2017 01:36 AM, 石井 俊直 wrote:
> Hi.
> 
> We have, sometimes, a problem in our two nodes cluster on CentOS7. Let node-2 
> and node-3
> be the names of the nodes. When the problem happens, both nodes are 
> recognized OFFLINE
> on node-3 and on node-2, only node-3 is recognized OFFLINE.
> 
> When that happens, the following log message is added repeatedly on node-2 
> and log file
> (/var/log/cluster/corosync.log) becomes hundreds of megabytes in short time. 
> Log message
> content on node-3 is different.
> 
> The erroneous state is temporally solved if OS of node-2 is restarted. On the 
> other hand,
> restarting OS of node-3 results in the same state.
> 
> I’ve searched content of ML and found a post (Mon Oct 1 01:27:39 CEST 2012) 
> about
> "Discarding update with feature set” problem. According to the message, our 
> problem
> may be solved by removing /var/lib/pacemaker/crm/cib.* on node-2.
> 
> What I want to know is whether removing the above files on just one of the 
> node is safe ?
> If there’s other method to solve the problem, I’d like to hear that.
> 
> Thanks.
> 
> —— from corosync.log  
> cib:error: cib_perform_op:Discarding update with feature set 
> '3.0.11' greater than our own '3.0.10'

This implies that the pacemaker versions are different on the two nodes.
Usually, when the pacemaker version changes, the feature set version
also changes, which means that it introduces new features that won't
work with older pacemaker versions.

Running a cluster with mixed pacemaker versions in such a case is
allowed, but only during a rolling upgrade. Once an older node leaves
the cluster for any reason, it will not be allowed to rejoin until it is
upgraded.

Removing the cib files won't help, since node-2 apparently does not
support node-3's pacemaker version.

If that's not the situation you are in, please give more details, as
this should not be possible otherwise.

> cib:error: cib_process_request:   Completed cib_replace operation for 
> section 'all': Protocol not supported (rc=-93, origin=node-3/crmd/12708, 
> version=0.83.30)
> crmd:   error: finalize_sync_callback:Sync from node-3 failed: 
> Protocol not supported
> crmd:info: register_fsa_error_adv:Resetting the current action 
> list
> crmd: warning: do_log:Input I_ELECTION_DC received in state 
> S_FINALIZE_JOIN from finalize_sync_callback
> crmd:info: do_state_transition:   State transition S_FINALIZE_JOIN -> 
> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL 
> origin=finalize_sync_callback
> crmd:info: crm_update_peer_join:  initialize_join: Node node-2[1] - 
> join-6329 phase 2 -> 0
> crmd:info: crm_update_peer_join:  initialize_join: Node node-3[2] - 
> join-6329 phase 2 -> 0
> crmd:info: update_dc: Unset DC. Was node-2
> crmd:info: join_make_offer:   join-6329: Sending offer to node-2
> crmd:info: crm_update_peer_join:  join_make_offer: Node node-2[1] - 
> join-6329 phase 0 -> 1
> crmd:info: join_make_offer:   join-6329: Sending offer to node-3
> crmd:info: crm_update_peer_join:  join_make_offer: Node node-3[2] - 
> join-6329 phase 0 -> 1
> crmd:info: do_dc_join_offer_all:  join-6329: Waiting on 2 outstanding 
> join acks
> crmd:info: update_dc: Set DC to node-2 (3.0.10)
> crmd:info: crm_update_peer_join:  do_dc_join_filter_offer: Node node-2[1] 
> - join-6329 phase 1 -> 2
> crmd:info: crm_update_peer_join:  do_dc_join_filter_offer: Node node-3[2] 
> - join-6329 phase 1 -> 2
> crmd:info: do_state_transition:   State transition S_INTEGRATION -> 
> S_FINALIZE_JOIN | input=I_INTEGRATED cause=C_FSA_INTERNAL 
> origin=check_join_state
> crmd:info: crmd_join_phase_log:   join-6329: node-2=integrated
> crmd:info: crmd_join_phase_log:   join-6329: node-3=integrated
> crmd:  notice: do_dc_join_finalize:   Syncing the Cluster Information Base 
> from node-3 to rest of cluster | join-6329
> crmd:  notice: do_dc_join_finalize:   Requested versioncrm_feature_set="3.0.11" validate-with="pacemaker-2.5" epoch="84" 
> num_updates="1" admin_epoch="0" cib-last-written="Thu May 11 08:05:45 2017" 
> update-origin="node-2" update-client="crm_resource" update-user="root" 
> have-quorum="1"/>
> cib: info: cib_process_request:   Forwarding cib_sync operation for 
> section 'all' to node-3 (origin=local/crmd/12710)
> cib: info: cib_process_replace:   Digest matched on replace from node-3: 
> 85a19c7927c54ccb15794f2720e07ce1
> cib: info: cib_process_replace:   Replaced 0.83.30 with 0.84.1 from node-3
> cib: info: __xml_diff_object: Moved node_state@crmd (3 -> 2)

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] both nodes OFFLINE

2017-05-13 Thread 石井 俊直
Hi.

We have, sometimes, a problem in our two nodes cluster on CentOS7. Let node-2 
and node-3
be the names of the nodes. When the problem happens, both nodes are recognized 
OFFLINE
on node-3 and on node-2, only node-3 is recognized OFFLINE.

When that happens, the following log message is added repeatedly on node-2 and 
log file
(/var/log/cluster/corosync.log) becomes hundreds of megabytes in short time. 
Log message
content on node-3 is different.

The erroneous state is temporally solved if OS of node-2 is restarted. On the 
other hand,
restarting OS of node-3 results in the same state.

I’ve searched content of ML and found a post (Mon Oct 1 01:27:39 CEST 2012) 
about
"Discarding update with feature set” problem. According to the message, our 
problem
may be solved by removing /var/lib/pacemaker/crm/cib.* on node-2.

What I want to know is whether removing the above files on just one of the node 
is safe ?
If there’s other method to solve the problem, I’d like to hear that.

Thanks.

—— from corosync.log  
cib:error: cib_perform_op:  Discarding update with feature set '3.0.11' 
greater than our own '3.0.10'
cib:error: cib_process_request: Completed cib_replace operation for 
section 'all': Protocol not supported (rc=-93, origin=node-3/crmd/12708, 
version=0.83.30)
crmd:   error: finalize_sync_callback:  Sync from node-3 failed: Protocol not 
supported
crmd:info: register_fsa_error_adv:  Resetting the current action list
crmd: warning: do_log:  Input I_ELECTION_DC received in state S_FINALIZE_JOIN 
from finalize_sync_callback
crmd:info: do_state_transition: State transition S_FINALIZE_JOIN -> 
S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL 
origin=finalize_sync_callback
crmd:info: crm_update_peer_join:initialize_join: Node node-2[1] - 
join-6329 phase 2 -> 0
crmd:info: crm_update_peer_join:initialize_join: Node node-3[2] - 
join-6329 phase 2 -> 0
crmd:info: update_dc:   Unset DC. Was node-2
crmd:info: join_make_offer: join-6329: Sending offer to node-2
crmd:info: crm_update_peer_join:join_make_offer: Node node-2[1] - 
join-6329 phase 0 -> 1
crmd:info: join_make_offer: join-6329: Sending offer to node-3
crmd:info: crm_update_peer_join:join_make_offer: Node node-3[2] - 
join-6329 phase 0 -> 1
crmd:info: do_dc_join_offer_all:join-6329: Waiting on 2 outstanding 
join acks
crmd:info: update_dc:   Set DC to node-2 (3.0.10)
crmd:info: crm_update_peer_join:do_dc_join_filter_offer: Node node-2[1] 
- join-6329 phase 1 -> 2
crmd:info: crm_update_peer_join:do_dc_join_filter_offer: Node node-3[2] 
- join-6329 phase 1 -> 2
crmd:info: do_state_transition: State transition S_INTEGRATION -> 
S_FINALIZE_JOIN | input=I_INTEGRATED cause=C_FSA_INTERNAL 
origin=check_join_state
crmd:info: crmd_join_phase_log: join-6329: node-2=integrated
crmd:info: crmd_join_phase_log: join-6329: node-3=integrated
crmd:  notice: do_dc_join_finalize: Syncing the Cluster Information Base 
from node-3 to rest of cluster | join-6329
crmd:  notice: do_dc_join_finalize: Requested version   
cib: info: cib_process_request: Forwarding cib_sync operation for 
section 'all' to node-3 (origin=local/crmd/12710)
cib: info: cib_process_replace: Digest matched on replace from node-3: 
85a19c7927c54ccb15794f2720e07ce1
cib: info: cib_process_replace: Replaced 0.83.30 with 0.84.1 from node-3
cib: info: __xml_diff_object:   Moved node_state@crmd (3 -> 2)
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org