On Fri, 2017-11-24 at 18:00 +0800, Hui Xiang wrote: > Jan, > > Very appreciated on your help, I am getting further more, but still > it looks very strange. > > 1. To use "debug-promote", I upgrade pacemaker from 1.12 to 1.16, pcs > to 0.9.160. > > 2. Recreate resource with below commands > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ > master_ip=192.168.0.99 \ > op monitor interval="10s" \ > op monitor interval="11s" role=Master > pcs resource master ovndb_servers-master ovndb_servers \ > meta notify="true" master-max="1" master-node-max="1" clone-max="3" > clone-node-max="1" > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.0.99 \ > op monitor interval=10s > pcs constraint colocation add VirtualIP with master ovndb_servers- > master \ > score=INFINITY > > 3. pcs status > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld > ] > VirtualIP (ocf::heartbeat:IPaddr2): Stopped > > 4. Manually run 'debug-start' on 3 nodes and 'debug-promote' on one > of nodes > run below on [ node-1.domain.tld node-2.domain.tld node-3.domain.tld > ] > # pcs resource debug-start ovndb_servers --full > run below on [ node-1.domain.tld ] > # pcs resource debug-promote ovndb_servers --full
Before running debug-* commands, I'd unmanage the resource or put the cluster in maintenance mode, so Pacemaker doesn't try to "correct" your actions. > > 5. pcs status > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld > ] > VirtualIP (ocf::heartbeat:IPaddr2): Stopped > > 6. However I have seen that one of ovndb_servers has been indeed > promoted as master, but pcs status still showed all 'stopped' > what am I missing? It's hard to tell from these logs. It's possible the resource agent's monitor command is not exiting with the expected status values: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemake r_Explained/index.html#_requirements_for_multi_state_resource_agents One of the nodes will be elected the DC, meaning it coordinates the cluster's actions. The DC's logs will have more "pengine:" messages, with each action that needs to be taken (e.g. "* Start <rsc> <node>"). You can look through those actions to see what the cluster decided to do -- whether the resources were ever started, whether any was promoted, and whether any were explicitly stopped. > > stderr: + 17:45:59: ocf_log:327: __OCF_MSG='ovndb_servers: > Promoting node-1.domain.tld as the master' > > stderr: + 17:45:59: ocf_log:329: case "${__OCF_PRIO}" in > > stderr: + 17:45:59: ocf_log:333: __OCF_PRIO=INFO > > stderr: + 17:45:59: ocf_log:338: '[' INFO = DEBUG ']' > > stderr: + 17:45:59: ocf_log:341: ha_log 'INFO: ovndb_servers: > Promoting node-1.domain.tld as the master' > > stderr: + 17:45:59: ha_log:253: __ha_log 'INFO: ovndb_servers: > Promoting node-1.domain.tld as the master' > > stderr: + 17:45:59: __ha_log:185: local ignore_stderr=false > > stderr: + 17:45:59: __ha_log:186: local loglevel > > stderr: + 17:45:59: __ha_log:188: '[' 'xINFO: ovndb_servers: > Promoting node-1.domain.tld as the master' = x--ignore-stderr ']' > > stderr: + 17:45:59: __ha_log:190: '[' none = '' ']' > > stderr: + 17:45:59: __ha_log:192: tty > > stderr: + 17:45:59: __ha_log:193: '[' x = x0 -a x = xdebug ']' > > stderr: + 17:45:59: __ha_log:195: '[' false = true ']' > > stderr: + 17:45:59: __ha_log:199: '[' '' ']' > > stderr: + 17:45:59: __ha_log:202: echo 'INFO: ovndb_servers: > Promoting node-1.domain.tld as the master' > > stderr: INFO: ovndb_servers: Promoting node-1.domain.tld as the > master > > stderr: + 17:45:59: __ha_log:204: return 0 > > stderr: + 17:45:59: ovsdb_server_promote:378: > /usr/sbin/crm_attribute --type crm_config --name OVN_REPL_INFO -s > ovn_ovsdb_master_server -v node-1.domain.tld > > stderr: + 17:45:59: ovsdb_server_promote:379: > ovsdb_server_master_update 8 > > stderr: + 17:45:59: ovsdb_server_master_update:214: case $1 in > > stderr: + 17:45:59: ovsdb_server_master_update:218: > /usr/sbin/crm_master -l reboot -v 10 > > stderr: + 17:45:59: ovsdb_server_promote:380: return 0 > > stderr: + 17:45:59: 458: rc=0 > > stderr: + 17:45:59: 459: exit 0 > > > On 23/11/17 23:52 +0800, Hui Xiang wrote: > > I am working on HA with 3-nodes, which has below configurations: > > > > """ > > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ > > master_ip=168.254.101.2 \ > > op monitor interval="10s" \ > > op monitor interval="11s" role=Master > > pcs resource master ovndb_servers-master ovndb_servers \ > > meta notify="true" master-max="1" master-node-max="1" clone- > max="3" > > clone-node-max="1" > > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 > ip=168.254.101.2 \ > > op monitor interval=10s > > pcs constraint order promote ovndb_servers-master then VirtualIP > > pcs constraint colocation add VirtualIP with master ovndb_servers- > master \ > > score=INFINITY > > """ > > (Out of curiosity, this looks like a mix of output from > pcs config export pcs-commands [or clufter cib2pcscmd -s] > and manual editing. Is this a good guess?) > It's the output of "pcs status". > > > However, after setting it as above, the master is not being > selected, all > > are stopped, from pacemaker log, node-1 has been chosen as the > master, I am > > confuse where is wrong, can anybody give a help, it would be very > > appreciated. > > > > > > Master/Slave Set: ovndb_servers-master [ovndb_servers] > > Stopped: [ node-1.domain.tld node-2.domain.tld node- > 3.domain.tld ] > > VirtualIP (ocf::heartbeat:IPaddr2): Stopped > > > > > > # pacemaker log > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ /cib/configuration/resources: <primitive > class="ocf" > > id="ovndb_servers" provider="ovn" type="ovndb-servers"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: > ++ <instance_attributes > > id="ovndb_servers-instance_attributes"> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <nvpair > > id="ovndb_servers-instance_attributes-master_ip" name="master_ip" > > value="168.254.101.2"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: > ++ </instance_attributes> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <operations> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <op > > id="ovndb_servers-start-timeout-30s" interval="0s" name="start" > > timeout="30s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <op > > id="ovndb_servers-stop-timeout-20s" interval="0s" name="stop" > > timeout="20s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <op > > id="ovndb_servers-promote-timeout-50s" interval="0s" name="promote" > > timeout="50s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <op > > id="ovndb_servers-demote-timeout-50s" interval="0s" name="demote" > > timeout="50s"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <op > > id="ovndb_servers-monitor-interval-10s" interval="10s" > name="monitor"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ <op > > id="ovndb_servers-monitor-interval-11s-role-Master" interval="11s" > > name="monitor" role="Master"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ </operations> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ </primitive> > > > > Nov 23 23:06:03 [665249] node-1.domain.tld attrd: info: > > attrd_peer_update: Setting master-ovndb_servers[node-1.domain.tld]: > (null) > > -> 5 from node-1.domain.tld > > If it's probable your ocf:ovn:ovndb-servers agent in master mode can > run something like "attrd_updater -n master-ovndb_servers -U 5", then > it was indeed launched OK, and if it does not continue to run as > expected, there may be a problem with the agent itself. > > no change. > You can try running "pcs resource debug-promote ovndb_servers --full" > to examine the executation details (assuming the agent responds to > OCF_TRACE_RA=1 environment variable, which is what shell-based > agents built on top ocf-shellfuncs sourcable shell library from > resource-agents project, hence incl. also agents it ships, > customarily do). > Yes, thank, it's helpful. > > > Nov 23 23:06:03 [665251] node-1.domain.tld crmd: notice: > > process_lrm_event: Operation ovndb_servers_monitor_0: ok > > (node=node-1.domain.tld, call=185, rc=0, cib-update=88, > confirmed=true) > > <29>Nov 23 23:06:03 node-1 crmd[665251]: notice: > process_lrm_event: > > Operation ovndb_servers_monitor_0: ok (node=node-1.domain.tld, > call=185, > > rc=0, cib-update=88, confirmed=true) > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: Diff: --- 0.630.2 2 > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: Diff: +++ 0.630.3 (null) > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: + /cib: @num_updates=3 > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_perform_op: ++ > > > /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instanc > e_attributes[@id='status-1']: > > <nvpair id="status-1-master-ovndb_servers" name="master- > ovndb_servers" > > value="5"/> > > Nov 23 23:06:03 [665246] node-1.domain.tld cib: info: > > cib_process_request: Completed cib_modify operation for section > status: OK > > (rc=0, origin=node-3.domain.tld/attrd/80, version=0.630.3) > > Also depends if there's anything interesting after this point... > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org