>>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 01.03.2017 um 22:07 in Nachricht <offc50c6dc.1138528d-on002580d6.006f49aa-852580d6.00740...@notes.na.collabserv.c m>:
> Hi.. > > I am running a few corosync "passive mode" Redundant Ring Protocol (RRP) > failure scenarios, where > my cluster has several remote-node VirtualDomain resources running on each > node in the cluster, > which have been configured to allow Live Guest Migration (LGM) operations. > > While both corosync rings are active, if I drop ring0 on a given node where > I have remote node (guests) running, > I noticed that the guest will be shutdown / re-started on the same host, > after which the connection is re-established > and the guest proceeds to run on that same cluster node. Could it be you forgot "allow-migrate=true" at the resource level or some migration IP address at the node level? I only have SLES11 here... > > I am wondering why pacemaker doesn't try to "live" migrate the remote node > (guest) to a different node, instead > of rebooting the guest? Is there some way to configure the remote nodes > such that the recovery action is > LGM instead of reboot when the host-to-remote_node connect is lost in an > RRP situation? I guess the > next question is, is it even possible to LGM a remote node guest if the > corosync ring fails over from ring0 to ring1 > (or vise-versa)? > > # For example, here's a remote node's VirtualDomain resource definition. > > [root@zs95kj]# pcs resource show zs95kjg110102_res > Resource: zs95kjg110102_res (class=ocf provider=heartbeat > type=VirtualDomain) > Attributes: config=/guestxml/nfs1/zs95kjg110102.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true remote-node=zs95kjg110102 > remote-addr=10.20.110.102 > Operations: start interval=0s timeout=480 > (zs95kjg110102_res-start-interval-0s) > stop interval=0s timeout=120 > (zs95kjg110102_res-stop-interval-0s) > monitor interval=30s (zs95kjg110102_res-monitor-interval-30s) > migrate-from interval=0s timeout=1200 > (zs95kjg110102_res-migrate-from-interval-0s) > migrate-to interval=0s timeout=1200 > (zs95kjg110102_res-migrate-to-interval-0s) > [root@zs95kj VD]# > > > > > # My RRP rings are active, and configured "rrp_mode="passive" > > [root@zs95kj ~]# corosync-cfgtool -s > Printing ring status. > Local node ID 2 > RING ID 0 > id = 10.20.93.12 > status = ring 0 active with no faults > RING ID 1 > id = 10.20.94.212 > status = ring 1 active with no faults > > > > # Here's the corosync.conf .. > > [root@zs95kj ~]# cat /etc/corosync/corosync.conf > totem { > version: 2 > secauth: off > cluster_name: test_cluster_2 > transport: udpu > rrp_mode: passive > } > > nodelist { > node { > ring0_addr: zs95kjpcs1 > ring1_addr: zs95kjpcs2 > nodeid: 2 > } > > node { > ring0_addr: zs95KLpcs1 > ring1_addr: zs95KLpcs2 > nodeid: 3 > } > > node { > ring0_addr: zs90kppcs1 > ring1_addr: zs90kppcs2 > nodeid: 4 > } > > node { > ring0_addr: zs93KLpcs1 > ring1_addr: zs93KLpcs2 > nodeid: 5 > } > > node { > ring0_addr: zs93kjpcs1 > ring1_addr: zs93kjpcs2 > nodeid: 1 > } > } > > quorum { > provider: corosync_votequorum > } > > logging { > to_logfile: yes > logfile: /var/log/corosync/corosync.log > timestamp: on > syslog_facility: daemon > to_syslog: yes > debug: on > > logger_subsys { > debug: off > subsys: QUORUM > } > } > > > > > # Here's the vlan / route situation on cluster node zs95kj: > > ring0 is on vlan1293 > ring1 is on vlan1294 > > [root@zs95kj ~]# route -n > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use > Iface > 0.0.0.0 10.20.93.254 0.0.0.0 UG 400 0 0 > vlan1293 << default route to guests from ring0 > 9.0.0.0 9.12.23.1 255.0.0.0 UG 400 0 0 > vlan508 > 9.12.23.0 0.0.0.0 255.255.255.0 U 400 0 0 > vlan508 > 10.20.92.0 0.0.0.0 255.255.255.0 U 400 0 0 > vlan1292 > 10.20.93.0 0.0.0.0 255.255.255.0 U 0 0 0 > vlan1293 << ring0 IPs > 10.20.93.0 0.0.0.0 255.255.255.0 U 400 0 0 > vlan1293 > 10.20.94.0 0.0.0.0 255.255.255.0 U 0 0 0 > vlan1294 << ring1 IPs > 10.20.94.0 0.0.0.0 255.255.255.0 U 400 0 0 > vlan1294 > 10.20.101.0 0.0.0.0 255.255.255.0 U 400 0 0 > vlan1298 > 10.20.109.0 10.20.94.254 255.255.255.0 UG 400 0 0 > vlan1294 << Route to guests on 10.20.109 from ring1 > 10.20.110.0 10.20.94.254 255.255.255.0 UG 400 0 0 > vlan1294 << Route to guests on 10.20.110 from ring1 > 169.254.0.0 0.0.0.0 255.255.0.0 U 1007 0 0 > enccw0.0.02e0 > 169.254.0.0 0.0.0.0 255.255.0.0 U 1016 0 0 > ovsbridge1 > 192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 > virbr0 > > > > # On remote node, you can see we have a connection back to the host. > > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: lrmd > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: notice: > lrmd_init_remote_tls_server: Starting a tls listener on port 3121. > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: notice: > bind_and_listen: Listening on address :: > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: cib_ro > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: cib_rw > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: cib_shm > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: attrd > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: stonith-ng > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: crmd > Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info: main: > Starting > Feb 28 14:30:27 [928] zs95kjg110102 pacemaker_remoted: notice: > lrmd_remote_listen: LRMD client connection established. 0x9ec18b50 id: > 93e25ef0-4ff8-45ac-a6ed-f13b64588326 > > zs95kjg110102:~ # netstat -anp > Active Internet connections (servers and established) > Proto Recv-Q Send-Q Local Address Foreign Address State > PID/Program name > tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN > 946/sshd > tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN > 1022/master > tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN > 931/xinetd > tcp 0 0 0.0.0.0:5801 0.0.0.0:* LISTEN > 931/xinetd > tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN > 931/xinetd > tcp 0 0 :::21 :::* LISTEN > 926/vsftpd > tcp 0 0 :::22 :::* LISTEN > 946/sshd > tcp 0 0 ::1:25 :::* LISTEN > 1022/master > tcp 0 0 :::44931 :::* LISTEN > 1068/xdm > tcp 0 0 :::80 :::* LISTEN > 929/httpd-prefork > tcp 0 0 :::3121 :::* LISTEN > 928/pacemaker_remot > tcp 0 0 10.20.110.102:3121 10.20.93.12:46425 > ESTABLISHED 928/pacemaker_remot > udp 0 0 :::177 :::* > 1068/xdm > > > > > ## Drop the ring0 (vlan1293) interface on cluster node zs95kj, causing fail > over to ring1 (vlan1294) > > [root@zs95kj]# date;ifdown vlan1293 > Tue Feb 28 15:54:11 EST 2017 > Device 'vlan1293' successfully disconnected. > > > > ## Confirm that ring0 is now offline (a.k.a. "FAULTY") > > [root@zs95kj]# date;corosync-cfgtool -s > Tue Feb 28 15:54:49 EST 2017 > Printing ring status. > Local node ID 2 > RING ID 0 > id = 10.20.93.12 > status = Marking ringid 0 interface 10.20.93.12 FAULTY > RING ID 1 > id = 10.20.94.212 > status = ring 1 active with no faults > [root@zs95kj VD]# > > > > > # See that the resource stayed local to cluster node zs95kj. > > [root@zs95kj]# date;pcs resource show |grep zs95kjg110102 > Tue Feb 28 15:55:32 EST 2017 > zs95kjg110102_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 > You have new mail in /var/spool/mail/root > > > > # On the remote node, show new entries in pacemaker.log showing connection > re-established. > > Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: notice: > crm_signal_dispatch: Invoking handler for signal 15: Terminated > Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: info: > lrmd_shutdown: Terminating with 1 clients > Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_withdraw: withdrawing server sockets > Feb 28 15:55:17 [928] zs95kjg110102 pacemaker_remoted: info: > crm_xml_cleanup: Cleaning up memory from libxml2 > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: lrmd > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: notice: > lrmd_init_remote_tls_server: Starting a tls listener on port 3121. > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: notice: > bind_and_listen: Listening on address :: > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: cib_ro > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: cib_rw > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: cib_shm > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: attrd > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: stonith-ng > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: > qb_ipcs_us_publish: server name: crmd > Feb 28 15:55:37 [942] zs95kjg110102 pacemaker_remoted: info: main: > Starting > Feb 28 15:55:38 [942] zs95kjg110102 pacemaker_remoted: notice: > lrmd_remote_listen: LRMD client connection established. 0xbed1ab50 id: > b19ed532-6f61-4d9c-9439-ffb836eea34f > zs95kjg110102:~ # > > > > zs95kjg110102:~ # netstat -anp |less > Active Internet connections (servers and established) > Proto Recv-Q Send-Q Local Address Foreign Address State > PID/Program name > tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN > 961/sshd > tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN > 1065/master > tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN > 946/xinetd > tcp 0 0 0.0.0.0:5801 0.0.0.0:* LISTEN > 946/xinetd > tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN > 946/xinetd > tcp 0 0 10.20.110.102:22 10.20.94.32:57749 > ESTABLISHED 1134/0 > tcp 0 0 :::21 :::* LISTEN > 941/vsftpd > tcp 0 0 :::22 :::* LISTEN > 961/sshd > tcp 0 0 ::1:25 :::* LISTEN > 1065/master > tcp 0 0 :::80 :::* LISTEN > 944/httpd-prefork > tcp 0 0 :::3121 :::* LISTEN > 942/pacemaker_remot > tcp 0 0 :::34836 :::* LISTEN > 1070/xdm > tcp 0 0 10.20.110.102:3121 10.20.94.212:49666 > ESTABLISHED 942/pacemaker_remot > udp 0 0 :::177 :::* > 1070/xdm > > > > ## On host node, zs95kj show system messages indicating remote node (guest) > shutdown / start ... (but no attempt to LGM). > > [root@zs95kj ~]# grep "Feb 28" /var/log/messages |grep zs95kjg110102 > > Feb 28 15:55:07 zs95kj crmd[121380]: error: Operation > zs95kjg110102_monitor_30000: Timed Out (node=zs95kjpcs1, call=2, > timeout=30000ms) > Feb 28 15:55:07 zs95kj crmd[121380]: error: Unexpected disconnect on > remote-node zs95kjg110102 > Feb 28 15:55:17 zs95kj crmd[121380]: notice: Operation > zs95kjg110102_stop_0: ok (node=zs95kjpcs1, call=38, rc=0, cib-update=370, > confirmed=true) > Feb 28 15:55:17 zs95kj attrd[121378]: notice: Removing all zs95kjg110102 > attributes for zs95kjpcs1 > Feb 28 15:55:17 zs95kj VirtualDomain(zs95kjg110102_res)[173127]: INFO: > Issuing graceful shutdown request for domain zs95kjg110102. > Feb 28 15:55:23 zs95kj systemd-machined: Machine qemu-38-zs95kjg110102 > terminated. > Feb 28 15:55:23 zs95kj crmd[121380]: notice: Operation > zs95kjg110102_res_stop_0: ok (node=zs95kjpcs1, call=858, rc=0, > cib-update=378, confirmed=true) > Feb 28 15:55:24 zs95kj systemd-machined: New machine qemu-64-zs95kjg110102. > Feb 28 15:55:24 zs95kj systemd: Started Virtual Machine > qemu-64-zs95kjg110102. > Feb 28 15:55:24 zs95kj systemd: Starting Virtual Machine > qemu-64-zs95kjg110102. > Feb 28 15:55:25 zs95kj crmd[121380]: notice: Operation > zs95kjg110102_res_start_0: ok (node=zs95kjpcs1, call=859, rc=0, > cib-update=385, confirmed=true) > Feb 28 15:55:38 zs95kj crmd[121380]: notice: Operation > zs95kjg110102_start_0: ok (node=zs95kjpcs1, call=44, rc=0, cib-update=387, > confirmed=true) > [root@zs95kj ~]# > > > Once the remote node established re-connection, there was no further remote > node / resource instability. > > Anyway, just wondering why there was no attempt to migrate this remote node > guest as opposed to a reboot? Is it necessary to reboot the guest in > order to be managed > by pacemaker and corosync over the ring1 interface if ring0 goes down? > Is live guest migration even possible if ring0 goes away and ring1 takes > over? > > Thanks in advance.. > > Scott Greenlese ... KVM on System Z - Solutions Test, IBM Poughkeepsie, > N.Y. > INTERNET: swgre...@us.ibm.com _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org