Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources
Ken (and Ulrich), Found it! You're right, we do deliver a man page... man]# find . -name *Virtual* -print ./man7/ocf_heartbeat_VirtualDomain.7.gz # rpm -q --whatprovides /usr/share/man/man7/ocf_heartbeat_VirtualDomain.7.gz resource-agents-3.9.7-4.el7_2.kvmibm1_1_3.1.s390x Much obliged, sir(s). Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301)M/S: POK 42HA/P966 From: Ken Gaillot <kgail...@redhat.com> To: users@clusterlabs.org Date: 02/01/2017 10:33 AM Subject: Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources On 02/01/2017 09:15 AM, Scott Greenlese wrote: > Hi all... > > Just a quick follow-up. > > Thought I should come clean and share with you that the incorrect > "migrate-to" operation name defined in my VirtualDomain > resource was my mistake. It was mis-coded in the virtual guest > provisioning script. I have since changed it to "migrate_to" > and of course, the specified live migration timeout value is working > effectively now. (For some reason, I assumed we were letting that > operation meta value default). > > I was wondering if someone could refer me to the definitive online link > for pacemaker resource man pages? I don't see any resource man pages > installed > on my system anywhere. I found this one online: > https://www.mankier.com/7/ocf_heartbeat_VirtualDomain but is there a > more 'official' page I should refer our > Linux KVM on System z customers to? All distributions that I know of include the man pages with the packages they distribute. Are you building from source? They are named like "man ocf_heartbeat_IPaddr2". FYI after following this thread, the pcs developers are making a change so that pcs refuses to add an unrecognized operation unless the user uses --force. Thanks for being involved in the community; this is how we learn to improve! > Thanks again for your assistance. > > Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > > > Inactive hide details for "Ulrich Windl" ---01/27/2017 02:32:43 AM--->>> > "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27."Ulrich Windl" > ---01/27/2017 02:32:43 AM--->>> "Scott Greenlese" <swgre...@us.ibm.com> > schrieb am 27.01.2017 um 02:47 in Nachricht > > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM@IBMUS > Cc: "Si Bo Niu" <nius...@cn.ibm.com>, Michael Tebolt/Poughkeepsie/IBM@IBMUS > Date: 01/27/2017 02:32 AM > Subject: Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts > for VirtualDomain resources > > > > > >>>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27.01.2017 um > 02:47 in > Nachricht > <of63cd0e10.d58c4c3d-on002580b5.0005c410-852580b5.0009d...@notes.na.collabserv.c > > m>: > >> Hi guys.. >> >> Well, today I confirmed that what Ulrich said is correct. If I update the >> VirtualDomain resource with the operation name "migrate_to" instead of >> "migrate-to", it effectively overrides and enforces the 1200ms default >> value to the new value. >> >> I am wondering how I would have known that I was using the wrong operation >> name, when the initial operation name is already incorrect >> when the resource is created? > > For SLES 11, I made a quick (portable non-portable unstable) try (print > the operations known to an RA): > # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p" > Operations' defaults (advisory minimum): > >start timeout=90 >stop timeout=90 >statustimeout=30 interval=10 >monitor timeout=30 interval=10 >migrate_from timeout=60 >migrate_totimeout=120 > > Regards, > Ulrich > >> >> This is what the meta data for my resource looked like after making the >> update: >> >> [root@zs95kj VD]# date;pcs resource update zs95kjg110065_res op migrate_to >> timeout="360s" >> Thu Jan 26 16:43:11 EST 2017 >> You have new mail in /var/spool/mail/root >> >> [root@zs95kj VD]# date;pcs resource show zs95kjg110065_res >> Thu Jan 26 16:43:46 EST 2017 >> Resource: zs95kjg110065_res (class=ocf provider=heartbeat >> type=VirtualDomain) >> Attributes: config=/guestxml/nfs1/zs95kjg110065.xml >> hypervisor=qemu:///system migration_transport=ssh >&g
Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources
On 02/01/2017 09:15 AM, Scott Greenlese wrote: > Hi all... > > Just a quick follow-up. > > Thought I should come clean and share with you that the incorrect > "migrate-to" operation name defined in my VirtualDomain > resource was my mistake. It was mis-coded in the virtual guest > provisioning script. I have since changed it to "migrate_to" > and of course, the specified live migration timeout value is working > effectively now. (For some reason, I assumed we were letting that > operation meta value default). > > I was wondering if someone could refer me to the definitive online link > for pacemaker resource man pages? I don't see any resource man pages > installed > on my system anywhere. I found this one online: > https://www.mankier.com/7/ocf_heartbeat_VirtualDomain but is there a > more 'official' page I should refer our > Linux KVM on System z customers to? All distributions that I know of include the man pages with the packages they distribute. Are you building from source? They are named like "man ocf_heartbeat_IPaddr2". FYI after following this thread, the pcs developers are making a change so that pcs refuses to add an unrecognized operation unless the user uses --force. Thanks for being involved in the community; this is how we learn to improve! > Thanks again for your assistance. > > Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > > > Inactive hide details for "Ulrich Windl" ---01/27/2017 02:32:43 AM--->>> > "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27."Ulrich Windl" > ---01/27/2017 02:32:43 AM--->>> "Scott Greenlese" <swgre...@us.ibm.com> > schrieb am 27.01.2017 um 02:47 in Nachricht > > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM@IBMUS > Cc: "Si Bo Niu" <nius...@cn.ibm.com>, Michael Tebolt/Poughkeepsie/IBM@IBMUS > Date: 01/27/2017 02:32 AM > Subject: Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts > for VirtualDomain resources > > > > > >>>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27.01.2017 um > 02:47 in > Nachricht > <of63cd0e10.d58c4c3d-on002580b5.0005c410-852580b5.0009d...@notes.na.collabserv.c > > m>: > >> Hi guys.. >> >> Well, today I confirmed that what Ulrich said is correct. If I update the >> VirtualDomain resource with the operation name "migrate_to" instead of >> "migrate-to", it effectively overrides and enforces the 1200ms default >> value to the new value. >> >> I am wondering how I would have known that I was using the wrong operation >> name, when the initial operation name is already incorrect >> when the resource is created? > > For SLES 11, I made a quick (portable non-portable unstable) try (print > the operations known to an RA): > # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p" > Operations' defaults (advisory minimum): > >start timeout=90 >stop timeout=90 >statustimeout=30 interval=10 >monitor timeout=30 interval=10 >migrate_from timeout=60 >migrate_totimeout=120 > > Regards, > Ulrich > >> >> This is what the meta data for my resource looked like after making the >> update: >> >> [root@zs95kj VD]# date;pcs resource update zs95kjg110065_res op migrate_to >> timeout="360s" >> Thu Jan 26 16:43:11 EST 2017 >> You have new mail in /var/spool/mail/root >> >> [root@zs95kj VD]# date;pcs resource show zs95kjg110065_res >> Thu Jan 26 16:43:46 EST 2017 >> Resource: zs95kjg110065_res (class=ocf provider=heartbeat >> type=VirtualDomain) >> Attributes: config=/guestxml/nfs1/zs95kjg110065.xml >> hypervisor=qemu:///system migration_transport=ssh >> Meta Attrs: allow-migrate=true >> Operations: start interval=0s timeout=120 >> (zs95kjg110065_res-start-interval-0s) >> stop interval=0s timeout=120 >> (zs95kjg110065_res-stop-interval-0s) >> monitor interval=30s > (zs95kjg110065_res-monitor-interval-30s) >> migrate-from interval=0s timeout=1200 >> (zs95kjg110065_res-migrate-from-interval-0s) >> migrate-to interval=0s timeout=1200 >> (zs95kjg110065_res-migrate-to-interval-0s) <<< Original op name / value >> migrate_to interval=0s timeout=360s >> (z
Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources
Hi all... Just a quick follow-up. Thought I should come clean and share with you that the incorrect "migrate-to" operation name defined in my VirtualDomain resource was my mistake. It was mis-coded in the virtual guest provisioning script. I have since changed it to "migrate_to" and of course, the specified live migration timeout value is working effectively now. (For some reason, I assumed we were letting that operation meta value default). I was wondering if someone could refer me to the definitive online link for pacemaker resource man pages? I don't see any resource man pages installed on my system anywhere. I found this one online: https://www.mankier.com/7/ocf_heartbeat_VirtualDomain but is there a more 'official' page I should refer our Linux KVM on System z customers to? Thanks again for your assistance. Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> To: <users@clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM@IBMUS Cc: "Si Bo Niu" <nius...@cn.ibm.com>, Michael Tebolt/Poughkeepsie/IBM@IBMUS Date: 01/27/2017 02:32 AM Subject: Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts for VirtualDomain resources >>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27.01.2017 um 02:47 in Nachricht <of63cd0e10.d58c4c3d-on002580b5.0005c410-852580b5.0009d...@notes.na.collabserv.c m>: > Hi guys.. > > Well, today I confirmed that what Ulrich said is correct. If I update the > VirtualDomain resource with the operation name "migrate_to" instead of > "migrate-to", it effectively overrides and enforces the 1200ms default > value to the new value. > > I am wondering how I would have known that I was using the wrong operation > name, when the initial operation name is already incorrect > when the resource is created? For SLES 11, I made a quick (portable non-portable unstable) try (print the operations known to an RA): # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p" Operations' defaults (advisory minimum): start timeout=90 stop timeout=90 statustimeout=30 interval=10 monitor timeout=30 interval=10 migrate_from timeout=60 migrate_totimeout=120 Regards, Ulrich > > This is what the meta data for my resource looked like after making the > update: > > [root@zs95kj VD]# date;pcs resource update zs95kjg110065_res op migrate_to > timeout="360s" > Thu Jan 26 16:43:11 EST 2017 > You have new mail in /var/spool/mail/root > > [root@zs95kj VD]# date;pcs resource show zs95kjg110065_res > Thu Jan 26 16:43:46 EST 2017 > Resource: zs95kjg110065_res (class=ocf provider=heartbeat > type=VirtualDomain) > Attributes: config=/guestxml/nfs1/zs95kjg110065.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true > Operations: start interval=0s timeout=120 > (zs95kjg110065_res-start-interval-0s) > stop interval=0s timeout=120 > (zs95kjg110065_res-stop-interval-0s) > monitor interval=30s (zs95kjg110065_res-monitor-interval-30s) > migrate-from interval=0s timeout=1200 > (zs95kjg110065_res-migrate-from-interval-0s) > migrate-to interval=0s timeout=1200 > (zs95kjg110065_res-migrate-to-interval-0s) <<< Original op name / value > migrate_to interval=0s timeout=360s > (zs95kjg110065_res-migrate_to-interval-0s) <<< New op name / value > > > Where does that original op name come from in the VirtualDomain resource > definition? How can we get the initial meta value changed and shipped with > a valid operation name (i.e. migrate_to), and > maybe a more reasonable migrate_to timeout value... something significantly > higher than 1200ms , i.e. 1.2 seconds ? Can I report this request as a > bugzilla on the RHEL side, or should this go to my internal IBM bugzilla > for KVM on System Z development? > > Anyway, thanks so much for identifying my issue. I can reconfigure my > resources to make them tolerate longer migration execution times. > > > Scott Greenlese ... IBM KVM on System Z Solution Test > INTERNET: swgre...@us.ibm.com > > > > > From: Ken Gaillot <kgail...@redhat.com> > To:Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>, > users@clusterlabs.org > Date: 01/19/2017 10:26 AM > Subject: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts for > VirtualDomain resources > > > > On 01/19/2017 01:36 AM, Ulrich Windl wrote: >>
Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources
with migrate_to op Any ideas? Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi.. From: Ken Gaillot <kgail...@redhat.com> To: users@clusterlabs.org Date: 01/17/2017 11:41 AM Subject: Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources On 01/17/2017 10:19 AM, Scott Greenlese wrote: Hi.. I've been testing live guest migration (LGM) with VirtualDomain resources, which are guests running on Linux KVM / System Z managed by pacemaker. I'm looking for documentation that explains how to configure my VirtualDomain resources such that they will not timeout prematurely when there is a heavy I/O workload running on the guest. If I perform the LGM with an unmanaged guest (resource disabled), it takes anywhere from 2 - 5 minutes to complete the LGM. Example: # Migrate guest, specify a timeout value of 600s [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061 qemu+ssh://zs90kppcs1/system Mon Jan 16 16:35:32 EST 2017 Migration: [100 %] [root@zs95kj VD]# date Mon Jan 16 16:40:01 EST 2017 [root@zs95kj VD]# Start: 16:35:32 End: 16:40:01 Total: *4 min 29 sec* In comparison, when the guest is managed by pacemaker, and enabled for LGM ... I get this: [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res Mon Jan 16 15:13:33 EST 2017 Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61 Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s) stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s) *migrate-to* interval=0s *timeout=1200* (zs95kjg110061_res-migrate-to-interval-0s) NOTE: I didn't specify any migrate-to value for timeout, so it defaulted to 1200. Is this seconds? If so, that's 20 minutes, ample time to complete a 5 minute migration. Not sure where the default of 1200 comes from, but I believe the default is milliseconds if no unit is specified. Normally you'd specify something like "timeout=1200s". [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:27:01 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 [root@zs95kj VD]# [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1* Mon Jan 16 14:45:39 EST 2017 You have new mail in /var/spool/mail/root Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh --connect=qemu:///system --quiet migrate --live zs95kjg110061 qemu+ssh://zs95kjpcs1/system ).* Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0:21050 - timed out after 2ms Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978, timeout=2ms) Jan 16 14:45:58 zs90kp journal: operation failed: migration job: unexpectedly failed [root@zs90KP VD]# So, the migration timed out after 2ms. Assuming ms is milliseconds, that's only 20 seconds. So, it seems that LGM timeout has nothing to do with *migrate-to* on the resource definition. Yes, ms is milliseconds. Pacemaker internally represents all times in milliseconds, even though in most actual usage, it has 1-second granularity. If your specified timeout is 1200ms, I'm not sure why it's using 2ms. There may be a minimum enforced somewhere. Also, what is the expected behavior when the migration times out? I watched the VirtualDomain resource state during the migration process... [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:45:57 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:02 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:06 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:08 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;
Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources
log_finished: > finished - rsc:zs95kjg110061_res action:migrate_to call_id:941 > pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms > Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:1292 ) trace: > lrmd_rsc_execute: Nothing further to do for zs95kjg110061_res > Jan 17 13:55:14 [27555] zs95kj crmd: ( utils.c:1942 ) debug: > create_operation_update: do_update_resource: Updating resource > zs95kjg110061_res after migrate_to op Timed Out (interval=0) > Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:2397 ) error: > process_lrm_event: Operation zs95kjg110061_res_migrate_to_0: Timed Out > (node=zs95kjpcs1, call=941, timeout=20000ms) > Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:196 ) debug: > update_history_cache: Updating history for 'zs95kjg110061_res' with > migrate_to op > > > Any ideas? > > > > Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > > > Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On > 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot > ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese > wrote: > Hi.. > > From: Ken Gaillot <kgail...@redhat.com> > To: users@clusterlabs.org > Date: 01/17/2017 11:41 AM > Subject: Re: [ClusterLabs] Live Guest Migration timeouts for > VirtualDomain resources > > > > > > On 01/17/2017 10:19 AM, Scott Greenlese wrote: >> Hi.. >> >> I've been testing live guest migration (LGM) with VirtualDomain >> resources, which are guests running on Linux KVM / System Z >> managed by pacemaker. >> >> I'm looking for documentation that explains how to configure my >> VirtualDomain resources such that they will not timeout >> prematurely when there is a heavy I/O workload running on the guest. >> >> If I perform the LGM with an unmanaged guest (resource disabled), it >> takes anywhere from 2 - 5 minutes to complete the LGM. >> Example: >> >> # Migrate guest, specify a timeout value of 600s >> >> [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live >> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061 >> qemu+ssh://zs90kppcs1/system >> Mon Jan 16 16:35:32 EST 2017 >> >> Migration: [100 %] >> >> [root@zs95kj VD]# date >> Mon Jan 16 16:40:01 EST 2017 >> [root@zs95kj VD]# >> >> Start: 16:35:32 >> End: 16:40:01 >> Total: *4 min 29 sec* >> >> >> In comparison, when the guest is managed by pacemaker, and enabled for >> LGM ... I get this: >> >> [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res >> Mon Jan 16 15:13:33 EST 2017 >> Resource: zs95kjg110061_res (class=ocf provider=heartbeat >> type=VirtualDomain) >> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml >> hypervisor=qemu:///system migration_transport=ssh >> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 >> remote-addr=10.20.110.61 >> Operations: start interval=0s timeout=480 >> (zs95kjg110061_res-start-interval-0s) >> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) >> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) >> migrate-from interval=0s timeout=1200 >> (zs95kjg110061_res-migrate-from-interval-0s) >> *migrate-to* interval=0s *timeout=1200* >> (zs95kjg110061_res-migrate-to-interval-0s) >> >> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted >> to 1200. Is this seconds? If so, that's 20 minutes, >> ample time to complete a 5 minute migration. > > Not sure where the default of 1200 comes from, but I believe the default > is milliseconds if no unit is specified. Normally you'd specify > something like "timeout=1200s". > >> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res >> Mon Jan 16 14:27:01 EST 2017 >> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 >> [root@zs95kj VD]# >> >> >> [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1* >> Mon Jan 16 14:45:39 EST 2017 >> You have new mail in /var/spool/mail/root >> >> >> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: >> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh >> --connect=qemu:///system --quiet migrate --live zs95kjg110061 >> qemu+ssh://zs95kjpcs1/system ).* >> Jan 16 14:45:57 zs90kp lrmd[12798]: warning: >> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out >> Jan 16 14:45:57 zs90kp lrmd[12798
Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources
lution Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com From: Ken Gaillot <kgail...@redhat.com> To: users@clusterlabs.org Date: 01/17/2017 11:41 AM Subject: Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi.. > > I've been testing live guest migration (LGM) with VirtualDomain > resources, which are guests running on Linux KVM / System Z > managed by pacemaker. > > I'm looking for documentation that explains how to configure my > VirtualDomain resources such that they will not timeout > prematurely when there is a heavy I/O workload running on the guest. > > If I perform the LGM with an unmanaged guest (resource disabled), it > takes anywhere from 2 - 5 minutes to complete the LGM. > Example: > > # Migrate guest, specify a timeout value of 600s > > [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live > --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061 > qemu+ssh://zs90kppcs1/system > Mon Jan 16 16:35:32 EST 2017 > > Migration: [100 %] > > [root@zs95kj VD]# date > Mon Jan 16 16:40:01 EST 2017 > [root@zs95kj VD]# > > Start: 16:35:32 > End: 16:40:01 > Total: *4 min 29 sec* > > > In comparison, when the guest is managed by pacemaker, and enabled for > LGM ... I get this: > > [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res > Mon Jan 16 15:13:33 EST 2017 > Resource: zs95kjg110061_res (class=ocf provider=heartbeat > type=VirtualDomain) > Attributes: config=/guestxml/nfs1/zs95kjg110061.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 > remote-addr=10.20.110.61 > Operations: start interval=0s timeout=480 > (zs95kjg110061_res-start-interval-0s) > stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) > monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) > migrate-from interval=0s timeout=1200 > (zs95kjg110061_res-migrate-from-interval-0s) > *migrate-to* interval=0s *timeout=1200* > (zs95kjg110061_res-migrate-to-interval-0s) > > NOTE: I didn't specify any migrate-to value for timeout, so it defaulted > to 1200. Is this seconds? If so, that's 20 minutes, > ample time to complete a 5 minute migration. Not sure where the default of 1200 comes from, but I believe the default is milliseconds if no unit is specified. Normally you'd specify something like "timeout=1200s". > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:27:01 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 > [root@zs95kj VD]# > > > [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1* > Mon Jan 16 14:45:39 EST 2017 > You have new mail in /var/spool/mail/root > > > Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: > zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh > --connect=qemu:///system --quiet migrate --live zs95kjg110061 > qemu+ssh://zs95kjpcs1/system ).* > Jan 16 14:45:57 zs90kp lrmd[12798]: warning: > zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out > Jan 16 14:45:57 zs90kp lrmd[12798]: warning: > zs95kjg110061_res_migrate_to_0:21050 - timed out after 2ms > Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation > zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978, > timeout=2ms) > Jan 16 14:45:58 zs90kp journal: operation failed: migration job: > unexpectedly failed > [root@zs90KP VD]# > > So, the migration timed out after 2ms. Assuming ms is milliseconds, > that's only 20 seconds. So, it seems that LGM timeout has > nothing to do with *migrate-to* on the resource definition. Yes, ms is milliseconds. Pacemaker internally represents all times in milliseconds, even though in most actual usage, it has 1-second granularity. If your specified timeout is 1200ms, I'm not sure why it's using 2ms. There may be a minimum enforced somewhere. > Also, what is the expected behavior when the migration times out? I > watched the VirtualDomain resource state during the migration process... > > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:45:57 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:02 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:06 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs