Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-02-01 Thread Scott Greenlese

Ken (and Ulrich),

Found it!  You're right, we do deliver a man page...

man]# find . -name *Virtual* -print
./man7/ocf_heartbeat_VirtualDomain.7.gz

# rpm -q
--whatprovides /usr/share/man/man7/ocf_heartbeat_VirtualDomain.7.gz
resource-agents-3.9.7-4.el7_2.kvmibm1_1_3.1.s390x

Much obliged, sir(s).

Scott Greenlese ... IBM z/BX Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966




From:   Ken Gaillot <kgail...@redhat.com>
To: users@clusterlabs.org
Date:   02/01/2017 10:33 AM
Subject:        Re: [ClusterLabs] Live Guest Migration timeouts for
    VirtualDomain resources



On 02/01/2017 09:15 AM, Scott Greenlese wrote:
> Hi all...
>
> Just a quick follow-up.
>
> Thought I should come clean and share with you that the incorrect
> "migrate-to" operation name defined in my VirtualDomain
> resource was my mistake. It was mis-coded in the virtual guest
> provisioning script. I have since changed it to "migrate_to"
> and of course, the specified live migration timeout value is working
> effectively now. (For some reason, I assumed we were letting that
> operation meta value default).
>
> I was wondering if someone could refer me to the definitive online link
> for pacemaker resource man pages? I don't see any resource man pages
> installed
> on my system anywhere. I found this one online:
> https://www.mankier.com/7/ocf_heartbeat_VirtualDomain but is there a
> more 'official' page I should refer our
> Linux KVM on System z customers to?

All distributions that I know of include the man pages with the packages
they distribute. Are you building from source? They are named like "man
ocf_heartbeat_IPaddr2".

FYI after following this thread, the pcs developers are making a change
so that pcs refuses to add an unrecognized operation unless the user
uses --force. Thanks for being involved in the community; this is how we
learn to improve!

> Thanks again for your assistance.
>
> Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
>
>
> Inactive hide details for "Ulrich Windl" ---01/27/2017 02:32:43 AM--->>>
> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27."Ulrich Windl"
> ---01/27/2017 02:32:43 AM--->>> "Scott Greenlese" <swgre...@us.ibm.com>
> schrieb am 27.01.2017 um 02:47 in Nachricht
>
> From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
> To: <users@clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM@IBMUS
> Cc: "Si Bo Niu" <nius...@cn.ibm.com>, Michael
Tebolt/Poughkeepsie/IBM@IBMUS
> Date: 01/27/2017 02:32 AM
> Subject: Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts
> for VirtualDomain resources
>
> 
>
>
>
>>>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27.01.2017 um
> 02:47 in
> Nachricht
>
<of63cd0e10.d58c4c3d-on002580b5.0005c410-852580b5.0009d...@notes.na.collabserv.c

>
> m>:
>
>> Hi guys..
>>
>> Well, today I confirmed that what Ulrich said is correct.  If I update
the
>> VirtualDomain resource with the operation name  "migrate_to" instead of
>> "migrate-to",  it effectively overrides and enforces the 1200ms default
>> value to the new value.
>>
>> I am wondering how I would have known that I was using the wrong
operation
>> name, when the initial operation name is already incorrect
>> when the resource is created?
>
> For SLES 11, I made a quick (portable non-portable unstable) try (print
> the operations known to an RA):
> # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p"
> Operations' defaults (advisory minimum):
>
>start timeout=90
>stop  timeout=90
>statustimeout=30 interval=10
>monitor   timeout=30 interval=10
>migrate_from  timeout=60
>migrate_totimeout=120
>
> Regards,
> Ulrich
>
>>
>> This is what the meta data for my resource looked like after making the
>> update:
>>
>> [root@zs95kj VD]# date;pcs resource update zs95kjg110065_res op
migrate_to
>> timeout="360s"
>> Thu Jan 26 16:43:11 EST 2017
>> You have new mail in /var/spool/mail/root
>>
>> [root@zs95kj VD]# date;pcs resource show zs95kjg110065_res
>> Thu Jan 26 16:43:46 EST 2017
>>  Resource: zs95kjg110065_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>>   Attributes: config=/guestxml/nfs1/zs95kjg110065.xml
>> hypervisor=qemu:///system migration_transport=ssh
>&g

Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-02-01 Thread Ken Gaillot
On 02/01/2017 09:15 AM, Scott Greenlese wrote:
> Hi all...
> 
> Just a quick follow-up.
> 
> Thought I should come clean and share with you that the incorrect
> "migrate-to" operation name defined in my VirtualDomain
> resource was my mistake. It was mis-coded in the virtual guest
> provisioning script. I have since changed it to "migrate_to"
> and of course, the specified live migration timeout value is working
> effectively now. (For some reason, I assumed we were letting that
> operation meta value default).
> 
> I was wondering if someone could refer me to the definitive online link
> for pacemaker resource man pages? I don't see any resource man pages
> installed
> on my system anywhere. I found this one online:
> https://www.mankier.com/7/ocf_heartbeat_VirtualDomain but is there a
> more 'official' page I should refer our
> Linux KVM on System z customers to?

All distributions that I know of include the man pages with the packages
they distribute. Are you building from source? They are named like "man
ocf_heartbeat_IPaddr2".

FYI after following this thread, the pcs developers are making a change
so that pcs refuses to add an unrecognized operation unless the user
uses --force. Thanks for being involved in the community; this is how we
learn to improve!

> Thanks again for your assistance.
> 
> Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> 
> 
> Inactive hide details for "Ulrich Windl" ---01/27/2017 02:32:43 AM--->>>
> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27."Ulrich Windl"
> ---01/27/2017 02:32:43 AM--->>> "Scott Greenlese" <swgre...@us.ibm.com>
> schrieb am 27.01.2017 um 02:47 in Nachricht
> 
> From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
> To: <users@clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM@IBMUS
> Cc: "Si Bo Niu" <nius...@cn.ibm.com>, Michael Tebolt/Poughkeepsie/IBM@IBMUS
> Date: 01/27/2017 02:32 AM
> Subject: Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts
> for VirtualDomain resources
> 
> 
> 
> 
> 
>>>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27.01.2017 um
> 02:47 in
> Nachricht
> <of63cd0e10.d58c4c3d-on002580b5.0005c410-852580b5.0009d...@notes.na.collabserv.c
> 
> m>:
> 
>> Hi guys..
>>
>> Well, today I confirmed that what Ulrich said is correct.  If I update the
>> VirtualDomain resource with the operation name  "migrate_to" instead of
>> "migrate-to",  it effectively overrides and enforces the 1200ms default
>> value to the new value.
>>
>> I am wondering how I would have known that I was using the wrong operation
>> name, when the initial operation name is already incorrect
>> when the resource is created?
> 
> For SLES 11, I made a quick (portable non-portable unstable) try (print
> the operations known to an RA):
> # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p"
> Operations' defaults (advisory minimum):
> 
>start timeout=90
>stop  timeout=90
>statustimeout=30 interval=10
>monitor   timeout=30 interval=10
>migrate_from  timeout=60
>migrate_totimeout=120
> 
> Regards,
> Ulrich
> 
>>
>> This is what the meta data for my resource looked like after making the
>> update:
>>
>> [root@zs95kj VD]# date;pcs resource update zs95kjg110065_res op migrate_to
>> timeout="360s"
>> Thu Jan 26 16:43:11 EST 2017
>> You have new mail in /var/spool/mail/root
>>
>> [root@zs95kj VD]# date;pcs resource show zs95kjg110065_res
>> Thu Jan 26 16:43:46 EST 2017
>>  Resource: zs95kjg110065_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>>   Attributes: config=/guestxml/nfs1/zs95kjg110065.xml
>> hypervisor=qemu:///system migration_transport=ssh
>>   Meta Attrs: allow-migrate=true
>>   Operations: start interval=0s timeout=120
>> (zs95kjg110065_res-start-interval-0s)
>>   stop interval=0s timeout=120
>> (zs95kjg110065_res-stop-interval-0s)
>>   monitor interval=30s
> (zs95kjg110065_res-monitor-interval-30s)
>>   migrate-from interval=0s timeout=1200
>> (zs95kjg110065_res-migrate-from-interval-0s)
>>   migrate-to interval=0s timeout=1200
>> (zs95kjg110065_res-migrate-to-interval-0s)   <<< Original op name / value
>>   migrate_to interval=0s timeout=360s
>> (z

Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-02-01 Thread Scott Greenlese

Hi all...

Just a quick follow-up.

Thought I should come clean and share with you that the incorrect
"migrate-to" operation name defined in my VirtualDomain
resource was my mistake.  It was mis-coded in the virtual guest
provisioning script.  I have since changed it to "migrate_to"
and of course, the specified live migration timeout value is working
effectively now.  (For some reason, I assumed we were letting that
operation meta value default).

I was wondering if someone could refer me to the definitive online link for
pacemaker resource man pages?  I don't see any resource man pages installed
on my system anywhere.   I found this one online:
https://www.mankier.com/7/ocf_heartbeat_VirtualDomain  but is there a more
'official' page I should refer our
Linux KVM on System z customers to?

Thanks again for your assistance.

Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com




From:   "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
To: <users@clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM@IBMUS
Cc: "Si Bo Niu" <nius...@cn.ibm.com>, Michael
Tebolt/Poughkeepsie/IBM@IBMUS
Date:   01/27/2017 02:32 AM
Subject:    Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts
for VirtualDomain resources



>>> "Scott Greenlese" <swgre...@us.ibm.com> schrieb am 27.01.2017 um 02:47
in
Nachricht
<of63cd0e10.d58c4c3d-on002580b5.0005c410-852580b5.0009d...@notes.na.collabserv.c

m>:

> Hi guys..
>
> Well, today I confirmed that what Ulrich said is correct.  If I update
the
> VirtualDomain resource with the operation name  "migrate_to" instead of
> "migrate-to",  it effectively overrides and enforces the 1200ms default
> value to the new value.
>
> I am wondering how I would have known that I was using the wrong
operation
> name, when the initial operation name is already incorrect
> when the resource is created?

For SLES 11, I made a quick (portable non-portable unstable) try (print the
operations known to an RA):
 # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p"
Operations' defaults (advisory minimum):

start timeout=90
stop  timeout=90
statustimeout=30 interval=10
monitor   timeout=30 interval=10
migrate_from  timeout=60
migrate_totimeout=120

Regards,
Ulrich

>
> This is what the meta data for my resource looked like after making the
> update:
>
> [root@zs95kj VD]# date;pcs resource update zs95kjg110065_res op
migrate_to
> timeout="360s"
> Thu Jan 26 16:43:11 EST 2017
> You have new mail in /var/spool/mail/root
>
> [root@zs95kj VD]# date;pcs resource show zs95kjg110065_res
> Thu Jan 26 16:43:46 EST 2017
>  Resource: zs95kjg110065_res (class=ocf provider=heartbeat
> type=VirtualDomain)
>   Attributes: config=/guestxml/nfs1/zs95kjg110065.xml
> hypervisor=qemu:///system migration_transport=ssh
>   Meta Attrs: allow-migrate=true
>   Operations: start interval=0s timeout=120
> (zs95kjg110065_res-start-interval-0s)
>   stop interval=0s timeout=120
> (zs95kjg110065_res-stop-interval-0s)
>   monitor interval=30s
(zs95kjg110065_res-monitor-interval-30s)
>   migrate-from interval=0s timeout=1200
> (zs95kjg110065_res-migrate-from-interval-0s)
>   migrate-to interval=0s timeout=1200
> (zs95kjg110065_res-migrate-to-interval-0s)   <<< Original op name / value
>   migrate_to interval=0s timeout=360s
> (zs95kjg110065_res-migrate_to-interval-0s)  <<< New op name / value
>
>
> Where does that original op name come from in the VirtualDomain resource
> definition?  How can we get the initial meta value changed and shipped
with
> a valid operation name (i.e. migrate_to), and
> maybe a more reasonable migrate_to timeout value... something
significantly
> higher than 1200ms , i.e. 1.2 seconds ?  Can I report this request as a
> bugzilla on the RHEL side, or should this go to my internal IBM bugzilla
> for KVM on System Z development?
>
> Anyway, thanks so much for identifying my issue.  I can reconfigure my
> resources to make them tolerate longer migration execution times.
>
>
> Scott Greenlese ... IBM KVM on System Z Solution Test
>   INTERNET:  swgre...@us.ibm.com
>
>
>
>
> From:  Ken Gaillot <kgail...@redhat.com>
> To:Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>,
> users@clusterlabs.org
> Date:  01/19/2017 10:26 AM
> Subject:   Re: [ClusterLabs] Antw: Re: Live Guest Migration
timeouts for
> VirtualDomain resources
>
>
>
> On 01/19/2017 01:36 AM, Ulrich Windl wrote:
>>

Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-01-18 Thread Tomas Jelinek
with
migrate_to op


Any ideas?



Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.
INTERNET: swgre...@us.ibm.com


Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On
01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot
---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese
wrote: > Hi..

From: Ken Gaillot <kgail...@redhat.com>
To: users@clusterlabs.org
Date: 01/17/2017 11:41 AM
Subject: Re: [ClusterLabs] Live Guest Migration timeouts for
VirtualDomain resources





On 01/17/2017 10:19 AM, Scott Greenlese wrote:

Hi..

I've been testing live guest migration (LGM) with VirtualDomain
resources, which are guests running on Linux KVM / System Z
managed by pacemaker.

I'm looking for documentation that explains how to configure my
VirtualDomain resources such that they will not timeout
prematurely when there is a heavy I/O workload running on the guest.

If I perform the LGM with an unmanaged guest (resource disabled), it
takes anywhere from 2 - 5 minutes to complete the LGM.
Example:

# Migrate guest, specify a timeout value of 600s

[root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
--persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
qemu+ssh://zs90kppcs1/system
Mon Jan 16 16:35:32 EST 2017

Migration: [100 %]

[root@zs95kj VD]# date
Mon Jan 16 16:40:01 EST 2017
[root@zs95kj VD]#

Start: 16:35:32
End: 16:40:01
Total: *4 min 29 sec*


In comparison, when the guest is managed by pacemaker, and enabled for
LGM ... I get this:

[root@zs95kj VD]# date;pcs resource show zs95kjg110061_res
Mon Jan 16 15:13:33 EST 2017
Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61
Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
*migrate-to* interval=0s *timeout=1200*
(zs95kjg110061_res-migrate-to-interval-0s)

NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
to 1200. Is this seconds? If so, that's 20 minutes,
ample time to complete a 5 minute migration.


Not sure where the default of 1200 comes from, but I believe the default
is milliseconds if no unit is specified. Normally you'd specify
something like "timeout=1200s".


[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:27:01 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
[root@zs95kj VD]#


[root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
Mon Jan 16 14:45:39 EST 2017
You have new mail in /var/spool/mail/root


Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live zs95kjg110061
qemu+ssh://zs95kjpcs1/system ).*
Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
zs95kjg110061_res_migrate_to_0:21050 - timed out after 2ms
Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
timeout=2ms)
Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
unexpectedly failed
[root@zs90KP VD]#

So, the migration timed out after 2ms. Assuming ms is milliseconds,
that's only 20 seconds. So, it seems that LGM timeout has
nothing to do with *migrate-to* on the resource definition.


Yes, ms is milliseconds. Pacemaker internally represents all times in
milliseconds, even though in most actual usage, it has 1-second granularity.

If your specified timeout is 1200ms, I'm not sure why it's using
2ms. There may be a minimum enforced somewhere.


Also, what is the expected behavior when the migration times out? I
watched the VirtualDomain resource state during the migration process...

[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:45:57 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:02 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:06 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:08 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;

Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-01-18 Thread Ken Gaillot
log_finished:
> finished - rsc:zs95kjg110061_res action:migrate_to call_id:941
> pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms
> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:1292 ) trace:
> lrmd_rsc_execute: Nothing further to do for zs95kjg110061_res
> Jan 17 13:55:14 [27555] zs95kj crmd: ( utils.c:1942 ) debug:
> create_operation_update: do_update_resource: Updating resource
> zs95kjg110061_res after migrate_to op Timed Out (interval=0)
> Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:2397 ) error:
> process_lrm_event: Operation zs95kjg110061_res_migrate_to_0: Timed Out
> (node=zs95kjpcs1, call=941, timeout=20000ms)
> Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:196 ) debug:
> update_history_cache: Updating history for 'zs95kjg110061_res' with
> migrate_to op
> 
> 
> Any ideas?
> 
> 
> 
> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> 
> 
> Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On
> 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot
> ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese
> wrote: > Hi..
> 
> From: Ken Gaillot <kgail...@redhat.com>
> To: users@clusterlabs.org
> Date: 01/17/2017 11:41 AM
> Subject: Re: [ClusterLabs] Live Guest Migration timeouts for
> VirtualDomain resources
> 
> 
> 
> 
> 
> On 01/17/2017 10:19 AM, Scott Greenlese wrote:
>> Hi..
>>
>> I've been testing live guest migration (LGM) with VirtualDomain
>> resources, which are guests running on Linux KVM / System Z
>> managed by pacemaker.
>>
>> I'm looking for documentation that explains how to configure my
>> VirtualDomain resources such that they will not timeout
>> prematurely when there is a heavy I/O workload running on the guest.
>>
>> If I perform the LGM with an unmanaged guest (resource disabled), it
>> takes anywhere from 2 - 5 minutes to complete the LGM.
>> Example:
>>
>> # Migrate guest, specify a timeout value of 600s
>>
>> [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
>> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
>> qemu+ssh://zs90kppcs1/system
>> Mon Jan 16 16:35:32 EST 2017
>>
>> Migration: [100 %]
>>
>> [root@zs95kj VD]# date
>> Mon Jan 16 16:40:01 EST 2017
>> [root@zs95kj VD]#
>>
>> Start: 16:35:32
>> End: 16:40:01
>> Total: *4 min 29 sec*
>>
>>
>> In comparison, when the guest is managed by pacemaker, and enabled for
>> LGM ... I get this:
>>
>> [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res
>> Mon Jan 16 15:13:33 EST 2017
>> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
>> hypervisor=qemu:///system migration_transport=ssh
>> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
>> remote-addr=10.20.110.61
>> Operations: start interval=0s timeout=480
>> (zs95kjg110061_res-start-interval-0s)
>> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
>> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
>> migrate-from interval=0s timeout=1200
>> (zs95kjg110061_res-migrate-from-interval-0s)
>> *migrate-to* interval=0s *timeout=1200*
>> (zs95kjg110061_res-migrate-to-interval-0s)
>>
>> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
>> to 1200. Is this seconds? If so, that's 20 minutes,
>> ample time to complete a 5 minute migration.
> 
> Not sure where the default of 1200 comes from, but I believe the default
> is milliseconds if no unit is specified. Normally you'd specify
> something like "timeout=1200s".
> 
>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:27:01 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
>> [root@zs95kj VD]#
>>
>>
>> [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
>> Mon Jan 16 14:45:39 EST 2017
>> You have new mail in /var/spool/mail/root
>>
>>
>> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
>> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
>> --connect=qemu:///system --quiet migrate --live zs95kjg110061
>> qemu+ssh://zs95kjpcs1/system ).*
>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
>> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
>> Jan 16 14:45:57 zs90kp lrmd[12798

Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

2017-01-17 Thread Scott Greenlese
lution Test,  Poughkeepsie,
N.Y.
  INTERNET:  swgre...@us.ibm.com




From:   Ken Gaillot <kgail...@redhat.com>
To: users@clusterlabs.org
Date:   01/17/2017 11:41 AM
Subject:    Re: [ClusterLabs] Live Guest Migration timeouts for
    VirtualDomain resources



On 01/17/2017 10:19 AM, Scott Greenlese wrote:
> Hi..
>
> I've been testing live guest migration (LGM) with VirtualDomain
> resources, which are guests running on Linux KVM / System Z
> managed by pacemaker.
>
> I'm looking for documentation that explains how to configure my
> VirtualDomain resources such that they will not timeout
> prematurely when there is a heavy I/O workload running on the guest.
>
> If I perform the LGM with an unmanaged guest (resource disabled), it
> takes anywhere from 2 - 5 minutes to complete the LGM.
> Example:
>
> # Migrate guest, specify a timeout value of 600s
>
> [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
> qemu+ssh://zs90kppcs1/system
> Mon Jan 16 16:35:32 EST 2017
>
> Migration: [100 %]
>
> [root@zs95kj VD]# date
> Mon Jan 16 16:40:01 EST 2017
> [root@zs95kj VD]#
>
> Start: 16:35:32
> End: 16:40:01
> Total: *4 min 29 sec*
>
>
> In comparison, when the guest is managed by pacemaker, and enabled for
> LGM ... I get this:
>
> [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res
> Mon Jan 16 15:13:33 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> *migrate-to* interval=0s *timeout=1200*
> (zs95kjg110061_res-migrate-to-interval-0s)
>
> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
> to 1200. Is this seconds? If so, that's 20 minutes,
> ample time to complete a 5 minute migration.

Not sure where the default of 1200 comes from, but I believe the default
is milliseconds if no unit is specified. Normally you'd specify
something like "timeout=1200s".

> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:27:01 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
> [root@zs95kj VD]#
>
>
> [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
> Mon Jan 16 14:45:39 EST 2017
> You have new mail in /var/spool/mail/root
>
>
> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
> --connect=qemu:///system --quiet migrate --live zs95kjg110061
> qemu+ssh://zs95kjpcs1/system ).*
> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
> zs95kjg110061_res_migrate_to_0:21050 - timed out after 2ms
> Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
> timeout=2ms)
> Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
> unexpectedly failed
> [root@zs90KP VD]#
>
> So, the migration timed out after 2ms. Assuming ms is milliseconds,
> that's only 20 seconds. So, it seems that LGM timeout has
> nothing to do with *migrate-to* on the resource definition.

Yes, ms is milliseconds. Pacemaker internally represents all times in
milliseconds, even though in most actual usage, it has 1-second
granularity.

If your specified timeout is 1200ms, I'm not sure why it's using
2ms. There may be a minimum enforced somewhere.

> Also, what is the expected behavior when the migration times out? I
> watched the VirtualDomain resource state during the migration process...
>
> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:45:57 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:02 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:06 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root@zs95kj VD]# date;pcs resource show |grep zs