Re: [vdsm] Intended Migration Time-out Behaviour and it's current state

2013-07-10 Thread Mark Wu

On 07/09/2013 09:20 AM, Vinzenz Feenstra wrote:

Hi,

While addressing a change for the migration progress timeout 
(http://gerrit.ovirt.org/#/c/16602/) I happen to notice that there are 
currently some strange things implemented which don't really make 
sense to me and it seems like that they lost their original intend and 
functionality over time.


I'd like to explain here what is currently happening in the code 
related to the 3 locations where the migration_timeout value is used.


1. Migration progress timeout (this seems to be working fine)
The progress is checked via Vm._dom.jobInfo() every 
"migration_monitor_interval" seconds (default is 10 seconds)
If dataRemaining is less than the last remembered value, the 
current timestamp is taken and remembered.


if there is no progress (no lower value of dataRemaining) and the 
time difference from the last remembered
timestamp is longer than 'migration_timeout' then the migration 
job is aborted. And the MigrationMonitorThread

stopped.

Note: as part of Changeset 16602 migration_timeout has been replaced 
with migration_progress_timeout in the code.


2. Migration destination timeout

In _waitForIncomingMigrationFinish:

We are waiting 'migration_timeout' on the 
_incomingMigrationFinished event. If the event times out we're simply
trying to create a new NotifyingVirDomain instance by looking up 
the VirDomainPtr instance via
libvirt.lookupByUUIDString which will only fail if the domain is 
not yet defined in libvirt.


Without even checking if the _incomingMigrationFinished is still 
not set, it continues:
In my opinion,  it should check if the _incomingMigrationFinished is 
still not set before calling _domDependentInit,  because 
libvirt.lookupByUUIDString

can return the domain even if the migrations is not finished.


It calls _domDependentInit which initializes the internal device 
information and fires up

the statistics thread. And also calls _dom.resume()

Note: I did on purpose ignore that _incomingMigrationFinished being 
set first is another scenario, however it is not so problematic as the 
one mentioned above.


3. MigrationDowntimeThread

_migrationTiemout is defined as follows:
def _migrationTimeout(self):
timeout = config.getint('vars', 'migration_timeout')
mem = int(self.conf['memSize'])
if mem > 2048:
timeout = timeout * mem / 2048
return timeout

The MigrationDowntimeThread is intialized with a value retrieved 
from _migrationTimeout and devided by 2.
=> t = MigrationDowntimeThread(int(self._downtime), 
self._migrationTimeout() / 2)

the second value is called 'wait' here.

In the MigrationDowntimeThread.run method there's a loop which is 
executed 'migration_downtime_steps' (10) times.
In each step it first waits: (wait / migration_downtime_steps 
seconds)
And then with each step it increases the value by a 10th (1 in 
migration_downtime_steps) of the passed downtime and sets it to

_dom.migrateSetMaxDowntime

Note: There are two things which are confusing me about this one:
   1. migrateSetMaxDowntime documentation says: Sets maximum tolerable 
time for which the domain is allowed to be paused at the end of live 
migration.

   Why does this need to be increased gradually over time?
   2. Why is this MigrationDowntimeThread running only for half of the 
time returned by _migrationTimeout?

Here's my understanding:
As we know,  the less migration downtime has less impact on guest's app, 
but more difficult to achieve, especially for a guest with high memory 
workload.
So I think the intention of MigrationDowntimeThread.runt is trying to 
start with a small downtime,  and gradually increase it if the migration 
can't be finished.


For 2nd question,  I guess it just want to leave enough time for the 
last try with the biggest downtime.






-

Now please could someone look over my description and tell me if this 
behaviour I have described here was the intended behaviour?
For me it seems like that with the patchset 16602 the 1st described 
part makes sense. Probably the value could be even lower than the 
currently suggested 150s


Part 2 and 3 however don't seem to be as intended, at least that's how 
it looks like to me.


Please enlighten me if this is the way it was intended. :-)
I have some ideas regarding these things which could be part of the 
continuation of the re-factoring work in vm.py, but first I would like 
to understand the original intend. Thanks.




___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


[vdsm] Intended Migration Time-out Behaviour and it's current state

2013-07-09 Thread Vinzenz Feenstra

Hi,

While addressing a change for the migration progress timeout 
(http://gerrit.ovirt.org/#/c/16602/) I happen to notice that there are 
currently some strange things implemented which don't really make sense 
to me and it seems like that they lost their original intend and 
functionality over time.


I'd like to explain here what is currently happening in the code related 
to the 3 locations where the migration_timeout value is used.


1. Migration progress timeout (this seems to be working fine)
The progress is checked via Vm._dom.jobInfo() every 
"migration_monitor_interval" seconds (default is 10 seconds)
If dataRemaining is less than the last remembered value, the 
current timestamp is taken and remembered.


if there is no progress (no lower value of dataRemaining) and the 
time difference from the last remembered
timestamp is longer than 'migration_timeout' then the migration job 
is aborted. And the MigrationMonitorThread

stopped.

Note: as part of Changeset 16602 migration_timeout has been replaced 
with migration_progress_timeout in the code.


2. Migration destination timeout

In _waitForIncomingMigrationFinish:

We are waiting 'migration_timeout' on the 
_incomingMigrationFinished event. If the event times out we're simply
trying to create a new NotifyingVirDomain instance by looking up 
the VirDomainPtr instance via
libvirt.lookupByUUIDString which will only fail if the domain is 
not yet defined in libvirt.


Without even checking if the _incomingMigrationFinished is still 
not set, it continues:


It calls _domDependentInit which initializes the internal device 
information and fires up

the statistics thread. And also calls _dom.resume()

Note: I did on purpose ignore that _incomingMigrationFinished being set 
first is another scenario, however it is not so problematic as the one 
mentioned above.


3. MigrationDowntimeThread

_migrationTiemout is defined as follows:
def _migrationTimeout(self):
timeout = config.getint('vars', 'migration_timeout')
mem = int(self.conf['memSize'])
if mem > 2048:
timeout = timeout * mem / 2048
return timeout

The MigrationDowntimeThread is intialized with a value retrieved 
from _migrationTimeout and devided by 2.
=> t = MigrationDowntimeThread(int(self._downtime), 
self._migrationTimeout() / 2)

the second value is called 'wait' here.

In the MigrationDowntimeThread.run method there's a loop which is 
executed 'migration_downtime_steps' (10) times.

In each step it first waits: (wait / migration_downtime_steps seconds)
And then with each step it increases the value by a 10th (1 in 
migration_downtime_steps) of the passed downtime and sets it to

_dom.migrateSetMaxDowntime

Note: There are two things which are confusing me about this one:
   1. migrateSetMaxDowntime documentation says: Sets maximum tolerable 
time for which the domain is allowed to be paused at the end of live 
migration.

   Why does this need to be increased gradually over time?
   2. Why is this MigrationDowntimeThread running only for half of the 
time returned by _migrationTimeout?



-

Now please could someone look over my description and tell me if this 
behaviour I have described here was the intended behaviour?
For me it seems like that with the patchset 16602 the 1st described part 
makes sense. Probably the value could be even lower than the currently 
suggested 150s


Part 2 and 3 however don't seem to be as intended, at least that's how 
it looks like to me.


Please enlighten me if this is the way it was intended. :-)
I have some ideas regarding these things which could be part of the 
continuation of the re-factoring work in vm.py, but first I would like 
to understand the original intend. Thanks.


--
Regards,

Vinzenz Feenstra | Senior Software Engineer
RedHat Engineering Virtualization R & D
Phone: +420 532 294 625
IRC: vfeenstr or evilissimo

Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel