While addressing a change for the migration progress timeout
(http://gerrit.ovirt.org/#/c/16602/) I happen to notice that there are
currently some strange things implemented which don't really make sense
to me and it seems like that they lost their original intend and
functionality over time.
I'd like to explain here what is currently happening in the code related
to the 3 locations where the migration_timeout value is used.
1. Migration progress timeout (this seems to be working fine)
The progress is checked via Vm._dom.jobInfo() every
"migration_monitor_interval" seconds (default is 10 seconds)
If dataRemaining is less than the last remembered value, the
current timestamp is taken and remembered.
if there is no progress (no lower value of dataRemaining) and the
time difference from the last remembered
timestamp is longer than 'migration_timeout' then the migration job
is aborted. And the MigrationMonitorThread
Note: as part of Changeset 16602 migration_timeout has been replaced
with migration_progress_timeout in the code.
2. Migration destination timeout
We are waiting 'migration_timeout' on the
_incomingMigrationFinished event. If the event times out we're simply
trying to create a new NotifyingVirDomain instance by looking up
the VirDomainPtr instance via
libvirt.lookupByUUIDString which will only fail if the domain is
not yet defined in libvirt.
Without even checking if the _incomingMigrationFinished is still
not set, it continues:
It calls _domDependentInit which initializes the internal device
information and fires up
the statistics thread. And also calls _dom.resume()
Note: I did on purpose ignore that _incomingMigrationFinished being set
first is another scenario, however it is not so problematic as the one
_migrationTiemout is defined as follows:
timeout = config.getint('vars', 'migration_timeout')
mem = int(self.conf['memSize'])
if mem > 2048:
timeout = timeout * mem / 2048
The MigrationDowntimeThread is intialized with a value retrieved
from _migrationTimeout and devided by 2.
=> t = MigrationDowntimeThread(int(self._downtime),
self._migrationTimeout() / 2)
the second value is called 'wait' here.
In the MigrationDowntimeThread.run method there's a loop which is
executed 'migration_downtime_steps' (10) times.
In each step it first waits: (wait / migration_downtime_steps seconds)
And then with each step it increases the value by a 10th (1 in
migration_downtime_steps) of the passed downtime and sets it to
Note: There are two things which are confusing me about this one:
1. migrateSetMaxDowntime documentation says: Sets maximum tolerable
time for which the domain is allowed to be paused at the end of live
Why does this need to be increased gradually over time?
2. Why is this MigrationDowntimeThread running only for half of the
time returned by _migrationTimeout?
Now please could someone look over my description and tell me if this
behaviour I have described here was the intended behaviour?
For me it seems like that with the patchset 16602 the 1st described part
makes sense. Probably the value could be even lower than the currently
Part 2 and 3 however don't seem to be as intended, at least that's how
it looks like to me.
Please enlighten me if this is the way it was intended. :-)
I have some ideas regarding these things which could be part of the
continuation of the re-factoring work in vm.py, but first I would like
to understand the original intend. Thanks.
Vinzenz Feenstra | Senior Software Engineer
RedHat Engineering Virtualization R & D
Phone: +420 532 294 625
IRC: vfeenstr or evilissimo
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com
vdsm-devel mailing list