Re: [vdsm] Intended Migration Time-out Behaviour and it's current state

Mark Wu Wed, 10 Jul 2013 23:57:18 -0700

On 07/09/2013 09:20 AM, Vinzenz Feenstra wrote:

Hi,
While addressing a change for the migration progress timeout(http://gerrit.ovirt.org/#/c/16602/) I happen to notice that there arecurrently some strange things implemented which don't really makesense to me and it seems like that they lost their original intend andfunctionality over time.
I'd like to explain here what is currently happening in the coderelated to the 3 locations where the migration_timeout value is used.
1. Migration progress timeout (this seems to be working fine)
The progress is checked via Vm._dom.jobInfo() every"migration_monitor_interval" seconds (default is 10 seconds)If dataRemaining is less than the last remembered value, thecurrent timestamp is taken and remembered.
if there is no progress (no lower value of dataRemaining) and thetime difference from the last rememberedtimestamp is longer than 'migration_timeout' then the migrationjob is aborted. And the MigrationMonitorThread
    stopped.
Note: as part of Changeset 16602 migration_timeout has been replacedwith migration_progress_timeout in the code.
2. Migration destination timeout

    In _waitForIncomingMigrationFinish:
We are waiting 'migration_timeout' on the_incomingMigrationFinished event. If the event times out we're simplytrying to create a new NotifyingVirDomain instance by looking upthe VirDomainPtr instance vialibvirt.lookupByUUIDString which will only fail if the domain isnot yet defined in libvirt.
Without even checking if the _incomingMigrationFinished is stillnot set, it continues:

In my opinion, it should check if the _incomingMigrationFinished isstill not set before calling _domDependentInit, becauselibvirt.lookupByUUIDString

can return the domain even if the migrations is not finished.

It calls _domDependentInit which initializes the internal deviceinformation and fires up
    the statistics thread. And also calls _dom.resume()
Note: I did on purpose ignore that _incomingMigrationFinished beingset first is another scenario, however it is not so problematic as theone mentioned above.
3. MigrationDowntimeThread

    _migrationTiemout is defined as follows:
    def _migrationTimeout(self):
        timeout = config.getint('vars', 'migration_timeout')
        mem = int(self.conf['memSize'])
        if mem > 2048:
            timeout = timeout * mem / 2048
        return timeout
The MigrationDowntimeThread is intialized with a value retrievedfrom _migrationTimeout and devided by 2.=> t = MigrationDowntimeThread(int(self._downtime),self._migrationTimeout() / 2)
    the second value is called 'wait' here.
In the MigrationDowntimeThread.run method there's a loop which isexecuted 'migration_downtime_steps' (10) times.In each step it first waits: (wait / migration_downtime_stepsseconds)And then with each step it increases the value by a 10th (1 inmigration_downtime_steps) of the passed downtime and sets it to
    _dom.migrateSetMaxDowntime

Note: There are two things which are confusing me about this one:
1. migrateSetMaxDowntime documentation says: Sets maximum tolerabletime for which the domain is allowed to be paused at the end of livemigration.
       Why does this need to be increased gradually over time?
2. Why is this MigrationDowntimeThread running only for half of thetime returned by _migrationTimeout?

Here's my understanding:

As we know, the less migration downtime has less impact on guest's app,but more difficult to achieve, especially for a guest with high memoryworkload.So I think the intention of MigrationDowntimeThread.runt is trying tostart with a small downtime, and gradually increase it if the migrationcan't be finished.

For 2nd question, I guess it just want to leave enough time for thelast try with the biggest downtime.

-----------------------------
Now please could someone look over my description and tell me if thisbehaviour I have described here was the intended behaviour?For me it seems like that with the patchset 16602 the 1st describedpart makes sense. Probably the value could be even lower than thecurrently suggested 150s
Part 2 and 3 however don't seem to be as intended, at least that's howit looks like to me.
Please enlighten me if this is the way it was intended. :-)
I have some ideas regarding these things which could be part of thecontinuation of the re-factoring work in vm.py, but first I would liketo understand the original intend. Thanks.


_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] Intended Migration Time-out Behaviour and it's current state

Reply via email to