Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin > > > > > > > wrote: > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé > > > > > > >> wrote: > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin > > > > > > >> > wrote: > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela > > > > > > >> > > wrote: > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF > > > > > > >> > > > device, so > > > > > > >> > > > we wait there forever. Just change the code to give one > > > > > > >> > > > error on that > > > > > > >> > > > case. > > > > > > >> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > >> > > > > > > > > >> > > It's certainly possible but it's management that created > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > >> > > a policy? It is possible that management will unpause > > > > > > >> > > immediately > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > >> > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > > > > >> > > because of a bug. > > > > > > >> > > > > > > > >> > That's pretty different behaviour from how migration normally > > > > > > >> > handles > > > > > > >> > a paused guest, which is that it is guaranteed to complete the > > > > > > >> > migration > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > >> > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps > > > > > > >> > / admins, > > > > > > >> > because the person/entity invoking the migration is not likely > > > > > > >> > to have > > > > > > >> > checked wether this particular guest uses net failover or not > > > > > > >> > before > > > > > > >> > invoking - they'll just be expecting a paused migration to run > > > > > > >> > fast and > > > > > > >> > be guaranteed to complete. > > > > > > >> > > > > > > > >> > Regards, > > > > > > >> > Daniel > > > > > > >> > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation > > > > > > >> too: > > > > > > >> pausing guest after migration started but before device was > > > > > > >> unplugged? > > > > > > >> > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > Further solution would be: > > > > > > - Add a new migration parameter: migrate-paused > > > > > > - change libvirt to use the new parameter if it exist > > > > > > - in qemu, when we do start migration (but after we wait for the > > > > > > unplug > > > > > > device) paused the guest before starting migration and resume it > > > > > > after > > > > > > migration finish. > > > > > > > > > > It would also have to handle issuing of paused after migration has > > > > > been started - delay the pause request until the nuplug is complete > > > > > is one answer. > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > resources back to host. It's problematic if guest can delay > > > > that indefinitely. > > > > > > hmm, yes, that is awkward. Perhaps we should just report an explicit > > > error then. > > > > Report an error in response to which command? Do you mean > > fail migration? > > If mgt attempt to pause an existing migration that hasn't finished > the PCI unplug stage, then fail the pause request. OK so I guess I'll apply the rest of the patchset, and let's see a complete patch that makes pause and migrate mutually exclusive? > > > > > In normal cases this won't happen, as unplug will have > > > easily completed before the mgmt app pauses the running migration. > > > In broken/malicious guest cases, this at least ives mgmt a heads up > > > that something is wrong and they might then decide to cancel the > > > migration. > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 12:16:24PM +, Daniel P. Berrangé wrote: > On Thu, Dec 03, 2020 at 07:11:17AM -0500, Michael S. Tsirkin wrote: > > On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote: > > > Another way to solve this would be to remove the unplugging from the > > > migration layer and leave it as a problem for the management layer to do > > > the unplug. > > > > Daniel described the problem with modular management tools which expect > > pausing or slowing down guest to cause migration to converge. > > > > Point is, it actually *will* make it converge but only if you > > pause it after unplug. > > > > As it is, these tools fundamentally can not handle failover > > requiring guest cooperation. Moving code between layers won't help. > > Introducing failure modes as this patch does won't help either > > especially since Daniel wrote there are countless tools like this. > > We just break them all but have no resources to fix them, > > this does not help at all. > > > > We can just leave the situation as is. > > > > Or if we do want to be nice to these tools, how about we > > unpause the guest until unplug, then pause it again? > > This actually addresses the problem instead of > > shifting the blame, does it not? > > This is a very bad idea because it changes the execution status of the > guest behind the apps/admins back, and that cannot be assumed to be a > safe thing todo. > > Regards, > Daniel My question is this: management gets an error since guest was paused presumably by someone (admin?) outside its control. How does it know when it is unpaused? > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 07:11:17AM -0500, Michael S. Tsirkin wrote: > On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote: > > Another way to solve this would be to remove the unplugging from the > > migration layer and leave it as a problem for the management layer to do > > the unplug. > > Daniel described the problem with modular management tools which expect > pausing or slowing down guest to cause migration to converge. > > Point is, it actually *will* make it converge but only if you > pause it after unplug. > > As it is, these tools fundamentally can not handle failover > requiring guest cooperation. Moving code between layers won't help. > Introducing failure modes as this patch does won't help either > especially since Daniel wrote there are countless tools like this. > We just break them all but have no resources to fix them, > this does not help at all. > > We can just leave the situation as is. > > Or if we do want to be nice to these tools, how about we > unpause the guest until unplug, then pause it again? > This actually addresses the problem instead of > shifting the blame, does it not? This is a very bad idea because it changes the execution status of the guest behind the apps/admins back, and that cannot be assumed to be a safe thing todo. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 12:06:14PM +, Daniel P. Berrangé wrote: > > > It isn't really about the admin. It is about countless existing mgmt apps > > > that expect migration will always succeed if the VM is paused. The mgmt > > > apps triggering the migraiton is not neccessarily the same as the app > > > which introduced the use of NIC failover in the config. > > > > > > eg in OpenStack Nova provides the VM config, but there are completely > > > separate apps that are built todo automation on top of Nova which > > > this is liable to break. There's no human admin there to diagnose > > > this and re-try with unpause, as all the logic is in the apps. > > > > > > > > > Regards, > > > Daniel > > > > So let's say pause fails. Won't this break these theoretical apps? > > Yes, they are broken in a way that can now actually be seen, as opposed > to just hanging forever in migration. Right but there are countless apps out there, and they are all broken, maybe not too hard to debug if one knows where to look, but still ... And again only true for pause then migrate, migrate then pause is still broken, and if we fix it by failing pause then no one knows whether apps are prepared to handle a failure in the pause command, previously if used correctly it would never fail I think ... -- MST
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote: > Another way to solve this would be to remove the unplugging from the > migration layer and leave it as a problem for the management layer to do > the unplug. Daniel described the problem with modular management tools which expect pausing or slowing down guest to cause migration to converge. Point is, it actually *will* make it converge but only if you pause it after unplug. As it is, these tools fundamentally can not handle failover requiring guest cooperation. Moving code between layers won't help. Introducing failure modes as this patch does won't help either especially since Daniel wrote there are countless tools like this. We just break them all but have no resources to fix them, this does not help at all. We can just leave the situation as is. Or if we do want to be nice to these tools, how about we unpause the guest until unplug, then pause it again? This actually addresses the problem instead of shifting the blame, does it not? -- MST
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 07:01:11AM -0500, Michael S. Tsirkin wrote: > On Thu, Dec 03, 2020 at 11:45:12AM +, Daniel P. Berrangé wrote: > > On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote: > > > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé > > > > > > > wrote: > > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin > > > > > > > > wrote: > > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé > > > > > > > > > wrote: > > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela > > > > > > > > > > wrote: > > > > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. > > > > > > > > > > > > Tsirkin wrote: > > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > > > > >> Berrangé wrote: > > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > > > > >> > Tsirkin wrote: > > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan > > > > > > > > > > > >> > > Quintela wrote: > > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the > > > > > > > > > > > >> > > > network VF device, so > > > > > > > > > > > >> > > > we wait there forever. Just change the code to > > > > > > > > > > > >> > > > give one error on that > > > > > > > > > > > >> > > > case. > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > It's certainly possible but it's management that > > > > > > > > > > > >> > > created > > > > > > > > > > > >> > > this situation after all - why do we bother to > > > > > > > > > > > >> > > enforce > > > > > > > > > > > >> > > a policy? It is possible that management will > > > > > > > > > > > >> > > unpause immediately > > > > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that > > > > > > > > > > > >> > > is stuck > > > > > > > > > > > >> > > because of a bug. > > > > > > > > > > > >> > > > > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > > > > >> > normally handles > > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to > > > > > > > > > > > >> > complete the migration > > > > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > > > > >> > > > > > > > > > > > > >> > Just ignoring the situation I think will lead to > > > > > > > > > > > >> > surprise apps / admins, > > > > > > > > > > > >> > because the person/entity invoking the migration is > > > > > > > > > > > >> > not likely to have > > > > > > > > > > > >> > checked wether this particular guest uses net > > > > > > > > > > > >> > failover or not before > > > > > > > > > > > >> > invoking - they'll just be expecting a paused > > > > > > > > > > > >> > migration to run fast and > > > > > > > > > > > >> > be guaranteed to complete. > > > > > > > > > > > >> > > > > > > > > > > > > >> > Regards, > > > > > > > > > > > >> > Daniel > > > > > > > > > > > >> > > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > > > > >> situation too: > > > > > > > > > > > >> pausing guest after migration started but before > > > > > > > > > > > >> device was > > > > > > > > > > > >> unplugged? > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - > > > > > > > > > > > > fail > > > > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > > > > - in qemu, when we do start migration (but after we wait > > > > > > > > > > > for the unplug > > > > > > > > > > > device) paused the guest before starting migration and > > > > > > > > > > > resume it after > > > > > > > > > > > migration finish. > > > > > > > > > > > > > > > > > > > > It would also have to handle issuing of paused after > > > > > > > > > > migration has > > > > > > > >
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
* Michael S. Tsirkin (m...@redhat.com) wrote: > On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote: > > * Michael S. Tsirkin (m...@redhat.com) wrote: > > > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé > > > > > > > wrote: > > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin > > > > > > > > wrote: > > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé > > > > > > > > > wrote: > > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela > > > > > > > > > > wrote: > > > > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. > > > > > > > > > > > > Tsirkin wrote: > > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > > > > >> Berrangé wrote: > > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > > > > >> > Tsirkin wrote: > > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan > > > > > > > > > > > >> > > Quintela wrote: > > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the > > > > > > > > > > > >> > > > network VF device, so > > > > > > > > > > > >> > > > we wait there forever. Just change the code to > > > > > > > > > > > >> > > > give one error on that > > > > > > > > > > > >> > > > case. > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > It's certainly possible but it's management that > > > > > > > > > > > >> > > created > > > > > > > > > > > >> > > this situation after all - why do we bother to > > > > > > > > > > > >> > > enforce > > > > > > > > > > > >> > > a policy? It is possible that management will > > > > > > > > > > > >> > > unpause immediately > > > > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that > > > > > > > > > > > >> > > is stuck > > > > > > > > > > > >> > > because of a bug. > > > > > > > > > > > >> > > > > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > > > > >> > normally handles > > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to > > > > > > > > > > > >> > complete the migration > > > > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > > > > >> > > > > > > > > > > > > >> > Just ignoring the situation I think will lead to > > > > > > > > > > > >> > surprise apps / admins, > > > > > > > > > > > >> > because the person/entity invoking the migration is > > > > > > > > > > > >> > not likely to have > > > > > > > > > > > >> > checked wether this particular guest uses net > > > > > > > > > > > >> > failover or not before > > > > > > > > > > > >> > invoking - they'll just be expecting a paused > > > > > > > > > > > >> > migration to run fast and > > > > > > > > > > > >> > be guaranteed to complete. > > > > > > > > > > > >> > > > > > > > > > > > > >> > Regards, > > > > > > > > > > > >> > Daniel > > > > > > > > > > > >> > > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > > > > >> situation too: > > > > > > > > > > > >> pausing guest after migration started but before > > > > > > > > > > > >> device was > > > > > > > > > > > >> unplugged? > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - > > > > > > > > > > > > fail > > > > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > > > > - in qemu, when we do start migration (but after we wait > > > > > > > > > > > for the unplug > > > > > > > > > > > device) paused the guest before starting migration and > > > > > > > > > > > resume it after > > > > > > > > > > > migration finish. > > > > > > > > > > > > > > > > > > > > It would also have to handle issuing of paused after > > > > > > > > > > migration has > > > > > > > > > > been started - delay the pause
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote: > * Michael S. Tsirkin (m...@redhat.com) wrote: > > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin > > > > > > > wrote: > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé > > > > > > > > wrote: > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. > > > > > > > > > > > Tsirkin wrote: > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > > > >> Berrangé wrote: > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > > > >> > Tsirkin wrote: > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan > > > > > > > > > > >> > > Quintela wrote: > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the > > > > > > > > > > >> > > > network VF device, so > > > > > > > > > > >> > > > we wait there forever. Just change the code to > > > > > > > > > > >> > > > give one error on that > > > > > > > > > > >> > > > case. > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > > > >> > > > > > > > > > > > > >> > > It's certainly possible but it's management that > > > > > > > > > > >> > > created > > > > > > > > > > >> > > this situation after all - why do we bother to > > > > > > > > > > >> > > enforce > > > > > > > > > > >> > > a policy? It is possible that management will > > > > > > > > > > >> > > unpause immediately > > > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > > > >> > > > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that > > > > > > > > > > >> > > is stuck > > > > > > > > > > >> > > because of a bug. > > > > > > > > > > >> > > > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > > > >> > normally handles > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to > > > > > > > > > > >> > complete the migration > > > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > > > >> > > > > > > > > > > > >> > Just ignoring the situation I think will lead to > > > > > > > > > > >> > surprise apps / admins, > > > > > > > > > > >> > because the person/entity invoking the migration is > > > > > > > > > > >> > not likely to have > > > > > > > > > > >> > checked wether this particular guest uses net failover > > > > > > > > > > >> > or not before > > > > > > > > > > >> > invoking - they'll just be expecting a paused > > > > > > > > > > >> > migration to run fast and > > > > > > > > > > >> > be guaranteed to complete. > > > > > > > > > > >> > > > > > > > > > > > >> > Regards, > > > > > > > > > > >> > Daniel > > > > > > > > > > >> > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > > > >> situation too: > > > > > > > > > > >> pausing guest after migration started but before device > > > > > > > > > > >> was > > > > > > > > > > >> unplugged? > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - > > > > > > > > > > > fail > > > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > > > - in qemu, when we do start migration (but after we wait > > > > > > > > > > for the unplug > > > > > > > > > > device) paused the guest before starting migration and > > > > > > > > > > resume it after > > > > > > > > > > migration finish. > > > > > > > > > > > > > > > > > > It would also have to handle issuing of paused after > > > > > > > > > migration has > > > > > > > > > been started - delay the pause request until the nuplug is > > > > > > > > > complete > > > > > > > > > is one answer. > > > > > > > > > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > > > > > resources back to host. It's problematic if guest can delay > > > > > > > > that indefinitely. > >
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 11:45:12AM +, Daniel P. Berrangé wrote: > On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote: > > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin > > > > > > > wrote: > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé > > > > > > > > wrote: > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. > > > > > > > > > > > Tsirkin wrote: > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > > > >> Berrangé wrote: > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > > > >> > Tsirkin wrote: > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan > > > > > > > > > > >> > > Quintela wrote: > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the > > > > > > > > > > >> > > > network VF device, so > > > > > > > > > > >> > > > we wait there forever. Just change the code to > > > > > > > > > > >> > > > give one error on that > > > > > > > > > > >> > > > case. > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > > > >> > > > > > > > > > > > > >> > > It's certainly possible but it's management that > > > > > > > > > > >> > > created > > > > > > > > > > >> > > this situation after all - why do we bother to > > > > > > > > > > >> > > enforce > > > > > > > > > > >> > > a policy? It is possible that management will > > > > > > > > > > >> > > unpause immediately > > > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > > > >> > > > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that > > > > > > > > > > >> > > is stuck > > > > > > > > > > >> > > because of a bug. > > > > > > > > > > >> > > > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > > > >> > normally handles > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to > > > > > > > > > > >> > complete the migration > > > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > > > >> > > > > > > > > > > > >> > Just ignoring the situation I think will lead to > > > > > > > > > > >> > surprise apps / admins, > > > > > > > > > > >> > because the person/entity invoking the migration is > > > > > > > > > > >> > not likely to have > > > > > > > > > > >> > checked wether this particular guest uses net failover > > > > > > > > > > >> > or not before > > > > > > > > > > >> > invoking - they'll just be expecting a paused > > > > > > > > > > >> > migration to run fast and > > > > > > > > > > >> > be guaranteed to complete. > > > > > > > > > > >> > > > > > > > > > > > >> > Regards, > > > > > > > > > > >> > Daniel > > > > > > > > > > >> > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > > > >> situation too: > > > > > > > > > > >> pausing guest after migration started but before device > > > > > > > > > > >> was > > > > > > > > > > >> unplugged? > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - > > > > > > > > > > > fail > > > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > > > - in qemu, when we do start migration (but after we wait > > > > > > > > > > for the unplug > > > > > > > > > > device) paused the guest before starting migration and > > > > > > > > > > resume it after > > > > > > > > > > migration finish. > > > > > > > > > > > > > > > > > > It would also have to handle issuing of paused after > > > > > > > > > migration has > > > > > > > > > been started - delay the pause request until the nuplug is > > > > > > > > > complete > > > > > > > > > is one answer. > > > > > > > > > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > > > > > resources back to host. It's problematic if guest can delay > > > > > > > > that
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote: > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé > > > > > > > wrote: > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. > > > > > > > > > > Tsirkin wrote: > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > > >> Berrangé wrote: > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > > >> > Tsirkin wrote: > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan > > > > > > > > > >> > > Quintela wrote: > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the > > > > > > > > > >> > > > network VF device, so > > > > > > > > > >> > > > we wait there forever. Just change the code to give > > > > > > > > > >> > > > one error on that > > > > > > > > > >> > > > case. > > > > > > > > > >> > > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > > >> > > > > > > > > > > > >> > > It's certainly possible but it's management that > > > > > > > > > >> > > created > > > > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > > > > >> > > a policy? It is possible that management will unpause > > > > > > > > > >> > > immediately > > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > > >> > > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is > > > > > > > > > >> > > stuck > > > > > > > > > >> > > because of a bug. > > > > > > > > > >> > > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > > >> > normally handles > > > > > > > > > >> > a paused guest, which is that it is guaranteed to > > > > > > > > > >> > complete the migration > > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > > >> > > > > > > > > > > >> > Just ignoring the situation I think will lead to > > > > > > > > > >> > surprise apps / admins, > > > > > > > > > >> > because the person/entity invoking the migration is not > > > > > > > > > >> > likely to have > > > > > > > > > >> > checked wether this particular guest uses net failover > > > > > > > > > >> > or not before > > > > > > > > > >> > invoking - they'll just be expecting a paused migration > > > > > > > > > >> > to run fast and > > > > > > > > > >> > be guaranteed to complete. > > > > > > > > > >> > > > > > > > > > > >> > Regards, > > > > > > > > > >> > Daniel > > > > > > > > > >> > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > > >> situation too: > > > > > > > > > >> pausing guest after migration started but before device was > > > > > > > > > >> unplugged? > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > > - in qemu, when we do start migration (but after we wait for > > > > > > > > > the unplug > > > > > > > > > device) paused the guest before starting migration and > > > > > > > > > resume it after > > > > > > > > > migration finish. > > > > > > > > > > > > > > > > It would also have to handle issuing of paused after migration > > > > > > > > has > > > > > > > > been started - delay the pause request until the nuplug is > > > > > > > > complete > > > > > > > > is one answer. > > > > > > > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > > > > resources back to host. It's problematic if guest can delay > > > > > > > that indefinitely. > > > > > > > > > > > > hmm, yes, that is awkward. Perhaps we should just report an > > > > > > explicit > > > > > > error then. > > > > > > > > > > Report an error in response to which command? Do you mean > > > > > fail migration? > > > > > > > > If mgt attempt to pause an existing migration that hasn't finished
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
* Michael S. Tsirkin (m...@redhat.com) wrote: > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé > > > > > > > wrote: > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. > > > > > > > > > > Tsirkin wrote: > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > > >> Berrangé wrote: > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > > >> > Tsirkin wrote: > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan > > > > > > > > > >> > > Quintela wrote: > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the > > > > > > > > > >> > > > network VF device, so > > > > > > > > > >> > > > we wait there forever. Just change the code to give > > > > > > > > > >> > > > one error on that > > > > > > > > > >> > > > case. > > > > > > > > > >> > > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > > >> > > > > > > > > > > > >> > > It's certainly possible but it's management that > > > > > > > > > >> > > created > > > > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > > > > >> > > a policy? It is possible that management will unpause > > > > > > > > > >> > > immediately > > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > > >> > > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is > > > > > > > > > >> > > stuck > > > > > > > > > >> > > because of a bug. > > > > > > > > > >> > > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > > >> > normally handles > > > > > > > > > >> > a paused guest, which is that it is guaranteed to > > > > > > > > > >> > complete the migration > > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > > >> > > > > > > > > > > >> > Just ignoring the situation I think will lead to > > > > > > > > > >> > surprise apps / admins, > > > > > > > > > >> > because the person/entity invoking the migration is not > > > > > > > > > >> > likely to have > > > > > > > > > >> > checked wether this particular guest uses net failover > > > > > > > > > >> > or not before > > > > > > > > > >> > invoking - they'll just be expecting a paused migration > > > > > > > > > >> > to run fast and > > > > > > > > > >> > be guaranteed to complete. > > > > > > > > > >> > > > > > > > > > > >> > Regards, > > > > > > > > > >> > Daniel > > > > > > > > > >> > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > > >> situation too: > > > > > > > > > >> pausing guest after migration started but before device was > > > > > > > > > >> unplugged? > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > > - in qemu, when we do start migration (but after we wait for > > > > > > > > > the unplug > > > > > > > > > device) paused the guest before starting migration and > > > > > > > > > resume it after > > > > > > > > > migration finish. > > > > > > > > > > > > > > > > It would also have to handle issuing of paused after migration > > > > > > > > has > > > > > > > > been started - delay the pause request until the nuplug is > > > > > > > > complete > > > > > > > > is one answer. > > > > > > > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > > > > resources back to host. It's problematic if guest can delay > > > > > > > that indefinitely. > > > > > > > > > > > > hmm, yes, that is awkward. Perhaps we should just report an > > > > > > explicit > > > > > > error then. > > > > > > > > > > Report an error in response to which command? Do you mean > > > > > fail migration? > > > > > > > > If mgt attempt to pause an existing migration that hasn't finished > > > > the PCI unplug
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote: > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin > > > > > > > > > wrote: > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > > >> Berrangé wrote: > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. > > > > > > > > >> > Tsirkin wrote: > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela > > > > > > > > >> > > wrote: > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network > > > > > > > > >> > > > VF device, so > > > > > > > > >> > > > we wait there forever. Just change the code to give > > > > > > > > >> > > > one error on that > > > > > > > > >> > > > case. > > > > > > > > >> > > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > > >> > > > > > > > > > > >> > > It's certainly possible but it's management that created > > > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > > > >> > > a policy? It is possible that management will unpause > > > > > > > > >> > > immediately > > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > > >> > > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is > > > > > > > > >> > > stuck > > > > > > > > >> > > because of a bug. > > > > > > > > >> > > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > > >> > normally handles > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete > > > > > > > > >> > the migration > > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > > >> > > > > > > > > > >> > Just ignoring the situation I think will lead to surprise > > > > > > > > >> > apps / admins, > > > > > > > > >> > because the person/entity invoking the migration is not > > > > > > > > >> > likely to have > > > > > > > > >> > checked wether this particular guest uses net failover or > > > > > > > > >> > not before > > > > > > > > >> > invoking - they'll just be expecting a paused migration to > > > > > > > > >> > run fast and > > > > > > > > >> > be guaranteed to complete. > > > > > > > > >> > > > > > > > > > >> > Regards, > > > > > > > > >> > Daniel > > > > > > > > >> > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > > >> situation too: > > > > > > > > >> pausing guest after migration started but before device was > > > > > > > > >> unplugged? > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > > > Further solution would be: > > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > > - in qemu, when we do start migration (but after we wait for > > > > > > > > the unplug > > > > > > > > device) paused the guest before starting migration and resume > > > > > > > > it after > > > > > > > > migration finish. > > > > > > > > > > > > > > It would also have to handle issuing of paused after migration has > > > > > > > been started - delay the pause request until the nuplug is > > > > > > > complete > > > > > > > is one answer. > > > > > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > > > resources back to host. It's problematic if guest can delay > > > > > > that indefinitely. > > > > > > > > > > hmm, yes, that is awkward. Perhaps we should just report an explicit > > > > > error then. > > > > > > > > Report an error in response to which command? Do you mean > > > > fail migration? > > > > > > If mgt attempt to pause an existing migration that hasn't finished > > > the PCI unplug stage, then fail the pause request. > > > > Pause guest not migration ... > > Might be tricky ... > > > > Let me ask this, why not just produce a warning > > that migration wan't finish until guest actually runs? > > User will then know and unpause the guest when he wants > > migration to succeed ...
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote: > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin > > > > > > > > wrote: > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. > > > > > > > >> Berrangé wrote: > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin > > > > > > > >> > wrote: > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela > > > > > > > >> > > wrote: > > > > > > > >> > > > If we have a paused guest, it can't unplug the network > > > > > > > >> > > > VF device, so > > > > > > > >> > > > we wait there forever. Just change the code to give one > > > > > > > >> > > > error on that > > > > > > > >> > > > case. > > > > > > > >> > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > > >> > > > > > > > > > >> > > It's certainly possible but it's management that created > > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > > >> > > a policy? It is possible that management will unpause > > > > > > > >> > > immediately > > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > > >> > > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > > > > > >> > > because of a bug. > > > > > > > >> > > > > > > > > >> > That's pretty different behaviour from how migration > > > > > > > >> > normally handles > > > > > > > >> > a paused guest, which is that it is guaranteed to complete > > > > > > > >> > the migration > > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > > >> > > > > > > > > >> > Just ignoring the situation I think will lead to surprise > > > > > > > >> > apps / admins, > > > > > > > >> > because the person/entity invoking the migration is not > > > > > > > >> > likely to have > > > > > > > >> > checked wether this particular guest uses net failover or > > > > > > > >> > not before > > > > > > > >> > invoking - they'll just be expecting a paused migration to > > > > > > > >> > run fast and > > > > > > > >> > be guaranteed to complete. > > > > > > > >> > > > > > > > > >> > Regards, > > > > > > > >> > Daniel > > > > > > > >> > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse > > > > > > > >> situation too: > > > > > > > >> pausing guest after migration started but before device was > > > > > > > >> unplugged? > > > > > > > >> > > > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > > > Further solution would be: > > > > > > > - Add a new migration parameter: migrate-paused > > > > > > > - change libvirt to use the new parameter if it exist > > > > > > > - in qemu, when we do start migration (but after we wait for the > > > > > > > unplug > > > > > > > device) paused the guest before starting migration and resume > > > > > > > it after > > > > > > > migration finish. > > > > > > > > > > > > It would also have to handle issuing of paused after migration has > > > > > > been started - delay the pause request until the nuplug is complete > > > > > > is one answer. > > > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > > resources back to host. It's problematic if guest can delay > > > > > that indefinitely. > > > > > > > > hmm, yes, that is awkward. Perhaps we should just report an explicit > > > > error then. > > > > > > Report an error in response to which command? Do you mean > > > fail migration? > > > > If mgt attempt to pause an existing migration that hasn't finished > > the PCI unplug stage, then fail the pause request. > > Pause guest not migration ... > Might be tricky ... > > Let me ask this, why not just produce a warning > that migration wan't finish until guest actually runs? > User will then know and unpause the guest when he wants > migration to succeed ... A warning is going to be essentally invisible if the pause command succeeeds. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote: > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin > > > > > > > wrote: > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé > > > > > > >> wrote: > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin > > > > > > >> > wrote: > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela > > > > > > >> > > wrote: > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF > > > > > > >> > > > device, so > > > > > > >> > > > we wait there forever. Just change the code to give one > > > > > > >> > > > error on that > > > > > > >> > > > case. > > > > > > >> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > >> > > > > > > > > >> > > It's certainly possible but it's management that created > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > >> > > a policy? It is possible that management will unpause > > > > > > >> > > immediately > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > >> > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > > > > >> > > because of a bug. > > > > > > >> > > > > > > > >> > That's pretty different behaviour from how migration normally > > > > > > >> > handles > > > > > > >> > a paused guest, which is that it is guaranteed to complete the > > > > > > >> > migration > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > >> > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps > > > > > > >> > / admins, > > > > > > >> > because the person/entity invoking the migration is not likely > > > > > > >> > to have > > > > > > >> > checked wether this particular guest uses net failover or not > > > > > > >> > before > > > > > > >> > invoking - they'll just be expecting a paused migration to run > > > > > > >> > fast and > > > > > > >> > be guaranteed to complete. > > > > > > >> > > > > > > > >> > Regards, > > > > > > >> > Daniel > > > > > > >> > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation > > > > > > >> too: > > > > > > >> pausing guest after migration started but before device was > > > > > > >> unplugged? > > > > > > >> > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > Further solution would be: > > > > > > - Add a new migration parameter: migrate-paused > > > > > > - change libvirt to use the new parameter if it exist > > > > > > - in qemu, when we do start migration (but after we wait for the > > > > > > unplug > > > > > > device) paused the guest before starting migration and resume it > > > > > > after > > > > > > migration finish. > > > > > > > > > > It would also have to handle issuing of paused after migration has > > > > > been started - delay the pause request until the nuplug is complete > > > > > is one answer. > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > resources back to host. It's problematic if guest can delay > > > > that indefinitely. > > > > > > hmm, yes, that is awkward. Perhaps we should just report an explicit > > > error then. > > > > Report an error in response to which command? Do you mean > > fail migration? > > If mgt attempt to pause an existing migration that hasn't finished > the PCI unplug stage, then fail the pause request. Pause guest not migration ... Might be tricky ... Let me ask this, why not just produce a warning that migration wan't finish until guest actually runs? User will then know and unpause the guest when he wants migration to succeed ... For example, user can restrict the amount of cpu using cgroups to a level where almost no progress is made. QEMU can't detect this > > > > > In normal cases this won't happen, as unplug will have > > > easily completed before the mgmt app pauses the running migration. > > > In broken/malicious guest cases, this at least ives mgmt a heads up > > > that something is wrong and they might then decide to cancel the > > > migration. > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |:
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > "Michael S. Tsirkin" wrote: > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé > > > > > >> wrote: > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin > > > > > >> > wrote: > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > > > >> > > > If we have a paused guest, it can't unplug the network VF > > > > > >> > > > device, so > > > > > >> > > > we wait there forever. Just change the code to give one > > > > > >> > > > error on that > > > > > >> > > > case. > > > > > >> > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > >> > > > > > > > >> > > It's certainly possible but it's management that created > > > > > >> > > this situation after all - why do we bother to enforce > > > > > >> > > a policy? It is possible that management will unpause > > > > > >> > > immediately > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > >> > > > > > > > >> > > Yes migration will not happen until guest is > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > > > >> > > because of a bug. > > > > > >> > > > > > > >> > That's pretty different behaviour from how migration normally > > > > > >> > handles > > > > > >> > a paused guest, which is that it is guaranteed to complete the > > > > > >> > migration > > > > > >> > in as short a time as network bandwidth allows. > > > > > >> > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / > > > > > >> > admins, > > > > > >> > because the person/entity invoking the migration is not likely > > > > > >> > to have > > > > > >> > checked wether this particular guest uses net failover or not > > > > > >> > before > > > > > >> > invoking - they'll just be expecting a paused migration to run > > > > > >> > fast and > > > > > >> > be guaranteed to complete. > > > > > >> > > > > > > >> > Regards, > > > > > >> > Daniel > > > > > >> > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation > > > > > >> too: > > > > > >> pausing guest after migration started but before device was > > > > > >> unplugged? > > > > > >> > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > Further solution would be: > > > > > - Add a new migration parameter: migrate-paused > > > > > - change libvirt to use the new parameter if it exist > > > > > - in qemu, when we do start migration (but after we wait for the > > > > > unplug > > > > > device) paused the guest before starting migration and resume it > > > > > after > > > > > migration finish. > > > > > > > > It would also have to handle issuing of paused after migration has > > > > been started - delay the pause request until the nuplug is complete > > > > is one answer. > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > resources back to host. It's problematic if guest can delay > > > that indefinitely. > > > > hmm, yes, that is awkward. Perhaps we should just report an explicit > > error then. > > Report an error in response to which command? Do you mean > fail migration? If mgt attempt to pause an existing migration that hasn't finished the PCI unplug stage, then fail the pause request. > > > In normal cases this won't happen, as unplug will have > > easily completed before the mgmt app pauses the running migration. > > In broken/malicious guest cases, this at least ives mgmt a heads up > > that something is wrong and they might then decide to cancel the > > migration. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > "Michael S. Tsirkin" wrote: > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > >> > > > If we have a paused guest, it can't unplug the network VF > > > >> > > > device, so > > > >> > > > we wait there forever. Just change the code to give one error > > > >> > > > on that > > > >> > > > case. > > > >> > > > > > > >> > > > Signed-off-by: Juan Quintela > > > >> > > > > > >> > > It's certainly possible but it's management that created > > > >> > > this situation after all - why do we bother to enforce > > > >> > > a policy? It is possible that management will unpause immediately > > > >> > > afterwards and everything will proceed smoothly. > > > >> > > > > > >> > > Yes migration will not happen until guest is > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > >> > > because of a bug. > > > >> > > > > >> > That's pretty different behaviour from how migration normally handles > > > >> > a paused guest, which is that it is guaranteed to complete the > > > >> > migration > > > >> > in as short a time as network bandwidth allows. > > > >> > > > > >> > Just ignoring the situation I think will lead to surprise apps / > > > >> > admins, > > > >> > because the person/entity invoking the migration is not likely to > > > >> > have > > > >> > checked wether this particular guest uses net failover or not before > > > >> > invoking - they'll just be expecting a paused migration to run fast > > > >> > and > > > >> > be guaranteed to complete. > > > >> > > > > >> > Regards, > > > >> > Daniel > > > >> > > > >> Okay I guess. But then shouldn't we handle the reverse situation too: > > > >> pausing guest after migration started but before device was > > > >> unplugged? > > > >> > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > pausing guest until migration is cancelled? > > > > > > > > All this seems heavy handed to me ... > > > > > > This is the minimal fix that I can think of. > > > > > > Further solution would be: > > > - Add a new migration parameter: migrate-paused > > > - change libvirt to use the new parameter if it exist > > > - in qemu, when we do start migration (but after we wait for the unplug > > > device) paused the guest before starting migration and resume it after > > > migration finish. > > > > It would also have to handle issuing of paused after migration has > > been started - delay the pause request until the nuplug is complete > > is one answer. > > Hmm my worry would be that pausing is one way to give cpu > resources back to host. It's problematic if guest can delay > that indefinitely. hmm, yes, that is awkward. Perhaps we should just report an explicit error then. In normal cases this won't happen, as unplug will have easily completed before the mgmt app pauses the running migration. In broken/malicious guest cases, this at least ives mgmt a heads up that something is wrong and they might then decide to cancel the migration. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
"Michael S. Tsirkin" wrote: > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: >> > > > If we have a paused guest, it can't unplug the network VF device, so >> > > > we wait there forever. Just change the code to give one error on that >> > > > case. >> > > > >> > > > Signed-off-by: Juan Quintela >> > > >> > > It's certainly possible but it's management that created >> > > this situation after all - why do we bother to enforce >> > > a policy? It is possible that management will unpause immediately >> > > afterwards and everything will proceed smoothly. >> > > >> > > Yes migration will not happen until guest is >> > > unpaused but the same it true of e.g. a guest that is stuck >> > > because of a bug. >> > >> > That's pretty different behaviour from how migration normally handles >> > a paused guest, which is that it is guaranteed to complete the migration >> > in as short a time as network bandwidth allows. >> > >> > Just ignoring the situation I think will lead to surprise apps / admins, >> > because the person/entity invoking the migration is not likely to have >> > checked wether this particular guest uses net failover or not before >> > invoking - they'll just be expecting a paused migration to run fast and >> > be guaranteed to complete. >> > >> > Regards, >> > Daniel >> >> Okay I guess. But then shouldn't we handle the reverse situation too: >> pausing guest after migration started but before device was >> unplugged? >> > > Thinking of which, I have no idea how we'd handle it - fail > pausing guest until migration is cancelled? > > All this seems heavy handed to me ... This is the minimal fix that I can think of. Further solution would be: - Add a new migration parameter: migrate-paused - change libvirt to use the new parameter if it exist - in qemu, when we do start migration (but after we wait for the unplug device) paused the guest before starting migration and resume it after migration finish. My understanding talking with Laine is that they use this functionality by default for migration, saving, etc, i.e. it is not an isolated case. Later, Juan.
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote: > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > "Michael S. Tsirkin" wrote: > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé > > > > >> wrote: > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > > >> > > > If we have a paused guest, it can't unplug the network VF > > > > >> > > > device, so > > > > >> > > > we wait there forever. Just change the code to give one error > > > > >> > > > on that > > > > >> > > > case. > > > > >> > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > >> > > > > > > >> > > It's certainly possible but it's management that created > > > > >> > > this situation after all - why do we bother to enforce > > > > >> > > a policy? It is possible that management will unpause immediately > > > > >> > > afterwards and everything will proceed smoothly. > > > > >> > > > > > > >> > > Yes migration will not happen until guest is > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > > >> > > because of a bug. > > > > >> > > > > > >> > That's pretty different behaviour from how migration normally > > > > >> > handles > > > > >> > a paused guest, which is that it is guaranteed to complete the > > > > >> > migration > > > > >> > in as short a time as network bandwidth allows. > > > > >> > > > > > >> > Just ignoring the situation I think will lead to surprise apps / > > > > >> > admins, > > > > >> > because the person/entity invoking the migration is not likely to > > > > >> > have > > > > >> > checked wether this particular guest uses net failover or not > > > > >> > before > > > > >> > invoking - they'll just be expecting a paused migration to run > > > > >> > fast and > > > > >> > be guaranteed to complete. > > > > >> > > > > > >> > Regards, > > > > >> > Daniel > > > > >> > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too: > > > > >> pausing guest after migration started but before device was > > > > >> unplugged? > > > > >> > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > pausing guest until migration is cancelled? > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > Further solution would be: > > > > - Add a new migration parameter: migrate-paused > > > > - change libvirt to use the new parameter if it exist > > > > - in qemu, when we do start migration (but after we wait for the unplug > > > > device) paused the guest before starting migration and resume it after > > > > migration finish. > > > > > > It would also have to handle issuing of paused after migration has > > > been started - delay the pause request until the nuplug is complete > > > is one answer. > > > > Hmm my worry would be that pausing is one way to give cpu > > resources back to host. It's problematic if guest can delay > > that indefinitely. > > hmm, yes, that is awkward. Perhaps we should just report an explicit > error then. Report an error in response to which command? Do you mean fail migration? > In normal cases this won't happen, as unplug will have > easily completed before the mgmt app pauses the running migration. > In broken/malicious guest cases, this at least ives mgmt a heads up > that something is wrong and they might then decide to cancel the > migration. > > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > If we have a paused guest, it can't unplug the network VF device, so > > > we wait there forever. Just change the code to give one error on that > > > case. > > > > > > Signed-off-by: Juan Quintela > > > > It's certainly possible but it's management that created > > this situation after all - why do we bother to enforce > > a policy? It is possible that management will unpause immediately > > afterwards and everything will proceed smoothly. > > > > Yes migration will not happen until guest is > > unpaused but the same it true of e.g. a guest that is stuck > > because of a bug. > > That's pretty different behaviour from how migration normally handles > a paused guest, which is that it is guaranteed to complete the migration > in as short a time as network bandwidth allows. > > Just ignoring the situation I think will lead to surprise apps / admins, > because the person/entity invoking the migration is not likely to have > checked wether this particular guest uses net failover or not before > invoking - they'll just be expecting a paused migration to run fast and > be guaranteed to complete. > > Regards, > Daniel Okay I guess. But then shouldn't we handle the reverse situation too: pausing guest after migration started but before device was unplugged? > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote: > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > "Michael S. Tsirkin" wrote: > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > >> > > > If we have a paused guest, it can't unplug the network VF device, > > >> > > > so > > >> > > > we wait there forever. Just change the code to give one error on > > >> > > > that > > >> > > > case. > > >> > > > > > >> > > > Signed-off-by: Juan Quintela > > >> > > > > >> > > It's certainly possible but it's management that created > > >> > > this situation after all - why do we bother to enforce > > >> > > a policy? It is possible that management will unpause immediately > > >> > > afterwards and everything will proceed smoothly. > > >> > > > > >> > > Yes migration will not happen until guest is > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > >> > > because of a bug. > > >> > > > >> > That's pretty different behaviour from how migration normally handles > > >> > a paused guest, which is that it is guaranteed to complete the > > >> > migration > > >> > in as short a time as network bandwidth allows. > > >> > > > >> > Just ignoring the situation I think will lead to surprise apps / > > >> > admins, > > >> > because the person/entity invoking the migration is not likely to have > > >> > checked wether this particular guest uses net failover or not before > > >> > invoking - they'll just be expecting a paused migration to run fast and > > >> > be guaranteed to complete. > > >> > > > >> > Regards, > > >> > Daniel > > >> > > >> Okay I guess. But then shouldn't we handle the reverse situation too: > > >> pausing guest after migration started but before device was > > >> unplugged? > > >> > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > pausing guest until migration is cancelled? > > > > > > All this seems heavy handed to me ... > > > > This is the minimal fix that I can think of. > > > > Further solution would be: > > - Add a new migration parameter: migrate-paused > > - change libvirt to use the new parameter if it exist > > - in qemu, when we do start migration (but after we wait for the unplug > > device) paused the guest before starting migration and resume it after > > migration finish. > > It would also have to handle issuing of paused after migration has > been started - delay the pause request until the nuplug is complete > is one answer. Hmm my worry would be that pausing is one way to give cpu resources back to host. It's problematic if guest can delay that indefinitely. > > My understanding talking with Laine is that they use this functionality > > by default for migration, saving, etc, i.e. it is not an isolated case. > > Yep, save-to-disk always runs in the paused state, and migration is > also paused by default unless the mgmt app explicitl asks for live > migration. > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 05:31:50AM -0500, Michael S. Tsirkin wrote: > On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: > > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > > If we have a paused guest, it can't unplug the network VF device, so > > > > we wait there forever. Just change the code to give one error on that > > > > case. > > > > > > > > Signed-off-by: Juan Quintela > > > > > > It's certainly possible but it's management that created > > > this situation after all - why do we bother to enforce > > > a policy? It is possible that management will unpause immediately > > > afterwards and everything will proceed smoothly. > > > > > > Yes migration will not happen until guest is > > > unpaused but the same it true of e.g. a guest that is stuck > > > because of a bug. > > > > That's pretty different behaviour from how migration normally handles > > a paused guest, which is that it is guaranteed to complete the migration > > in as short a time as network bandwidth allows. > > > > Just ignoring the situation I think will lead to surprise apps / admins, > > because the person/entity invoking the migration is not likely to have > > checked wether this particular guest uses net failover or not before > > invoking - they'll just be expecting a paused migration to run fast and > > be guaranteed to complete. > > Okay I guess. But then shouldn't we handle the reverse situation too: > pausing guest after migration started but before device was > unplugged? Yeah we likely want todo something there. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: > > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > > If we have a paused guest, it can't unplug the network VF device, so > > > > we wait there forever. Just change the code to give one error on that > > > > case. > > > > > > > > Signed-off-by: Juan Quintela > > > > > > It's certainly possible but it's management that created > > > this situation after all - why do we bother to enforce > > > a policy? It is possible that management will unpause immediately > > > afterwards and everything will proceed smoothly. > > > > > > Yes migration will not happen until guest is > > > unpaused but the same it true of e.g. a guest that is stuck > > > because of a bug. > > > > That's pretty different behaviour from how migration normally handles > > a paused guest, which is that it is guaranteed to complete the migration > > in as short a time as network bandwidth allows. > > > > Just ignoring the situation I think will lead to surprise apps / admins, > > because the person/entity invoking the migration is not likely to have > > checked wether this particular guest uses net failover or not before > > invoking - they'll just be expecting a paused migration to run fast and > > be guaranteed to complete. > > > > Regards, > > Daniel > > Okay I guess. But then shouldn't we handle the reverse situation too: > pausing guest after migration started but before device was > unplugged? > Thinking of which, I have no idea how we'd handle it - fail pausing guest until migration is cancelled? All this seems heavy handed to me ... > > -- > > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange > > :| > > |: https://libvirt.org -o-https://fstop138.berrange.com > > :| > > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange > > :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > If we have a paused guest, it can't unplug the network VF device, so > > we wait there forever. Just change the code to give one error on that > > case. > > > > Signed-off-by: Juan Quintela > > It's certainly possible but it's management that created > this situation after all - why do we bother to enforce > a policy? It is possible that management will unpause immediately > afterwards and everything will proceed smoothly. > > Yes migration will not happen until guest is > unpaused but the same it true of e.g. a guest that is stuck > because of a bug. That's pretty different behaviour from how migration normally handles a paused guest, which is that it is guaranteed to complete the migration in as short a time as network bandwidth allows. Just ignoring the situation I think will lead to surprise apps / admins, because the person/entity invoking the migration is not likely to have checked wether this particular guest uses net failover or not before invoking - they'll just be expecting a paused migration to run fast and be guaranteed to complete. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > "Michael S. Tsirkin" wrote: > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote: > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > >> > > > If we have a paused guest, it can't unplug the network VF device, so > >> > > > we wait there forever. Just change the code to give one error on > >> > > > that > >> > > > case. > >> > > > > >> > > > Signed-off-by: Juan Quintela > >> > > > >> > > It's certainly possible but it's management that created > >> > > this situation after all - why do we bother to enforce > >> > > a policy? It is possible that management will unpause immediately > >> > > afterwards and everything will proceed smoothly. > >> > > > >> > > Yes migration will not happen until guest is > >> > > unpaused but the same it true of e.g. a guest that is stuck > >> > > because of a bug. > >> > > >> > That's pretty different behaviour from how migration normally handles > >> > a paused guest, which is that it is guaranteed to complete the migration > >> > in as short a time as network bandwidth allows. > >> > > >> > Just ignoring the situation I think will lead to surprise apps / admins, > >> > because the person/entity invoking the migration is not likely to have > >> > checked wether this particular guest uses net failover or not before > >> > invoking - they'll just be expecting a paused migration to run fast and > >> > be guaranteed to complete. > >> > > >> > Regards, > >> > Daniel > >> > >> Okay I guess. But then shouldn't we handle the reverse situation too: > >> pausing guest after migration started but before device was > >> unplugged? > >> > > > > Thinking of which, I have no idea how we'd handle it - fail > > pausing guest until migration is cancelled? > > > > All this seems heavy handed to me ... > > This is the minimal fix that I can think of. > > Further solution would be: > - Add a new migration parameter: migrate-paused > - change libvirt to use the new parameter if it exist > - in qemu, when we do start migration (but after we wait for the unplug > device) paused the guest before starting migration and resume it after > migration finish. It would also have to handle issuing of paused after migration has been started - delay the pause request until the nuplug is complete is one answer. > My understanding talking with Laine is that they use this functionality > by default for migration, saving, etc, i.e. it is not an isolated case. Yep, save-to-disk always runs in the paused state, and migration is also paused by default unless the mgmt app explicitl asks for live migration. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > If we have a paused guest, it can't unplug the network VF device, so > we wait there forever. Just change the code to give one error on that > case. > > Signed-off-by: Juan Quintela It's certainly possible but it's management that created this situation after all - why do we bother to enforce a policy? It is possible that management will unpause immediately afterwards and everything will proceed smoothly. Yes migration will not happen until guest is unpaused but the same it true of e.g. a guest that is stuck because of a bug. > --- > migration/migration.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/migration/migration.c b/migration/migration.c > index 87a9b59f83..d44fc880f9 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque) > qemu_savevm_state_setup(s->to_dst_file); > > if (qemu_savevm_state_guest_unplug_pending()) { > +/* if guest is paused, it can send back the wait event */ > +if (!runstate_is_running()) { > +Error *local_err = NULL; > + > +error_setg(_err, "migration: network failover and " > + "guest is paused'"); > +migrate_set_error(s, local_err); > +error_free(local_err); > +migrate_set_state(>state, MIGRATION_STATUS_SETUP, > + MIGRATION_STATUS_FAILED); > +goto end; > +} > migrate_set_state(>state, MIGRATION_STATUS_SETUP, >MIGRATION_STATUS_WAIT_UNPLUG); > > @@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque) > } > > trace_migration_thread_after_loop(); > +end: > migration_iteration_finish(s); > object_unref(OBJECT(s)); > rcu_unregister_thread(); > -- > 2.26.2
Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
* Juan Quintela (quint...@redhat.com) wrote: > If we have a paused guest, it can't unplug the network VF device, so > we wait there forever. Just change the code to give one error on that > case. > > Signed-off-by: Juan Quintela Reviewed-by: Dr. David Alan Gilbert > --- > migration/migration.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/migration/migration.c b/migration/migration.c > index 87a9b59f83..d44fc880f9 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque) > qemu_savevm_state_setup(s->to_dst_file); > > if (qemu_savevm_state_guest_unplug_pending()) { > +/* if guest is paused, it can send back the wait event */ > +if (!runstate_is_running()) { > +Error *local_err = NULL; > + > +error_setg(_err, "migration: network failover and " > + "guest is paused'"); > +migrate_set_error(s, local_err); > +error_free(local_err); > +migrate_set_state(>state, MIGRATION_STATUS_SETUP, > + MIGRATION_STATUS_FAILED); > +goto end; > +} > migrate_set_state(>state, MIGRATION_STATUS_SETUP, >MIGRATION_STATUS_WAIT_UNPLUG); > > @@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque) > } > > trace_migration_thread_after_loop(); > +end: > migration_iteration_finish(s); > object_unref(OBJECT(s)); > rcu_unregister_thread(); > -- > 2.26.2 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK