Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-08 Thread Michael S. Tsirkin
On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé 
> > > > > > >> wrote:
> > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin 
> > > > > > >> > wrote:
> > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela 
> > > > > > >> > > wrote:
> > > > > > >> > > > If we have a paused guest, it can't unplug the network VF 
> > > > > > >> > > > device, so
> > > > > > >> > > > we wait there forever.  Just change the code to give one 
> > > > > > >> > > > error on that
> > > > > > >> > > > case.
> > > > > > >> > > > 
> > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > >> > > 
> > > > > > >> > > It's certainly possible but it's management that created
> > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > >> > > immediately
> > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > >> > > 
> > > > > > >> > > Yes migration will not happen until guest is
> > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > >> > > because of a bug.
> > > > > > >> > 
> > > > > > >> > That's pretty different behaviour from how migration normally 
> > > > > > >> > handles
> > > > > > >> > a paused guest, which is that it is guaranteed to complete the 
> > > > > > >> > migration
> > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > >> > 
> > > > > > >> > Just ignoring the situation I think will lead to surprise apps 
> > > > > > >> > / admins,
> > > > > > >> > because the person/entity invoking the migration is not likely 
> > > > > > >> > to have
> > > > > > >> > checked wether this particular guest uses net failover or not 
> > > > > > >> > before
> > > > > > >> > invoking - they'll just be expecting a paused migration to run 
> > > > > > >> > fast and
> > > > > > >> > be guaranteed to complete.
> > > > > > >> > 
> > > > > > >> > Regards,
> > > > > > >> > Daniel
> > > > > > >> 
> > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation 
> > > > > > >> too:
> > > > > > >> pausing guest after migration started but before device was
> > > > > > >> unplugged?
> > > > > > >> 
> > > > > > >
> > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > pausing guest until migration is cancelled?
> > > > > > >
> > > > > > > All this seems heavy handed to me ...
> > > > > > 
> > > > > > This is the minimal fix that I can think of.
> > > > > > 
> > > > > > Further solution would be:
> > > > > > - Add a new migration parameter: migrate-paused
> > > > > > - change libvirt to use the new parameter if it exist
> > > > > > - in qemu, when we do start migration (but after we wait for the 
> > > > > > unplug
> > > > > >   device) paused the guest before starting migration and resume it 
> > > > > > after
> > > > > >   migration finish.
> > > > > 
> > > > > It would also have to handle issuing of paused after migration has
> > > > > been started - delay the pause request until the nuplug is complete
> > > > > is one answer.
> > > > 
> > > > Hmm my worry would be that pausing is one way to give cpu
> > > > resources back to host. It's problematic if guest can delay
> > > > that indefinitely.
> > > 
> > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > error then.
> > 
> > Report an error in response to which command? Do you mean
> > fail migration?
> 
> If mgt attempt to pause an existing migration that hasn't finished
> the PCI unplug stage, then fail the pause request.

OK so I guess I'll apply the rest of the patchset, and let's see
a complete patch that makes pause and migrate mutually exclusive?


> > 
> > > In normal cases this won't happen, as unplug will have
> > > easily completed before the mgmt app pauses the running migration.
> > > In broken/malicious guest cases, this at least ives mgmt a heads up
> > > that something is wrong and they might then decide to cancel the
> > > migration.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-08 Thread Michael S. Tsirkin
On Thu, Dec 03, 2020 at 12:16:24PM +, Daniel P. Berrangé wrote:
> On Thu, Dec 03, 2020 at 07:11:17AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote:
> > > Another way to solve this would be to remove the unplugging from the
> > > migration layer and leave it as a problem for the management layer to do
> > > the unplug.
> > 
> > Daniel described the problem with modular management tools which expect
> > pausing or slowing down guest to cause migration to converge.
> > 
> > Point is, it actually *will* make it converge but only if you
> > pause it after unplug.
> > 
> > As it is, these tools fundamentally can not handle failover
> > requiring guest cooperation. Moving code between layers won't help.
> > Introducing failure modes as this patch does won't help either
> > especially since Daniel wrote there are countless tools like this.
> > We just break them all but have no resources to fix them,
> > this does not help at all.
> > 
> > We can just leave the situation as is.
> > 
> > Or if we do want to be nice to these tools, how about we
> > unpause the guest until unplug, then pause it again?
> > This actually addresses the problem instead of
> > shifting the blame, does it not?
> 
> This is a very bad idea because it changes the execution status of the
> guest behind the apps/admins back, and that cannot be assumed to be a
> safe thing todo.
> 
> Regards,
> Daniel

My question is this: management gets an error since guest was paused
presumably by someone (admin?) outside its control.
How does it know when it is unpaused?


> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Daniel P . Berrangé
On Thu, Dec 03, 2020 at 07:11:17AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote:
> > Another way to solve this would be to remove the unplugging from the
> > migration layer and leave it as a problem for the management layer to do
> > the unplug.
> 
> Daniel described the problem with modular management tools which expect
> pausing or slowing down guest to cause migration to converge.
> 
> Point is, it actually *will* make it converge but only if you
> pause it after unplug.
> 
> As it is, these tools fundamentally can not handle failover
> requiring guest cooperation. Moving code between layers won't help.
> Introducing failure modes as this patch does won't help either
> especially since Daniel wrote there are countless tools like this.
> We just break them all but have no resources to fix them,
> this does not help at all.
> 
> We can just leave the situation as is.
> 
> Or if we do want to be nice to these tools, how about we
> unpause the guest until unplug, then pause it again?
> This actually addresses the problem instead of
> shifting the blame, does it not?

This is a very bad idea because it changes the execution status of the
guest behind the apps/admins back, and that cannot be assumed to be a
safe thing todo.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Michael S. Tsirkin
On Thu, Dec 03, 2020 at 12:06:14PM +, Daniel P. Berrangé wrote:
> > > It isn't really about the admin.  It is about countless existing mgmt apps
> > > that expect migration will always succeed if the VM is paused. The mgmt
> > > apps triggering the migraiton is not neccessarily the same as the app
> > > which introduced the use of NIC failover in the config.
> > > 
> > > eg in OpenStack Nova provides the VM config, but there are completely
> > > separate apps that are built todo automation on top of Nova which 
> > > this is liable to break. There's no human admin there to diagnose
> > > this and re-try with unpause, as all the logic is in the apps.
> > > 
> > > 
> > > Regards,
> > > Daniel
> > 
> > So let's say pause fails. Won't this break these theoretical apps?
> 
> Yes, they are broken in a way that can now actually be seen, as opposed
> to just hanging forever in migration.

Right but there are countless apps out there, and they are all broken,
maybe not too hard to debug if one knows where to look, but still ...

And again only true for pause then migrate, migrate then pause is still
broken, and if we fix it by failing pause then no one knows whether apps
are prepared to handle a failure in the pause command, previously if
used correctly it would never fail I think ...

-- 
MST




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Michael S. Tsirkin
On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote:
> Another way to solve this would be to remove the unplugging from the
> migration layer and leave it as a problem for the management layer to do
> the unplug.

Daniel described the problem with modular management tools which expect
pausing or slowing down guest to cause migration to converge.

Point is, it actually *will* make it converge but only if you
pause it after unplug.

As it is, these tools fundamentally can not handle failover
requiring guest cooperation. Moving code between layers won't help.
Introducing failure modes as this patch does won't help either
especially since Daniel wrote there are countless tools like this.
We just break them all but have no resources to fix them,
this does not help at all.

We can just leave the situation as is.

Or if we do want to be nice to these tools, how about we
unpause the guest until unplug, then pause it again?
This actually addresses the problem instead of
shifting the blame, does it not?

-- 
MST




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Daniel P . Berrangé
On Thu, Dec 03, 2020 at 07:01:11AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:45:12AM +, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin 
> > > > > > > > wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé 
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela 
> > > > > > > > > > wrote:
> > > > > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > > > Tsirkin wrote:
> > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > > > > >> Berrangé wrote:
> > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > > > >> > > > network VF device, so
> > > > > > > > > > > >> > > > we wait there forever.  Just change the code to 
> > > > > > > > > > > >> > > > give one error on that
> > > > > > > > > > > >> > > > case.
> > > > > > > > > > > >> > > > 
> > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > > > > >> > > > 
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > > > >> > > created
> > > > > > > > > > > >> > > this situation after all - why do we bother to 
> > > > > > > > > > > >> > > enforce
> > > > > > > > > > > >> > > a policy? It is possible that management will 
> > > > > > > > > > > >> > > unpause immediately
> > > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that 
> > > > > > > > > > > >> > > is stuck
> > > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > > > >> > normally handles
> > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > > > >> > complete the migration
> > > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > > > >> > because the person/entity invoking the migration is 
> > > > > > > > > > > >> > not likely to have
> > > > > > > > > > > >> > checked wether this particular guest uses net 
> > > > > > > > > > > >> > failover or not before
> > > > > > > > > > > >> > invoking - they'll just be expecting a paused 
> > > > > > > > > > > >> > migration to run fast and
> > > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Regards,
> > > > > > > > > > > >> > Daniel
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > > > >> situation too:
> > > > > > > > > > > >> pausing guest after migration started but before 
> > > > > > > > > > > >> device was
> > > > > > > > > > > >> unplugged?
> > > > > > > > > > > >> 
> > > > > > > > > > > >
> > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - 
> > > > > > > > > > > > fail
> > > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > > >
> > > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > > 
> > > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > > 
> > > > > > > > > > > Further solution would be:
> > > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > > - in qemu, when we do start migration (but after we wait 
> > > > > > > > > > > for the unplug
> > > > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > > > resume it after
> > > > > > > > > > >   migration finish.
> > > > > > > > > > 
> > > > > > > > > > It would also have to handle issuing of paused after 
> > > > > > > > > > migration has
> > > > > > > > 

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Dr. David Alan Gilbert
* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin 
> > > > > > > > wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé 
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela 
> > > > > > > > > > wrote:
> > > > > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > > > Tsirkin wrote:
> > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > > > > >> Berrangé wrote:
> > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > > > >> > > > network VF device, so
> > > > > > > > > > > >> > > > we wait there forever.  Just change the code to 
> > > > > > > > > > > >> > > > give one error on that
> > > > > > > > > > > >> > > > case.
> > > > > > > > > > > >> > > > 
> > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > > > > >> > > > 
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > > > >> > > created
> > > > > > > > > > > >> > > this situation after all - why do we bother to 
> > > > > > > > > > > >> > > enforce
> > > > > > > > > > > >> > > a policy? It is possible that management will 
> > > > > > > > > > > >> > > unpause immediately
> > > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that 
> > > > > > > > > > > >> > > is stuck
> > > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > > > >> > normally handles
> > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > > > >> > complete the migration
> > > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > > > >> > because the person/entity invoking the migration is 
> > > > > > > > > > > >> > not likely to have
> > > > > > > > > > > >> > checked wether this particular guest uses net 
> > > > > > > > > > > >> > failover or not before
> > > > > > > > > > > >> > invoking - they'll just be expecting a paused 
> > > > > > > > > > > >> > migration to run fast and
> > > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Regards,
> > > > > > > > > > > >> > Daniel
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > > > >> situation too:
> > > > > > > > > > > >> pausing guest after migration started but before 
> > > > > > > > > > > >> device was
> > > > > > > > > > > >> unplugged?
> > > > > > > > > > > >> 
> > > > > > > > > > > >
> > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - 
> > > > > > > > > > > > fail
> > > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > > >
> > > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > > 
> > > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > > 
> > > > > > > > > > > Further solution would be:
> > > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > > - in qemu, when we do start migration (but after we wait 
> > > > > > > > > > > for the unplug
> > > > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > > > resume it after
> > > > > > > > > > >   migration finish.
> > > > > > > > > > 
> > > > > > > > > > It would also have to handle issuing of paused after 
> > > > > > > > > > migration has
> > > > > > > > > > been started - delay the pause 

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Michael S. Tsirkin
On Thu, Dec 03, 2020 at 11:43:41AM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé 
> > > > > > > > wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > > Tsirkin wrote:
> > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > > > >> Berrangé wrote:
> > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > > >> > > > network VF device, so
> > > > > > > > > > >> > > > we wait there forever.  Just change the code to 
> > > > > > > > > > >> > > > give one error on that
> > > > > > > > > > >> > > > case.
> > > > > > > > > > >> > > > 
> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > > >> > > created
> > > > > > > > > > >> > > this situation after all - why do we bother to 
> > > > > > > > > > >> > > enforce
> > > > > > > > > > >> > > a policy? It is possible that management will 
> > > > > > > > > > >> > > unpause immediately
> > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that 
> > > > > > > > > > >> > > is stuck
> > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > > >> > normally handles
> > > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > > >> > complete the migration
> > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > > >> > because the person/entity invoking the migration is 
> > > > > > > > > > >> > not likely to have
> > > > > > > > > > >> > checked wether this particular guest uses net failover 
> > > > > > > > > > >> > or not before
> > > > > > > > > > >> > invoking - they'll just be expecting a paused 
> > > > > > > > > > >> > migration to run fast and
> > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Regards,
> > > > > > > > > > >> > Daniel
> > > > > > > > > > >> 
> > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > > >> situation too:
> > > > > > > > > > >> pausing guest after migration started but before device 
> > > > > > > > > > >> was
> > > > > > > > > > >> unplugged?
> > > > > > > > > > >> 
> > > > > > > > > > >
> > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - 
> > > > > > > > > > > fail
> > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > >
> > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > 
> > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > 
> > > > > > > > > > Further solution would be:
> > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > - in qemu, when we do start migration (but after we wait 
> > > > > > > > > > for the unplug
> > > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > > resume it after
> > > > > > > > > >   migration finish.
> > > > > > > > > 
> > > > > > > > > It would also have to handle issuing of paused after 
> > > > > > > > > migration has
> > > > > > > > > been started - delay the pause request until the nuplug is 
> > > > > > > > > complete
> > > > > > > > > is one answer.
> > > > > > > > 
> > > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > > that indefinitely.
> > 

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Michael S. Tsirkin
On Thu, Dec 03, 2020 at 11:45:12AM +, Daniel P. Berrangé wrote:
> On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé 
> > > > > > > > wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > > Tsirkin wrote:
> > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > > > >> Berrangé wrote:
> > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > > >> > > > network VF device, so
> > > > > > > > > > >> > > > we wait there forever.  Just change the code to 
> > > > > > > > > > >> > > > give one error on that
> > > > > > > > > > >> > > > case.
> > > > > > > > > > >> > > > 
> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > > >> > > created
> > > > > > > > > > >> > > this situation after all - why do we bother to 
> > > > > > > > > > >> > > enforce
> > > > > > > > > > >> > > a policy? It is possible that management will 
> > > > > > > > > > >> > > unpause immediately
> > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that 
> > > > > > > > > > >> > > is stuck
> > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > > >> > normally handles
> > > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > > >> > complete the migration
> > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > > >> > because the person/entity invoking the migration is 
> > > > > > > > > > >> > not likely to have
> > > > > > > > > > >> > checked wether this particular guest uses net failover 
> > > > > > > > > > >> > or not before
> > > > > > > > > > >> > invoking - they'll just be expecting a paused 
> > > > > > > > > > >> > migration to run fast and
> > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Regards,
> > > > > > > > > > >> > Daniel
> > > > > > > > > > >> 
> > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > > >> situation too:
> > > > > > > > > > >> pausing guest after migration started but before device 
> > > > > > > > > > >> was
> > > > > > > > > > >> unplugged?
> > > > > > > > > > >> 
> > > > > > > > > > >
> > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - 
> > > > > > > > > > > fail
> > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > >
> > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > 
> > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > 
> > > > > > > > > > Further solution would be:
> > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > - in qemu, when we do start migration (but after we wait 
> > > > > > > > > > for the unplug
> > > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > > resume it after
> > > > > > > > > >   migration finish.
> > > > > > > > > 
> > > > > > > > > It would also have to handle issuing of paused after 
> > > > > > > > > migration has
> > > > > > > > > been started - delay the pause request until the nuplug is 
> > > > > > > > > complete
> > > > > > > > > is one answer.
> > > > > > > > 
> > > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > > that 

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Daniel P . Berrangé
On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > Tsirkin wrote:
> > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > > >> Berrangé wrote:
> > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > >> > > > network VF device, so
> > > > > > > > > >> > > > we wait there forever.  Just change the code to give 
> > > > > > > > > >> > > > one error on that
> > > > > > > > > >> > > > case.
> > > > > > > > > >> > > > 
> > > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > > >> > > 
> > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > >> > > created
> > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > > > > >> > > immediately
> > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > >> > > 
> > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is 
> > > > > > > > > >> > > stuck
> > > > > > > > > >> > > because of a bug.
> > > > > > > > > >> > 
> > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > >> > normally handles
> > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > >> > complete the migration
> > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > >> > 
> > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > >> > because the person/entity invoking the migration is not 
> > > > > > > > > >> > likely to have
> > > > > > > > > >> > checked wether this particular guest uses net failover 
> > > > > > > > > >> > or not before
> > > > > > > > > >> > invoking - they'll just be expecting a paused migration 
> > > > > > > > > >> > to run fast and
> > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > >> > 
> > > > > > > > > >> > Regards,
> > > > > > > > > >> > Daniel
> > > > > > > > > >> 
> > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > >> situation too:
> > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > >> unplugged?
> > > > > > > > > >> 
> > > > > > > > > >
> > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > >
> > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > 
> > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > 
> > > > > > > > > Further solution would be:
> > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > - in qemu, when we do start migration (but after we wait for 
> > > > > > > > > the unplug
> > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > resume it after
> > > > > > > > >   migration finish.
> > > > > > > > 
> > > > > > > > It would also have to handle issuing of paused after migration 
> > > > > > > > has
> > > > > > > > been started - delay the pause request until the nuplug is 
> > > > > > > > complete
> > > > > > > > is one answer.
> > > > > > > 
> > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > that indefinitely.
> > > > > > 
> > > > > > hmm, yes, that is awkward.  Perhaps we should just report an 
> > > > > > explicit
> > > > > > error then.
> > > > > 
> > > > > Report an error in response to which command? Do you mean
> > > > > fail migration?
> > > > 
> > > > If mgt attempt to pause an existing migration that hasn't finished

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Dr. David Alan Gilbert
* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > Tsirkin wrote:
> > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > > >> Berrangé wrote:
> > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > >> > > > network VF device, so
> > > > > > > > > >> > > > we wait there forever.  Just change the code to give 
> > > > > > > > > >> > > > one error on that
> > > > > > > > > >> > > > case.
> > > > > > > > > >> > > > 
> > > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > > >> > > 
> > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > >> > > created
> > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > > > > >> > > immediately
> > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > >> > > 
> > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is 
> > > > > > > > > >> > > stuck
> > > > > > > > > >> > > because of a bug.
> > > > > > > > > >> > 
> > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > >> > normally handles
> > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > >> > complete the migration
> > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > >> > 
> > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > >> > because the person/entity invoking the migration is not 
> > > > > > > > > >> > likely to have
> > > > > > > > > >> > checked wether this particular guest uses net failover 
> > > > > > > > > >> > or not before
> > > > > > > > > >> > invoking - they'll just be expecting a paused migration 
> > > > > > > > > >> > to run fast and
> > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > >> > 
> > > > > > > > > >> > Regards,
> > > > > > > > > >> > Daniel
> > > > > > > > > >> 
> > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > >> situation too:
> > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > >> unplugged?
> > > > > > > > > >> 
> > > > > > > > > >
> > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > >
> > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > 
> > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > 
> > > > > > > > > Further solution would be:
> > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > - in qemu, when we do start migration (but after we wait for 
> > > > > > > > > the unplug
> > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > resume it after
> > > > > > > > >   migration finish.
> > > > > > > > 
> > > > > > > > It would also have to handle issuing of paused after migration 
> > > > > > > > has
> > > > > > > > been started - delay the pause request until the nuplug is 
> > > > > > > > complete
> > > > > > > > is one answer.
> > > > > > > 
> > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > that indefinitely.
> > > > > > 
> > > > > > hmm, yes, that is awkward.  Perhaps we should just report an 
> > > > > > explicit
> > > > > > error then.
> > > > > 
> > > > > Report an error in response to which command? Do you mean
> > > > > fail migration?
> > > > 
> > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > the PCI unplug 

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Michael S. Tsirkin
On Thu, Dec 03, 2020 at 11:32:53AM +, Daniel P. Berrangé wrote:
> On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin 
> > > > > > > > > wrote:
> > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > > >> Berrangé wrote:
> > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > >> > Tsirkin wrote:
> > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela 
> > > > > > > > >> > > wrote:
> > > > > > > > >> > > > If we have a paused guest, it can't unplug the network 
> > > > > > > > >> > > > VF device, so
> > > > > > > > >> > > > we wait there forever.  Just change the code to give 
> > > > > > > > >> > > > one error on that
> > > > > > > > >> > > > case.
> > > > > > > > >> > > > 
> > > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > > >> > > 
> > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > > > >> > > immediately
> > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > >> > > 
> > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is 
> > > > > > > > >> > > stuck
> > > > > > > > >> > > because of a bug.
> > > > > > > > >> > 
> > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > >> > normally handles
> > > > > > > > >> > a paused guest, which is that it is guaranteed to complete 
> > > > > > > > >> > the migration
> > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > >> > 
> > > > > > > > >> > Just ignoring the situation I think will lead to surprise 
> > > > > > > > >> > apps / admins,
> > > > > > > > >> > because the person/entity invoking the migration is not 
> > > > > > > > >> > likely to have
> > > > > > > > >> > checked wether this particular guest uses net failover or 
> > > > > > > > >> > not before
> > > > > > > > >> > invoking - they'll just be expecting a paused migration to 
> > > > > > > > >> > run fast and
> > > > > > > > >> > be guaranteed to complete.
> > > > > > > > >> > 
> > > > > > > > >> > Regards,
> > > > > > > > >> > Daniel
> > > > > > > > >> 
> > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > >> situation too:
> > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > >> unplugged?
> > > > > > > > >> 
> > > > > > > > >
> > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > >
> > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > 
> > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > 
> > > > > > > > Further solution would be:
> > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > - in qemu, when we do start migration (but after we wait for 
> > > > > > > > the unplug
> > > > > > > >   device) paused the guest before starting migration and resume 
> > > > > > > > it after
> > > > > > > >   migration finish.
> > > > > > > 
> > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > been started - delay the pause request until the nuplug is 
> > > > > > > complete
> > > > > > > is one answer.
> > > > > > 
> > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > resources back to host. It's problematic if guest can delay
> > > > > > that indefinitely.
> > > > > 
> > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > error then.
> > > > 
> > > > Report an error in response to which command? Do you mean
> > > > fail migration?
> > > 
> > > If mgt attempt to pause an existing migration that hasn't finished
> > > the PCI unplug stage, then fail the pause request.
> > 
> > Pause guest not migration ...
> > Might be tricky ...
> > 
> > Let me ask this, why not just produce a warning
> > that migration wan't finish until guest actually runs?
> > User will then know and unpause the guest when he wants
> > migration to succeed ...

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Daniel P . Berrangé
On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin 
> > > > > > > > wrote:
> > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. 
> > > > > > > >> Berrangé wrote:
> > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin 
> > > > > > > >> > wrote:
> > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela 
> > > > > > > >> > > wrote:
> > > > > > > >> > > > If we have a paused guest, it can't unplug the network 
> > > > > > > >> > > > VF device, so
> > > > > > > >> > > > we wait there forever.  Just change the code to give one 
> > > > > > > >> > > > error on that
> > > > > > > >> > > > case.
> > > > > > > >> > > > 
> > > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > > >> > > 
> > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > > >> > > immediately
> > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > >> > > 
> > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > >> > > because of a bug.
> > > > > > > >> > 
> > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > >> > normally handles
> > > > > > > >> > a paused guest, which is that it is guaranteed to complete 
> > > > > > > >> > the migration
> > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > >> > 
> > > > > > > >> > Just ignoring the situation I think will lead to surprise 
> > > > > > > >> > apps / admins,
> > > > > > > >> > because the person/entity invoking the migration is not 
> > > > > > > >> > likely to have
> > > > > > > >> > checked wether this particular guest uses net failover or 
> > > > > > > >> > not before
> > > > > > > >> > invoking - they'll just be expecting a paused migration to 
> > > > > > > >> > run fast and
> > > > > > > >> > be guaranteed to complete.
> > > > > > > >> > 
> > > > > > > >> > Regards,
> > > > > > > >> > Daniel
> > > > > > > >> 
> > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > >> situation too:
> > > > > > > >> pausing guest after migration started but before device was
> > > > > > > >> unplugged?
> > > > > > > >> 
> > > > > > > >
> > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > pausing guest until migration is cancelled?
> > > > > > > >
> > > > > > > > All this seems heavy handed to me ...
> > > > > > > 
> > > > > > > This is the minimal fix that I can think of.
> > > > > > > 
> > > > > > > Further solution would be:
> > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > - in qemu, when we do start migration (but after we wait for the 
> > > > > > > unplug
> > > > > > >   device) paused the guest before starting migration and resume 
> > > > > > > it after
> > > > > > >   migration finish.
> > > > > > 
> > > > > > It would also have to handle issuing of paused after migration has
> > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > is one answer.
> > > > > 
> > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > resources back to host. It's problematic if guest can delay
> > > > > that indefinitely.
> > > > 
> > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > error then.
> > > 
> > > Report an error in response to which command? Do you mean
> > > fail migration?
> > 
> > If mgt attempt to pause an existing migration that hasn't finished
> > the PCI unplug stage, then fail the pause request.
> 
> Pause guest not migration ...
> Might be tricky ...
> 
> Let me ask this, why not just produce a warning
> that migration wan't finish until guest actually runs?
> User will then know and unpause the guest when he wants
> migration to succeed ...

A warning is going to be essentally invisible if the pause command
succeeeds. 

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-03 Thread Michael S. Tsirkin
On Wed, Dec 02, 2020 at 12:01:21PM +, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé 
> > > > > > >> wrote:
> > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin 
> > > > > > >> > wrote:
> > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela 
> > > > > > >> > > wrote:
> > > > > > >> > > > If we have a paused guest, it can't unplug the network VF 
> > > > > > >> > > > device, so
> > > > > > >> > > > we wait there forever.  Just change the code to give one 
> > > > > > >> > > > error on that
> > > > > > >> > > > case.
> > > > > > >> > > > 
> > > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > > >> > > 
> > > > > > >> > > It's certainly possible but it's management that created
> > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > >> > > immediately
> > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > >> > > 
> > > > > > >> > > Yes migration will not happen until guest is
> > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > >> > > because of a bug.
> > > > > > >> > 
> > > > > > >> > That's pretty different behaviour from how migration normally 
> > > > > > >> > handles
> > > > > > >> > a paused guest, which is that it is guaranteed to complete the 
> > > > > > >> > migration
> > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > >> > 
> > > > > > >> > Just ignoring the situation I think will lead to surprise apps 
> > > > > > >> > / admins,
> > > > > > >> > because the person/entity invoking the migration is not likely 
> > > > > > >> > to have
> > > > > > >> > checked wether this particular guest uses net failover or not 
> > > > > > >> > before
> > > > > > >> > invoking - they'll just be expecting a paused migration to run 
> > > > > > >> > fast and
> > > > > > >> > be guaranteed to complete.
> > > > > > >> > 
> > > > > > >> > Regards,
> > > > > > >> > Daniel
> > > > > > >> 
> > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation 
> > > > > > >> too:
> > > > > > >> pausing guest after migration started but before device was
> > > > > > >> unplugged?
> > > > > > >> 
> > > > > > >
> > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > pausing guest until migration is cancelled?
> > > > > > >
> > > > > > > All this seems heavy handed to me ...
> > > > > > 
> > > > > > This is the minimal fix that I can think of.
> > > > > > 
> > > > > > Further solution would be:
> > > > > > - Add a new migration parameter: migrate-paused
> > > > > > - change libvirt to use the new parameter if it exist
> > > > > > - in qemu, when we do start migration (but after we wait for the 
> > > > > > unplug
> > > > > >   device) paused the guest before starting migration and resume it 
> > > > > > after
> > > > > >   migration finish.
> > > > > 
> > > > > It would also have to handle issuing of paused after migration has
> > > > > been started - delay the pause request until the nuplug is complete
> > > > > is one answer.
> > > > 
> > > > Hmm my worry would be that pausing is one way to give cpu
> > > > resources back to host. It's problematic if guest can delay
> > > > that indefinitely.
> > > 
> > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > error then.
> > 
> > Report an error in response to which command? Do you mean
> > fail migration?
> 
> If mgt attempt to pause an existing migration that hasn't finished
> the PCI unplug stage, then fail the pause request.

Pause guest not migration ...
Might be tricky ...

Let me ask this, why not just produce a warning
that migration wan't finish until guest actually runs?
User will then know and unpause the guest when he wants
migration to succeed ...


For example, user can restrict the amount of cpu
using cgroups to a level where almost no progress
is made. QEMU can't detect this 



> > 
> > > In normal cases this won't happen, as unplug will have
> > > easily completed before the mgmt app pauses the running migration.
> > > In broken/malicious guest cases, this at least ives mgmt a heads up
> > > that something is wrong and they might then decide to cancel the
> > > migration.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: 

Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Daniel P . Berrangé
On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > "Michael S. Tsirkin"  wrote:
> > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé 
> > > > > >> wrote:
> > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin 
> > > > > >> > wrote:
> > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > >> > > > If we have a paused guest, it can't unplug the network VF 
> > > > > >> > > > device, so
> > > > > >> > > > we wait there forever.  Just change the code to give one 
> > > > > >> > > > error on that
> > > > > >> > > > case.
> > > > > >> > > > 
> > > > > >> > > > Signed-off-by: Juan Quintela 
> > > > > >> > > 
> > > > > >> > > It's certainly possible but it's management that created
> > > > > >> > > this situation after all - why do we bother to enforce
> > > > > >> > > a policy? It is possible that management will unpause 
> > > > > >> > > immediately
> > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > >> > > 
> > > > > >> > > Yes migration will not happen until guest is
> > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > >> > > because of a bug.
> > > > > >> > 
> > > > > >> > That's pretty different behaviour from how migration normally 
> > > > > >> > handles
> > > > > >> > a paused guest, which is that it is guaranteed to complete the 
> > > > > >> > migration
> > > > > >> > in as short a time as network bandwidth allows.
> > > > > >> > 
> > > > > >> > Just ignoring the situation I think will lead to surprise apps / 
> > > > > >> > admins,
> > > > > >> > because the person/entity invoking the migration is not likely 
> > > > > >> > to have
> > > > > >> > checked wether this particular guest uses net failover or not 
> > > > > >> > before
> > > > > >> > invoking - they'll just be expecting a paused migration to run 
> > > > > >> > fast and
> > > > > >> > be guaranteed to complete.
> > > > > >> > 
> > > > > >> > Regards,
> > > > > >> > Daniel
> > > > > >> 
> > > > > >> Okay I guess. But then shouldn't we handle the reverse situation 
> > > > > >> too:
> > > > > >> pausing guest after migration started but before device was
> > > > > >> unplugged?
> > > > > >> 
> > > > > >
> > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > pausing guest until migration is cancelled?
> > > > > >
> > > > > > All this seems heavy handed to me ...
> > > > > 
> > > > > This is the minimal fix that I can think of.
> > > > > 
> > > > > Further solution would be:
> > > > > - Add a new migration parameter: migrate-paused
> > > > > - change libvirt to use the new parameter if it exist
> > > > > - in qemu, when we do start migration (but after we wait for the 
> > > > > unplug
> > > > >   device) paused the guest before starting migration and resume it 
> > > > > after
> > > > >   migration finish.
> > > > 
> > > > It would also have to handle issuing of paused after migration has
> > > > been started - delay the pause request until the nuplug is complete
> > > > is one answer.
> > > 
> > > Hmm my worry would be that pausing is one way to give cpu
> > > resources back to host. It's problematic if guest can delay
> > > that indefinitely.
> > 
> > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > error then.
> 
> Report an error in response to which command? Do you mean
> fail migration?

If mgt attempt to pause an existing migration that hasn't finished
the PCI unplug stage, then fail the pause request.

> 
> > In normal cases this won't happen, as unplug will have
> > easily completed before the mgmt app pauses the running migration.
> > In broken/malicious guest cases, this at least ives mgmt a heads up
> > that something is wrong and they might then decide to cancel the
> > migration.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Daniel P . Berrangé
On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > "Michael S. Tsirkin"  wrote:
> > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
> > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > >> > > > If we have a paused guest, it can't unplug the network VF 
> > > >> > > > device, so
> > > >> > > > we wait there forever.  Just change the code to give one error 
> > > >> > > > on that
> > > >> > > > case.
> > > >> > > > 
> > > >> > > > Signed-off-by: Juan Quintela 
> > > >> > > 
> > > >> > > It's certainly possible but it's management that created
> > > >> > > this situation after all - why do we bother to enforce
> > > >> > > a policy? It is possible that management will unpause immediately
> > > >> > > afterwards and everything will proceed smoothly.
> > > >> > > 
> > > >> > > Yes migration will not happen until guest is
> > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > >> > > because of a bug.
> > > >> > 
> > > >> > That's pretty different behaviour from how migration normally handles
> > > >> > a paused guest, which is that it is guaranteed to complete the 
> > > >> > migration
> > > >> > in as short a time as network bandwidth allows.
> > > >> > 
> > > >> > Just ignoring the situation I think will lead to surprise apps / 
> > > >> > admins,
> > > >> > because the person/entity invoking the migration is not likely to 
> > > >> > have
> > > >> > checked wether this particular guest uses net failover or not before
> > > >> > invoking - they'll just be expecting a paused migration to run fast 
> > > >> > and
> > > >> > be guaranteed to complete.
> > > >> > 
> > > >> > Regards,
> > > >> > Daniel
> > > >> 
> > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > >> pausing guest after migration started but before device was
> > > >> unplugged?
> > > >> 
> > > >
> > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > pausing guest until migration is cancelled?
> > > >
> > > > All this seems heavy handed to me ...
> > > 
> > > This is the minimal fix that I can think of.
> > > 
> > > Further solution would be:
> > > - Add a new migration parameter: migrate-paused
> > > - change libvirt to use the new parameter if it exist
> > > - in qemu, when we do start migration (but after we wait for the unplug
> > >   device) paused the guest before starting migration and resume it after
> > >   migration finish.
> > 
> > It would also have to handle issuing of paused after migration has
> > been started - delay the pause request until the nuplug is complete
> > is one answer.
> 
> Hmm my worry would be that pausing is one way to give cpu
> resources back to host. It's problematic if guest can delay
> that indefinitely.

hmm, yes, that is awkward.  Perhaps we should just report an explicit
error then. In normal cases this won't happen, as unplug will have
easily completed before the mgmt app pauses the running migration.
In broken/malicious guest cases, this at least ives mgmt a heads up
that something is wrong and they might then decide to cancel the
migration.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Juan Quintela
"Michael S. Tsirkin"  wrote:
> On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
>> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
>> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
>> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
>> > > > If we have a paused guest, it can't unplug the network VF device, so
>> > > > we wait there forever.  Just change the code to give one error on that
>> > > > case.
>> > > > 
>> > > > Signed-off-by: Juan Quintela 
>> > > 
>> > > It's certainly possible but it's management that created
>> > > this situation after all - why do we bother to enforce
>> > > a policy? It is possible that management will unpause immediately
>> > > afterwards and everything will proceed smoothly.
>> > > 
>> > > Yes migration will not happen until guest is
>> > > unpaused but the same it true of e.g. a guest that is stuck
>> > > because of a bug.
>> > 
>> > That's pretty different behaviour from how migration normally handles
>> > a paused guest, which is that it is guaranteed to complete the migration
>> > in as short a time as network bandwidth allows.
>> > 
>> > Just ignoring the situation I think will lead to surprise apps / admins,
>> > because the person/entity invoking the migration is not likely to have
>> > checked wether this particular guest uses net failover or not before
>> > invoking - they'll just be expecting a paused migration to run fast and
>> > be guaranteed to complete.
>> > 
>> > Regards,
>> > Daniel
>> 
>> Okay I guess. But then shouldn't we handle the reverse situation too:
>> pausing guest after migration started but before device was
>> unplugged?
>> 
>
> Thinking of which, I have no idea how we'd handle it - fail
> pausing guest until migration is cancelled?
>
> All this seems heavy handed to me ...

This is the minimal fix that I can think of.

Further solution would be:
- Add a new migration parameter: migrate-paused
- change libvirt to use the new parameter if it exist
- in qemu, when we do start migration (but after we wait for the unplug
  device) paused the guest before starting migration and resume it after
  migration finish.

My understanding talking with Laine is that they use this functionality
by default for migration, saving, etc, i.e. it is not an isolated case.

Later, Juan.




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Michael S. Tsirkin
On Wed, Dec 02, 2020 at 11:26:39AM +, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > "Michael S. Tsirkin"  wrote:
> > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé 
> > > > >> wrote:
> > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > >> > > > If we have a paused guest, it can't unplug the network VF 
> > > > >> > > > device, so
> > > > >> > > > we wait there forever.  Just change the code to give one error 
> > > > >> > > > on that
> > > > >> > > > case.
> > > > >> > > > 
> > > > >> > > > Signed-off-by: Juan Quintela 
> > > > >> > > 
> > > > >> > > It's certainly possible but it's management that created
> > > > >> > > this situation after all - why do we bother to enforce
> > > > >> > > a policy? It is possible that management will unpause immediately
> > > > >> > > afterwards and everything will proceed smoothly.
> > > > >> > > 
> > > > >> > > Yes migration will not happen until guest is
> > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > >> > > because of a bug.
> > > > >> > 
> > > > >> > That's pretty different behaviour from how migration normally 
> > > > >> > handles
> > > > >> > a paused guest, which is that it is guaranteed to complete the 
> > > > >> > migration
> > > > >> > in as short a time as network bandwidth allows.
> > > > >> > 
> > > > >> > Just ignoring the situation I think will lead to surprise apps / 
> > > > >> > admins,
> > > > >> > because the person/entity invoking the migration is not likely to 
> > > > >> > have
> > > > >> > checked wether this particular guest uses net failover or not 
> > > > >> > before
> > > > >> > invoking - they'll just be expecting a paused migration to run 
> > > > >> > fast and
> > > > >> > be guaranteed to complete.
> > > > >> > 
> > > > >> > Regards,
> > > > >> > Daniel
> > > > >> 
> > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > >> pausing guest after migration started but before device was
> > > > >> unplugged?
> > > > >> 
> > > > >
> > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > pausing guest until migration is cancelled?
> > > > >
> > > > > All this seems heavy handed to me ...
> > > > 
> > > > This is the minimal fix that I can think of.
> > > > 
> > > > Further solution would be:
> > > > - Add a new migration parameter: migrate-paused
> > > > - change libvirt to use the new parameter if it exist
> > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > >   device) paused the guest before starting migration and resume it after
> > > >   migration finish.
> > > 
> > > It would also have to handle issuing of paused after migration has
> > > been started - delay the pause request until the nuplug is complete
> > > is one answer.
> > 
> > Hmm my worry would be that pausing is one way to give cpu
> > resources back to host. It's problematic if guest can delay
> > that indefinitely.
> 
> hmm, yes, that is awkward.  Perhaps we should just report an explicit
> error then.

Report an error in response to which command? Do you mean
fail migration?

> In normal cases this won't happen, as unplug will have
> easily completed before the mgmt app pauses the running migration.
> In broken/malicious guest cases, this at least ives mgmt a heads up
> that something is wrong and they might then decide to cancel the
> migration.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Michael S. Tsirkin
On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > If we have a paused guest, it can't unplug the network VF device, so
> > > we wait there forever.  Just change the code to give one error on that
> > > case.
> > > 
> > > Signed-off-by: Juan Quintela 
> > 
> > It's certainly possible but it's management that created
> > this situation after all - why do we bother to enforce
> > a policy? It is possible that management will unpause immediately
> > afterwards and everything will proceed smoothly.
> > 
> > Yes migration will not happen until guest is
> > unpaused but the same it true of e.g. a guest that is stuck
> > because of a bug.
> 
> That's pretty different behaviour from how migration normally handles
> a paused guest, which is that it is guaranteed to complete the migration
> in as short a time as network bandwidth allows.
> 
> Just ignoring the situation I think will lead to surprise apps / admins,
> because the person/entity invoking the migration is not likely to have
> checked wether this particular guest uses net failover or not before
> invoking - they'll just be expecting a paused migration to run fast and
> be guaranteed to complete.
> 
> Regards,
> Daniel

Okay I guess. But then shouldn't we handle the reverse situation too:
pausing guest after migration started but before device was
unplugged?


> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Michael S. Tsirkin
On Wed, Dec 02, 2020 at 10:55:15AM +, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > "Michael S. Tsirkin"  wrote:
> > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
> > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > >> > > > If we have a paused guest, it can't unplug the network VF device, 
> > >> > > > so
> > >> > > > we wait there forever.  Just change the code to give one error on 
> > >> > > > that
> > >> > > > case.
> > >> > > > 
> > >> > > > Signed-off-by: Juan Quintela 
> > >> > > 
> > >> > > It's certainly possible but it's management that created
> > >> > > this situation after all - why do we bother to enforce
> > >> > > a policy? It is possible that management will unpause immediately
> > >> > > afterwards and everything will proceed smoothly.
> > >> > > 
> > >> > > Yes migration will not happen until guest is
> > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > >> > > because of a bug.
> > >> > 
> > >> > That's pretty different behaviour from how migration normally handles
> > >> > a paused guest, which is that it is guaranteed to complete the 
> > >> > migration
> > >> > in as short a time as network bandwidth allows.
> > >> > 
> > >> > Just ignoring the situation I think will lead to surprise apps / 
> > >> > admins,
> > >> > because the person/entity invoking the migration is not likely to have
> > >> > checked wether this particular guest uses net failover or not before
> > >> > invoking - they'll just be expecting a paused migration to run fast and
> > >> > be guaranteed to complete.
> > >> > 
> > >> > Regards,
> > >> > Daniel
> > >> 
> > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > >> pausing guest after migration started but before device was
> > >> unplugged?
> > >> 
> > >
> > > Thinking of which, I have no idea how we'd handle it - fail
> > > pausing guest until migration is cancelled?
> > >
> > > All this seems heavy handed to me ...
> > 
> > This is the minimal fix that I can think of.
> > 
> > Further solution would be:
> > - Add a new migration parameter: migrate-paused
> > - change libvirt to use the new parameter if it exist
> > - in qemu, when we do start migration (but after we wait for the unplug
> >   device) paused the guest before starting migration and resume it after
> >   migration finish.
> 
> It would also have to handle issuing of paused after migration has
> been started - delay the pause request until the nuplug is complete
> is one answer.

Hmm my worry would be that pausing is one way to give cpu
resources back to host. It's problematic if guest can delay
that indefinitely.

> > My understanding talking with Laine is that they use this functionality
> > by default for migration, saving, etc, i.e. it is not an isolated case.
> 
> Yep,  save-to-disk always runs in the paused state, and migration is
> also paused by default unless the mgmt app explicitl asks for live
> migration.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Daniel P . Berrangé
On Wed, Dec 02, 2020 at 05:31:50AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > we wait there forever.  Just change the code to give one error on that
> > > > case.
> > > > 
> > > > Signed-off-by: Juan Quintela 
> > > 
> > > It's certainly possible but it's management that created
> > > this situation after all - why do we bother to enforce
> > > a policy? It is possible that management will unpause immediately
> > > afterwards and everything will proceed smoothly.
> > > 
> > > Yes migration will not happen until guest is
> > > unpaused but the same it true of e.g. a guest that is stuck
> > > because of a bug.
> > 
> > That's pretty different behaviour from how migration normally handles
> > a paused guest, which is that it is guaranteed to complete the migration
> > in as short a time as network bandwidth allows.
> > 
> > Just ignoring the situation I think will lead to surprise apps / admins,
> > because the person/entity invoking the migration is not likely to have
> > checked wether this particular guest uses net failover or not before
> > invoking - they'll just be expecting a paused migration to run fast and
> > be guaranteed to complete.
>
> Okay I guess. But then shouldn't we handle the reverse situation too:
> pausing guest after migration started but before device was
> unplugged?

Yeah we likely want todo something there.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Michael S. Tsirkin
On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > we wait there forever.  Just change the code to give one error on that
> > > > case.
> > > > 
> > > > Signed-off-by: Juan Quintela 
> > > 
> > > It's certainly possible but it's management that created
> > > this situation after all - why do we bother to enforce
> > > a policy? It is possible that management will unpause immediately
> > > afterwards and everything will proceed smoothly.
> > > 
> > > Yes migration will not happen until guest is
> > > unpaused but the same it true of e.g. a guest that is stuck
> > > because of a bug.
> > 
> > That's pretty different behaviour from how migration normally handles
> > a paused guest, which is that it is guaranteed to complete the migration
> > in as short a time as network bandwidth allows.
> > 
> > Just ignoring the situation I think will lead to surprise apps / admins,
> > because the person/entity invoking the migration is not likely to have
> > checked wether this particular guest uses net failover or not before
> > invoking - they'll just be expecting a paused migration to run fast and
> > be guaranteed to complete.
> > 
> > Regards,
> > Daniel
> 
> Okay I guess. But then shouldn't we handle the reverse situation too:
> pausing guest after migration started but before device was
> unplugged?
> 

Thinking of which, I have no idea how we'd handle it - fail
pausing guest until migration is cancelled?

All this seems heavy handed to me ...

> > -- 
> > |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange 
> > :|
> > |: https://libvirt.org -o-https://fstop138.berrange.com 
> > :|
> > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange 
> > :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Daniel P . Berrangé
On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > If we have a paused guest, it can't unplug the network VF device, so
> > we wait there forever.  Just change the code to give one error on that
> > case.
> > 
> > Signed-off-by: Juan Quintela 
> 
> It's certainly possible but it's management that created
> this situation after all - why do we bother to enforce
> a policy? It is possible that management will unpause immediately
> afterwards and everything will proceed smoothly.
> 
> Yes migration will not happen until guest is
> unpaused but the same it true of e.g. a guest that is stuck
> because of a bug.

That's pretty different behaviour from how migration normally handles
a paused guest, which is that it is guaranteed to complete the migration
in as short a time as network bandwidth allows.

Just ignoring the situation I think will lead to surprise apps / admins,
because the person/entity invoking the migration is not likely to have
checked wether this particular guest uses net failover or not before
invoking - they'll just be expecting a paused migration to run fast and
be guaranteed to complete.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Daniel P . Berrangé
On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin"  wrote:
> > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> >> On Wed, Dec 02, 2020 at 10:27:18AM +, Daniel P. Berrangé wrote:
> >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> >> > > > If we have a paused guest, it can't unplug the network VF device, so
> >> > > > we wait there forever.  Just change the code to give one error on 
> >> > > > that
> >> > > > case.
> >> > > > 
> >> > > > Signed-off-by: Juan Quintela 
> >> > > 
> >> > > It's certainly possible but it's management that created
> >> > > this situation after all - why do we bother to enforce
> >> > > a policy? It is possible that management will unpause immediately
> >> > > afterwards and everything will proceed smoothly.
> >> > > 
> >> > > Yes migration will not happen until guest is
> >> > > unpaused but the same it true of e.g. a guest that is stuck
> >> > > because of a bug.
> >> > 
> >> > That's pretty different behaviour from how migration normally handles
> >> > a paused guest, which is that it is guaranteed to complete the migration
> >> > in as short a time as network bandwidth allows.
> >> > 
> >> > Just ignoring the situation I think will lead to surprise apps / admins,
> >> > because the person/entity invoking the migration is not likely to have
> >> > checked wether this particular guest uses net failover or not before
> >> > invoking - they'll just be expecting a paused migration to run fast and
> >> > be guaranteed to complete.
> >> > 
> >> > Regards,
> >> > Daniel
> >> 
> >> Okay I guess. But then shouldn't we handle the reverse situation too:
> >> pausing guest after migration started but before device was
> >> unplugged?
> >> 
> >
> > Thinking of which, I have no idea how we'd handle it - fail
> > pausing guest until migration is cancelled?
> >
> > All this seems heavy handed to me ...
> 
> This is the minimal fix that I can think of.
> 
> Further solution would be:
> - Add a new migration parameter: migrate-paused
> - change libvirt to use the new parameter if it exist
> - in qemu, when we do start migration (but after we wait for the unplug
>   device) paused the guest before starting migration and resume it after
>   migration finish.

It would also have to handle issuing of paused after migration has
been started - delay the pause request until the nuplug is complete
is one answer.
 
> My understanding talking with Laine is that they use this functionality
> by default for migration, saving, etc, i.e. it is not an isolated case.

Yep,  save-to-disk always runs in the paused state, and migration is
also paused by default unless the mgmt app explicitl asks for live
migration.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-12-02 Thread Michael S. Tsirkin
On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> If we have a paused guest, it can't unplug the network VF device, so
> we wait there forever.  Just change the code to give one error on that
> case.
> 
> Signed-off-by: Juan Quintela 

It's certainly possible but it's management that created
this situation after all - why do we bother to enforce
a policy? It is possible that management will unpause immediately
afterwards and everything will proceed smoothly.

Yes migration will not happen until guest is
unpaused but the same it true of e.g. a guest that is stuck
because of a bug.




> ---
>  migration/migration.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 87a9b59f83..d44fc880f9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque)
>  qemu_savevm_state_setup(s->to_dst_file);
>  
>  if (qemu_savevm_state_guest_unplug_pending()) {
> +/* if guest is paused, it can send back the wait event */
> +if (!runstate_is_running()) {
> +Error *local_err = NULL;
> +
> +error_setg(_err, "migration: network failover and "
> +   "guest is paused'");
> +migrate_set_error(s, local_err);
> +error_free(local_err);
> +migrate_set_state(>state, MIGRATION_STATUS_SETUP,
> +  MIGRATION_STATUS_FAILED);
> +goto end;
> +}
>  migrate_set_state(>state, MIGRATION_STATUS_SETUP,
>MIGRATION_STATUS_WAIT_UNPLUG);
>  
> @@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque)
>  }
>  
>  trace_migration_thread_after_loop();
> +end:
>  migration_iteration_finish(s);
>  object_unref(OBJECT(s));
>  rcu_unregister_thread();
> -- 
> 2.26.2




Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest

2020-11-25 Thread Dr. David Alan Gilbert
* Juan Quintela (quint...@redhat.com) wrote:
> If we have a paused guest, it can't unplug the network VF device, so
> we wait there forever.  Just change the code to give one error on that
> case.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/migration.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 87a9b59f83..d44fc880f9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque)
>  qemu_savevm_state_setup(s->to_dst_file);
>  
>  if (qemu_savevm_state_guest_unplug_pending()) {
> +/* if guest is paused, it can send back the wait event */
> +if (!runstate_is_running()) {
> +Error *local_err = NULL;
> +
> +error_setg(_err, "migration: network failover and "
> +   "guest is paused'");
> +migrate_set_error(s, local_err);
> +error_free(local_err);
> +migrate_set_state(>state, MIGRATION_STATUS_SETUP,
> +  MIGRATION_STATUS_FAILED);
> +goto end;
> +}
>  migrate_set_state(>state, MIGRATION_STATUS_SETUP,
>MIGRATION_STATUS_WAIT_UNPLUG);
>  
> @@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque)
>  }
>  
>  trace_migration_thread_after_loop();
> +end:
>  migration_iteration_finish(s);
>  object_unref(OBJECT(s));
>  rcu_unregister_thread();
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK