Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-22 Thread Alex Williamson
On Thu, 23 Aug 2018 04:02:43 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, August 23, 2018 11:47 AM
> > 
> > On Wed, 22 Aug 2018 02:30:12 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Wednesday, August 22, 2018 10:08 AM
> > > >
> > > > On Wed, 22 Aug 2018 01:27:05 +
> > > > "Tian, Kevin"  wrote:
> > > >  
> > > > > > From: Wang, Zhi A
> > > > > > Sent: Wednesday, August 22, 2018 2:43 AM  
> > > > > > >
> > > > > > > Are there any suggestions how we can deal with security issues?
> > > > > > > Allowing userspace to provide a data stream representing the  
> > internal  
> > > > > > > state of a virtual device model living within the kernel seems
> > > > > > > troublesome.  If we need to trust the data stream, do we need to
> > > > > > > somehow make the operation more privileged than what a vfio  
> > user  
> > > > > > might  
> > > > > > > have otherwise?  Does the data stream need to be somehow  
> > signed  
> > > > and  
> > > > > > how  
> > > > > > > might we do that?  How can we build in protection against an  
> > > > untrusted  
> > > > > > > restore image?  Thanks,  
> > > > >
> > > > > imo it is not necessary. restoring mdev state should be handled as if
> > > > > guest is programming the mdev.  
> > > >
> > > > To me this suggests that a state save/restore is just an algorithm
> > > > executed by userspace using the existing vfio device accesses.  This is
> > > > not at all what we've been discussing for migration.  I believe the  
> > >
> > > not algorithm by userspace. It's kernel driver to apply the audit when
> > > receiving opaque state data.  
> > 
> > And a kernel driver receiving and processing opaque state date from a
> > user doesn't raise security concerns for you?  
> 
> opaque is from userspace p.o.v. kernel driver understands the actual
> format and thus can audit when restoring the state.

Which only means that we risk having untold security issues within each
separate mdev vendor driver.

> > > > interface we've been hashing out exposes opaque device state through  
> > a  
> > > > vfio region.  We therefore must assume that that opaque data contains
> > > > not only device state, but also emulation state, similar to what we see
> > > > for any QEMU device.  Not only is there internal emulation state, but
> > > > we have no guarantee that the device state goes through the same
> > > > auditing as it does through the vfio interface.  Since this device and
> > > > emulation state live inside the kernel and not just within the user's
> > > > own process, a malicious user can do far more than shoot themselves.  
> > It  
> > > > would be one thing devices were IOMMU isolated, but they're not,
> > > > they're isolated through vendor and device specific mechanism, and for
> > > > all we know the parameters of that isolation are included in the
> > > > restore state.  I don't see how we can say this is not an issue.  
> > >
> > > I didn't quite get this. My understanding is that isolation configuration
> > > is completed when a mdev is created on DEST machine given a type
> > > definition. The state image contains just runtime data reflecting what
> > > guest driver does on SRC machine. Restoring such state shouldn't
> > > change the isolation policy.  
> > 
> > Let's invent an example where the mdev vendor driver has a set of
> > pinned pages which are the current working set for the device at the
> > time of migration.  Information about that pinning might be included in
> > the opaque migration state.  If a malicious user discovers this, they
> > can potentially also craft a modified state which can exploit the host
> > kernel isolation.  
> 
> pinned pages may be not a good example. the pin knowledge could be
> reconstructed when restoring the state (e.g. in GVT-g pinning is triggered
> by shadowing GPU page table which has to be recreated on DEST). 

There are always ways for vendor drivers to do this correctly, but
again, one vendor doing it correctly doesn't prevent this from being a
gaping security issue with unending vulnerabilities for other vendors.

> > > > > Then all the audits/security checks
> > > > > enforced in normal emulation path should still apply. vendor driver
> > > > > may choose to audit every state restore operation one-by-one, and
> > > > > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > > > > scheduled, similar to what we did before VMENTRY).  
> > > >
> > > > Giving the vendor driver the choice of whether to be secure or not is
> > > > exactly what I'm trying to propose we spend some time thinking about.
> > > > For instance, what if instead of allowing the user to load device state
> > > > through a region, the kernel could side load it using sometime similar
> > > > to the firmware loading path.  The user could be provided with a file
> > > > name token that they push through the vfio 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-22 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Thursday, August 23, 2018 11:47 AM
> 
> On Wed, 22 Aug 2018 02:30:12 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Wednesday, August 22, 2018 10:08 AM
> > >
> > > On Wed, 22 Aug 2018 01:27:05 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Wang, Zhi A
> > > > > Sent: Wednesday, August 22, 2018 2:43 AM
> > > > > >
> > > > > > Are there any suggestions how we can deal with security issues?
> > > > > > Allowing userspace to provide a data stream representing the
> internal
> > > > > > state of a virtual device model living within the kernel seems
> > > > > > troublesome.  If we need to trust the data stream, do we need to
> > > > > > somehow make the operation more privileged than what a vfio
> user
> > > > > might
> > > > > > have otherwise?  Does the data stream need to be somehow
> signed
> > > and
> > > > > how
> > > > > > might we do that?  How can we build in protection against an
> > > untrusted
> > > > > > restore image?  Thanks,
> > > >
> > > > imo it is not necessary. restoring mdev state should be handled as if
> > > > guest is programming the mdev.
> > >
> > > To me this suggests that a state save/restore is just an algorithm
> > > executed by userspace using the existing vfio device accesses.  This is
> > > not at all what we've been discussing for migration.  I believe the
> >
> > not algorithm by userspace. It's kernel driver to apply the audit when
> > receiving opaque state data.
> 
> And a kernel driver receiving and processing opaque state date from a
> user doesn't raise security concerns for you?

opaque is from userspace p.o.v. kernel driver understands the actual
format and thus can audit when restoring the state.

> 
> > > interface we've been hashing out exposes opaque device state through
> a
> > > vfio region.  We therefore must assume that that opaque data contains
> > > not only device state, but also emulation state, similar to what we see
> > > for any QEMU device.  Not only is there internal emulation state, but
> > > we have no guarantee that the device state goes through the same
> > > auditing as it does through the vfio interface.  Since this device and
> > > emulation state live inside the kernel and not just within the user's
> > > own process, a malicious user can do far more than shoot themselves.
> It
> > > would be one thing devices were IOMMU isolated, but they're not,
> > > they're isolated through vendor and device specific mechanism, and for
> > > all we know the parameters of that isolation are included in the
> > > restore state.  I don't see how we can say this is not an issue.
> >
> > I didn't quite get this. My understanding is that isolation configuration
> > is completed when a mdev is created on DEST machine given a type
> > definition. The state image contains just runtime data reflecting what
> > guest driver does on SRC machine. Restoring such state shouldn't
> > change the isolation policy.
> 
> Let's invent an example where the mdev vendor driver has a set of
> pinned pages which are the current working set for the device at the
> time of migration.  Information about that pinning might be included in
> the opaque migration state.  If a malicious user discovers this, they
> can potentially also craft a modified state which can exploit the host
> kernel isolation.

pinned pages may be not a good example. the pin knowledge could be
reconstructed when restoring the state (e.g. in GVT-g pinning is triggered
by shadowing GPU page table which has to be recreated on DEST). 

> 
> > > > Then all the audits/security checks
> > > > enforced in normal emulation path should still apply. vendor driver
> > > > may choose to audit every state restore operation one-by-one, and
> > > > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > > > scheduled, similar to what we did before VMENTRY).
> > >
> > > Giving the vendor driver the choice of whether to be secure or not is
> > > exactly what I'm trying to propose we spend some time thinking about.
> > > For instance, what if instead of allowing the user to load device state
> > > through a region, the kernel could side load it using sometime similar
> > > to the firmware loading path.  The user could be provided with a file
> > > name token that they push through the vfio interface to trigger the
> > > state loading from a location with proper file level ACLs such that the
> > > image can be considered trusted.  Unfortunately the collateral is that
> > > libvirt would need to become the secure delivery entity, somehow
> > > stripping this section of the migration stream into a file and
> > > providing a token for the user to ask the kernel to load it.  What are
> > > some other options?  Could save/restore be done simply as an
> > > algorithmic script matched to stack of data, as I read into your first
> > > statement above?  I have doubts that we can achieve 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-22 Thread Alex Williamson
On Wed, 22 Aug 2018 02:30:12 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Wednesday, August 22, 2018 10:08 AM
> > 
> > On Wed, 22 Aug 2018 01:27:05 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Wang, Zhi A
> > > > Sent: Wednesday, August 22, 2018 2:43 AM  
> > > > >
> > > > > Are there any suggestions how we can deal with security issues?
> > > > > Allowing userspace to provide a data stream representing the internal
> > > > > state of a virtual device model living within the kernel seems
> > > > > troublesome.  If we need to trust the data stream, do we need to
> > > > > somehow make the operation more privileged than what a vfio user  
> > > > might  
> > > > > have otherwise?  Does the data stream need to be somehow signed  
> > and  
> > > > how  
> > > > > might we do that?  How can we build in protection against an  
> > untrusted  
> > > > > restore image?  Thanks,  
> > >
> > > imo it is not necessary. restoring mdev state should be handled as if
> > > guest is programming the mdev.  
> > 
> > To me this suggests that a state save/restore is just an algorithm
> > executed by userspace using the existing vfio device accesses.  This is
> > not at all what we've been discussing for migration.  I believe the  
> 
> not algorithm by userspace. It's kernel driver to apply the audit when
> receiving opaque state data.

And a kernel driver receiving and processing opaque state date from a
user doesn't raise security concerns for you?

> > interface we've been hashing out exposes opaque device state through a
> > vfio region.  We therefore must assume that that opaque data contains
> > not only device state, but also emulation state, similar to what we see
> > for any QEMU device.  Not only is there internal emulation state, but
> > we have no guarantee that the device state goes through the same
> > auditing as it does through the vfio interface.  Since this device and
> > emulation state live inside the kernel and not just within the user's
> > own process, a malicious user can do far more than shoot themselves.  It
> > would be one thing devices were IOMMU isolated, but they're not,
> > they're isolated through vendor and device specific mechanism, and for
> > all we know the parameters of that isolation are included in the
> > restore state.  I don't see how we can say this is not an issue.  
> 
> I didn't quite get this. My understanding is that isolation configuration
> is completed when a mdev is created on DEST machine given a type
> definition. The state image contains just runtime data reflecting what
> guest driver does on SRC machine. Restoring such state shouldn't
> change the isolation policy.

Let's invent an example where the mdev vendor driver has a set of
pinned pages which are the current working set for the device at the
time of migration.  Information about that pinning might be included in
the opaque migration state.  If a malicious user discovers this, they
can potentially also craft a modified state which can exploit the host
kernel isolation.

> > > Then all the audits/security checks
> > > enforced in normal emulation path should still apply. vendor driver
> > > may choose to audit every state restore operation one-by-one, and
> > > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > > scheduled, similar to what we did before VMENTRY).  
> > 
> > Giving the vendor driver the choice of whether to be secure or not is
> > exactly what I'm trying to propose we spend some time thinking about.
> > For instance, what if instead of allowing the user to load device state
> > through a region, the kernel could side load it using sometime similar
> > to the firmware loading path.  The user could be provided with a file
> > name token that they push through the vfio interface to trigger the
> > state loading from a location with proper file level ACLs such that the
> > image can be considered trusted.  Unfortunately the collateral is that
> > libvirt would need to become the secure delivery entity, somehow
> > stripping this section of the migration stream into a file and
> > providing a token for the user to ask the kernel to load it.  What are
> > some other options?  Could save/restore be done simply as an
> > algorithmic script matched to stack of data, as I read into your first
> > statement above?  I have doubts that we can achieve the internal state
> > we need, or maybe even the performance we need using such a process.
> > Thanks,
> >   
> 
> for GVT-g I think we invoke common functions as used in emulation path
> to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if
> I'm wrong.

One example of migration state being restored in a secure manner does
not prove that such an interface is universally secure or a good idea.

> Can you elaborate the difference between device state and emulation
> state which you mentioned earlier? We may need look at some concrete
> example to understand 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-21 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 22, 2018 10:08 AM
> 
> On Wed, 22 Aug 2018 01:27:05 +
> "Tian, Kevin"  wrote:
> 
> > > From: Wang, Zhi A
> > > Sent: Wednesday, August 22, 2018 2:43 AM
> > > >
> > > > Are there any suggestions how we can deal with security issues?
> > > > Allowing userspace to provide a data stream representing the internal
> > > > state of a virtual device model living within the kernel seems
> > > > troublesome.  If we need to trust the data stream, do we need to
> > > > somehow make the operation more privileged than what a vfio user
> > > might
> > > > have otherwise?  Does the data stream need to be somehow signed
> and
> > > how
> > > > might we do that?  How can we build in protection against an
> untrusted
> > > > restore image?  Thanks,
> >
> > imo it is not necessary. restoring mdev state should be handled as if
> > guest is programming the mdev.
> 
> To me this suggests that a state save/restore is just an algorithm
> executed by userspace using the existing vfio device accesses.  This is
> not at all what we've been discussing for migration.  I believe the

not algorithm by userspace. It's kernel driver to apply the audit when
receiving opaque state data.

> interface we've been hashing out exposes opaque device state through a
> vfio region.  We therefore must assume that that opaque data contains
> not only device state, but also emulation state, similar to what we see
> for any QEMU device.  Not only is there internal emulation state, but
> we have no guarantee that the device state goes through the same
> auditing as it does through the vfio interface.  Since this device and
> emulation state live inside the kernel and not just within the user's
> own process, a malicious user can do far more than shoot themselves.  It
> would be one thing devices were IOMMU isolated, but they're not,
> they're isolated through vendor and device specific mechanism, and for
> all we know the parameters of that isolation are included in the
> restore state.  I don't see how we can say this is not an issue.

I didn't quite get this. My understanding is that isolation configuration
is completed when a mdev is created on DEST machine given a type
definition. The state image contains just runtime data reflecting what
guest driver does on SRC machine. Restoring such state shouldn't
change the isolation policy.

> 
> > Then all the audits/security checks
> > enforced in normal emulation path should still apply. vendor driver
> > may choose to audit every state restore operation one-by-one, and
> > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > scheduled, similar to what we did before VMENTRY).
> 
> Giving the vendor driver the choice of whether to be secure or not is
> exactly what I'm trying to propose we spend some time thinking about.
> For instance, what if instead of allowing the user to load device state
> through a region, the kernel could side load it using sometime similar
> to the firmware loading path.  The user could be provided with a file
> name token that they push through the vfio interface to trigger the
> state loading from a location with proper file level ACLs such that the
> image can be considered trusted.  Unfortunately the collateral is that
> libvirt would need to become the secure delivery entity, somehow
> stripping this section of the migration stream into a file and
> providing a token for the user to ask the kernel to load it.  What are
> some other options?  Could save/restore be done simply as an
> algorithmic script matched to stack of data, as I read into your first
> statement above?  I have doubts that we can achieve the internal state
> we need, or maybe even the performance we need using such a process.
> Thanks,
> 

for GVT-g I think we invoke common functions as used in emulation path
to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if
I'm wrong.

Can you elaborate the difference between device state and emulation
state which you mentioned earlier? We may need look at some concrete
example to understand the actual problem here.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-21 Thread Alex Williamson
On Wed, 22 Aug 2018 01:27:05 +
"Tian, Kevin"  wrote:

> > From: Wang, Zhi A
> > Sent: Wednesday, August 22, 2018 2:43 AM  
> > >
> > > Are there any suggestions how we can deal with security issues?
> > > Allowing userspace to provide a data stream representing the internal
> > > state of a virtual device model living within the kernel seems
> > > troublesome.  If we need to trust the data stream, do we need to
> > > somehow make the operation more privileged than what a vfio user  
> > might  
> > > have otherwise?  Does the data stream need to be somehow signed and  
> > how  
> > > might we do that?  How can we build in protection against an untrusted
> > > restore image?  Thanks,  
> 
> imo it is not necessary. restoring mdev state should be handled as if
> guest is programming the mdev.

To me this suggests that a state save/restore is just an algorithm
executed by userspace using the existing vfio device accesses.  This is
not at all what we've been discussing for migration.  I believe the
interface we've been hashing out exposes opaque device state through a
vfio region.  We therefore must assume that that opaque data contains
not only device state, but also emulation state, similar to what we see
for any QEMU device.  Not only is there internal emulation state, but
we have no guarantee that the device state goes through the same
auditing as it does through the vfio interface.  Since this device and
emulation state live inside the kernel and not just within the user's
own process, a malicious user can do far more than shoot themselves.  It
would be one thing devices were IOMMU isolated, but they're not,
they're isolated through vendor and device specific mechanism, and for
all we know the parameters of that isolation are included in the
restore state.  I don't see how we can say this is not an issue.

> Then all the audits/security checks
> enforced in normal emulation path should still apply. vendor driver
> may choose to audit every state restore operation one-by-one, and 
> do it altoghter at a synchronization point (e.g. when the mdev is re-
> scheduled, similar to what we did before VMENTRY).

Giving the vendor driver the choice of whether to be secure or not is
exactly what I'm trying to propose we spend some time thinking about.
For instance, what if instead of allowing the user to load device state
through a region, the kernel could side load it using sometime similar
to the firmware loading path.  The user could be provided with a file
name token that they push through the vfio interface to trigger the
state loading from a location with proper file level ACLs such that the
image can be considered trusted.  Unfortunately the collateral is that
libvirt would need to become the secure delivery entity, somehow
stripping this section of the migration stream into a file and
providing a token for the user to ask the kernel to load it.  What are
some other options?  Could save/restore be done simply as an
algorithmic script matched to stack of data, as I read into your first
statement above?  I have doubts that we can achieve the internal state
we need, or maybe even the performance we need using such a process.
Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-21 Thread Tian, Kevin
> From: Wang, Zhi A
> Sent: Wednesday, August 22, 2018 2:43 AM
> >
> > Are there any suggestions how we can deal with security issues?
> > Allowing userspace to provide a data stream representing the internal
> > state of a virtual device model living within the kernel seems
> > troublesome.  If we need to trust the data stream, do we need to
> > somehow make the operation more privileged than what a vfio user
> might
> > have otherwise?  Does the data stream need to be somehow signed and
> how
> > might we do that?  How can we build in protection against an untrusted
> > restore image?  Thanks,

imo it is not necessary. restoring mdev state should be handled as if
guest is programming the mdev. Then all the audits/security checks
enforced in normal emulation path should still apply. vendor driver
may choose to audit every state restore operation one-by-one, and 
do it altoghter at a synchronization point (e.g. when the mdev is re-
scheduled, similar to what we did before VMENTRY).

> What a good point!
> 
> I dig the kernel module security case, which seems similar with this
> case. The security of loading kernel module relies on root privilege and
> signature.
> 
> For root privilege, QEMU could run as non root in libvirtd. So this
> wouldn't be an option.
> 
> For signature, I am wondering if there is any similar cases in other
> kernel components, like KVM or another modules which provides ioctls to
> userspace. Maybe they don't even load some binary from userspace, but
> they could suffer from DDOS flood from userspace. Maybe some ioctls or
> interfaces in kernel should only allow signed/trusted userspace
> application to call. (previously it's "allow signed kernel module to load")
> 
> Thanks,
> Zhi.
> 
> >
> > Alex
> >

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-21 Thread Zhi Wang




On 08/21/18 07:08, Alex Williamson wrote:

On Sun, 19 Aug 2018 22:25:19 +0800
Zhi Wang  wrote:


Share some updates of my work on this topic recently:

Thanks for Erik's guide and advices. Now my PoC patches almost works.
Will send the RFC soon.

Mostly the ideas are based on Alex's idea: a match between a device
state version and a minimum required version


"Match of versions" in Libvirt

Initialization stage:

- Libvirt would detect if there is any device state version in a
"mdev_type" of a mediated device when creating a mdev node in node
device tree.
- If the "mdev_type" of a mediated device *has* a device state version,
then this mediated device supports migration.
- If not, (compatibility case, mostly for old vendor drivers which
don't support migration), this mediated device doesn't support migration

Migration stage:

- Libvirt would put the mdev information inside cookies and send them
between src machine and dst machine. So a new type of cookie would be
added here.

There are different versions of migration protocols in libvirt. Each of
them starts to send cookies in different sequence. The idea here is to
let the match happens as early as possible. Looks like QEMU driver in
libvirt only support V2/V3 proto.


V2 proto:

- The match would happen in SRC machine after the DST machine transfers
the cookies with mdev information back to the SRC machine during the
"preparation" stage. The disadvantage is the DST virtual machine has
already been created in "preparation" stage. If the match fails, the
virtual machine in DST machine has to be killed as well, which would
waste some time.

V3 proto:

- The match would happen in DST machine after the SRC machine transfers
the cookies to the DST machine during the "begin" stage. As the DST
machine hasn't entered into "preparation" stage at this time, the
virtual machine hasn't been created in DST machine at this point. No
extra VM destroy is needed if the match fails. This would be the ideal
place for a match.

"Match of version" in QEMU level

As there are several different types of migration in libvirt. In a
migration with hypervisor native transport, the target machine could
even not have libvirtd, the migration happens between device models
directly. So we need a match in QEMU level as well. We might still need
Kirti's approach as the last level match.


The kernel and vendor driver will always have a last opportunity to nak
a migration, the purpose of making certain information readily
available to libvirt is only to allow userspace some insight into where
a migration is likely to be successful.  Even if we expose these things
to userspace, it's the kernel's responsibility to validate the
migration data.  


Yes. The vendor driver should be the last keeper to nak a migration. It 
should be implemented inside the vendor driver.


In fact, pushing state information for a device into

the kernel would seem to be a massive security target.  For instance
how many vulnerabilities might a malicious user be able to exploit in
the code that parses the device specific state information?  How do we
even detect non-malicious user errors, like trying to migrate GVTg
device state to an NVIDIA vGPU?


For now, we only depends on mdev_type, after the discussion of vendor id 
or device id.


The latter at least suggests that the kernel needs to perform the same
set of validation that we're trying to enable userspace to do.
Cornelia also mentioned that some mdev devices are more or less shells
within which a device is configured, such as ccw and likely the crypto
ap devices.  In those cases the mdev type might not be sufficient meta
data about what we're dealing with.  This might suggest some sort of
header within the migration region parsed by common code for basic
validation.
Yes. If we could validate it earlier then better since, we don't need to 
wait until the DST machine start the VM and try to load the 1st states.


Are there any suggestions how we can deal with security issues?
Allowing userspace to provide a data stream representing the internal
state of a virtual device model living within the kernel seems
troublesome.  If we need to trust the data stream, do we need to
somehow make the operation more privileged than what a vfio user might
have otherwise?  Does the data stream need to be somehow signed and how
might we do that?  How can we build in protection against an untrusted
restore image?  Thanks,

What a good point!

I dig the kernel module security case, which seems similar with this 
case. The security of loading kernel module relies on root privilege and 
signature.


For root privilege, QEMU could run as non root in libvirtd. So this 
wouldn't be an option.


For signature, I am wondering if there is any similar cases in other 
kernel components, like KVM or another modules which provides ioctls to 
userspace. Maybe they don't even load some binary from userspace, but 
they could suffer from DDOS flood from userspace. Maybe some 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-20 Thread Alex Williamson
On Sun, 19 Aug 2018 22:25:19 +0800
Zhi Wang  wrote:

> Share some updates of my work on this topic recently:
> 
> Thanks for Erik's guide and advices. Now my PoC patches almost works. 
> Will send the RFC soon.
> 
> Mostly the ideas are based on Alex's idea: a match between a device 
> state version and a minimum required version
> 
> 
> "Match of versions" in Libvirt
> 
> Initialization stage:
> 
> - Libvirt would detect if there is any device state version in a 
> "mdev_type" of a mediated device when creating a mdev node in node 
> device tree.
>   - If the "mdev_type" of a mediated device *has* a device state version, 
> then this mediated device supports migration.
>   - If not, (compatibility case, mostly for old vendor drivers which 
> don't support migration), this mediated device doesn't support migration
> 
> Migration stage:
> 
> - Libvirt would put the mdev information inside cookies and send them 
> between src machine and dst machine. So a new type of cookie would be 
> added here.
> 
> There are different versions of migration protocols in libvirt. Each of 
> them starts to send cookies in different sequence. The idea here is to 
> let the match happens as early as possible. Looks like QEMU driver in 
> libvirt only support V2/V3 proto.
> 
> 
> V2 proto:
> 
> - The match would happen in SRC machine after the DST machine transfers 
> the cookies with mdev information back to the SRC machine during the 
> "preparation" stage. The disadvantage is the DST virtual machine has 
> already been created in "preparation" stage. If the match fails, the 
> virtual machine in DST machine has to be killed as well, which would 
> waste some time.
> 
> V3 proto:
> 
> - The match would happen in DST machine after the SRC machine transfers 
> the cookies to the DST machine during the "begin" stage. As the DST 
> machine hasn't entered into "preparation" stage at this time, the 
> virtual machine hasn't been created in DST machine at this point. No 
> extra VM destroy is needed if the match fails. This would be the ideal 
> place for a match.
> 
> "Match of version" in QEMU level
> 
> As there are several different types of migration in libvirt. In a 
> migration with hypervisor native transport, the target machine could 
> even not have libvirtd, the migration happens between device models 
> directly. So we need a match in QEMU level as well. We might still need 
> Kirti's approach as the last level match.

The kernel and vendor driver will always have a last opportunity to nak
a migration, the purpose of making certain information readily
available to libvirt is only to allow userspace some insight into where
a migration is likely to be successful.  Even if we expose these things
to userspace, it's the kernel's responsibility to validate the
migration data.  In fact, pushing state information for a device into
the kernel would seem to be a massive security target.  For instance
how many vulnerabilities might a malicious user be able to exploit in
the code that parses the device specific state information?  How do we
even detect non-malicious user errors, like trying to migrate GVTg
device state to an NVIDIA vGPU?

The latter at least suggests that the kernel needs to perform the same
set of validation that we're trying to enable userspace to do.
Cornelia also mentioned that some mdev devices are more or less shells
within which a device is configured, such as ccw and likely the crypto
ap devices.  In those cases the mdev type might not be sufficient meta
data about what we're dealing with.  This might suggest some sort of
header within the migration region parsed by common code for basic
validation.

Are there any suggestions how we can deal with security issues?
Allowing userspace to provide a data stream representing the internal
state of a virtual device model living within the kernel seems
troublesome.  If we need to trust the data stream, do we need to
somehow make the operation more privileged than what a vfio user might
have otherwise?  Does the data stream need to be somehow signed and how
might we do that?  How can we build in protection against an untrusted
restore image?  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-19 Thread Zhi Wang

Share some updates of my work on this topic recently:

Thanks for Erik's guide and advices. Now my PoC patches almost works. 
Will send the RFC soon.


Mostly the ideas are based on Alex's idea: a match between a device 
state version and a minimum required version



"Match of versions" in Libvirt

Initialization stage:

- Libvirt would detect if there is any device state version in a 
"mdev_type" of a mediated device when creating a mdev node in node 
device tree.
	- If the "mdev_type" of a mediated device *has* a device state version, 
then this mediated device supports migration.
	- If not, (compatibility case, mostly for old vendor drivers which 
don't support migration), this mediated device doesn't support migration


Migration stage:

- Libvirt would put the mdev information inside cookies and send them 
between src machine and dst machine. So a new type of cookie would be 
added here.


There are different versions of migration protocols in libvirt. Each of 
them starts to send cookies in different sequence. The idea here is to 
let the match happens as early as possible. Looks like QEMU driver in 
libvirt only support V2/V3 proto.



V2 proto:

- The match would happen in SRC machine after the DST machine transfers 
the cookies with mdev information back to the SRC machine during the 
"preparation" stage. The disadvantage is the DST virtual machine has 
already been created in "preparation" stage. If the match fails, the 
virtual machine in DST machine has to be killed as well, which would 
waste some time.


V3 proto:

- The match would happen in DST machine after the SRC machine transfers 
the cookies to the DST machine during the "begin" stage. As the DST 
machine hasn't entered into "preparation" stage at this time, the 
virtual machine hasn't been created in DST machine at this point. No 
extra VM destroy is needed if the match fails. This would be the ideal 
place for a match.


"Match of version" in QEMU level

As there are several different types of migration in libvirt. In a 
migration with hypervisor native transport, the target machine could 
even not have libvirtd, the migration happens between device models 
directly. So we need a match in QEMU level as well. We might still need 
Kirti's approach as the last level match.


Thanks,
Zhi.

On 08/11/18 05:28, Zhi Wang wrote:

Hi Alex and Kirti:

Thanks for your reply and discussion. :)  Sorry for my late reply since 
there quite some work and email needs to be caught up after my vacation.


 From my point of view,  failing the migration because of the mismatch 
of version in different levels provides different pros/cons.


- Match version in userspace toolkit level, like in QEMU and Libvirt:

Pros: Better responsiveness since the match of the version would be 
figured out before actually suspend/resume devices. All the userspace 
toolkit could provide these information to UI or other management tool, 
like virtsh and virt manager, so it would be helpful for the 
administrator to know what's happening through the management interface.


Cons: Vendor driver has to expose the version information. Some vendor 
driver might not wish to expose that explicitly. Considering the mdev 
could be highly related to different vendors and different devices, this 
might happen in future as well.


- Match version in device state level (vendor-specific)

Pros: The vendor driver doesn't need to explain and expose the a 
explicit version of device state.


Cons: Waste of bandwidth. Bad responsiveness and informative.

How about we combine the two ideas together? The vendor driver could 
decide to use the device state or not. But still, the error information 
could be a problem since it's could be hard for the management tool like 
virtsh or virt-manager to get a error message from a remote node.


Let me cook some RFC patch in the next week.

Have a great weekend. :)

Thanks,
Zhi.

-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Monday, 
August 6, 2018 10:22 PM

To: Kirti Wankhede 
Cc: Wang, Zhi A ; libvir-list@redhat.com
Subject: Re: Matching the type of mediated devices in the migration

On Mon, 6 Aug 2018 23:45:21 +0530
Kirti Wankhede  wrote:


On 8/3/2018 11:26 PM, Alex Williamson wrote:
> On Fri, 3 Aug 2018 12:07:58 +
> "Wang, Zhi A"  wrote:
> >> Hi:
>>
>> Thanks for unfolding your idea. The picture is clearer to me now. I 
didn't realize that you also want to support cross hardware migration. 
Well, I thought for a while, the cross hardware migration might be not 
popular in vGPU case but could be quite popular in other mdev cases. > 
> Exactly, we need to think beyond the implementation for a specific > 
vendor or class of device.

> >> Let me continue my summary:
>>
>> Mdev dev type has already included a parent driver name/a group 
name/physical device version/configuration type. For example 
i915-GVTg_V5_4. The driver name and the group name could already 
distinguish the vendor and the product 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-10 Thread Zhi Wang

Hi Alex and Kirti:

Thanks for your reply and discussion. :)  Sorry for my late reply since 
there quite some work and email needs to be caught up after my vacation.


From my point of view,  failing the migration because of the mismatch 
of version in different levels provides different pros/cons.


- Match version in userspace toolkit level, like in QEMU and Libvirt:

Pros: Better responsiveness since the match of the version would be 
figured out before actually suspend/resume devices. All the userspace 
toolkit could provide these information to UI or other management tool, 
like virtsh and virt manager, so it would be helpful for the 
administrator to know what's happening through the management interface.


Cons: Vendor driver has to expose the version information. Some vendor 
driver might not wish to expose that explicitly. Considering the mdev 
could be highly related to different vendors and different devices, this 
might happen in future as well.


- Match version in device state level (vendor-specific)

Pros: The vendor driver doesn't need to explain and expose the a 
explicit version of device state.


Cons: Waste of bandwidth. Bad responsiveness and informative.

How about we combine the two ideas together? The vendor driver could 
decide to use the device state or not. But still, the error information 
could be a problem since it's could be hard for the management tool like 
virtsh or virt-manager to get a error message from a remote node.


Let me cook some RFC patch in the next week.

Have a great weekend. :)

Thanks,
Zhi.

-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Monday, 
August 6, 2018 10:22 PM

To: Kirti Wankhede 
Cc: Wang, Zhi A ; libvir-list@redhat.com
Subject: Re: Matching the type of mediated devices in the migration

On Mon, 6 Aug 2018 23:45:21 +0530
Kirti Wankhede  wrote:


On 8/3/2018 11:26 PM, Alex Williamson wrote:
> On Fri, 3 Aug 2018 12:07:58 +
> "Wang, Zhi A"  wrote:
>   
>> Hi:

>>
>> Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not popular in vGPU case but could be quite popular in other mdev cases.  
> 
> Exactly, we need to think beyond the implementation for a specific 
> vendor or class of device.
>
>> Let me continue my summary:

>>
>> Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther.  
> 
> Note that there are only two identifiers here, a vendor driver and a 
> type.  We included the vendor driver to avoid namespace collisions 
> between vendors.  The type itself should be considered opaque 
> regardless of how a specific vendor makes use of it.
>   
>> Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on performances. 
>>

>> Let's say a new device N and an old device O, they both support mdev_type M.
>>
>> For example:
>> Device N is newer and supports the versions of device state: [ 6.3  
>> 6.2 .6.1 ] in mdev type M Device O is older and supports the 
>> versions of device state: [ 5.3 5.2 5.1 ] in mdev type M

>>
>> - Version scheme of device state in backwards compatibility case: Migrate a 
VM from a VM with device O to a VM with device N, the mdev type is M.
>>
>> Device N: [ 6.3 6.2 6.1 5.3 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M
>> Version used in migration: 5.3
>> The new device directly supports mdev_type M with the preferred version on 
Device O. Good, best situation.
>>
>> Device N: [ 6.3 6.2 6.1 5.2 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M
>> Version used in migration: 5.2
>> The new device supports mdev_type M, but not the preferred version. After 
the migration, the vendor driver might have to disable some features which is not 
mentioned in 5.2 device state. But this totally depends on the vendor driver. If user 
wish to achieve the best experience, he should update the vendor driver in device N, 
which supports the preferred version on device O.
>>
>> Device N: [ 6.3 6.2 6.1 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M
>> Version used in migration: None
>> No version is matched. Migration would fail. User should update the vendor 
driver on device N and device O.
>>
>> - Version scheme of device state in forwards compatibility case: Migrate a 
VM from a VM with N to a VM with device O, the mdev type is M.
>>
>> Device N: [ 6.3 6.2 .6.1 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor 
>> driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] 
>> (As an old device, the Device O still prefers version 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-06 Thread Alex Williamson
On Mon, 6 Aug 2018 23:45:21 +0530
Kirti Wankhede  wrote:

> On 8/3/2018 11:26 PM, Alex Williamson wrote:
> > On Fri, 3 Aug 2018 12:07:58 +
> > "Wang, Zhi A"  wrote:
> >   
> >> Hi:
> >>
> >> Thanks for unfolding your idea. The picture is clearer to me now. I didn't 
> >> realize that you also want to support cross hardware migration. Well, I 
> >> thought for a while, the cross hardware migration might be not popular in 
> >> vGPU case but could be quite popular in other mdev cases.  
> > 
> > Exactly, we need to think beyond the implementation for a specific
> > vendor or class of device.
> >
> >> Let me continue my summary:
> >>
> >> Mdev dev type has already included a parent driver name/a group 
> >> name/physical device version/configuration type. For example 
> >> i915-GVTg_V5_4. The driver name and the group name could already 
> >> distinguish the vendor and the product between different mdevs, e.g. 
> >> between Intel and Nvidia, between vGPU or vOther.  
> > 
> > Note that there are only two identifiers here, a vendor driver and a
> > type.  We included the vendor driver to avoid namespace collisions
> > between vendors.  The type itself should be considered opaque regardless
> > of how a specific vendor makes use of it.
> >   
> >> Each device provides a collection of the version of device state of data 
> >> stream in a preferred order in a mdev type, as newer version of device 
> >> state might contains more information which might help on performances. 
> >>
> >> Let's say a new device N and an old device O, they both support mdev_type 
> >> M.
> >>
> >> For example:
> >> Device N is newer and supports the versions of device state: [ 6.3  6.2 
> >> .6.1 ] in mdev type M
> >> Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 
> >> ] in mdev type M
> >>
> >> - Version scheme of device state in backwards compatibility case: Migrate 
> >> a VM from a VM with device O to a VM with device N, the mdev type is M.
> >>
> >> Device N: [ 6.3 6.2 6.1 5.3 ] in M
> >> Device O: [ 5.3 5.2 5.1 ] in M
> >> Version used in migration: 5.3
> >> The new device directly supports mdev_type M with the preferred version on 
> >> Device O. Good, best situation.
> >>
> >> Device N: [ 6.3 6.2 6.1 5.2 ] in M
> >> Device O: [ 5.3 5.2 5.1 ] in M
> >> Version used in migration: 5.2
> >> The new device supports mdev_type M, but not the preferred version. After 
> >> the migration, the vendor driver might have to disable some features which 
> >> is not mentioned in 5.2 device state. But this totally depends on the 
> >> vendor driver. If user wish to achieve the best experience, he should 
> >> update the vendor driver in device N, which supports the preferred version 
> >> on device O.
> >>
> >> Device N: [ 6.3 6.2 6.1 ] in M
> >> Device O: [ 5.3 5.2 5.1 ] in M
> >> Version used in migration: None
> >> No version is matched. Migration would fail. User should update the vendor 
> >> driver on device N and device O.
> >>
> >> - Version scheme of device state in forwards compatibility case: Migrate a 
> >> VM from a VM with N to a VM with device O, the mdev type is M.
> >>
> >> Device N: [ 6.3 6.2 .6.1 ] in M
> >> Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on 
> >> device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old 
> >> device, the Device O still prefers version 5.3)
> >> Version used in migration: 6.1
> >> As the new device states is going to migrate to an old device, the vendor 
> >> driver on old device might have to specially dealing with the new version 
> >> of device state. It depends on the vendor driver. 
> >>
> >> - QEMU has to figure out and choose the version of device states before 
> >> reading device state from the region. (Perhaps we can put the option of 
> >> selection in the control part of the region as well)
> >> - Libvirt will check if there is any match of the version in the 
> >> collection in device O and device N before migration.
> >> - Each mdev_type has its own collection of versions. (Device can support 
> >> different versions in different types)
> >> - Better the collection is not a range, better they could be a collection 
> >> of the version strings. (The vendor driver might drop some versions during 
> >> the upgrade since they are not ideal)  
> > 
> > I believe that QEMU has always avoided trying to negotiate a migration
> > version.  We can only negotiate if the target is online and since a
> > save/restore is essentially an offline migration, there's no
> > opportunity for negotiation.  Therefore I think we need to assume the
> > source version is fixed.  If we need to expose an older migration
> > interface, I think we'd need to consider instantiating the mdev with
> > that specification or configuring it via attributes before usage, just
> > like QEMU does with specifying a machine type version.
> > 
> > Providing an explicit list of compatible versions also seems like it
> > could quickly get out of hand, 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-06 Thread Kirti Wankhede



On 8/3/2018 11:26 PM, Alex Williamson wrote:
> On Fri, 3 Aug 2018 12:07:58 +
> "Wang, Zhi A"  wrote:
> 
>> Hi:
>>
>> Thanks for unfolding your idea. The picture is clearer to me now. I didn't 
>> realize that you also want to support cross hardware migration. Well, I 
>> thought for a while, the cross hardware migration might be not popular in 
>> vGPU case but could be quite popular in other mdev cases.
> 
> Exactly, we need to think beyond the implementation for a specific
> vendor or class of device.
>  
>> Let me continue my summary:
>>
>> Mdev dev type has already included a parent driver name/a group 
>> name/physical device version/configuration type. For example i915-GVTg_V5_4. 
>> The driver name and the group name could already distinguish the vendor and 
>> the product between different mdevs, e.g. between Intel and Nvidia, between 
>> vGPU or vOther.
> 
> Note that there are only two identifiers here, a vendor driver and a
> type.  We included the vendor driver to avoid namespace collisions
> between vendors.  The type itself should be considered opaque regardless
> of how a specific vendor makes use of it.
> 
>> Each device provides a collection of the version of device state of data 
>> stream in a preferred order in a mdev type, as newer version of device state 
>> might contains more information which might help on performances. 
>>
>> Let's say a new device N and an old device O, they both support mdev_type M.
>>
>> For example:
>> Device N is newer and supports the versions of device state: [ 6.3  6.2 .6.1 
>> ] in mdev type M
>> Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] 
>> in mdev type M
>>
>> - Version scheme of device state in backwards compatibility case: Migrate a 
>> VM from a VM with device O to a VM with device N, the mdev type is M.
>>
>> Device N: [ 6.3 6.2 6.1 5.3 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M
>> Version used in migration: 5.3
>> The new device directly supports mdev_type M with the preferred version on 
>> Device O. Good, best situation.
>>
>> Device N: [ 6.3 6.2 6.1 5.2 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M
>> Version used in migration: 5.2
>> The new device supports mdev_type M, but not the preferred version. After 
>> the migration, the vendor driver might have to disable some features which 
>> is not mentioned in 5.2 device state. But this totally depends on the vendor 
>> driver. If user wish to achieve the best experience, he should update the 
>> vendor driver in device N, which supports the preferred version on device O.
>>
>> Device N: [ 6.3 6.2 6.1 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M
>> Version used in migration: None
>> No version is matched. Migration would fail. User should update the vendor 
>> driver on device N and device O.
>>
>> - Version scheme of device state in forwards compatibility case: Migrate a 
>> VM from a VM with N to a VM with device O, the mdev type is M.
>>
>> Device N: [ 6.3 6.2 .6.1 ] in M
>> Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on 
>> device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, 
>> the Device O still prefers version 5.3)
>> Version used in migration: 6.1
>> As the new device states is going to migrate to an old device, the vendor 
>> driver on old device might have to specially dealing with the new version of 
>> device state. It depends on the vendor driver. 
>>
>> - QEMU has to figure out and choose the version of device states before 
>> reading device state from the region. (Perhaps we can put the option of 
>> selection in the control part of the region as well)
>> - Libvirt will check if there is any match of the version in the collection 
>> in device O and device N before migration.
>> - Each mdev_type has its own collection of versions. (Device can support 
>> different versions in different types)
>> - Better the collection is not a range, better they could be a collection of 
>> the version strings. (The vendor driver might drop some versions during the 
>> upgrade since they are not ideal)
> 
> I believe that QEMU has always avoided trying to negotiate a migration
> version.  We can only negotiate if the target is online and since a
> save/restore is essentially an offline migration, there's no
> opportunity for negotiation.  Therefore I think we need to assume the
> source version is fixed.  If we need to expose an older migration
> interface, I think we'd need to consider instantiating the mdev with
> that specification or configuring it via attributes before usage, just
> like QEMU does with specifying a machine type version.
> 
> Providing an explicit list of compatible versions also seems like it
> could quickly get out of hand, imagine a driver with regular releases
> that maintains compatibility for years.  The list could get
> unmanageable.
> 
> To be honest, I'm pretty dubious whether vendors will actually implement
> cross version migration, or really consider migration compatibility at
> all, 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-05 Thread Wang, Zhi A
Hi:

Thanks for unfolding your idea. The picture is clearer to me now. I didn't 
realize that you also want to support cross hardware migration. Well, I thought 
for a while, the cross hardware migration might be not popular in vGPU case but 
could be quite popular in other mdev cases.

Let me continue my summary:

Mdev dev type has already included a parent driver name/a group name/physical 
device version/configuration type. For example i915-GVTg_V5_4. The driver name 
and the group name could already distinguish the vendor and the product between 
different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther.

Each device provides a collection of the version of device state of data stream 
in a preferred order in a mdev type, as newer version of device state might 
contains more information which might help on performances. 

Let's say a new device N and an old device O, they both support mdev_type M.

For example:
Device N is newer and supports the versions of device state: [ 6.3  6.2 .6.1 ] 
in mdev type M
Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] in 
mdev type M

- Version scheme of device state in backwards compatibility case: Migrate a VM 
from a VM with device O to a VM with device N, the mdev type is M.

Device N: [ 6.3 6.2 6.1 5.3 ] in M
Device O: [ 5.3 5.2 5.1 ] in M
Version used in migration: 5.3
The new device directly supports mdev_type M with the preferred version on 
Device O. Good, best situation.

Device N: [ 6.3 6.2 6.1 5.2 ] in M
Device O: [ 5.3 5.2 5.1 ] in M
Version used in migration: 5.2
The new device supports mdev_type M, but not the preferred version. After the 
migration, the vendor driver might have to disable some features which is not 
mentioned in 5.2 device state. But this totally depends on the vendor driver. 
If user wish to achieve the best experience, he should update the vendor driver 
in device N, which supports the preferred version on device O.

Device N: [ 6.3 6.2 6.1 ] in M
Device O: [ 5.3 5.2 5.1 ] in M
Version used in migration: None
No version is matched. Migration would fail. User should update the vendor 
driver on device N and device O.

- Version scheme of device state in forwards compatibility case: Migrate a VM 
from a VM with N to a VM with device O, the mdev type is M.

Device N: [ 6.3 6.2 .6.1 ] in M
Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on 
device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, the 
Device O still prefers version 5.3)
Version used in migration: 6.1
As the new device states is going to migrate to an old device, the vendor 
driver on old device might have to specially dealing with the new version of 
device state. It depends on the vendor driver. 

- QEMU has to figure out and choose the version of device states before reading 
device state from the region. (Perhaps we can put the option of selection in 
the control part of the region as well)
- Libvirt will check if there is any match of the version in the collection in 
device O and device N before migration.
- Each mdev_type has its own collection of versions. (Device can support 
different versions in different types)
- Better the collection is not a range, better they could be a collection of 
the version strings. (The vendor driver might drop some versions during the 
upgrade since they are not ideal)

That's the picture so far in my mind.

Thanks,
Zhi.

-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com] 
Sent: Wednesday, August 1, 2018 8:19 PM
To: Wang, Zhi A 
Cc: libvir-list@redhat.com; kwankh...@nvidia.com
Subject: Re: Matching the type of mediated devices in the migration

On Wed, 1 Aug 2018 10:22:39 +
"Wang, Zhi A"  wrote:

> Hi:
> 
> Let me summarize the understanding so far I got from the discussions since I 
> am new to this discussion.
> 
> The mdev_type would be a generic stuff since we don't want userspace 
> application to be confused. The example of mdev_type is:

I don't think 'generic' is the right term here.  An mdev_type is a specific 
thing with a defined interface, we just don't define what that interface is.
 
> There are several pre-defined mdev_types with different configurations, let's 
> say MDEV_TYPE A/B/C. The HW 1.0 might only support MDEV_TYPE A, the HW 2.0 
> might support both MDEV_TYPE A and B, but due to HW difference, we cannot 
> migrate MDEV_TYPE A with HW 1.0 to MDEV_TYPE A with HW 2.0 even they have the 
> same MDEV_TYPE. So we need a device version either in the existing MDEV_TYPE 
> or a new sysfs entry.

This is correct, if a foo_type_a is exposed by the same vendor driver on 
different hardware, then the vendor driver is guaranteeing those mdev devices 
are software compatible to the user.  Whether the vendor driver is willing or 
able to support migration across the underlying hardware is a separate 
question.  Migration compatibility and user compatibility are separate features.

> Libvirt would have to 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-05 Thread Alex Williamson
On Tue, 31 Jul 2018 04:05:11 +0800
Zhi Wang  wrote:

> On 07/30/18 23:56, Alex Williamson wrote:
> > On Sun, 29 Jul 2018 21:19:41 +
> > "Wang, Zhi A"  wrote:
> >   
> >> BACKGROUND
> >>
> >> As the live migration of mdev is going to be supported in VFIO, a scheme 
> >> of deciding if a mdev could be migratable between the source machine and 
> >> the destination machine is needed. Mostly, this email is going to discuss 
> >> a possible solution which needs fewer modifications of libvirt/VFIO.
> >>
> >> The configuration of a mdev is located in the domain XML, which guides 
> >> libvirt how to find the mdev and generating the command line for QEMU. It 
> >> basically only includes the UUID of a mdev. The domain XML of the source 
> >> machine and destination machine are going to be compared before the 
> >> migration really happens. Each configuration item would be compared and 
> >> checked by libvirt. If one item of the source machine is different from 
> >> the item of destination machine, the migration fails. For mdev, there is 
> >> no any check/match before the migration happens yet.
> >>
> >> The user could use the node device list of libvirt to list the host 
> >> devices and see the capabilities of those devices. The current node device 
> >> code of libvirt has already been able to extract the supported mdev types 
> >> from a host PCI device, plus some basic information, like max supported 
> >> mdev instance of a host PCI device.
> >>
> >> THE SOLUTION
> >>
> >> To strictly check the mdev type and make sure the migration happens 
> >> between the compatible mediated devices, three new mandatory elements in 
> >> the domain XML below the hostdev element would be introduced:
> >>
> >> vendorid: The vendor ID of the mdev, which comes from the host PCI device. 
> >> A user could obtain this information from the host PCI device which 
> >> supports mdev in the node device list.
> >> productid: The product ID of the mdev, which also comes from the host PCI 
> >> device. A user could obtain this information from the same approach above. 
> >>  
> > 
> > The parent of an mdev device is not necessarily a PCI device.  
> Good point. I didn't get that.
> >   
> >> mdevtype: The type of the mdev. As the creation of the mdev is managed by 
> >> the user, the user knows the type of the mdev and would be responsible for 
> >> filling out this information.
> >>
> >> These three elements are only needed when the device API of a mdev is 
> >> "vfio-PCI". Take the example of mdev configuration from 
> >> https://libvirt.org/formatdomain.html to illustrate the modification:
> >>
> >>
> >>  
> >>  
> >>
> >>0xdead 
> >>0xbeef 
> >>type 
> >>  
> >>  
> >>
> >> With the newly introduced elements above, the flow of the creation of a 
> >> domain XML with mdev will be like:
> >>
> >> 1. The user obtains the vendorid/productid from node device list
> >> 2. The user fills the vendorid/productid/mdevtype in the domain XML
> >> 3. When a migration happens, libvirt check these elements. If one item is 
> >> different between two domain XML, then migration fails.  
> > 
> > I don't see how this solves anything.  The vendor and product are
> > redundant and specific to PCI hosted mdev devices.  These do nothing
> > to enhance the definition of an mdev type, where we've decided the
> > mdev type is a guest software compatible definition of a device.
> > Simply knowing the type doesn't help me know that the state data
> > between source and target is compatible.  This is the difference
> > between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx'
> > versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'.  We need somehow to define
> > a version of a device, what we consider to be compatible versions for
> > migration, and hopefully some standard(ish) mechanism libvirt could
> > use to determine this.  Thanks,
> >   
> 
> I see your point. We could combine these stuff together and improve 
> "mdev" type, not by introducing new stuff to decide the compatibility. 
> Let me know if I misunderstood.
> 
> I guess you are now talking about "the thing" we should give libvirt. 
> Are you implying that the mdev type we give in libvirt should be a 
> string? If we could take the inspiration of PCI device? Like:
> 
> class name - vendor name - product name - version
> 
> mdev type  gpu-intel-gen9-11
> gpu-nvidia-grid-11
> 
> Then every mdev driver needs to fill these information and VFIO could 
> combine and expose them as the name of folder in mdev_supported_types. 
> Libvirt could address the mdev type by reading the mdev_type in UUID folder.

I don't think this is practical, the mdev vendor driver already
guarantees that a given mdev type is software compatible regardless of
the underlying hardware or driver version.  If it's not compatible in
these ways, different mdev types should be used.  If we then cross that
definition with migration compatibility then 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-05 Thread Zhi Wang




On 07/30/18 23:56, Alex Williamson wrote:

On Sun, 29 Jul 2018 21:19:41 +
"Wang, Zhi A"  wrote:


BACKGROUND

As the live migration of mdev is going to be supported in VFIO, a scheme of 
deciding if a mdev could be migratable between the source machine and the 
destination machine is needed. Mostly, this email is going to discuss a 
possible solution which needs fewer modifications of libvirt/VFIO.

The configuration of a mdev is located in the domain XML, which guides libvirt 
how to find the mdev and generating the command line for QEMU. It basically 
only includes the UUID of a mdev. The domain XML of the source machine and 
destination machine are going to be compared before the migration really 
happens. Each configuration item would be compared and checked by libvirt. If 
one item of the source machine is different from the item of destination 
machine, the migration fails. For mdev, there is no any check/match before the 
migration happens yet.

The user could use the node device list of libvirt to list the host devices and 
see the capabilities of those devices. The current node device code of libvirt 
has already been able to extract the supported mdev types from a host PCI 
device, plus some basic information, like max supported mdev instance of a host 
PCI device.

THE SOLUTION

To strictly check the mdev type and make sure the migration happens between the 
compatible mediated devices, three new mandatory elements in the domain XML 
below the hostdev element would be introduced:

vendorid: The vendor ID of the mdev, which comes from the host PCI device. A 
user could obtain this information from the host PCI device which supports mdev 
in the node device list.
productid: The product ID of the mdev, which also comes from the host PCI 
device. A user could obtain this information from the same approach above.


The parent of an mdev device is not necessarily a PCI device.

Good point. I didn't get that.



mdevtype: The type of the mdev. As the creation of the mdev is managed by the 
user, the user knows the type of the mdev and would be responsible for filling 
out this information.

These three elements are only needed when the device API of a mdev is 
"vfio-PCI". Take the example of mdev configuration from 
https://libvirt.org/formatdomain.html to illustrate the modification:

   
 
 
   
   0xdead 
   0xbeef 
   type 
 
 

With the newly introduced elements above, the flow of the creation of a domain 
XML with mdev will be like:

1. The user obtains the vendorid/productid from node device list
2. The user fills the vendorid/productid/mdevtype in the domain XML
3. When a migration happens, libvirt check these elements. If one item is 
different between two domain XML, then migration fails.


I don't see how this solves anything.  The vendor and product are
redundant and specific to PCI hosted mdev devices.  These do nothing
to enhance the definition of an mdev type, where we've decided the
mdev type is a guest software compatible definition of a device.
Simply knowing the type doesn't help me know that the state data
between source and target is compatible.  This is the difference
between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx'
versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'.  We need somehow to define
a version of a device, what we consider to be compatible versions for
migration, and hopefully some standard(ish) mechanism libvirt could
use to determine this.  Thanks,



I see your point. We could combine these stuff together and improve 
"mdev" type, not by introducing new stuff to decide the compatibility. 
Let me know if I misunderstood.


I guess you are now talking about "the thing" we should give libvirt. 
Are you implying that the mdev type we give in libvirt should be a 
string? If we could take the inspiration of PCI device? Like:


   class name - vendor name - product name - version

mdev type  gpu-intel-gen9-11
   gpu-nvidia-grid-11

Then every mdev driver needs to fill these information and VFIO could 
combine and expose them as the name of folder in mdev_supported_types. 
Libvirt could address the mdev type by reading the mdev_type in UUID folder.


BTW,

As far as I read the code, the migration check function would check 
quite a lot of things before migration really happens, not only machine 
type.


Mdev is listed as a sub-hierarchy of hostdev in the migration check 
function. "hostdev" in the code means "a host device", like a 
passthrough PCI device. The function would check the compatibility of 
source device and destination device by types. e.g. for PCI passthrough 
device, it would check the BDF. For mdev, it doesn't check anything 
right now. That's how this idea come out: Let libvirt have something to 
check and know if the mdevs between source machine and destination 
machine are compatible.


Simply knowing the type is not enough currently and we need prepare 
something to let libvirt check the 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-03 Thread Alex Williamson
On Fri, 3 Aug 2018 12:07:58 +
"Wang, Zhi A"  wrote:

> Hi:
> 
> Thanks for unfolding your idea. The picture is clearer to me now. I didn't 
> realize that you also want to support cross hardware migration. Well, I 
> thought for a while, the cross hardware migration might be not popular in 
> vGPU case but could be quite popular in other mdev cases.

Exactly, we need to think beyond the implementation for a specific
vendor or class of device.
 
> Let me continue my summary:
> 
> Mdev dev type has already included a parent driver name/a group name/physical 
> device version/configuration type. For example i915-GVTg_V5_4. The driver 
> name and the group name could already distinguish the vendor and the product 
> between different mdevs, e.g. between Intel and Nvidia, between vGPU or 
> vOther.

Note that there are only two identifiers here, a vendor driver and a
type.  We included the vendor driver to avoid namespace collisions
between vendors.  The type itself should be considered opaque regardless
of how a specific vendor makes use of it.

> Each device provides a collection of the version of device state of data 
> stream in a preferred order in a mdev type, as newer version of device state 
> might contains more information which might help on performances. 
> 
> Let's say a new device N and an old device O, they both support mdev_type M.
> 
> For example:
> Device N is newer and supports the versions of device state: [ 6.3  6.2 .6.1 
> ] in mdev type M
> Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] 
> in mdev type M
> 
> - Version scheme of device state in backwards compatibility case: Migrate a 
> VM from a VM with device O to a VM with device N, the mdev type is M.
> 
> Device N: [ 6.3 6.2 6.1 5.3 ] in M
> Device O: [ 5.3 5.2 5.1 ] in M
> Version used in migration: 5.3
> The new device directly supports mdev_type M with the preferred version on 
> Device O. Good, best situation.
> 
> Device N: [ 6.3 6.2 6.1 5.2 ] in M
> Device O: [ 5.3 5.2 5.1 ] in M
> Version used in migration: 5.2
> The new device supports mdev_type M, but not the preferred version. After the 
> migration, the vendor driver might have to disable some features which is not 
> mentioned in 5.2 device state. But this totally depends on the vendor driver. 
> If user wish to achieve the best experience, he should update the vendor 
> driver in device N, which supports the preferred version on device O.
> 
> Device N: [ 6.3 6.2 6.1 ] in M
> Device O: [ 5.3 5.2 5.1 ] in M
> Version used in migration: None
> No version is matched. Migration would fail. User should update the vendor 
> driver on device N and device O.
> 
> - Version scheme of device state in forwards compatibility case: Migrate a VM 
> from a VM with N to a VM with device O, the mdev type is M.
> 
> Device N: [ 6.3 6.2 .6.1 ] in M
> Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on 
> device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, 
> the Device O still prefers version 5.3)
> Version used in migration: 6.1
> As the new device states is going to migrate to an old device, the vendor 
> driver on old device might have to specially dealing with the new version of 
> device state. It depends on the vendor driver. 
> 
> - QEMU has to figure out and choose the version of device states before 
> reading device state from the region. (Perhaps we can put the option of 
> selection in the control part of the region as well)
> - Libvirt will check if there is any match of the version in the collection 
> in device O and device N before migration.
> - Each mdev_type has its own collection of versions. (Device can support 
> different versions in different types)
> - Better the collection is not a range, better they could be a collection of 
> the version strings. (The vendor driver might drop some versions during the 
> upgrade since they are not ideal)

I believe that QEMU has always avoided trying to negotiate a migration
version.  We can only negotiate if the target is online and since a
save/restore is essentially an offline migration, there's no
opportunity for negotiation.  Therefore I think we need to assume the
source version is fixed.  If we need to expose an older migration
interface, I think we'd need to consider instantiating the mdev with
that specification or configuring it via attributes before usage, just
like QEMU does with specifying a machine type version.

Providing an explicit list of compatible versions also seems like it
could quickly get out of hand, imagine a driver with regular releases
that maintains compatibility for years.  The list could get
unmanageable.

To be honest, I'm pretty dubious whether vendors will actually implement
cross version migration, or really consider migration compatibility at
all, which is why I think we need to impose migration compatibility with
this sort of interface.  A vendor that doesn't want to support cross
version migration can simply 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-01 Thread Alex Williamson
On Wed, 1 Aug 2018 10:22:39 +
"Wang, Zhi A"  wrote:

> Hi:
> 
> Let me summarize the understanding so far I got from the discussions since I 
> am new to this discussion.
> 
> The mdev_type would be a generic stuff since we don't want userspace 
> application to be confused. The example of mdev_type is:

I don't think 'generic' is the right term here.  An mdev_type is a
specific thing with a defined interface, we just don't define what that
interface is.
 
> There are several pre-defined mdev_types with different configurations, let's 
> say MDEV_TYPE A/B/C. The HW 1.0 might only support MDEV_TYPE A, the HW 2.0 
> might support both MDEV_TYPE A and B, but due to HW difference, we cannot 
> migrate MDEV_TYPE A with HW 1.0 to MDEV_TYPE A with HW 2.0 even they have the 
> same MDEV_TYPE. So we need a device version either in the existing MDEV_TYPE 
> or a new sysfs entry.

This is correct, if a foo_type_a is exposed by the same vendor
driver on different hardware, then the vendor driver is guaranteeing
those mdev devices are software compatible to the user.  Whether the
vendor driver is willing or able to support migration across the
underlying hardware is a separate question.  Migration compatibility
and user compatibility are separate features.

> Libvirt would have to check MDEV_TYPE match between source machine and 
> destination machine, then the device version. If any of them is different, 
> then it fails the migration.

Device version of what?  The hardware?  The mdev?  If the device
version represents a different software interface, then the mdev type
should be different.  If the device version represents a migration
interface compatibility then we should define it as such.

> If my above understanding is correct, for VFIO part, we could define the 
> device version as string or a magic number. For example, the vendor mdev 
> driver could pass the vendor/device id and a version to VFIO and VFIO could 
> expose them in the UUID sysfs no matter through a new sysfs entry or through 
> existing MDEV_TYPE.

As above, why are we trying to infer migration compatibility from a
device version?  What does a device version imply?  What if a vendor
driver wants to support cross version migration?

> I prefer to expose it in the mdev_supported_types, since the libvirt node 
> device list could extract the device version when it enumerating the host PCI 
> devices or other devices, which supports mdev. We can also put it into UUID 
> sysfs, but the user might have to first logon the target machine and then 
> check the UUID and the device version by themselves, based on current code of 
> libvirty. I suppose all the host device management would be in node device in 
> libvirt, which provides remotely management of the host devices.
> 
> For the format of a device version, an example would be:
> 
> Vendor ID(16bit)Device ID(16bit)Class ID(16bit)Version(16bit)

This is no different from the mdev type, these are user visible
attributes of the device which should not change without also changing
the type.  Why do these necessarily convey that the migration stream is
also compatible?

> For string version of the device version, I guess we have to define the max 
> string length, which is hard to say yet. Also, a magic number is easier to be 
> put into the state data header during the migration.

I don't think we've accomplished anything with this "device version".
If anything, I think we're looking for a sysfs representation of a
migration stream version where userspace would match the vendor, type,
and migration stream version to determine compatibility.  For vendor
drivers that want to provide backwards compatibility, perhaps an
optional minimum migration stream version would be provided, which
would therefore imply that the format of the version can be parsed into
a monotonically increasing value so that userspace can compare a stream
produced by a source to a range supported by a target.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-01 Thread Wang, Zhi A
Hi:

Let me summarize the understanding so far I got from the discussions since I am 
new to this discussion.

The mdev_type would be a generic stuff since we don't want userspace 
application to be confused. The example of mdev_type is:

There are several pre-defined mdev_types with different configurations, let's 
say MDEV_TYPE A/B/C. The HW 1.0 might only support MDEV_TYPE A, the HW 2.0 
might support both MDEV_TYPE A and B, but due to HW difference, we cannot 
migrate MDEV_TYPE A with HW 1.0 to MDEV_TYPE A with HW 2.0 even they have the 
same MDEV_TYPE. So we need a device version either in the existing MDEV_TYPE or 
a new sysfs entry.

Libvirt would have to check MDEV_TYPE match between source machine and 
destination machine, then the device version. If any of them is different, then 
it fails the migration.

If my above understanding is correct, for VFIO part, we could define the device 
version as string or a magic number. For example, the vendor mdev driver could 
pass the vendor/device id and a version to VFIO and VFIO could expose them in 
the UUID sysfs no matter through a new sysfs entry or through existing 
MDEV_TYPE.

I prefer to expose it in the mdev_supported_types, since the libvirt node 
device list could extract the device version when it enumerating the host PCI 
devices or other devices, which supports mdev. We can also put it into UUID 
sysfs, but the user might have to first logon the target machine and then check 
the UUID and the device version by themselves, based on current code of 
libvirty. I suppose all the host device management would be in node device in 
libvirt, which provides remotely management of the host devices.

For the format of a device version, an example would be:

Vendor ID(16bit)Device ID(16bit)Class ID(16bit)Version(16bit)

For string version of the device version, I guess we have to define the max 
string length, which is hard to say yet. Also, a magic number is easier to be 
put into the state data header during the migration.

Thanks,
Zhi.

-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com] 
Sent: Tuesday, July 31, 2018 12:49 AM
To: Wang, Zhi A 
Cc: libvir-list@redhat.com; kwankh...@nvidia.com
Subject: Re: Matching the type of mediated devices in the migration

On Tue, 31 Jul 2018 04:05:11 +0800
Zhi Wang  wrote:

> On 07/30/18 23:56, Alex Williamson wrote:
> > On Sun, 29 Jul 2018 21:19:41 +
> > "Wang, Zhi A"  wrote:
> >   
> >> BACKGROUND
> >>
> >> As the live migration of mdev is going to be supported in VFIO, a scheme 
> >> of deciding if a mdev could be migratable between the source machine and 
> >> the destination machine is needed. Mostly, this email is going to discuss 
> >> a possible solution which needs fewer modifications of libvirt/VFIO.
> >>
> >> The configuration of a mdev is located in the domain XML, which guides 
> >> libvirt how to find the mdev and generating the command line for QEMU. It 
> >> basically only includes the UUID of a mdev. The domain XML of the source 
> >> machine and destination machine are going to be compared before the 
> >> migration really happens. Each configuration item would be compared and 
> >> checked by libvirt. If one item of the source machine is different from 
> >> the item of destination machine, the migration fails. For mdev, there is 
> >> no any check/match before the migration happens yet.
> >>
> >> The user could use the node device list of libvirt to list the host 
> >> devices and see the capabilities of those devices. The current node device 
> >> code of libvirt has already been able to extract the supported mdev types 
> >> from a host PCI device, plus some basic information, like max supported 
> >> mdev instance of a host PCI device.
> >>
> >> THE SOLUTION
> >>
> >> To strictly check the mdev type and make sure the migration happens 
> >> between the compatible mediated devices, three new mandatory elements in 
> >> the domain XML below the hostdev element would be introduced:
> >>
> >> vendorid: The vendor ID of the mdev, which comes from the host PCI device. 
> >> A user could obtain this information from the host PCI device which 
> >> supports mdev in the node device list.
> >> productid: The product ID of the mdev, which also comes from the host PCI 
> >> device. A user could obtain this information from the same approach above. 
> >>  
> > 
> > The parent of an mdev device is not necessarily a PCI device.  
> Good point. I didn't get that.
> >   
> >> mdevtype: The type of the mdev. As the creation of the mdev is managed by 
> >> the user, the user knows the type of the mdev and would be responsible for 
> >> filling out this information.
> >>
> >> These three elements are only needed when the device API of a mdev is 
> >> "vfio-PCI". Take the example of mdev configuration from 
> >> https://libvirt.org/formatdomain.html to illustrate the modification:
> >>
> >>
> >>  
> >>  
> >>
> >>0xdead 
> >>0xbeef 
> >> 

Re: [libvirt] Matching the type of mediated devices in the migration

2018-07-30 Thread Zhi Wang

Hi Erik:

Thanks for the reply and also the detailed guide. :) I can understand 
your idea is a comprehensive and generic approach for matching and 
checking all kinds of "hostdev"s in libvirt, not only mdev-specific one 
since mdev is a sub-hierarchy of "hostdev". If you idea become true, 
mdev would benefit from it naturally. :)


Cross compatibility is quite a good point. I haven't seen such 
possibility in Intel product nowadays, but Nvidia might support it 
possibly. If we go device version, not only vendor ID and product ID, 
then the vendor could be able to control the cross compatibility stuff.


Thanks again for the reply and guide. :)

Thanks,
Zhi.

On 07/30/18 20:28, Erik Skultety wrote:

On Sun, Jul 29, 2018 at 09:19:41PM +, Wang, Zhi A wrote:

BACKGROUND

As the live migration of mdev is going to be supported in VFIO, a scheme of 
deciding if a mdev could be migratable between the source machine and the 
destination machine is needed. Mostly, this email is going to discuss a 
possible solution which needs fewer modifications of libvirt/VFIO.

The configuration of a mdev is located in the domain XML, which guides libvirt 
how to find the mdev and generating the command line for QEMU. It basically 
only includes the UUID of a mdev. The domain XML of the source machine and 
destination machine are going to be compared before the migration really 
happens. Each configuration item would be compared and checked by libvirt. If 
one item of the source machine is different from the item of destination 
machine, the migration fails. For mdev, there is no any check/match before the 
migration happens yet.

The user could use the node device list of libvirt to list the host devices and 
see the capabilities of those devices. The current node device code of libvirt 
has already been able to extract the supported mdev types from a host PCI 
device, plus some basic information, like max supported mdev instance of a host 
PCI device.

THE SOLUTION

To strictly check the mdev type and make sure the migration happens between the 
compatible mediated devices, three new mandatory elements in the domain XML 
below the hostdev element would be introduced:

vendorid: The vendor ID of the mdev, which comes from the host PCI device. A 
user could obtain this information from the host PCI device which supports mdev 
in the node device list.
productid: The product ID of the mdev, which also comes from the host PCI 
device. A user could obtain this information from the same approach above.
mdevtype: The type of the mdev. As the creation of the mdev is managed by the 
user, the user knows the type of the mdev and would be responsible for filling 
out this information.


As you pointed out, we have this information, we therefore shouldn't duplicate
it within the domain XML. AFAIK we can probe that information from the
node-device driver before starting migration, put it into the migration cookie,
send the cookie over to the destination, retrieve the info from the cookie,
perform some checks and decide whether we should continue or abort the
migration. Or is there something I'm missing ? (this can very much be the case
as I'm not very familiar with the migration code)



These three elements are only needed when the device API of a mdev is 
"vfio-PCI". Take the example of mdev configuration from 
https://libvirt.org/formatdomain.html to illustrate the modification:

   
 
 
   
   0xdead 
   0xbeef 
   type 
 
 

With the newly introduced elements above, the flow of the creation of a domain 
XML with mdev will be like:

1. The user obtains the vendorid/productid from node device list
2. The user fills the vendorid/productid/mdevtype in the domain XML
3. When a migration happens, libvirt check these elements. If one item is 
different between two domain XML, then migration fails.


What kind of checks are we talking about? Speaking of vendor/product ids,
simple string comparison doesn't scale, as libvirt would have to compensate for
every future updates to the vendor driver, IOW if  decides that in
driver version A, only matching product IDs were allowed in migration, but a
fresh new driver version B allows certain product IDs to be cross compatible in
terms of migration, libvirt must have access to this kind of information,
otherwise we're just going to end up being a dumping ground holding a massive
database of all the compatible combinations. The same goes for mdevtype,
ensuring compatibility between types is the vendors responsibility and may be a
subject to change which is out of libvirt's hands, thus if libvirt is the one
ultimately making a qualified decision about migration, we need to be able to
query this kind of data ad-hoc rather than having it as part of libvirt.

Erik



--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-07-30 Thread Alex Williamson
On Sun, 29 Jul 2018 21:19:41 +
"Wang, Zhi A"  wrote:

> BACKGROUND
> 
> As the live migration of mdev is going to be supported in VFIO, a scheme of 
> deciding if a mdev could be migratable between the source machine and the 
> destination machine is needed. Mostly, this email is going to discuss a 
> possible solution which needs fewer modifications of libvirt/VFIO.
> 
> The configuration of a mdev is located in the domain XML, which guides 
> libvirt how to find the mdev and generating the command line for QEMU. It 
> basically only includes the UUID of a mdev. The domain XML of the source 
> machine and destination machine are going to be compared before the migration 
> really happens. Each configuration item would be compared and checked by 
> libvirt. If one item of the source machine is different from the item of 
> destination machine, the migration fails. For mdev, there is no any 
> check/match before the migration happens yet.
> 
> The user could use the node device list of libvirt to list the host devices 
> and see the capabilities of those devices. The current node device code of 
> libvirt has already been able to extract the supported mdev types from a host 
> PCI device, plus some basic information, like max supported mdev instance of 
> a host PCI device.
> 
> THE SOLUTION
> 
> To strictly check the mdev type and make sure the migration happens between 
> the compatible mediated devices, three new mandatory elements in the domain 
> XML below the hostdev element would be introduced:
> 
> vendorid: The vendor ID of the mdev, which comes from the host PCI device. A 
> user could obtain this information from the host PCI device which supports 
> mdev in the node device list.
> productid: The product ID of the mdev, which also comes from the host PCI 
> device. A user could obtain this information from the same approach above.

The parent of an mdev device is not necessarily a PCI device.

> mdevtype: The type of the mdev. As the creation of the mdev is managed by the 
> user, the user knows the type of the mdev and would be responsible for 
> filling out this information.
> 
> These three elements are only needed when the device API of a mdev is 
> "vfio-PCI". Take the example of mdev configuration from 
> https://libvirt.org/formatdomain.html to illustrate the modification:
> 
>   
> 
> 
>   
>   0xdead 
>   0xbeef 
>   type 
> 
> 
> 
> With the newly introduced elements above, the flow of the creation of a 
> domain XML with mdev will be like:
> 
> 1. The user obtains the vendorid/productid from node device list
> 2. The user fills the vendorid/productid/mdevtype in the domain XML
> 3. When a migration happens, libvirt check these elements. If one item is 
> different between two domain XML, then migration fails.

I don't see how this solves anything.  The vendor and product are
redundant and specific to PCI hosted mdev devices.  These do nothing
to enhance the definition of an mdev type, where we've decided the
mdev type is a guest software compatible definition of a device.
Simply knowing the type doesn't help me know that the state data
between source and target is compatible.  This is the difference
between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx'
versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'.  We need somehow to define
a version of a device, what we consider to be compatible versions for
migration, and hopefully some standard(ish) mechanism libvirt could
use to determine this.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-07-30 Thread Erik Skultety
On Sun, Jul 29, 2018 at 09:19:41PM +, Wang, Zhi A wrote:
> BACKGROUND
>
> As the live migration of mdev is going to be supported in VFIO, a scheme of 
> deciding if a mdev could be migratable between the source machine and the 
> destination machine is needed. Mostly, this email is going to discuss a 
> possible solution which needs fewer modifications of libvirt/VFIO.
>
> The configuration of a mdev is located in the domain XML, which guides 
> libvirt how to find the mdev and generating the command line for QEMU. It 
> basically only includes the UUID of a mdev. The domain XML of the source 
> machine and destination machine are going to be compared before the migration 
> really happens. Each configuration item would be compared and checked by 
> libvirt. If one item of the source machine is different from the item of 
> destination machine, the migration fails. For mdev, there is no any 
> check/match before the migration happens yet.
>
> The user could use the node device list of libvirt to list the host devices 
> and see the capabilities of those devices. The current node device code of 
> libvirt has already been able to extract the supported mdev types from a host 
> PCI device, plus some basic information, like max supported mdev instance of 
> a host PCI device.
>
> THE SOLUTION
>
> To strictly check the mdev type and make sure the migration happens between 
> the compatible mediated devices, three new mandatory elements in the domain 
> XML below the hostdev element would be introduced:
>
> vendorid: The vendor ID of the mdev, which comes from the host PCI device. A 
> user could obtain this information from the host PCI device which supports 
> mdev in the node device list.
> productid: The product ID of the mdev, which also comes from the host PCI 
> device. A user could obtain this information from the same approach above.
> mdevtype: The type of the mdev. As the creation of the mdev is managed by the 
> user, the user knows the type of the mdev and would be responsible for 
> filling out this information.

As you pointed out, we have this information, we therefore shouldn't duplicate
it within the domain XML. AFAIK we can probe that information from the
node-device driver before starting migration, put it into the migration cookie,
send the cookie over to the destination, retrieve the info from the cookie,
perform some checks and decide whether we should continue or abort the
migration. Or is there something I'm missing ? (this can very much be the case
as I'm not very familiar with the migration code)

>
> These three elements are only needed when the device API of a mdev is 
> "vfio-PCI". Take the example of mdev configuration from 
> https://libvirt.org/formatdomain.html to illustrate the modification:
>
>   
> 
> 
>   
>   0xdead 
>   0xbeef 
>   type 
> 
> 
>
> With the newly introduced elements above, the flow of the creation of a 
> domain XML with mdev will be like:
>
> 1. The user obtains the vendorid/productid from node device list
> 2. The user fills the vendorid/productid/mdevtype in the domain XML
> 3. When a migration happens, libvirt check these elements. If one item is 
> different between two domain XML, then migration fails.

What kind of checks are we talking about? Speaking of vendor/product ids,
simple string comparison doesn't scale, as libvirt would have to compensate for
every future updates to the vendor driver, IOW if  decides that in
driver version A, only matching product IDs were allowed in migration, but a
fresh new driver version B allows certain product IDs to be cross compatible in
terms of migration, libvirt must have access to this kind of information,
otherwise we're just going to end up being a dumping ground holding a massive
database of all the compatible combinations. The same goes for mdevtype,
ensuring compatibility between types is the vendors responsibility and may be a
subject to change which is out of libvirt's hands, thus if libvirt is the one
ultimately making a qualified decision about migration, we need to be able to
query this kind of data ad-hoc rather than having it as part of libvirt.

Erik

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list