Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2018 at 10:42:05AM -0600, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:49:58 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > > > but I don't think it can be done
> > > > > atomically with respect to inflight DMA of a physical device where we
> > > > > cannot halt the device without interfering with its state.
> > > > 
> > > > Guests never add pages to the balloon if they are under DMA,
> > > > so that's fine - there's never an in-flight DMA, if
> > > > there is guest is buggy and it's ok to crash it.  
> > > 
> > > It's not the ballooned page that I'm trying to note, it's the entire
> > > remainder of the SubRegion which needs to be unmapped to remove that
> > > one page.  It's more compatible from an IOMMU perspective in that we're
> > > only unmapping with the same granularity with which we mapped, but it's
> > > incompatible with inflight DMA as we have no idea what DMA targets may
> > > reside within the remainder of that mapping while it's temporarily
> > > unmapped.  
> > 
> > I see. Yes you need to be careful to replace the host IOMMU PTE
> > atomically. Same applies to vIOMMU though - if guest changes
> > a PTE atomically host should do the same.
> 
> I'm not sure the hardware supports atomic updates in these cases and
> therefore I don't think the vIOMMU does either.  Thanks,
> 
> Alex

Interesting. What makes you think simply writing into PTE
then flushing the cache isn't atomic?

-- 
MST



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Alex Williamson
On Mon, 30 Jul 2018 18:49:58 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > > but I don't think it can be done
> > > > atomically with respect to inflight DMA of a physical device where we
> > > > cannot halt the device without interfering with its state.
> > > 
> > > Guests never add pages to the balloon if they are under DMA,
> > > so that's fine - there's never an in-flight DMA, if
> > > there is guest is buggy and it's ok to crash it.  
> > 
> > It's not the ballooned page that I'm trying to note, it's the entire
> > remainder of the SubRegion which needs to be unmapped to remove that
> > one page.  It's more compatible from an IOMMU perspective in that we're
> > only unmapping with the same granularity with which we mapped, but it's
> > incompatible with inflight DMA as we have no idea what DMA targets may
> > reside within the remainder of that mapping while it's temporarily
> > unmapped.  
> 
> I see. Yes you need to be careful to replace the host IOMMU PTE
> atomically. Same applies to vIOMMU though - if guest changes
> a PTE atomically host should do the same.

I'm not sure the hardware supports atomic updates in these cases and
therefore I don't think the vIOMMU does either.  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > but I don't think it can be done
> > > atomically with respect to inflight DMA of a physical device where we
> > > cannot halt the device without interfering with its state.  
> > 
> > Guests never add pages to the balloon if they are under DMA,
> > so that's fine - there's never an in-flight DMA, if
> > there is guest is buggy and it's ok to crash it.
> 
> It's not the ballooned page that I'm trying to note, it's the entire
> remainder of the SubRegion which needs to be unmapped to remove that
> one page.  It's more compatible from an IOMMU perspective in that we're
> only unmapping with the same granularity with which we mapped, but it's
> incompatible with inflight DMA as we have no idea what DMA targets may
> reside within the remainder of that mapping while it's temporarily
> unmapped.

I see. Yes you need to be careful to replace the host IOMMU PTE
atomically. Same applies to vIOMMU though - if guest changes
a PTE atomically host should do the same.

--
MST



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Alex Williamson
On Mon, 30 Jul 2018 17:51:28 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Jul 30, 2018 at 08:39:39AM -0600, Alex Williamson wrote:
> > This is more
> > compatible with the IOMMU mappings,  
> 
> Precisely. These are at page granularity.

(This/these being memory API mappings for context)

SubRegions are not page granule, the entire previous SubRegion needs to
be unmapped and any remaining SubRegions re-mapped.
 
> > but I don't think it can be done
> > atomically with respect to inflight DMA of a physical device where we
> > cannot halt the device without interfering with its state.  
> 
> Guests never add pages to the balloon if they are under DMA,
> so that's fine - there's never an in-flight DMA, if
> there is guest is buggy and it's ok to crash it.

It's not the ballooned page that I'm trying to note, it's the entire
remainder of the SubRegion which needs to be unmapped to remove that
one page.  It's more compatible from an IOMMU perspective in that we're
only unmapping with the same granularity with which we mapped, but it's
incompatible with inflight DMA as we have no idea what DMA targets may
reside within the remainder of that mapping while it's temporarily
unmapped.  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread David Hildenbrand
On 30.07.2018 16:58, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2018 at 04:46:25PM +0200, David Hildenbrand wrote:
>> On 30.07.2018 15:59, Michael S. Tsirkin wrote:
>>> On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
 On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>> Directly assigned vfio devices have never been compatible with
>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>> independent of vfio page pinning and IOMMU mapping, leaving us with
>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>> when the balloon deflates.  Mediated devices can theoretically do
>> better, if we make the assumption that the mdev vendor driver is fully
>> synchronized to the actual working set of the guest driver.  In that
>> case the guest balloon driver should never be able to allocate an mdev
>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>> workings of the vendor driver pinning, and doesn't actually know the
>> difference between mdev devices and directly assigned devices.  Until
>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>> is safe, the best approach is to disabling ballooning any time a vfio
>> devices is attached.
>>
>> To do that, simply make the balloon inhibitor a counter rather than a
>> boolean, fixup a case where KVM can then simply use the inhibit
>> interface, and inhibit ballooning any time a vfio device is attached.
>> I'm expecting we'll expose some sort of flag similar to
>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>> this.  An addition we could consider here would be yet another device
>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>> mdev devices that behave in a manner compatible with ballooning.
>>
>> Please let me know if this looks like a good idea.  Thanks,
>>
>> Alex
>
> It's probably the only a reasonable thing to do for this release.
>
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

 What if the guest is rebooted and pages are silently getting reused
 without getting a deflation request first?
>>>
>>> Good point. To handle we'd need to deflate fully on
>>> on device reset, allowing access to all memory again.
>>
>> 1. Doing it from the guest: not reliable. E.g. think about crashes +
>> reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
>> reliably possible.
>>
>> 2. Doing it in QEMU balloon implementation. Not possible. We don't track
>> the memory that has been inflated (and also should not do it).
>>
>> So the only thing we could do is "deflate all guest memory" which
>> implies a madvise WILLNEED on all guest memory. We definitely don't want
>> this. We could inform vfio about all guest memory.
> 
> Exactly. No need to track anything we just need QEMU to allow access to
> all guest memory.
> 
>> Everything sounds like a big hack that should be handled internally in
>> the kernel.
> 
> What exactly do you want the kernel to do?

As already discussed (in this thread? I don't remember), Alex was asking
if there is some kind of notifier way in the kernel to get notified when
a fresh page is being used on memory that was previously madvise
DONTNEEDed. Then that page could be immediately repinned.

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2018 at 04:46:25PM +0200, David Hildenbrand wrote:
> On 30.07.2018 15:59, Michael S. Tsirkin wrote:
> > On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
> >> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> >>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>  Directly assigned vfio devices have never been compatible with
>  ballooning.  Zapping MADV_DONTNEED pages happens completely
>  independent of vfio page pinning and IOMMU mapping, leaving us with
>  inconsistent GPA to HPA mapping between vCPUs and assigned devices
>  when the balloon deflates.  Mediated devices can theoretically do
>  better, if we make the assumption that the mdev vendor driver is fully
>  synchronized to the actual working set of the guest driver.  In that
>  case the guest balloon driver should never be able to allocate an mdev
>  pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>  workings of the vendor driver pinning, and doesn't actually know the
>  difference between mdev devices and directly assigned devices.  Until
>  we can sort out how the vfio IOMMU backend can tell us if ballooning
>  is safe, the best approach is to disabling ballooning any time a vfio
>  devices is attached.
> 
>  To do that, simply make the balloon inhibitor a counter rather than a
>  boolean, fixup a case where KVM can then simply use the inhibit
>  interface, and inhibit ballooning any time a vfio device is attached.
>  I'm expecting we'll expose some sort of flag similar to
>  KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>  this.  An addition we could consider here would be yet another device
>  option for vfio, such as x-disable-balloon-inhibit, in case there are
>  mdev devices that behave in a manner compatible with ballooning.
> 
>  Please let me know if this looks like a good idea.  Thanks,
> 
>  Alex
> >>>
> >>> It's probably the only a reasonable thing to do for this release.
> >>>
> >>> Long term however, why can't balloon notify vfio as pages are
> >>> added and removed? VFIO could update its mappings then.
> >>
> >> What if the guest is rebooted and pages are silently getting reused
> >> without getting a deflation request first?
> > 
> > Good point. To handle we'd need to deflate fully on
> > on device reset, allowing access to all memory again.
> 
> 1. Doing it from the guest: not reliable. E.g. think about crashes +
> reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
> reliably possible.
> 
> 2. Doing it in QEMU balloon implementation. Not possible. We don't track
> the memory that has been inflated (and also should not do it).
>
> So the only thing we could do is "deflate all guest memory" which
> implies a madvise WILLNEED on all guest memory. We definitely don't want
> this. We could inform vfio about all guest memory.

Exactly. No need to track anything we just need QEMU to allow access to
all guest memory.

> Everything sounds like a big hack that should be handled internally in
> the kernel.

What exactly do you want the kernel to do?

> -- 
> 
> Thanks,
> 
> David / dhildenb



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2018 at 08:39:39AM -0600, Alex Williamson wrote:
> This is more
> compatible with the IOMMU mappings,

Precisely. These are at page granularity.

> but I don't think it can be done
> atomically with respect to inflight DMA of a physical device where we
> cannot halt the device without interfering with its state.

Guests never add pages to the balloon if they are under DMA,
so that's fine - there's never an in-flight DMA, if
there is guest is buggy and it's ok to crash it.

-- 
MST



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread David Hildenbrand
On 30.07.2018 15:59, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
>> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
>>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
 Directly assigned vfio devices have never been compatible with
 ballooning.  Zapping MADV_DONTNEED pages happens completely
 independent of vfio page pinning and IOMMU mapping, leaving us with
 inconsistent GPA to HPA mapping between vCPUs and assigned devices
 when the balloon deflates.  Mediated devices can theoretically do
 better, if we make the assumption that the mdev vendor driver is fully
 synchronized to the actual working set of the guest driver.  In that
 case the guest balloon driver should never be able to allocate an mdev
 pinned page for balloon inflation.  Unfortunately, QEMU can't know the
 workings of the vendor driver pinning, and doesn't actually know the
 difference between mdev devices and directly assigned devices.  Until
 we can sort out how the vfio IOMMU backend can tell us if ballooning
 is safe, the best approach is to disabling ballooning any time a vfio
 devices is attached.

 To do that, simply make the balloon inhibitor a counter rather than a
 boolean, fixup a case where KVM can then simply use the inhibit
 interface, and inhibit ballooning any time a vfio device is attached.
 I'm expecting we'll expose some sort of flag similar to
 KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
 this.  An addition we could consider here would be yet another device
 option for vfio, such as x-disable-balloon-inhibit, in case there are
 mdev devices that behave in a manner compatible with ballooning.

 Please let me know if this looks like a good idea.  Thanks,

 Alex
>>>
>>> It's probably the only a reasonable thing to do for this release.
>>>
>>> Long term however, why can't balloon notify vfio as pages are
>>> added and removed? VFIO could update its mappings then.
>>
>> What if the guest is rebooted and pages are silently getting reused
>> without getting a deflation request first?
> 
> Good point. To handle we'd need to deflate fully on
> on device reset, allowing access to all memory again.

1. Doing it from the guest: not reliable. E.g. think about crashes +
reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
reliably possible.

2. Doing it in QEMU balloon implementation. Not possible. We don't track
the memory that has been inflated (and also should not do it).

So the only thing we could do is "deflate all guest memory" which
implies a madvise WILLNEED on all guest memory. We definitely don't want
this. We could inform vfio about all guest memory.

Everything sounds like a big hack that should be handled internally in
the kernel.

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Alex Williamson
On Mon, 30 Jul 2018 16:34:09 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,
> > 
> > Alex  
> 
> It's probably the only a reasonable thing to do for this release.
> 
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

Are you thinking of a notifier outside of the memory API or updating
the memory API to reflect the current ballooning state?  In the former
case, we don't have page level granularity for mapping and un-mapping.
We could invent a mechanism for userspace to specify page granularity
mapping to the vfio kernel module, but that incurs a cost at the
hardware and host level with poor IOTLB efficiency and excessive page
tables.  Additionally, how would a notifier approach handle hot-added
devices, is the notifier replayed for each added device?  This starts
to sound more like the existing functionality of the memory API.

If we go through the memory API then we also don't really have page
level granularity, removing a page from a SubRegion will remove the
entire region and add back the remaining SubRegion(s).  This is more
compatible with the IOMMU mappings, but I don't think it can be done
atomically with respect to inflight DMA of a physical device where we
cannot halt the device without interfering with its state.  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> >> Directly assigned vfio devices have never been compatible with
> >> ballooning.  Zapping MADV_DONTNEED pages happens completely
> >> independent of vfio page pinning and IOMMU mapping, leaving us with
> >> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> >> when the balloon deflates.  Mediated devices can theoretically do
> >> better, if we make the assumption that the mdev vendor driver is fully
> >> synchronized to the actual working set of the guest driver.  In that
> >> case the guest balloon driver should never be able to allocate an mdev
> >> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> >> workings of the vendor driver pinning, and doesn't actually know the
> >> difference between mdev devices and directly assigned devices.  Until
> >> we can sort out how the vfio IOMMU backend can tell us if ballooning
> >> is safe, the best approach is to disabling ballooning any time a vfio
> >> devices is attached.
> >>
> >> To do that, simply make the balloon inhibitor a counter rather than a
> >> boolean, fixup a case where KVM can then simply use the inhibit
> >> interface, and inhibit ballooning any time a vfio device is attached.
> >> I'm expecting we'll expose some sort of flag similar to
> >> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> >> this.  An addition we could consider here would be yet another device
> >> option for vfio, such as x-disable-balloon-inhibit, in case there are
> >> mdev devices that behave in a manner compatible with ballooning.
> >>
> >> Please let me know if this looks like a good idea.  Thanks,
> >>
> >> Alex
> > 
> > It's probably the only a reasonable thing to do for this release.
> > 
> > Long term however, why can't balloon notify vfio as pages are
> > added and removed? VFIO could update its mappings then.
> 
> What if the guest is rebooted and pages are silently getting reused
> without getting a deflation request first?

Good point. To handle we'd need to deflate fully on
on device reset, allowing access to all memory again.

> -- 
> 
> Thanks,
> 
> David / dhildenb



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread David Hildenbrand
On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>> Directly assigned vfio devices have never been compatible with
>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>> independent of vfio page pinning and IOMMU mapping, leaving us with
>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>> when the balloon deflates.  Mediated devices can theoretically do
>> better, if we make the assumption that the mdev vendor driver is fully
>> synchronized to the actual working set of the guest driver.  In that
>> case the guest balloon driver should never be able to allocate an mdev
>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>> workings of the vendor driver pinning, and doesn't actually know the
>> difference between mdev devices and directly assigned devices.  Until
>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>> is safe, the best approach is to disabling ballooning any time a vfio
>> devices is attached.
>>
>> To do that, simply make the balloon inhibitor a counter rather than a
>> boolean, fixup a case where KVM can then simply use the inhibit
>> interface, and inhibit ballooning any time a vfio device is attached.
>> I'm expecting we'll expose some sort of flag similar to
>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>> this.  An addition we could consider here would be yet another device
>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>> mdev devices that behave in a manner compatible with ballooning.
>>
>> Please let me know if this looks like a good idea.  Thanks,
>>
>> Alex
> 
> It's probably the only a reasonable thing to do for this release.
> 
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

What if the guest is rebooted and pages are silently getting reused
without getting a deflation request first?

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-30 Thread Michael S. Tsirkin
On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> Directly assigned vfio devices have never been compatible with
> ballooning.  Zapping MADV_DONTNEED pages happens completely
> independent of vfio page pinning and IOMMU mapping, leaving us with
> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> when the balloon deflates.  Mediated devices can theoretically do
> better, if we make the assumption that the mdev vendor driver is fully
> synchronized to the actual working set of the guest driver.  In that
> case the guest balloon driver should never be able to allocate an mdev
> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> workings of the vendor driver pinning, and doesn't actually know the
> difference between mdev devices and directly assigned devices.  Until
> we can sort out how the vfio IOMMU backend can tell us if ballooning
> is safe, the best approach is to disabling ballooning any time a vfio
> devices is attached.
> 
> To do that, simply make the balloon inhibitor a counter rather than a
> boolean, fixup a case where KVM can then simply use the inhibit
> interface, and inhibit ballooning any time a vfio device is attached.
> I'm expecting we'll expose some sort of flag similar to
> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> this.  An addition we could consider here would be yet another device
> option for vfio, such as x-disable-balloon-inhibit, in case there are
> mdev devices that behave in a manner compatible with ballooning.
> 
> Please let me know if this looks like a good idea.  Thanks,
> 
> Alex

It's probably the only a reasonable thing to do for this release.

Long term however, why can't balloon notify vfio as pages are
added and removed? VFIO could update its mappings then.

> ---
> 
> Alex Williamson (3):
>   balloon: Allow nested inhibits
>   kvm: Use inhibit to prevent ballooning without synchronous mmu
>   vfio: Inhibit ballooning
> 
> 
>  accel/kvm/kvm-all.c|4 
>  balloon.c  |7 ---
>  hw/vfio/common.c   |6 ++
>  hw/virtio/virtio-balloon.c |4 +---
>  4 files changed, 15 insertions(+), 6 deletions(-)



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-19 Thread Peter Xu
On Thu, Jul 19, 2018 at 09:01:46AM -0600, Alex Williamson wrote:
> On Thu, 19 Jul 2018 13:40:51 +0800
> Peter Xu  wrote:
> > On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> > > On Wed, 18 Jul 2018 14:48:03 +0800
> > > Peter Xu  wrote:
> > > > I'm wondering what if want to do that somehow some day... Whether
> > > > it'll work if we just let vfio-pci devices to register some guest
> > > > memory invalidation hook (just like the iommu notifiers, but for guest
> > > > memory address space instead), then we map/unmap the IOMMU pages there
> > > > for vfio-pci device to make sure the inflated balloon pages are not
> > > > mapped and also make sure new pages are remapped with correct HPA
> > > > after deflated.  This is a pure question out of my curiosity, and for
> > > > sure it makes little sense if the answer of the first question above
> > > > is positive.  
> > > 
> > > This is why I mention the KVM MMU synchronization flag above.  KVM
> > > essentially had this same problem and fixed it with with MMU notifiers
> > > in the kernel.  They expose that KVM has the capability of handling
> > > such a scenario via a feature flag.  We can do the same with vfio.  In
> > > scenarios where we're able to fix this, we could expose a flag on the
> > > container indicating support for the same sort of thing.  
> > 
> > Sorry I didn't really caught that point when reply.  So that's why we
> > have had the mmu notifiers... Hmm, glad to know that.
> > 
> > But I would guess that if we want that notifier for vfio it should be
> > in QEMU rather than the kernel one since kernel vfio driver should not
> > have enough information on the GPA address space, hence it might not
> > be able to rebuild the mapping when a new page is mapped?  While QEMU
> > should be able to get both GPA and HVA easily when the balloon device
> > wants to deflate a page. [1]
> 
> This is where the vfio IOMMU backend comes into play.  vfio devices
> make use of MemoryListeners to register the HVA to GPA translations
> within the AddressSpace of a device.  When we're using an IOMMU, we pin
> those HVAs in order to make the HPA static and insert the GPA to HPA
> mappings into the IOMMU.  When we don't have an IOMMU, the IOMMU
> backend is storing those HVA to GPA translations so that the mediated
> device vendor driver can make pinning requests.  The vendor driver
> requests pinning of a given set of GPAs and the IOMMU backend pins the
> matching HVA to provide an HPA.
> 
> When a page is ballooned, it's zapped from the process address space,
> so we need to invalidate the HVA to HPA mapping.  When the page is
> restored, we still have the correct HVA, but we need a notifier to tell
> us to put it back into play, re-pinning and inserting the mapping into
> the IOMMU if we have one.
> 
> In order for QEMU to do this, this ballooned page would need to be
> reflected in the memory API.  This would be quite simple, inserting a
> MemoryRegion overlapping the RAM page which is ballooned out and
> removing it when the balloon is deflated.  But we run into the same
> problems with mapping granularity.  In order to accommodate this new
> overlap, the memory API would first remove the previous mapping, split
> or truncate the region, then reinsert the result.  Just like if we tried
> to do this in the IOMMU, it's not atomic with respect to device DMA.  In
> order to achieve this model, the memory API would need to operate
> entirely on page size regions.  Now imagine that every MiB of guest RAM
> requires 256 ioctls to map (assuming 4KiB pages), 256K per GiB.  Clearly
> we'd want to use a larger granularity for efficiency.  If we allow the
> user to specify the granularity, perhaps abstracting that granularity
> as the size of a DIMM, suddenly we've moved from memory ballooning to
> memory hotplug, where the latter does make use of the memory API and
> has none of these issues AIUI.

I see.  Indeed pc-dimm seems to be more suitable here.  And I think I
better understand the awkwardness that the page granularity problem
has brought - since we need this page granularity to happen even for
the QEMU memory API then we'll possibly have 4k-sized memory regions
to fill up all the RAM address space.  It sounds a hard mission.

> 
> > > There are a few complications to this support though.  First ballooning
> > > works at page size granularity, but IOMMU mapping can make use of
> > > arbitrary superpage sizes and the IOMMU API only guarantees unmap
> > > granularity equal to the original mapping.  Therefore we cannot unmap
> > > individual pages unless we require that all mappings through the IOMMU
> > > API are done with page granularity, precluding the use of superpages by
> > > the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> > > we can't invalidate the mappings and fault them back in or halt the
> > > processor to make the page table updates appear atomic.  The device is
> > > considered always running and interfering with th

Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-19 Thread Alex Williamson
On Thu, 19 Jul 2018 12:49:23 +0800
Peter Xu  wrote:

> On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu  wrote:
> >   
> > > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:  
> > > > Directly assigned vfio devices have never been compatible with
> > > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > > when the balloon deflates.  Mediated devices can theoretically do
> > > > better, if we make the assumption that the mdev vendor driver is fully
> > > > synchronized to the actual working set of the guest driver.  In that
> > > > case the guest balloon driver should never be able to allocate an mdev
> > > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > > workings of the vendor driver pinning, and doesn't actually know the
> > > > difference between mdev devices and directly assigned devices.  Until
> > > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > > is safe, the best approach is to disabling ballooning any time a vfio
> > > > devices is attached.
> > > > 
> > > > To do that, simply make the balloon inhibitor a counter rather than a
> > > > boolean, fixup a case where KVM can then simply use the inhibit
> > > > interface, and inhibit ballooning any time a vfio device is attached.
> > > > I'm expecting we'll expose some sort of flag similar to
> > > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > > this.  An addition we could consider here would be yet another device
> > > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > > mdev devices that behave in a manner compatible with ballooning.
> > > > 
> > > > Please let me know if this looks like a good idea.  Thanks,
> > > 
> > > IMHO patches 1-2 are good cleanup as standalone patches...
> > > 
> > > I totally have no idea on whether people would like to use vfio-pci
> > > and the balloon device at the same time.  After all vfio-pci are
> > > majorly for performance players, then I would vaguely guess that they
> > > don't really care thin provisioning of memory at all, hence the usage
> > > scenario might not exist much.  Is that the major reason that we'd
> > > just like to disable it (which makes sense to me)?  
> > 
> > Don't people use vfio-pci as well if they want some special
> > capabilities from the passed-through device? (At least that's the main
> > use case for vfio-ccw, not any performance considerations.)  
> 
> Good to know these.
> 
> Out of topic: could I further ask what's these capabilities, and why
> these capabilities can't be emulated (or hard to be emulated) if we
> don't care about performance?

Are you assuming that anything that isn't strictly performance focused
for device assignment is self contained, fully documented, suitable for
emulation, and there's someone willing and able to invest and upstream
(open source) that emulation?  What about things like data acquisition
devices, TV capture cards, serial ports, real-time control systems,
etc.  This is one of the basic tenants of device assignment, it
provides users the ability to migrate physical systems to virtual, even
if the entire reason for the system existing is tied to hardware.  The
world is more than just NICs and HBAs.  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-19 Thread Alex Williamson
On Thu, 19 Jul 2018 13:40:51 +0800
Peter Xu  wrote:
> On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu  wrote:
> > > I'm wondering what if want to do that somehow some day... Whether
> > > it'll work if we just let vfio-pci devices to register some guest
> > > memory invalidation hook (just like the iommu notifiers, but for guest
> > > memory address space instead), then we map/unmap the IOMMU pages there
> > > for vfio-pci device to make sure the inflated balloon pages are not
> > > mapped and also make sure new pages are remapped with correct HPA
> > > after deflated.  This is a pure question out of my curiosity, and for
> > > sure it makes little sense if the answer of the first question above
> > > is positive.  
> > 
> > This is why I mention the KVM MMU synchronization flag above.  KVM
> > essentially had this same problem and fixed it with with MMU notifiers
> > in the kernel.  They expose that KVM has the capability of handling
> > such a scenario via a feature flag.  We can do the same with vfio.  In
> > scenarios where we're able to fix this, we could expose a flag on the
> > container indicating support for the same sort of thing.  
> 
> Sorry I didn't really caught that point when reply.  So that's why we
> have had the mmu notifiers... Hmm, glad to know that.
> 
> But I would guess that if we want that notifier for vfio it should be
> in QEMU rather than the kernel one since kernel vfio driver should not
> have enough information on the GPA address space, hence it might not
> be able to rebuild the mapping when a new page is mapped?  While QEMU
> should be able to get both GPA and HVA easily when the balloon device
> wants to deflate a page. [1]

This is where the vfio IOMMU backend comes into play.  vfio devices
make use of MemoryListeners to register the HVA to GPA translations
within the AddressSpace of a device.  When we're using an IOMMU, we pin
those HVAs in order to make the HPA static and insert the GPA to HPA
mappings into the IOMMU.  When we don't have an IOMMU, the IOMMU
backend is storing those HVA to GPA translations so that the mediated
device vendor driver can make pinning requests.  The vendor driver
requests pinning of a given set of GPAs and the IOMMU backend pins the
matching HVA to provide an HPA.

When a page is ballooned, it's zapped from the process address space,
so we need to invalidate the HVA to HPA mapping.  When the page is
restored, we still have the correct HVA, but we need a notifier to tell
us to put it back into play, re-pinning and inserting the mapping into
the IOMMU if we have one.

In order for QEMU to do this, this ballooned page would need to be
reflected in the memory API.  This would be quite simple, inserting a
MemoryRegion overlapping the RAM page which is ballooned out and
removing it when the balloon is deflated.  But we run into the same
problems with mapping granularity.  In order to accommodate this new
overlap, the memory API would first remove the previous mapping, split
or truncate the region, then reinsert the result.  Just like if we tried
to do this in the IOMMU, it's not atomic with respect to device DMA.  In
order to achieve this model, the memory API would need to operate
entirely on page size regions.  Now imagine that every MiB of guest RAM
requires 256 ioctls to map (assuming 4KiB pages), 256K per GiB.  Clearly
we'd want to use a larger granularity for efficiency.  If we allow the
user to specify the granularity, perhaps abstracting that granularity
as the size of a DIMM, suddenly we've moved from memory ballooning to
memory hotplug, where the latter does make use of the memory API and
has none of these issues AIUI.

> > There are a few complications to this support though.  First ballooning
> > works at page size granularity, but IOMMU mapping can make use of
> > arbitrary superpage sizes and the IOMMU API only guarantees unmap
> > granularity equal to the original mapping.  Therefore we cannot unmap
> > individual pages unless we require that all mappings through the IOMMU
> > API are done with page granularity, precluding the use of superpages by
> > the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> > we can't invalidate the mappings and fault them back in or halt the
> > processor to make the page table updates appear atomic.  The device is
> > considered always running and interfering with that would likely lead
> > to functional issues.  
> 
> Indeed.  Actually VT-d emulation bug was fixed just months ago where
> the QEMU shadow page code for the device quickly unmapped the pages
> and rebuilt the pages, but within the window we see DMA happened hence
> DMA error on missing page entries.  I wish I have had learnt that
> earlier from you!  Then the bug would be even more obvious to me.
> 
> And I would guess that if we want to do that in the future, the
> easiest way as the first step would be that we just tell vfio to avoid
> using huge p

Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-19 Thread Peter Xu
On Thu, Jul 19, 2018 at 10:42:22AM +0200, Cornelia Huck wrote:
> On Thu, 19 Jul 2018 12:49:23 +0800
> Peter Xu  wrote:
> 
> > On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > > On Wed, 18 Jul 2018 14:48:03 +0800
> > > Peter Xu  wrote:
> 
> > > > I totally have no idea on whether people would like to use vfio-pci
> > > > and the balloon device at the same time.  After all vfio-pci are
> > > > majorly for performance players, then I would vaguely guess that they
> > > > don't really care thin provisioning of memory at all, hence the usage
> > > > scenario might not exist much.  Is that the major reason that we'd
> > > > just like to disable it (which makes sense to me)?  
> > > 
> > > Don't people use vfio-pci as well if they want some special
> > > capabilities from the passed-through device? (At least that's the main
> > > use case for vfio-ccw, not any performance considerations.)  
> > 
> > Good to know these.
> > 
> > Out of topic: could I further ask what's these capabilities, and why
> > these capabilities can't be emulated (or hard to be emulated) if we
> > don't care about performance?
> 
> For vfio-ccw, the (current) main use case is ECKD DASD. While this is
> basically a block device, it has some useful features (path failover,
> remote copy, exclusive locking) that are not replicated by any emulated
> device (and I'm not sure how to do this, either). It also has things
> like a weird disk layout, which we _could_ emulate, but I'm not sure
> why we would do that (it is mainly interesting if you want to share a
> disk with a traditional mainframe OS).
> 
> Other usecases I'm thinking about are related to the
> on-list-but-not-yet-merged vfio-ap crypto adapter support: Using a
> crypto adapter that has been certified without needing to make keys
> available to the OS, getting real randomness out of the card and so on.
> 
> Generally, I'd think anything that needs complicated transformations,
> interaction with other ecosystems or exploiting reliability features
> might be a case for using assignment: not just because of performance,
> but also because you don't need to reinvent the wheel.

I'd confess that mainframe brought many interesting (and historical)
facts to me and I even feel like I understand virtualization a bit
better with them. :)

And I believe that the crypto use case is a good one too since that
can also happen on all architectures not only something special to
mainframe.  Security, randomness, and possibly more other things are
good reasons to use assigned devices indeed.

Thanks for sharing these!

-- 
Peter Xu



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-19 Thread Cornelia Huck
On Thu, 19 Jul 2018 12:49:23 +0800
Peter Xu  wrote:

> On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu  wrote:

> > > I totally have no idea on whether people would like to use vfio-pci
> > > and the balloon device at the same time.  After all vfio-pci are
> > > majorly for performance players, then I would vaguely guess that they
> > > don't really care thin provisioning of memory at all, hence the usage
> > > scenario might not exist much.  Is that the major reason that we'd
> > > just like to disable it (which makes sense to me)?  
> > 
> > Don't people use vfio-pci as well if they want some special
> > capabilities from the passed-through device? (At least that's the main
> > use case for vfio-ccw, not any performance considerations.)  
> 
> Good to know these.
> 
> Out of topic: could I further ask what's these capabilities, and why
> these capabilities can't be emulated (or hard to be emulated) if we
> don't care about performance?

For vfio-ccw, the (current) main use case is ECKD DASD. While this is
basically a block device, it has some useful features (path failover,
remote copy, exclusive locking) that are not replicated by any emulated
device (and I'm not sure how to do this, either). It also has things
like a weird disk layout, which we _could_ emulate, but I'm not sure
why we would do that (it is mainly interesting if you want to share a
disk with a traditional mainframe OS).

Other usecases I'm thinking about are related to the
on-list-but-not-yet-merged vfio-ap crypto adapter support: Using a
crypto adapter that has been certified without needing to make keys
available to the OS, getting real randomness out of the card and so on.

Generally, I'd think anything that needs complicated transformations,
interaction with other ecosystems or exploiting reliability features
might be a case for using assignment: not just because of performance,
but also because you don't need to reinvent the wheel.



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-18 Thread Peter Xu
On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> On Wed, 18 Jul 2018 14:48:03 +0800
> Peter Xu  wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > > Directly assigned vfio devices have never been compatible with
> > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > when the balloon deflates.  Mediated devices can theoretically do
> > > better, if we make the assumption that the mdev vendor driver is fully
> > > synchronized to the actual working set of the guest driver.  In that
> > > case the guest balloon driver should never be able to allocate an mdev
> > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > workings of the vendor driver pinning, and doesn't actually know the
> > > difference between mdev devices and directly assigned devices.  Until
> > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > is safe, the best approach is to disabling ballooning any time a vfio
> > > devices is attached.
> > > 
> > > To do that, simply make the balloon inhibitor a counter rather than a
> > > boolean, fixup a case where KVM can then simply use the inhibit
> > > interface, and inhibit ballooning any time a vfio device is attached.
> > > I'm expecting we'll expose some sort of flag similar to
> > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > this.  An addition we could consider here would be yet another device
> > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > mdev devices that behave in a manner compatible with ballooning.
> > > 
> > > Please let me know if this looks like a good idea.  Thanks,  
> > 
> > IMHO patches 1-2 are good cleanup as standalone patches...
> > 
> > I totally have no idea on whether people would like to use vfio-pci
> > and the balloon device at the same time.  After all vfio-pci are
> > majorly for performance players, then I would vaguely guess that they
> > don't really care thin provisioning of memory at all, hence the usage
> > scenario might not exist much.  Is that the major reason that we'd
> > just like to disable it (which makes sense to me)?
> 
> Well, the major reason for disabling it is that it currently doesn't
> work and when the balloon is deflated, the device and vCPU are talking
> to different host pages for the same GPA for previously ballooned
> pages.  Regardless of the amenability of device assignment to various
> usage scenarios, that's a bad thing.  I guess most device assignment
> users have either realized this doesn't work and avoid it, or perhaps
> they have VMs tuned more for performance than density and (again) don't
> use ballooning.

Makes sense to me.

>  
> > I'm wondering what if want to do that somehow some day... Whether
> > it'll work if we just let vfio-pci devices to register some guest
> > memory invalidation hook (just like the iommu notifiers, but for guest
> > memory address space instead), then we map/unmap the IOMMU pages there
> > for vfio-pci device to make sure the inflated balloon pages are not
> > mapped and also make sure new pages are remapped with correct HPA
> > after deflated.  This is a pure question out of my curiosity, and for
> > sure it makes little sense if the answer of the first question above
> > is positive.
> 
> This is why I mention the KVM MMU synchronization flag above.  KVM
> essentially had this same problem and fixed it with with MMU notifiers
> in the kernel.  They expose that KVM has the capability of handling
> such a scenario via a feature flag.  We can do the same with vfio.  In
> scenarios where we're able to fix this, we could expose a flag on the
> container indicating support for the same sort of thing.

Sorry I didn't really caught that point when reply.  So that's why we
have had the mmu notifiers... Hmm, glad to know that.

But I would guess that if we want that notifier for vfio it should be
in QEMU rather than the kernel one since kernel vfio driver should not
have enough information on the GPA address space, hence it might not
be able to rebuild the mapping when a new page is mapped?  While QEMU
should be able to get both GPA and HVA easily when the balloon device
wants to deflate a page. [1]

> 
> There are a few complications to this support though.  First ballooning
> works at page size granularity, but IOMMU mapping can make use of
> arbitrary superpage sizes and the IOMMU API only guarantees unmap
> granularity equal to the original mapping.  Therefore we cannot unmap
> individual pages unless we require that all mappings through the IOMMU
> API are done with page granularity, precluding the use of superpages by
> the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> we can't invalidate the mappings and fault them back in or halt the
> processor to make the

Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-18 Thread Peter Xu
On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> On Wed, 18 Jul 2018 14:48:03 +0800
> Peter Xu  wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > > Directly assigned vfio devices have never been compatible with
> > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > when the balloon deflates.  Mediated devices can theoretically do
> > > better, if we make the assumption that the mdev vendor driver is fully
> > > synchronized to the actual working set of the guest driver.  In that
> > > case the guest balloon driver should never be able to allocate an mdev
> > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > workings of the vendor driver pinning, and doesn't actually know the
> > > difference between mdev devices and directly assigned devices.  Until
> > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > is safe, the best approach is to disabling ballooning any time a vfio
> > > devices is attached.
> > > 
> > > To do that, simply make the balloon inhibitor a counter rather than a
> > > boolean, fixup a case where KVM can then simply use the inhibit
> > > interface, and inhibit ballooning any time a vfio device is attached.
> > > I'm expecting we'll expose some sort of flag similar to
> > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > this.  An addition we could consider here would be yet another device
> > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > mdev devices that behave in a manner compatible with ballooning.
> > > 
> > > Please let me know if this looks like a good idea.  Thanks,  
> > 
> > IMHO patches 1-2 are good cleanup as standalone patches...
> > 
> > I totally have no idea on whether people would like to use vfio-pci
> > and the balloon device at the same time.  After all vfio-pci are
> > majorly for performance players, then I would vaguely guess that they
> > don't really care thin provisioning of memory at all, hence the usage
> > scenario might not exist much.  Is that the major reason that we'd
> > just like to disable it (which makes sense to me)?
> 
> Don't people use vfio-pci as well if they want some special
> capabilities from the passed-through device? (At least that's the main
> use case for vfio-ccw, not any performance considerations.)

Good to know these.

Out of topic: could I further ask what's these capabilities, and why
these capabilities can't be emulated (or hard to be emulated) if we
don't care about performance?

(Any link or keyword would be welcomed too if there is)

Regards,

-- 
Peter Xu



Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-18 Thread Alex Williamson
On Wed, 18 Jul 2018 14:48:03 +0800
Peter Xu  wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,  
> 
> IMHO patches 1-2 are good cleanup as standalone patches...
> 
> I totally have no idea on whether people would like to use vfio-pci
> and the balloon device at the same time.  After all vfio-pci are
> majorly for performance players, then I would vaguely guess that they
> don't really care thin provisioning of memory at all, hence the usage
> scenario might not exist much.  Is that the major reason that we'd
> just like to disable it (which makes sense to me)?

Well, the major reason for disabling it is that it currently doesn't
work and when the balloon is deflated, the device and vCPU are talking
to different host pages for the same GPA for previously ballooned
pages.  Regardless of the amenability of device assignment to various
usage scenarios, that's a bad thing.  I guess most device assignment
users have either realized this doesn't work and avoid it, or perhaps
they have VMs tuned more for performance than density and (again) don't
use ballooning.
 
> I'm wondering what if want to do that somehow some day... Whether
> it'll work if we just let vfio-pci devices to register some guest
> memory invalidation hook (just like the iommu notifiers, but for guest
> memory address space instead), then we map/unmap the IOMMU pages there
> for vfio-pci device to make sure the inflated balloon pages are not
> mapped and also make sure new pages are remapped with correct HPA
> after deflated.  This is a pure question out of my curiosity, and for
> sure it makes little sense if the answer of the first question above
> is positive.

This is why I mention the KVM MMU synchronization flag above.  KVM
essentially had this same problem and fixed it with with MMU notifiers
in the kernel.  They expose that KVM has the capability of handling
such a scenario via a feature flag.  We can do the same with vfio.  In
scenarios where we're able to fix this, we could expose a flag on the
container indicating support for the same sort of thing.

There are a few complications to this support though.  First ballooning
works at page size granularity, but IOMMU mapping can make use of
arbitrary superpage sizes and the IOMMU API only guarantees unmap
granularity equal to the original mapping.  Therefore we cannot unmap
individual pages unless we require that all mappings through the IOMMU
API are done with page granularity, precluding the use of superpages by
the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
we can't invalidate the mappings and fault them back in or halt the
processor to make the page table updates appear atomic.  The device is
considered always running and interfering with that would likely lead
to functional issues.

Second MMU notifiers seem to provide invalidation, pte change notices,
and page aging interfaces, so if a page is consumed by the balloon
inflating, we can invalidate it (modulo the issues in the previous
paragraph), but how do we re-populate the mapping through the IOMMU
when the page is released as the balloon is deflated?  KVM seems to do
this by handling the page fault, but we don't really have that option
for devices.  If we try to solve this only for mdev devices, we can
request invalidation down the vendor driver with page granularity and
we could assume a vendor driver that's well s

Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-18 Thread Cornelia Huck
On Wed, 18 Jul 2018 14:48:03 +0800
Peter Xu  wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,  
> 
> IMHO patches 1-2 are good cleanup as standalone patches...
> 
> I totally have no idea on whether people would like to use vfio-pci
> and the balloon device at the same time.  After all vfio-pci are
> majorly for performance players, then I would vaguely guess that they
> don't really care thin provisioning of memory at all, hence the usage
> scenario might not exist much.  Is that the major reason that we'd
> just like to disable it (which makes sense to me)?

Don't people use vfio-pci as well if they want some special
capabilities from the passed-through device? (At least that's the main
use case for vfio-ccw, not any performance considerations.)

> 
> I'm wondering what if want to do that somehow some day... Whether
> it'll work if we just let vfio-pci devices to register some guest
> memory invalidation hook (just like the iommu notifiers, but for guest
> memory address space instead), then we map/unmap the IOMMU pages there
> for vfio-pci device to make sure the inflated balloon pages are not
> mapped and also make sure new pages are remapped with correct HPA
> after deflated.  This is a pure question out of my curiosity, and for
> sure it makes little sense if the answer of the first question above
> is positive.
> 
> Thanks,
> 




Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-17 Thread Peter Xu
On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> Directly assigned vfio devices have never been compatible with
> ballooning.  Zapping MADV_DONTNEED pages happens completely
> independent of vfio page pinning and IOMMU mapping, leaving us with
> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> when the balloon deflates.  Mediated devices can theoretically do
> better, if we make the assumption that the mdev vendor driver is fully
> synchronized to the actual working set of the guest driver.  In that
> case the guest balloon driver should never be able to allocate an mdev
> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> workings of the vendor driver pinning, and doesn't actually know the
> difference between mdev devices and directly assigned devices.  Until
> we can sort out how the vfio IOMMU backend can tell us if ballooning
> is safe, the best approach is to disabling ballooning any time a vfio
> devices is attached.
> 
> To do that, simply make the balloon inhibitor a counter rather than a
> boolean, fixup a case where KVM can then simply use the inhibit
> interface, and inhibit ballooning any time a vfio device is attached.
> I'm expecting we'll expose some sort of flag similar to
> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> this.  An addition we could consider here would be yet another device
> option for vfio, such as x-disable-balloon-inhibit, in case there are
> mdev devices that behave in a manner compatible with ballooning.
> 
> Please let me know if this looks like a good idea.  Thanks,

IMHO patches 1-2 are good cleanup as standalone patches...

I totally have no idea on whether people would like to use vfio-pci
and the balloon device at the same time.  After all vfio-pci are
majorly for performance players, then I would vaguely guess that they
don't really care thin provisioning of memory at all, hence the usage
scenario might not exist much.  Is that the major reason that we'd
just like to disable it (which makes sense to me)?

I'm wondering what if want to do that somehow some day... Whether
it'll work if we just let vfio-pci devices to register some guest
memory invalidation hook (just like the iommu notifiers, but for guest
memory address space instead), then we map/unmap the IOMMU pages there
for vfio-pci device to make sure the inflated balloon pages are not
mapped and also make sure new pages are remapped with correct HPA
after deflated.  This is a pure question out of my curiosity, and for
sure it makes little sense if the answer of the first question above
is positive.

Thanks,

-- 
Peter Xu



[Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements

2018-07-17 Thread Alex Williamson
Directly assigned vfio devices have never been compatible with
ballooning.  Zapping MADV_DONTNEED pages happens completely
independent of vfio page pinning and IOMMU mapping, leaving us with
inconsistent GPA to HPA mapping between vCPUs and assigned devices
when the balloon deflates.  Mediated devices can theoretically do
better, if we make the assumption that the mdev vendor driver is fully
synchronized to the actual working set of the guest driver.  In that
case the guest balloon driver should never be able to allocate an mdev
pinned page for balloon inflation.  Unfortunately, QEMU can't know the
workings of the vendor driver pinning, and doesn't actually know the
difference between mdev devices and directly assigned devices.  Until
we can sort out how the vfio IOMMU backend can tell us if ballooning
is safe, the best approach is to disabling ballooning any time a vfio
devices is attached.

To do that, simply make the balloon inhibitor a counter rather than a
boolean, fixup a case where KVM can then simply use the inhibit
interface, and inhibit ballooning any time a vfio device is attached.
I'm expecting we'll expose some sort of flag similar to
KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
this.  An addition we could consider here would be yet another device
option for vfio, such as x-disable-balloon-inhibit, in case there are
mdev devices that behave in a manner compatible with ballooning.

Please let me know if this looks like a good idea.  Thanks,

Alex

---

Alex Williamson (3):
  balloon: Allow nested inhibits
  kvm: Use inhibit to prevent ballooning without synchronous mmu
  vfio: Inhibit ballooning


 accel/kvm/kvm-all.c|4 
 balloon.c  |7 ---
 hw/vfio/common.c   |6 ++
 hw/virtio/virtio-balloon.c |4 +---
 4 files changed, 15 insertions(+), 6 deletions(-)