[PATCH net-next] net: allow to call netif_reset_xps_queues() under cpu_read_lock

2018-08-06 Thread Andrei Vagin
From: Andrei Vagin 

The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.

This patch adds an ability to call __netif_set_xps_queue under
cpu_read_lock().


WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted

swapper/0/1 is trying to acquire lock:
cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
static_key_slow_inc+0xe/0x20

but task is already holding lock:
cf973d46 (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0x513/0x5a0

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(cpu_hotplug_lock.rw_sem);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by swapper/0/1:
 #0: 244bc7da (>mutex){}, at: __driver_attach+0x5a/0x110
 #1: cf973d46 (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0x513/0x5a0
 #2: 5cd8463f (xps_map_mutex){+.+.}, at: 
__netif_set_xps_queue+0x8d/0xc60

Cc: "Nambiar, Amritha" 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")

Signed-off-by: Andrei Vagin 
---
 drivers/net/virtio_net.c  |  4 +++-
 include/linux/netdevice.h |  2 +-
 net/core/dev.c| 23 +--
 net/core/net-sysfs.c  |  2 +-
 4 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62311dde6e71..a4abcfcf26b2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1903,9 +1903,11 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
 
i = 0;
for_each_online_cpu(cpu) {
+   const unsigned long *mask = cpumask_bits(cpumask_of(cpu));
+
virtqueue_set_affinity(vi->rq[i].vq, cpu);
virtqueue_set_affinity(vi->sq[i].vq, cpu);
-   netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
+   __netif_set_xps_queue(vi->dev, mask, i, false, true);
i++;
}
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 282e2e95ad5b..124f9a00ce71 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3320,7 +3320,7 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
- u16 index, bool is_rxqs_map);
+ u16 index, bool is_rxqs_map, bool cpuslocked);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
diff --git a/net/core/dev.c b/net/core/dev.c
index f68122f0ab02..d6a0f64ccdd9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2176,6 +2176,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (!static_key_false(_needed))
return;
 
+   cpus_read_lock();
mutex_lock(_map_mutex);
 
if (static_key_false(_rxqs_needed)) {
@@ -2199,10 +2200,11 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
 out_no_maps:
if (static_key_enabled(_rxqs_needed))
-   static_key_slow_dec(_rxqs_needed);
+   static_key_slow_dec_cpuslocked(_rxqs_needed);
 
-   static_key_slow_dec(_needed);
+   static_key_slow_dec_cpuslocked(_needed);
mutex_unlock(_map_mutex);
+   cpus_read_unlock();
 }
 
 static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
@@ -2251,7 +2253,7 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
 }
 
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
- u16 index, bool is_rxqs_map)
+ u16 index, bool is_rxqs_map, bool cpuslocked)
 {
const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
@@ -2275,6 +2277,9 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
return -EINVAL;
}
 
+   if (!cpuslocked)
+   cpus_read_lock();
+
mutex_lock(_map_mutex);
if (is_rxqs_map) {
maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
@@ -2317,9 +2322,9 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!new_dev_maps)
goto out_no_new_maps;
 
-   static_key_slow_inc(_needed);
+   

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Tue, 2018-08-07 at 02:45 +0300, Michael S. Tsirkin wrote:
> > OK well, assuming Christoph can solve the direct case in a way that
> > also work for the virtio !iommu case, we still want some bit of logic
> > somewhere that will "switch" to swiotlb based ops if the DMA mask is
> > limited.
> > 
> > You mentioned an RFC for that ? Do you happen to have a link ?
> 
> No but Christoph did I think.

Ok I missed that, sorry, I'll dig it out. Thanks.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Tue, Aug 07, 2018 at 08:13:56AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2018-08-07 at 00:46 +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 07, 2018 at 07:26:35AM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> > > > > As I said replying to Christoph, we are "leaking" into the interface
> > > > > something here that is really what's the VM is doing to itself, which
> > > > > is to stash its memory away in an inaccessible place.
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > 
> > > > I think Christoph merely objects to the specific implementation.  If
> > > > instead you do something like tweak dev->bus_dma_mask for the virtio
> > > > device I think he won't object.
> > > 
> > > Well, we don't have "bus_dma_mask" yet ..or you mean dma_mask ?
> > > 
> > > So, something like that would be a possibility, but the problem is that
> > > the current virtio (guest side) implementation doesn't honor this when
> > > not using dma ops and will not use dma ops if not using iommu, so back
> > > to square one.
> > 
> > Well we have the RFC for that - the switch to using DMA ops unconditionally 
> > isn't
> > problematic itself IMHO, for now that RFC is blocked
> > by its perfromance overhead for now but Christoph says
> > he's trying to remove that for direct mappings,
> > so we should hopefully be able to get there in X weeks.
> 
> That would be good yes.
> 
>  ../..
> 
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -155,7 +155,7 @@ static bool vring_use_dma_api(struct virtio_device
> > > *vdev)
> > >  * the DMA API if we're a Xen guest, which at least allows
> > >  * all of the sensible Xen configurations to work correctly.
> > >  */
> > > -   if (xen_domain())
> > > +   if (xen_domain() || arch_virtio_direct_dma_ops(>dev))
> > > return true;
> > >  
> > > return false;
> > 
> > Right but can't we fix the retpoline overhead such that
> > vring_use_dma_api will not be called on data path any longer, making
> > this a setup time check?
> 
> Yes it needs to be a setup time check regardless actually !
> 
> The above is broken, sorry I was a bit quick here (too early in the
> morning... ugh). We don't want the arch to go override the dma ops
> every time that is callled.
> 
> But yes, if we can fix the overhead, it becomes just a matter of
> setting up the "right" ops automatically.
> 
> > > (Passing the dev allows the arch to know this is a virtio device in
> > > "direct" mode or whatever we want to call the !iommu case, and
> > > construct appropriate DMA ops for it, which aren't the same as the DMA
> > > ops of any other PCI device who *do* use the iommu).
> > 
> > I think that's where Christoph might have specific ideas about it.
> 
> OK well, assuming Christoph can solve the direct case in a way that
> also work for the virtio !iommu case, we still want some bit of logic
> somewhere that will "switch" to swiotlb based ops if the DMA mask is
> limited.
> 
> You mentioned an RFC for that ? Do you happen to have a link ?

No but Christoph did I think.

> It would be indeed ideal if all we had to do was setup some kind of
> bus_dma_mask on all PCI devices and have virtio automagically insert
> swiotlb when necessary.
> 
> Cheers,
> Ben.
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 07, 2018 at 05:56:59AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-08-06 at 16:46 +0300, Michael S. Tsirkin wrote:
> > > 
> > > > Right, we'll need some quirk to disable balloons  in the guest I
> > > > suppose.
> > > > 
> > > > Passing something from libvirt is cumbersome because the end user may
> > > > not even need to know about secure VMs. There are use cases where the
> > > > security is a contract down to some special application running inside
> > > > the secure VM, the sysadmin knows nothing about.
> > > > 
> > > > Also there's repercussions all the way to admin tools, web UIs etc...
> > > > so it's fairly wide ranging.
> > > > 
> > > > So as long as we only need to quirk a couple of devices, it's much
> > > > better contained that way.
> > > 
> > > So just the balloon thing already means that yes management and all the
> > > way to the user tools must know this is going on. Otherwise
> > > user will try to inflate the balloon and wonder why this does not work.
> > 
> > There is *dozens* of management systems out there, not even all open
> > source, we won't ever be able to see the end of the tunnel if we need
> > to teach every single of them, including end users, about platform
> > specific new VM flags like that.
> > 
> > .../...
> 
> In the end I suspect you will find you have to.

Maybe... we'll tackle this if/when we have to.

For balloon I suspect it's not such a big deal because once secure, all
the guest memory goes into the secure memory which isn't visible or
accounted by the hypervisor, so there's nothing to steal but the guest
is also using no HV memory (other than the few "non-secure" pages used
for swiotlb and a couple of other kernel things).

Future versions of our secure architecture might allow to turn
arbitrary pages of memory secure/non-secure rather than relying on a
separate physical pool, in which case, the balloon will be able to work
normally.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Tue, 2018-08-07 at 08:13 +1000, Benjamin Herrenschmidt wrote:
> 
> OK well, assuming Christoph can solve the direct case in a way that
> also work for the virtio !iommu case, we still want some bit of logic
> somewhere that will "switch" to swiotlb based ops if the DMA mask is
> limited.
> 
> You mentioned an RFC for that ? Do you happen to have a link ?
> 
> It would be indeed ideal if all we had to do was setup some kind of
> bus_dma_mask on all PCI devices and have virtio automagically insert
> swiotlb when necessary.

Actually... I can think of a simpler option (Anshuman, didn't you
prototype this earlier ?):

Since that limitaiton of requiring bounce buffering via swiotlb is true
of any device in a secure VM, whether it goes through the iommu or not,
the iommu remapping is essentially pointless.

Thus, we could ensure that the iommu maps 1:1 the swiotlb bounce buffer
(either that or we configure it as "disabled" which is equivalent in
this case).

That way, we can now use the basic swiotlb ops everywhere, the same
dma_ops (swiotlb) will work whether a device uses the iommu or not.

Which boils down now to only making virtio use dma ops, there is no
need to override the dma_ops.

Which means all we have to do is either make xen_domain() return true
(yuck) or replace that one test with arch_virtio_force_dma_api() which
resolves to xen_domain() on x86 and can do something else for us.

As to using a virtio feature flag for that, which is what Christoph
proposes, I'm not too fan of it because this means effectively exposing
this to the peer, ie the interface. I don't think it belong there. The
interface, from the hypervisor perspective, whether it's qemu, vmware,
hyperz etc... have no business knowing how the guest manages its dma
operations, and may not even be aware of the access limitations (in our
case they are somewhat guest self-imposed).

Now, if this flag really is what we have to do, then we'd probably need
a qemu hack which will go set that flag on all virtio devices when it
detects that the VM is going secure.

But I don't think that's where that information "need to use the dma
API even for direct mode" belongs.

Cheers,
Ben.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Tue, 2018-08-07 at 00:46 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 07, 2018 at 07:26:35AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> > > > As I said replying to Christoph, we are "leaking" into the interface
> > > > something here that is really what's the VM is doing to itself, which
> > > > is to stash its memory away in an inaccessible place.
> > > > 
> > > > Cheers,
> > > > Ben.
> > > 
> > > I think Christoph merely objects to the specific implementation.  If
> > > instead you do something like tweak dev->bus_dma_mask for the virtio
> > > device I think he won't object.
> > 
> > Well, we don't have "bus_dma_mask" yet ..or you mean dma_mask ?
> > 
> > So, something like that would be a possibility, but the problem is that
> > the current virtio (guest side) implementation doesn't honor this when
> > not using dma ops and will not use dma ops if not using iommu, so back
> > to square one.
> 
> Well we have the RFC for that - the switch to using DMA ops unconditionally 
> isn't
> problematic itself IMHO, for now that RFC is blocked
> by its perfromance overhead for now but Christoph says
> he's trying to remove that for direct mappings,
> so we should hopefully be able to get there in X weeks.

That would be good yes.

 ../..

> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -155,7 +155,7 @@ static bool vring_use_dma_api(struct virtio_device
> > *vdev)
> >  * the DMA API if we're a Xen guest, which at least allows
> >  * all of the sensible Xen configurations to work correctly.
> >  */
> > -   if (xen_domain())
> > +   if (xen_domain() || arch_virtio_direct_dma_ops(>dev))
> > return true;
> >  
> > return false;
> 
> Right but can't we fix the retpoline overhead such that
> vring_use_dma_api will not be called on data path any longer, making
> this a setup time check?

Yes it needs to be a setup time check regardless actually !

The above is broken, sorry I was a bit quick here (too early in the
morning... ugh). We don't want the arch to go override the dma ops
every time that is callled.

But yes, if we can fix the overhead, it becomes just a matter of
setting up the "right" ops automatically.

> > (Passing the dev allows the arch to know this is a virtio device in
> > "direct" mode or whatever we want to call the !iommu case, and
> > construct appropriate DMA ops for it, which aren't the same as the DMA
> > ops of any other PCI device who *do* use the iommu).
> 
> I think that's where Christoph might have specific ideas about it.

OK well, assuming Christoph can solve the direct case in a way that
also work for the virtio !iommu case, we still want some bit of logic
somewhere that will "switch" to swiotlb based ops if the DMA mask is
limited.

You mentioned an RFC for that ? Do you happen to have a link ?

It would be indeed ideal if all we had to do was setup some kind of
bus_dma_mask on all PCI devices and have virtio automagically insert
swiotlb when necessary.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Tue, Aug 07, 2018 at 07:26:35AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> > > As I said replying to Christoph, we are "leaking" into the interface
> > > something here that is really what's the VM is doing to itself, which
> > > is to stash its memory away in an inaccessible place.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > I think Christoph merely objects to the specific implementation.  If
> > instead you do something like tweak dev->bus_dma_mask for the virtio
> > device I think he won't object.
> 
> Well, we don't have "bus_dma_mask" yet ..or you mean dma_mask ?
> 
> So, something like that would be a possibility, but the problem is that
> the current virtio (guest side) implementation doesn't honor this when
> not using dma ops and will not use dma ops if not using iommu, so back
> to square one.

Well we have the RFC for that - the switch to using DMA ops unconditionally 
isn't
problematic itself IMHO, for now that RFC is blocked
by its perfromance overhead for now but Christoph says
he's trying to remove that for direct mappings,
so we should hopefully be able to get there in X weeks.

> Christoph seems to be wanting to use a flag in the interface to make
> the guest use dma_ops which is what I don't understand.
> 
> What would be needed then would be something along the lines of virtio
> noticing that dma_mask isn't big enough to cover all of memory (which
> isn't something generic code can easily do here for various reasons I
> can elaborate if you want, but that specific test more/less has to be
> arch specific), and in that case, force itself to use DMA ops routed to
> swiotlb.
> 
> I'd rather have arch code do the bulk of that work, don't you think ?
> 
> Which brings me back to this option, which may be the simplest and
> avoids the overhead of the proposed series (I found the series to be a
> nice cleanup but retpoline does kick us in the nuts here).
> 
> So what about this ?
> 
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -155,7 +155,7 @@ static bool vring_use_dma_api(struct virtio_device
> *vdev)
>  * the DMA API if we're a Xen guest, which at least allows
>  * all of the sensible Xen configurations to work correctly.
>  */
> -   if (xen_domain())
> +   if (xen_domain() || arch_virtio_direct_dma_ops(>dev))
> return true;
>  
> return false;

Right but can't we fix the retpoline overhead such that
vring_use_dma_api will not be called on data path any longer, making
this a setup time check?


> (Passing the dev allows the arch to know this is a virtio device in
> "direct" mode or whatever we want to call the !iommu case, and
> construct appropriate DMA ops for it, which aren't the same as the DMA
> ops of any other PCI device who *do* use the iommu).

I think that's where Christoph might have specific ideas about it.

> Otherwise, the harder option would be for us to hack so that
> xen_domain() returns true in our setup (gross), and have the arch code,
> when it sets up PCI device DMA ops, have a gross hack to identify
> virtio PCI devices, checks their F_IOMMU flag itself, and sets up the
> different ops at that point.
> 
> As for those "special" ops, they are of course just normal swiotlb ops,
> there's nothing "special" other that they aren't the ops that other PCI
> device on that bus use.
> 
> Cheers,
> Ben.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> > As I said replying to Christoph, we are "leaking" into the interface
> > something here that is really what's the VM is doing to itself, which
> > is to stash its memory away in an inaccessible place.
> > 
> > Cheers,
> > Ben.
> 
> I think Christoph merely objects to the specific implementation.  If
> instead you do something like tweak dev->bus_dma_mask for the virtio
> device I think he won't object.

Well, we don't have "bus_dma_mask" yet ..or you mean dma_mask ?

So, something like that would be a possibility, but the problem is that
the current virtio (guest side) implementation doesn't honor this when
not using dma ops and will not use dma ops if not using iommu, so back
to square one.

Christoph seems to be wanting to use a flag in the interface to make
the guest use dma_ops which is what I don't understand.

What would be needed then would be something along the lines of virtio
noticing that dma_mask isn't big enough to cover all of memory (which
isn't something generic code can easily do here for various reasons I
can elaborate if you want, but that specific test more/less has to be
arch specific), and in that case, force itself to use DMA ops routed to
swiotlb.

I'd rather have arch code do the bulk of that work, don't you think ?

Which brings me back to this option, which may be the simplest and
avoids the overhead of the proposed series (I found the series to be a
nice cleanup but retpoline does kick us in the nuts here).

So what about this ?

--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -155,7 +155,7 @@ static bool vring_use_dma_api(struct virtio_device
*vdev)
 * the DMA API if we're a Xen guest, which at least allows
 * all of the sensible Xen configurations to work correctly.
 */
-   if (xen_domain())
+   if (xen_domain() || arch_virtio_direct_dma_ops(>dev))
return true;
 
return false;

(Passing the dev allows the arch to know this is a virtio device in
"direct" mode or whatever we want to call the !iommu case, and
construct appropriate DMA ops for it, which aren't the same as the DMA
ops of any other PCI device who *do* use the iommu).

Otherwise, the harder option would be for us to hack so that
xen_domain() returns true in our setup (gross), and have the arch code,
when it sets up PCI device DMA ops, have a gross hack to identify
virtio PCI devices, checks their F_IOMMU flag itself, and sets up the
different ops at that point.

As for those "special" ops, they are of course just normal swiotlb ops,
there's nothing "special" other that they aren't the ops that other PCI
device on that bus use.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Tue, Aug 07, 2018 at 05:56:59AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-08-06 at 16:46 +0300, Michael S. Tsirkin wrote:
> > 
> > > Right, we'll need some quirk to disable balloons  in the guest I
> > > suppose.
> > > 
> > > Passing something from libvirt is cumbersome because the end user may
> > > not even need to know about secure VMs. There are use cases where the
> > > security is a contract down to some special application running inside
> > > the secure VM, the sysadmin knows nothing about.
> > > 
> > > Also there's repercussions all the way to admin tools, web UIs etc...
> > > so it's fairly wide ranging.
> > > 
> > > So as long as we only need to quirk a couple of devices, it's much
> > > better contained that way.
> > 
> > So just the balloon thing already means that yes management and all the
> > way to the user tools must know this is going on. Otherwise
> > user will try to inflate the balloon and wonder why this does not work.
> 
> There is *dozens* of management systems out there, not even all open
> source, we won't ever be able to see the end of the tunnel if we need
> to teach every single of them, including end users, about platform
> specific new VM flags like that.
> 
> .../...

In the end I suspect you will find you have to.

> > Here's another example: you can't migrate a secure vm to hypervisor
> > which doesn't support this feature. Again management tools above libvirt
> > need to know otherwise they will try.
> 
> There will have to be a new machine type for that I suppose, yes,
> though it's not just the hypervisor that needs to know about the
> modified migration stream, it's also the need to have a compatible
> ultravisor with the right keys on the other side.
> 
> So migration is going to be special and require extra admin work in all
> cases yes. But not all secure VMs are meant to be migratable.
> 
> In any case, back to the problem at hand. What a qemu flag gives us is
> just a way to force iommu at VM creation time.

I don't think a qemu flag is strictly required for a problem at hand.

> This is rather sub-optimal, we don't really want the iommu in the way,
> so it's at best a "workaround", and it's not really solving the real
> problem.

This specific problem, I think I agree.

> As I said replying to Christoph, we are "leaking" into the interface
> something here that is really what's the VM is doing to itself, which
> is to stash its memory away in an inaccessible place.
> 
> Cheers,
> Ben.

I think Christoph merely objects to the specific implementation.  If
instead you do something like tweak dev->bus_dma_mask for the virtio
device I think he won't object.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Mon, 2018-08-06 at 16:46 +0300, Michael S. Tsirkin wrote:
> 
> > Right, we'll need some quirk to disable balloons  in the guest I
> > suppose.
> > 
> > Passing something from libvirt is cumbersome because the end user may
> > not even need to know about secure VMs. There are use cases where the
> > security is a contract down to some special application running inside
> > the secure VM, the sysadmin knows nothing about.
> > 
> > Also there's repercussions all the way to admin tools, web UIs etc...
> > so it's fairly wide ranging.
> > 
> > So as long as we only need to quirk a couple of devices, it's much
> > better contained that way.
> 
> So just the balloon thing already means that yes management and all the
> way to the user tools must know this is going on. Otherwise
> user will try to inflate the balloon and wonder why this does not work.

There is *dozens* of management systems out there, not even all open
source, we won't ever be able to see the end of the tunnel if we need
to teach every single of them, including end users, about platform
specific new VM flags like that.

.../...

> Here's another example: you can't migrate a secure vm to hypervisor
> which doesn't support this feature. Again management tools above libvirt
> need to know otherwise they will try.

There will have to be a new machine type for that I suppose, yes,
though it's not just the hypervisor that needs to know about the
modified migration stream, it's also the need to have a compatible
ultravisor with the right keys on the other side.

So migration is going to be special and require extra admin work in all
cases yes. But not all secure VMs are meant to be migratable.

In any case, back to the problem at hand. What a qemu flag gives us is
just a way to force iommu at VM creation time.

This is rather sub-optimal, we don't really want the iommu in the way,
so it's at best a "workaround", and it's not really solving the real
problem.

As I said replying to Christoph, we are "leaking" into the interface
something here that is really what's the VM is doing to itself, which
is to stash its memory away in an inaccessible place.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Benjamin Herrenschmidt
On Mon, 2018-08-06 at 02:42 -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 07:16:47AM +1000, Benjamin Herrenschmidt wrote:
> > Who would set this bit ? qemu ? Under what circumstances ?
> 
> I don't really care who sets what.  The implementation might not even
> involved qemu.
> 
> It is your job to write a coherent interface specification that does
> not depend on the used components.  The hypervisor might be PAPR,
> Linux + qemu, VMware, Hyperv or something so secret that you'd have
> to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
> AIX, OS400 or a Hipster project of the day in Rust.  As long as we
> properly specify the interface it simplify does not matter.

That's the point Christoph. The interface is today's interface. It does
NOT change. That information is not part of the interface.

It's the VM itself that is stashing away its memory in a secret place,
and thus needs to do bounce buffering. There is no change to the virtio
interface per-se.

> > What would be the effect of this bit while VIRTIO_F_IOMMU is NOT set,
> > ie, what would qemu do and what would Linux do ? I'm not sure I fully
> > understand your idea.
> 
> In a perfect would we'd just reuse VIRTIO_F_IOMMU and clarify the
> description which currently is rather vague but basically captures
> the use case.  Currently is is:
> 
> VIRTIO_F_IOMMU_PLATFORM(33)
> This feature indicates that the device is behind an IOMMU that
> translates bus addresses from the device into physical addresses in
> memory. If this feature bit is set to 0, then the device emits
> physical addresses which are not translated further, even though an
> IOMMU may be present.
> 
> And I'd change it to something like:
> 
> VIRTIO_F_PLATFORM_DMA(33)
> This feature indicates that the device emits platform specific
> bus addresses that might not be identical to physical address.
> The translation of physical to bus address is platform speific
> and defined by the plaform specification for the bus that the virtio
> device is attached to.
> If this feature bit is set to 0, then the device emits
> physical addresses which are not translated further, even if
> the platform would normally require translations for the bus that
> the virtio device is attached to.
> 
> If we can't change the defintion any more we should deprecate the
> old VIRTIO_F_IOMMU_PLATFORM bit, and require the VIRTIO_F_IOMMU_PLATFORM
> and VIRTIO_F_PLATFORM_DMA to be not set at the same time.

But this doesn't really change our problem does it ?

None of what happens in our case is part of the "interface". The
suggestion to force the iommu ON was simply that it was a "workaround"
as by doing so, we get to override the DMA ops, but that's just a
trick.

Fundamentally, what we need to solve is pretty much entirely a guest
problem.

> > I'm trying to understand because the limitation is not a device side
> > limitation, it's not a qemu limitation, it's actually more of a VM
> > limitation. It has most of its memory pages made inaccessible for
> > security reasons. The platform from a qemu/KVM perspective is almost
> > entirely normal.
> 
> Well, find a way to describe this either in the qemu specification using
> new feature bits, or by using something like the above.

But again, why do you want to involve the interface, and thus the
hypervisor for something that is essentially what the guest is doign to
itself ?

It really is something we need to solve locally to the guest, it's not
part of the interface.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next V2] vhost: switch to use new message format

2018-08-06 Thread David Miller
From: Jason Wang 
Date: Mon,  6 Aug 2018 11:17:47 +0800

> We use to have message like:
> 
> struct vhost_msg {
>   int type;
>   union {
>   struct vhost_iotlb_msg iotlb;
>   __u8 padding[64];
>   };
> };
> 
> Unfortunately, there will be a hole of 32bit in 64bit machine because
> of the alignment. This leads a different formats between 32bit API and
> 64bit API. What's more it will break 32bit program running on 64bit
> machine.
> 
> So fixing this by introducing a new message type with an explicit
> 32bit reserved field after type like:
> 
> struct vhost_msg_v2 {
>   __u32 type;
>   __u32 reserved;
>   union {
>   struct vhost_iotlb_msg iotlb;
>   __u8 padding[64];
>   };
> };
> 
> We will have a consistent ABI after switching to use this. To enable
> this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
> userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
> 
> Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> Signed-off-by: Jason Wang 
> ---
> Changes from V1:
> - use __u32 instead of int for type

Applied, thanks Jason.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Christoph Hellwig
On Mon, Aug 06, 2018 at 07:13:32PM +0300, Michael S. Tsirkin wrote:
> Oh that makes sense then. Could you post a pointer pls so
> this patchset is rebased on top (there are things to
> change about 4/4 but 1-3 could go in if they don't add
> overhead)?

The dma mapping direct calls will need a major work vs what I posted.
I plan to start that work in about two weeks once returning from my
vacation.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Mon, Aug 06, 2018 at 09:10:40AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 07:06:05PM +0300, Michael S. Tsirkin wrote:
> > > I've done something very similar in the thread I posted a few years
> > > ago.
> > 
> > Right so that was before spectre where a virtual call was cheaper :(
> 
> Sorry, I meant days, not years.  The whole point of the thread was the
> slowdowns due to retpolines, which are the software spectre mitigation.

Oh that makes sense then. Could you post a pointer pls so
this patchset is rebased on top (there are things to
change about 4/4 but 1-3 could go in if they don't add
overhead)?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Christoph Hellwig
On Mon, Aug 06, 2018 at 07:06:05PM +0300, Michael S. Tsirkin wrote:
> > I've done something very similar in the thread I posted a few years
> > ago.
> 
> Right so that was before spectre where a virtual call was cheaper :(

Sorry, I meant days, not years.  The whole point of the thread was the
slowdowns due to retpolines, which are the software spectre mitigation.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Mon, Aug 06, 2018 at 08:24:06AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 04:36:43PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> > > On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> > > >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > > >> Please go through these patches and review whether this approach 
> > > >> broadly
> > > >> makes sense. I will appreciate suggestions, inputs, comments 
> > > >> regarding
> > > >> the patches or the approach in general. Thank you.
> > > >
> > > > Jason did some work on profiling this. Unfortunately he reports
> > > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > > 
> > >  The test is rather simple, just run pktgen 
> > >  (pktgen_sample01_simple.sh) in
> > >  guest and measure PPS on tap on host.
> > > 
> > >  Thanks
> > > >>>
> > > >>> Could you supply host configuration involved please?
> > > >>
> > > >> I wonder how much of that could be caused by Spectre mitigations
> > > >> blowing up indirect function calls...
> > > >>
> > > >> Cheers,
> > > >> Ben.
> > > > 
> > > > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> > > 
> > > Did we get better results (lower regression due to indirect calls) with
> > > the suggested mitigation ? Just curious.
> > 
> > I'm referring to this:
> > I wonder whether we can support map_sg and friends being NULL, then use
> > that when mapping is an identity. A conditional branch there is likely
> > very cheap.
> > 
> > I don't think anyone tried implementing this yes.
> 
> I've done something very similar in the thread I posted a few years
> ago.

Right so that was before spectre where a virtual call was cheaper :(

> I plan to get a version of that upstream for 4.20, but it won't
> cover the virtio case, just the real direct mapping.

I guess this RFC will have to be reworked on top and performance retested.

Thanks,

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Christoph Hellwig
On Mon, Aug 06, 2018 at 04:36:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> > On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> > >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > >> Please go through these patches and review whether this approach 
> > >> broadly
> > >> makes sense. I will appreciate suggestions, inputs, comments 
> > >> regarding
> > >> the patches or the approach in general. Thank you.
> > >
> > > Jason did some work on profiling this. Unfortunately he reports
> > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > 
> >  The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) 
> >  in
> >  guest and measure PPS on tap on host.
> > 
> >  Thanks
> > >>>
> > >>> Could you supply host configuration involved please?
> > >>
> > >> I wonder how much of that could be caused by Spectre mitigations
> > >> blowing up indirect function calls...
> > >>
> > >> Cheers,
> > >> Ben.
> > > 
> > > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> > 
> > Did we get better results (lower regression due to indirect calls) with
> > the suggested mitigation ? Just curious.
> 
> I'm referring to this:
>   I wonder whether we can support map_sg and friends being NULL, then use
>   that when mapping is an identity. A conditional branch there is likely
>   very cheap.
> 
> I don't think anyone tried implementing this yes.

I've done something very similar in the thread I posted a few years
ago. I plan to get a version of that upstream for 4.20, but it won't
cover the virtio case, just the real direct mapping.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Will Deacon
Hi Michael,

On Sun, Aug 05, 2018 at 03:27:42AM +0300, Michael S. Tsirkin wrote:
> On Wed, Aug 01, 2018 at 09:16:38AM +0100, Will Deacon wrote:
> > On Tue, Jul 31, 2018 at 03:36:22PM -0500, Benjamin Herrenschmidt wrote:
> > > On Tue, 2018-07-31 at 10:30 -0700, Christoph Hellwig wrote:
> > > > > However the question people raise is that DMA API is already full of
> > > > > arch-specific tricks the likes of which are outlined in your post 
> > > > > linked
> > > > > above. How is this one much worse?
> > > > 
> > > > None of these warts is visible to the driver, they are all handled in
> > > > the architecture (possibly on a per-bus basis).
> > > > 
> > > > So for virtio we really need to decide if it has one set of behavior
> > > > as specified in the virtio spec, or if it behaves exactly as if it
> > > > was on a PCI bus, or in fact probably both as you lined up.  But no
> > > > magic arch specific behavior inbetween.
> > > 
> > > The only arch specific behaviour is needed in the case where it doesn't
> > > behave like PCI. In this case, the PCI DMA ops are not suitable, but in
> > > our secure VMs, we still need to make it use swiotlb in order to bounce
> > > through non-secure pages.
> > 
> > On arm/arm64, the problem we have is that legacy virtio devices on the MMIO
> > transport (so definitely not PCI) have historically been advertised by qemu
> > as not being cache coherent, but because the virtio core has bypassed DMA
> > ops then everything has happened to work. If we blindly enable the arch DMA
> > ops, we'll plumb in the non-coherent ops and start getting data corruption,
> > so we do need a way to quirk virtio as being "always coherent" if we want to
> > use the DMA ops (which we do, because our emulation platforms have an IOMMU
> > for all virtio devices).
> > 
> > Will
> 
> Right that's not very different from placing the device within the IOMMU
> domain but in fact bypassing the IOMMU

Hmm, I'm not sure I follow you here -- the IOMMU bypassing is handled
inside the IOMMU driver, so we'd still end up with non-coherent DMA ops
for the guest accesses. The presence of an IOMMU doesn't imply coherency for
us. Or am I missing your point here?

> I wonder whether anyone ever needs a non coherent virtio-mmio. If yes we
> can extend PLATFORM_IOMMU to cover that or add another bit.

I think that's probably the right way around: assume that legacy virtio-mmio
devices are coherent by default.

> What exactly do the non-coherent ops do that causes the corruption?

The non-coherent ops mean that the guest ends up allocating the vring queues
using non-cacheable mappings, whereas qemu/hypervisor uses a cacheable
mapping despite not advertising the devices as being cache-coherent.

This hits something in the architecture known as "mismatched aliases", which
means that coherency is lost between the guest and the hypervisor,
consequently resulting in data not being visible and ordering not being
guaranteed. The usual symptom is that the device appears to lock up iirc,
because the guest and the hypervisor are unable to communicate with each
other.

Does that help to clarify things?

Thanks,

Will
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wang, Wei W
On Monday, August 6, 2018 9:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 21:44, Wang, Wei W wrote:
> > On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> >> On 2018/08/06 18:56, Wei Wang wrote:
> >>> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
>  On 2018/08/03 17:32, Wei Wang wrote:
> > +static int virtio_balloon_register_shrinker(struct virtio_balloon
> > +*vb) {
> > +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> > +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> > +    vb->shrinker.batch = 0;
> > +    vb->shrinker.seeks = DEFAULT_SEEKS;
>  Why flags field is not set? If vb is allocated by
>  kmalloc(GFP_KERNEL) and is nowhere zero-cleared, KASAN would
> complain it.
> >>>
> >>> Could you point where in the code that would complain it?
> >>> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> >> they seem not related to that.
> >>
> >> Where is vb->shrinker.flags initialized?
> >
> > Is that mandatory to be initialized?
> 
> Of course. ;-)
> 
> > I find it's not initialized in most shrinkers (e.g. zs_register_shrinker,
> huge_zero_page_shrinker).
> 
> Because most shrinkers are "statically initialized (which means that
> unspecified fields are implicitly zero-cleared)" or "dynamically allocated 
> with
> __GFP_ZERO or zero-cleared using
> memset() (which means that all fields are once zero-cleared)".
> 
> And if you once zero-clear vb at allocation time, you will get a bonus that
> calling unregister_shrinker() without corresponding register_shrinker() is 
> safe
> (which will simplify initialization failure path).

Oh, I see, thanks. So it sounds better to directly kzalloc vb.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Sun, Aug 05, 2018 at 02:52:54PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2018-08-05 at 03:22 +0300, Michael S. Tsirkin wrote:
> > I see the allure of this, but I think down the road you will
> > discover passing a flag in libvirt XML saying
> > "please use a secure mode" or whatever is a good idea.
> > 
> > Even thought it is probably not required to address this
> > specific issue.
> > 
> > For example, I don't think ballooning works in secure mode,
> > you will be able to teach libvirt not to try to add a
> > balloon to the guest.
> 
> Right, we'll need some quirk to disable balloons  in the guest I
> suppose.
> 
> Passing something from libvirt is cumbersome because the end user may
> not even need to know about secure VMs. There are use cases where the
> security is a contract down to some special application running inside
> the secure VM, the sysadmin knows nothing about.
> 
> Also there's repercussions all the way to admin tools, web UIs etc...
> so it's fairly wide ranging.
> 
> So as long as we only need to quirk a couple of devices, it's much
> better contained that way.

So just the balloon thing already means that yes management and all the
way to the user tools must know this is going on. Otherwise
user will try to inflate the balloon and wonder why this does not work.

> > > Later on, (we may have even already run Linux at that point,
> > > unsecurely, as we can use Linux as a bootloader under some
> > > circumstances), we start a "secure image".
> > > 
> > > This is a kernel zImage that includes a "ticket" that has the
> > > appropriate signature etc... so that when that kernel starts, it can
> > > authenticate with the ultravisor, be verified (along with its ramdisk)
> > > etc... and copied (by the UV) into secure memory & run from there.
> > > 
> > > At that point, the hypervisor is informed that the VM has become
> > > secure.
> > > 
> > > So at that point, we could exit to qemu to inform it of the change,
> > 
> > That's probably a good idea too.
> 
> We probably will have to tell qemu eventually for migration, as we'll
> need some kind of key exchange phase etc... to deal with the crypto
> aspects (the actual page copy is sorted via encrypting the secure pages
> back to normal pages in qemu, but we'll need extra metadata).
> 
> > > and
> > > have it walk the qtree and "Switch" all the virtio devices to use the
> > > IOMMU I suppose, but it feels a lot grosser to me.
> > 
> > That part feels gross, yes.
> > 
> > > That's the only other option I can think of.
> > > 
> > > > However in this specific case, the flag does not need to come from the
> > > > hypervisor, it can be set by arch boot code I think.
> > > > Christoph do you see a problem with that?
> > > 
> > > The above could do that yes. Another approach would be to do it from a
> > > small virtio "quirk" that pokes a bit in the device to force it to
> > > iommu mode when it detects that we are running in a secure VM. That's a
> > > bit warty on the virito side but probably not as much as having a qemu
> > > one that walks of the virtio devices to change how they behave.
> > > 
> > > What do you reckon ?
> > 
> > I think you are right that for the dma limit the hypervisor doesn't seem
> > to need to know.
> 
> It's not just a limit mind you. It's a range, at least if we allocate
> just a single pool of insecure pages. swiotlb feels like a better
> option for us.
> 
> > > What we want to avoid is to expose any of this to the *end user* or
> > > libvirt or any other higher level of the management stack. We really
> > > want that stuff to remain contained between the VM itself, KVM and
> > > maybe qemu.
> > > 
> > > We will need some other qemu changes for migration so that's ok. But
> > > the minute you start touching libvirt and the higher levels it becomes
> > > a nightmare.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > I don't believe you'll be able to avoid that entirely. The split between
> > libvirt and qemu is more about community than about code, random bits of
> > functionality tend to land on random sides of that fence.  Better add a
> > tag in domain XML early is my advice. Having said that, it's your
> > hypervisor. I'm just suggesting that when hypervisor does somehow need
> > to care then I suspect most people won't be receptive to the argument
> > that changing libvirt is a nightmare.
> 
> It only needs to care at runtime. The problem isn't changing libvirt
> per-se, I don't have a problem with that. The problem is that it means
> creating two categories of machines "secure" and "non-secure", which is
> end-user visible, and thus has to be escalated to all the various
> management stacks, UIs, etc... out there.
> 
> In addition, there are some cases where the individual creating the VMs
> may not have any idea that they are secure.
> 
> But yes, if we have to, we'll do it. However, so far, we don't think
> it's a great idea.
> 
> Cheers,
> Ben.

Here's another example: you can't migrate a secure vm to hypervisor

Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 21:44, Wang, Wei W wrote:
> On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
>> On 2018/08/06 18:56, Wei Wang wrote:
>>> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
 On 2018/08/03 17:32, Wei Wang wrote:
> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> +*vb) {
> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> +    vb->shrinker.batch = 0;
> +    vb->shrinker.seeks = DEFAULT_SEEKS;
 Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
 and is nowhere zero-cleared, KASAN would complain it.
>>>
>>> Could you point where in the code that would complain it?
>>> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
>> they seem not related to that.
>>
>> Where is vb->shrinker.flags initialized?
> 
> Is that mandatory to be initialized?

Of course. ;-)

> I find it's not initialized in most shrinkers (e.g. zs_register_shrinker, 
> huge_zero_page_shrinker).

Because most shrinkers are "statically initialized (which means that 
unspecified fields are
implicitly zero-cleared)" or "dynamically allocated with __GFP_ZERO or 
zero-cleared using
memset() (which means that all fields are once zero-cleared)".

And if you once zero-clear vb at allocation time, you will get a bonus that
calling unregister_shrinker() without corresponding register_shrinker() is safe
(which will simplify initialization failure path).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wang, Wei W
On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 18:56, Wei Wang wrote:
> > On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >> On 2018/08/03 17:32, Wei Wang wrote:
> >>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>> +*vb) {
> >>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>> +    vb->shrinker.batch = 0;
> >>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
> >> and is nowhere zero-cleared, KASAN would complain it.
> >
> > Could you point where in the code that would complain it?
> > I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> they seem not related to that.
> 
> Where is vb->shrinker.flags initialized?

Is that mandatory to be initialized? I find it's not initialized in most 
shrinkers (e.g. zs_register_shrinker, huge_zero_page_shrinker).

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 18:56, Wei Wang wrote:
> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
>> On 2018/08/03 17:32, Wei Wang wrote:
>>> +static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
>>> +{
>>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
>>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
>>> +    vb->shrinker.batch = 0;
>>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
>> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
>> and is nowhere zero-cleared, KASAN would complain it.
> 
> Could you point where in the code that would complain it?
> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and they seem not 
> related to that.

Where is vb->shrinker.flags initialized?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wei Wang

On 08/03/2018 08:11 PM, Tetsuo Handa wrote:

On 2018/08/03 17:32, Wei Wang wrote:

+static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
+{
+   vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
+   vb->shrinker.count_objects = virtio_balloon_shrinker_count;
+   vb->shrinker.batch = 0;
+   vb->shrinker.seeks = DEFAULT_SEEKS;

Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
and is nowhere zero-cleared, KASAN would complain it.


Could you point where in the code that would complain it?
I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and they 
seem not related to that.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Christoph Hellwig
On Mon, Aug 06, 2018 at 07:16:47AM +1000, Benjamin Herrenschmidt wrote:
> Who would set this bit ? qemu ? Under what circumstances ?

I don't really care who sets what.  The implementation might not even
involved qemu.

It is your job to write a coherent interface specification that does
not depend on the used components.  The hypervisor might be PAPR,
Linux + qemu, VMware, Hyperv or something so secret that you'd have
to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
AIX, OS400 or a Hipster project of the day in Rust.  As long as we
properly specify the interface it simplify does not matter.

> What would be the effect of this bit while VIRTIO_F_IOMMU is NOT set,
> ie, what would qemu do and what would Linux do ? I'm not sure I fully
> understand your idea.

In a perfect would we'd just reuse VIRTIO_F_IOMMU and clarify the
description which currently is rather vague but basically captures
the use case.  Currently is is:

VIRTIO_F_IOMMU_PLATFORM(33)
This feature indicates that the device is behind an IOMMU that
translates bus addresses from the device into physical addresses in
memory. If this feature bit is set to 0, then the device emits
physical addresses which are not translated further, even though an
IOMMU may be present.

And I'd change it to something like:

VIRTIO_F_PLATFORM_DMA(33)
This feature indicates that the device emits platform specific
bus addresses that might not be identical to physical address.
The translation of physical to bus address is platform speific
and defined by the plaform specification for the bus that the virtio
device is attached to.
If this feature bit is set to 0, then the device emits
physical addresses which are not translated further, even if
the platform would normally require translations for the bus that
the virtio device is attached to.

If we can't change the defintion any more we should deprecate the
old VIRTIO_F_IOMMU_PLATFORM bit, and require the VIRTIO_F_IOMMU_PLATFORM
and VIRTIO_F_PLATFORM_DMA to be not set at the same time.

> I'm trying to understand because the limitation is not a device side
> limitation, it's not a qemu limitation, it's actually more of a VM
> limitation. It has most of its memory pages made inaccessible for
> security reasons. The platform from a qemu/KVM perspective is almost
> entirely normal.

Well, find a way to describe this either in the qemu specification using
new feature bits, or by using something like the above.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Anshuman Khandual
On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
>> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
>> Please go through these patches and review whether this approach broadly
>> makes sense. I will appreciate suggestions, inputs, comments regarding
>> the patches or the approach in general. Thank you.
>
> Jason did some work on profiling this. Unfortunately he reports
> about 4% extra overhead from this switch on x86 with no vIOMMU.

 The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
 guest and measure PPS on tap on host.

 Thanks
>>>
>>> Could you supply host configuration involved please?
>>
>> I wonder how much of that could be caused by Spectre mitigations
>> blowing up indirect function calls...
>>
>> Cheers,
>> Ben.
> 
> I won't be surprised. If yes I suggested a way to mitigate the overhead.

Did we get better results (lower regression due to indirect calls) with
the suggested mitigation ? Just curious.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization