Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > > > Please go through these patches and review whether this approach broadly
> > > > makes sense. I will appreciate suggestions, inputs, comments regarding
> > > > the patches or the approach in general. Thank you.
> > > 
> > > Jason did some work on profiling this. Unfortunately he reports
> > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > 
> > The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> > guest and measure PPS on tap on host.
> > 
> > Thanks
> 
> Could you supply host configuration involved please?

I wonder how much of that could be caused by Spectre mitigations
blowing up indirect function calls...

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

What we want to avoid is to expose any of this to the *end user* or
libvirt or any other higher level of the management stack. We really
want that stuff to remain contained between the VM itself, KVM and
maybe qemu.

We will need some other qemu changes for migration so that's ok. But
the minute you start touching libvirt and the higher levels it becomes
a nightmare.

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

What we want to avoid is to expose any of this to the *end user* or
libvirt or any other higher level of the management stack. We really
want that stuff to remain contained between the VM itself, KVM and
maybe qemu.

We will need some other qemu changes for migration so that's ok. But
the minute you start touching libvirt and the higher levels it becomes
a nightmare.

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

(Appologies if you got this twice, my mailer had a brain fart and I don't
know if the first one got through & am about to disappear in a plane for 17h)

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

What we want to avoid is to expose any of this to the *end user* or
libvirt or any other higher level of the management stack. We really
want that stuff to remain contained between the VM itself, KVM and
maybe qemu.

We will need some other qemu changes for migration so that's ok. But
the minute you start touching libvirt and the higher levels it becomes
a nightmare.

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Michael S. Tsirkin
On Fri, Aug 03, 2018 at 12:05:07AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> > So let's differenciate the two problems of having an IOMMU (real or
> > emulated) which indeeds adds overhead etc... and using the DMA API.
> > 
> > At the moment, virtio does this all over the place:
> > 
> > if (use_dma_api)
> > dma_map/alloc_something(...)
> > else
> > use_pa
> > 
> > The idea of the patch set is to do two, somewhat orthogonal, changes
> > that together achieve what we want. Let me know where you think there
> > is "a bunch of issues" because I'm missing it:
> > 
> >  1- Replace the above if/else constructs with just calling the DMA API,
> > and have virtio, at initialization, hookup its own dma_ops that just
> > "return pa" (roughly) when the IOMMU stuff isn't used.
> > 
> > This adds an indirect function call to the path that previously didn't
> > have one (the else case above). Is that a significant/measurable
> > overhead ?
> 
> If you call it often enough it does:
> 
> https://www.spinics.net/lists/netdev/msg495413.html
> 
> >  2- Make virtio use the DMA API with our custom platform-provided
> > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > running on a secure VM in our case.
> 
> And total NAK the customer platform-provided part of this.  We need
> a flag passed in from the hypervisor that the device needs all bus
> specific dma api treatment, and then just use the normal plaform
> dma mapping setup.  To get swiotlb you'll need to then use the DT/ACPI
> dma-range property to limit the addressable range, and a swiotlb
> capable plaform will use swiotlb automatically.

It seems reasonable to teach a platform to override dma-range
for a specific device e.g. in case it knows about bugs in ACPI.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-08-03 Thread Michael S. Tsirkin
On Fri, Aug 03, 2018 at 12:06:51PM -0700, Andrei Vagin wrote:
> On Fri, Aug 03, 2018 at 12:08:05AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Aug 02, 2018 at 02:04:12PM -0700, Nambiar, Amritha wrote:
> > > On 8/1/2018 5:11 PM, Andrei Vagin wrote:
> > > > On Tue, Jul 10, 2018 at 07:28:49PM -0700, Nambiar, Amritha wrote:
> > > >> With this patch series, I introduced static_key for XPS maps
> > > >> (xps_needed), so static_key_slow_inc() is used to switch branches. The
> > > >> definition of static_key_slow_inc() has cpus_read_lock in place. In the
> > > >> virtio_net driver, XPS queues are initialized after setting the
> > > >> queue:cpu affinity in virtnet_set_affinity() which is already protected
> > > >> within cpus_read_lock. Hence, the warning here trying to acquire
> > > >> cpus_read_lock when it is already held.
> > > >>
> > > >> A quick fix for this would be to just extract netif_set_xps_queue() out
> > > >> of the lock by simply wrapping it with another put/get_online_cpus
> > > >> (unlock right before and hold lock right after).
> > > > 
> > > > virtnet_set_affinity() is called from virtnet_cpu_online(), which is
> > > > called under cpus_read_lock too.
> > > > 
> > > > It looks like now we can't call netif_set_xps_queue() from cpu hotplug
> > > > callbacks.
> > > > 
> > > > I can suggest a very straightforward fix for this problem. The patch is
> > > > attached.
> > > > 
> > > 
> > > Thanks for looking into this. I was thinking of fixing this in the
> > > virtio_net driver by moving the XPS initialization (and have a new
> > > get_affinity utility) in the ndo_open (so it is together with other tx
> > > preparation) instead of probe. Your patch solves this in general for
> > > setting up cpu hotplug callbacks which is under cpus_read_lock.
> > 
> > 
> > I like this too. Could you repost in a standard way
> > (inline, with your signoff etc) so we can ack this for
> > net-next?
> 
> When I was testing this patch, I got the following kasan warning. Michael,
> could you take a look at it. Maybe you will understand what was going wrong 
> there.
> 
> https://api.travis-ci.org/v3/job/410701353/log.txt
> 
> [7.275033] 
> ==
> [7.275226] BUG: KASAN: slab-out-of-bounds in virtnet_poll+0xaa1/0xd00
> [7.275359] Read of size 8 at addr 8801d444a000 by task ip/370
> [7.275488] 
> [7.275610] CPU: 1 PID: 370 Comm: ip Not tainted 4.18.0-rc6+ #1
> [7.275613] Hardware name: Google Google Compute Engine/Google Compute 
> Engine, BIOS Google 01/01/2011
> [7.275616] Call Trace:
> [7.275621]  
> [7.275630]  dump_stack+0x71/0xab
> [7.275640]  print_address_description+0x6a/0x270
> [7.275648]  kasan_report+0x258/0x380
> [7.275653]  ? virtnet_poll+0xaa1/0xd00
> [7.275661]  virtnet_poll+0xaa1/0xd00
> [7.275680]  ? receive_buf+0x5920/0x5920
> [7.275689]  ? do_raw_spin_unlock+0x54/0x220
> [7.275699]  ? find_held_lock+0x32/0x1c0
> [7.275710]  ? rcu_process_callbacks+0xa60/0xd20
> [7.275736]  net_rx_action+0x2ee/0xad0
> [7.275748]  ? rcu_note_context_switch+0x320/0x320
> [7.275754]  ? napi_complete_done+0x300/0x300
> [7.275763]  ? native_apic_msr_write+0x27/0x30
> [7.275768]  ? lapic_next_event+0x5b/0x90
> [7.275775]  ? clockevents_program_event+0x21d/0x2f0
> [7.275791]  __do_softirq+0x19a/0x623
> [7.275807]  do_softirq_own_stack+0x2a/0x40
> [7.275811]  
> [7.275818]  do_softirq.part.18+0x6a/0x80
> [7.275825]  __local_bh_enable_ip+0x49/0x50
> [7.275829]  virtnet_open+0x129/0x440
> [7.275841]  __dev_open+0x189/0x2c0
> [7.275848]  ? dev_set_rx_mode+0x30/0x30
> [7.275857]  ? do_raw_spin_unlock+0x54/0x220
> [7.275866]  __dev_change_flags+0x3a9/0x4f0
> [7.275873]  ? dev_set_allmulti+0x10/0x10
> [7.275889]  dev_change_flags+0x7a/0x150
> [7.275900]  do_setlink+0x9fe/0x2e40
> [7.275910]  ? deref_stack_reg+0xad/0xe0
> [7.275917]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [7.275922]  ? find_held_lock+0x32/0x1c0
> [7.275929]  ? rtnetlink_put_metrics+0x460/0x460
> [7.275935]  ? virtqueue_add_sgs+0x9e2/0xde0
> [7.275953]  ? virtscsi_add_cmd+0x454/0x780
> [7.275964]  ? find_held_lock+0x32/0x1c0
> [7.275973]  ? deref_stack_reg+0xad/0xe0
> [7.275979]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [7.275985]  ? lock_downgrade+0x5e0/0x5e0
> [7.275993]  ? memset+0x1f/0x40
> [7.276008]  ? nla_parse+0x33/0x290
> [7.276016]  rtnl_newlink+0x954/0x1120
> [7.276030]  ? rtnl_link_unregister+0x250/0x250
> [7.276044]  ? is_bpf_text_address+0x5/0x60
> [7.276054]  ? lock_downgrade+0x5e0/0x5e0
> [7.276057]  ? lock_acquire+0x10b/0x2a0
> [7.276072]  ? deref_stack_reg+0xad/0xe0
> [7.276078]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [7.276085]  ? __kernel_text_address+0xe/0x30
> [7.276090]  ? unwind_get_return_address+0x5f/0xa0
> [7.276103]  

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Michael S. Tsirkin
On Fri, Aug 03, 2018 at 10:41:41AM +0800, Jason Wang wrote:
> 
> 
> On 2018年08月03日 04:55, Michael S. Tsirkin wrote:
> > On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> > > This patch series is the follow up on the discussions we had before about
> > > the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> > > for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> > > were suggestions about doing away with two different paths of transactions
> > > with the host/QEMU, first being the direct GPA and the other being the DMA
> > > API based translations.
> > > 
> > > First patch attempts to create a direct GPA mapping based DMA operations
> > > structure called 'virtio_direct_dma_ops' with exact same implementation
> > > of the direct GPA path which virtio core currently has but just wrapped in
> > > a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> > > the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve 
> > > the
> > > existing semantics. The second patch does exactly that inside the function
> > > virtio_finalize_features(). The third patch removes the default direct GPA
> > > path from virtio core forcing it to use DMA API callbacks for all devices.
> > > Now with that change, every device must have a DMA operations structure
> > > associated with it. The fourth patch adds an additional hook which gives
> > > the platform an opportunity to do yet another override if required. This
> > > platform hook can be used on POWER Ultravisor based protected guests to
> > > load up SWIOTLB DMA callbacks to do the required (as discussed previously
> > > in the above mentioned thread how host is allowed to access only parts of
> > > the guest GPA range) bounce buffering into the shared memory for all I/O
> > > scatter gather buffers to be consumed on the host side.
> > > 
> > > Please go through these patches and review whether this approach broadly
> > > makes sense. I will appreciate suggestions, inputs, comments regarding
> > > the patches or the approach in general. Thank you.
> > Jason did some work on profiling this. Unfortunately he reports
> > about 4% extra overhead from this switch on x86 with no vIOMMU.
> 
> The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> guest and measure PPS on tap on host.
> 
> Thanks

Could you supply host configuration involved please?

> > 
> > I expect he's writing up the data in more detail, but
> > just wanted to let you know this would be one more
> > thing to debug before we can just switch to DMA APIs.
> > 
> > 
> > > Anshuman Khandual (4):
> > >virtio: Define virtio_direct_dma_ops structure
> > >virtio: Override device's DMA OPS with virtio_direct_dma_ops 
> > > selectively
> > >virtio: Force virtio core to use DMA API callbacks for all virtio 
> > > devices
> > >virtio: Add platform specific DMA API translation for virito devices
> > > 
> > >   arch/powerpc/include/asm/dma-mapping.h |  6 +++
> > >   arch/powerpc/platforms/pseries/iommu.c |  6 +++
> > >   drivers/virtio/virtio.c| 72 
> > > ++
> > >   drivers/virtio/virtio_pci_common.h |  3 ++
> > >   drivers/virtio/virtio_ring.c   | 65 
> > > +-
> > >   5 files changed, 89 insertions(+), 63 deletions(-)
> > > 
> > > -- 
> > > 2.9.3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 09:02 -0700, Christoph Hellwig wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> Well, if your setup is so fucked up I see no way to support it in Linux.
> 
> Let's end the discussion right now then.

You are saying something along the lines of "I don't like an
instruction in your ISA, let's not support your entire CPU architecture
in Linux".

Our setup is not fucked. It makes a LOT of sense and it's a very
sensible design. It's hitting a problem due to a corner case oddity in
virtio bypassing the MMU, we've worked around such corner cases many
times in the past without any problem, I fail to see what the problem
is here.

We aren't going to cancel years of HW and SW development for our
security infrastructure bcs you don't like a 2 lines hook into virtio
to make things work and aren't willing to even consider the options.

Ben.
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Christoph Hellwig
On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > >   2- Make virtio use the DMA API with our custom platform-provided
> > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > running on a secure VM in our case.
> > 
> > And total NAK the customer platform-provided part of this.  We need
> > a flag passed in from the hypervisor that the device needs all bus
> > specific dma api treatment, and then just use the normal plaform
> > dma mapping setup. 
> 
> Christoph, as I have explained already, we do NOT have a way to provide
> such a flag as neither the hypervisor nor qemu knows anything about
> this when the VM is created.

Well, if your setup is so fucked up I see no way to support it in Linux.

Let's end the discussion right now then.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Benjamin Herrenschmidt
On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> >   2- Make virtio use the DMA API with our custom platform-provided
> > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > running on a secure VM in our case.
> 
> And total NAK the customer platform-provided part of this.  We need
> a flag passed in from the hypervisor that the device needs all bus
> specific dma api treatment, and then just use the normal plaform
> dma mapping setup. 

Christoph, as I have explained already, we do NOT have a way to provide
such a flag as neither the hypervisor nor qemu knows anything about
this when the VM is created.

>  To get swiotlb you'll need to then use the DT/ACPI
> dma-range property to limit the addressable range, and a swiotlb
> capable plaform will use swiotlb automatically.

This cannot be done as you describe it.

The VM is created as a *normal* VM. The DT stuff is generated by qemu
at a point where it has *no idea* that the VM will later become secure
and thus will have to restrict which pages can be used for "DMA".

The VM will *at runtime* turn itself into a secure VM via interactions
with the security HW and the Ultravisor layer (which sits below the
HV). This happens way after the DT has been created and consumed, the
qemu devices instanciated etc...

Only the guest kernel knows because it initates the transition. When
that happens, the virtio devices have already been used by the guest
firmware, bootloader, possibly another kernel that kexeced the "secure"
one, etc... 

So instead of running around saying NAK NAK NAK, please explain how we
can solve that differently.

Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] crypto: virtio: Replace GFP_ATOMIC with GFP_KERNEL in __virtio_crypto_ablkcipher_do_req()

2018-08-03 Thread Herbert Xu
On Mon, Jul 23, 2018 at 04:43:46PM +0800, Jia-Ju Bai wrote:
> __virtio_crypto_ablkcipher_do_req() is never called in atomic context.
> 
> __virtio_crypto_ablkcipher_do_req() is only called by 
> virtio_crypto_ablkcipher_crypt_req(), which is only called by 
> virtcrypto_find_vqs() that is never called in atomic context.
> 
> __virtio_crypto_ablkcipher_do_req() calls kzalloc_node() with GFP_ATOMIC,
> which is not necessary.
> GFP_ATOMIC can be replaced with GFP_KERNEL.
> 
> This is found by a static analysis tool named DCNS written by myself.
> I also manually check the kernel code before reporting it.
> 
> Signed-off-by: Jia-Ju Bai 

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-03 Thread Tetsuo Handa
On 2018/08/03 17:32, Wei Wang wrote:
> +static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
> +{
> + vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> + vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> + vb->shrinker.batch = 0;
> + vb->shrinker.seeks = DEFAULT_SEEKS;

Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
and is nowhere zero-cleared, KASAN would complain it.

> +
> + return register_shrinker(>shrinker);
> +}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-03 Thread Wei Wang
The OOM notifier is getting deprecated to use for the reasons:
- As a callout from the oom context, it is too subtle and easy to
  generate bugs and corner cases which are hard to track;
- It is called too late (after the reclaiming has been performed).
  Drivers with large amuont of reclaimable memory is expected to
  release them at an early stage of memory pressure;
- The notifier callback isn't aware of oom contrains;
Link: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure. The balloon pages are
given back to mm adaptively by returning the number of pages that the
reclaimer is asking for (i.e. sc->nr_to_scan).

Currently the max possible value of sc->nr_to_scan passed to the balloon
shrinker is SHRINK_BATCH, which is 128. This is smaller than the
limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
returned via one invocation of leak_balloon. But this patch still
considers the case that SHRINK_BATCH or shrinker->batch could be changed
to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
do multiple invocations of leak_balloon.

Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
to release balloon pages on OOM. We continue to use this feature bit for
the shrinker, so the shrinker is only registered when this feature bit
has been negotiated with host.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 111 ++--
 1 file changed, 60 insertions(+), 51 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8100e77..612a359 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -40,13 +39,8 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
-#define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
-static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
-module_param(oom_pages, int, S_IRUSR | S_IWUSR);
-MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
-
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
 #endif
@@ -86,8 +80,8 @@ struct virtio_balloon {
/* Memory statistics */
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
-   /* To register callback in oom notifier call chain */
-   struct notifier_block nb;
+   /* To register a shrinker to shrink memory upon memory pressure */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -365,38 +359,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
-/*
- * virtballoon_oom_notify - release pages when system is under severe
- * memory pressure (called from out_of_memory())
- * @self : notifier block struct
- * @dummy: not used
- * @parm : returned - number of freed pages
- *
- * The balancing of memory by use of the virtio balloon should not cause
- * the termination of processes while there are pages in the balloon.
- * If virtio balloon manages to release some memory, it will make the
- * system return and retry the allocation that forced the OOM killer
- * to run.
- */
-static int virtballoon_oom_notify(struct notifier_block *self,
- unsigned long dummy, void *parm)
-{
-   struct virtio_balloon *vb;
-   unsigned long *freed;
-   unsigned num_freed_pages;
-
-   vb = container_of(self, struct virtio_balloon, nb);
-   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
-   return NOTIFY_OK;
-
-   freed = parm;
-   num_freed_pages = leak_balloon(vb, oom_pages);
-   update_balloon_size(vb);
-   *freed += num_freed_pages;
-
-   return NOTIFY_OK;
-}
-
 static void update_balloon_stats_func(struct work_struct *work)
 {
struct virtio_balloon *vb;
@@ -550,6 +512,53 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+   unsigned long pages_to_free, pages_freed = 0;
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   pages_to_free = sc->nr_to_scan * VIRTIO_BALLOON_PAGES_PER_PAGE;
+
+   /*
+* One invocation of leak_balloon can deflate at most
+* VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+* multiple times to deflate pages till reaching pages_to_free.
+*/
+   while (vb->num_pages && pages_to_free) {
+   

[PATCH v3 0/2] virtio-balloon: some improvements

2018-08-03 Thread Wei Wang
This series is split from the "Virtio-balloon: support free page
reporting" series to make some improvements.

ChangeLog:
v2->v3:
- shrink the balloon pages according to the amount requested by the
  claimer, instead of using a user specified number;
v1->v2:
- register the shrinker when VIRTIO_BALLOON_F_DEFLATE_ON_OOM is
  negotiated.

Wei Wang (2):
  virtio-balloon: remove BUG() in init_vqs
  virtio_balloon: replace oom notifier with shrinker

 drivers/virtio/virtio_balloon.c | 121 ++--
 1 file changed, 67 insertions(+), 54 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] vhost: switch to use new message format

2018-08-03 Thread Michael S. Tsirkin
On Fri, Aug 03, 2018 at 03:04:51PM +0800, Jason Wang wrote:
> We use to have message like:
> 
> struct vhost_msg {
>   int type;
>   union {
>   struct vhost_iotlb_msg iotlb;
>   __u8 padding[64];
>   };
> };
> 
> Unfortunately, there will be a hole of 32bit in 64bit machine because
> of the alignment. This leads a different formats between 32bit API and
> 64bit API. What's more it will break 32bit program running on 64bit
> machine.
> 
> So fixing this by introducing a new message type with an explicit
> 32bit reserved field after type like:
> 
> struct vhost_msg_v2 {
>   int type;
>   __u32 reserved;
>   union {
>   struct vhost_iotlb_msg iotlb;
>   __u8 padding[64];
>   };
> };
> 
> We will have a consistent ABI after switching to use this. To enable
> this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
> userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
> 
> Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/net.c| 30 
>  drivers/vhost/vhost.c  | 71 
> ++
>  drivers/vhost/vhost.h  | 11 ++-
>  include/uapi/linux/vhost.h | 18 
>  4 files changed, 111 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 367d802..4e656f8 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -78,6 +78,10 @@ enum {
>  };
>  
>  enum {
> + VHOST_NET_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
> +};
> +
> +enum {
>   VHOST_NET_VQ_RX = 0,
>   VHOST_NET_VQ_TX = 1,
>   VHOST_NET_VQ_MAX = 2,
> @@ -1399,6 +1403,21 @@ static long vhost_net_reset_owner(struct vhost_net *n)
>   return err;
>  }
>  
> +static int vhost_net_set_backend_features(struct vhost_net *n, u64 features)
> +{
> + int i;
> +
> + mutex_lock(>dev.mutex);
> + for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
> + mutex_lock(>vqs[i].vq.mutex);
> + n->vqs[i].vq.acked_backend_features = features;
> + mutex_unlock(>vqs[i].vq.mutex);
> + }
> + mutex_unlock(>dev.mutex);
> +
> + return 0;
> +}
> +
>  static int vhost_net_set_features(struct vhost_net *n, u64 features)
>  {
>   size_t vhost_hlen, sock_hlen, hdr_len;
> @@ -1489,6 +1508,17 @@ static long vhost_net_ioctl(struct file *f, unsigned 
> int ioctl,
>   if (features & ~VHOST_NET_FEATURES)
>   return -EOPNOTSUPP;
>   return vhost_net_set_features(n, features);
> + case VHOST_GET_BACKEND_FEATURES:
> + features = VHOST_NET_BACKEND_FEATURES;
> + if (copy_to_user(featurep, , sizeof(features)))
> + return -EFAULT;
> + return 0;
> + case VHOST_SET_BACKEND_FEATURES:
> + if (copy_from_user(, featurep, sizeof(features)))
> + return -EFAULT;
> + if (features & ~VHOST_NET_BACKEND_FEATURES)
> + return -EOPNOTSUPP;
> + return vhost_net_set_backend_features(n, features);
>   case VHOST_RESET_OWNER:
>   return vhost_net_reset_owner(n);
>   case VHOST_SET_OWNER:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a502f1a..6f6c42d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -315,6 +315,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   vq->log_addr = -1ull;
>   vq->private_data = NULL;
>   vq->acked_features = 0;
> + vq->acked_backend_features = 0;
>   vq->log_base = NULL;
>   vq->error_ctx = NULL;
>   vq->kick = NULL;
> @@ -1027,28 +1028,40 @@ static int vhost_process_iotlb_msg(struct vhost_dev 
> *dev,
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>struct iov_iter *from)
>  {
> - struct vhost_msg_node node;
> - unsigned size = sizeof(struct vhost_msg);
> - size_t ret;
> - int err;
> + struct vhost_iotlb_msg msg;
> + size_t offset;
> + int type, ret;
>  
> - if (iov_iter_count(from) < size)
> - return 0;
> - ret = copy_from_iter(, size, from);
> - if (ret != size)
> + ret = copy_from_iter(, sizeof(type), from);
> + if (ret != sizeof(type))
>   goto done;
>  
> - switch (node.msg.type) {
> + switch (type) {
>   case VHOST_IOTLB_MSG:
> - err = vhost_process_iotlb_msg(dev, );
> - if (err)
> - ret = err;
> + /* There maybe a hole after type for V1 message type,
> +  * so skip it here.
> +  */
> + offset = offsetof(struct vhost_msg, iotlb) - sizeof(int);
> + break;
> + case VHOST_IOTLB_MSG_V2:
> + offset = sizeof(__u32);
>   break;
>   default:
>   ret = -EINVAL;
> - break;
> +   

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Christoph Hellwig
On Thu, Aug 02, 2018 at 11:53:08PM +0300, Michael S. Tsirkin wrote:
> > We don't need cache flushing tricks.
> 
> You don't but do real devices on same platform need them?

IBMs power plaforms are always cache coherent.  There are some powerpc
platforms have not cache coherent DMA, but I guess this scheme isn't
intended for them.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Christoph Hellwig
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.
> 
> At the moment, virtio does this all over the place:
> 
>   if (use_dma_api)
>   dma_map/alloc_something(...)
>   else
>   use_pa
> 
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
> 
>  1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
> 
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?

If you call it often enough it does:

https://www.spinics.net/lists/netdev/msg495413.html

>  2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.

And total NAK the customer platform-provided part of this.  We need
a flag passed in from the hypervisor that the device needs all bus
specific dma api treatment, and then just use the normal plaform
dma mapping setup.  To get swiotlb you'll need to then use the DT/ACPI
dma-range property to limit the addressable range, and a swiotlb
capable plaform will use swiotlb automatically.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next] vhost: switch to use new message format

2018-08-03 Thread Jason Wang
We use to have message like:

struct vhost_msg {
int type;
union {
struct vhost_iotlb_msg iotlb;
__u8 padding[64];
};
};

Unfortunately, there will be a hole of 32bit in 64bit machine because
of the alignment. This leads a different formats between 32bit API and
64bit API. What's more it will break 32bit program running on 64bit
machine.

So fixing this by introducing a new message type with an explicit
32bit reserved field after type like:

struct vhost_msg_v2 {
int type;
__u32 reserved;
union {
struct vhost_iotlb_msg iotlb;
__u8 padding[64];
};
};

We will have a consistent ABI after switching to use this. To enable
this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).

Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c| 30 
 drivers/vhost/vhost.c  | 71 ++
 drivers/vhost/vhost.h  | 11 ++-
 include/uapi/linux/vhost.h | 18 
 4 files changed, 111 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 367d802..4e656f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -78,6 +78,10 @@ enum {
 };
 
 enum {
+   VHOST_NET_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
+};
+
+enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
VHOST_NET_VQ_MAX = 2,
@@ -1399,6 +1403,21 @@ static long vhost_net_reset_owner(struct vhost_net *n)
return err;
 }
 
+static int vhost_net_set_backend_features(struct vhost_net *n, u64 features)
+{
+   int i;
+
+   mutex_lock(>dev.mutex);
+   for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+   mutex_lock(>vqs[i].vq.mutex);
+   n->vqs[i].vq.acked_backend_features = features;
+   mutex_unlock(>vqs[i].vq.mutex);
+   }
+   mutex_unlock(>dev.mutex);
+
+   return 0;
+}
+
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
size_t vhost_hlen, sock_hlen, hdr_len;
@@ -1489,6 +1508,17 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
if (features & ~VHOST_NET_FEATURES)
return -EOPNOTSUPP;
return vhost_net_set_features(n, features);
+   case VHOST_GET_BACKEND_FEATURES:
+   features = VHOST_NET_BACKEND_FEATURES;
+   if (copy_to_user(featurep, , sizeof(features)))
+   return -EFAULT;
+   return 0;
+   case VHOST_SET_BACKEND_FEATURES:
+   if (copy_from_user(, featurep, sizeof(features)))
+   return -EFAULT;
+   if (features & ~VHOST_NET_BACKEND_FEATURES)
+   return -EOPNOTSUPP;
+   return vhost_net_set_backend_features(n, features);
case VHOST_RESET_OWNER:
return vhost_net_reset_owner(n);
case VHOST_SET_OWNER:
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a502f1a..6f6c42d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -315,6 +315,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->log_addr = -1ull;
vq->private_data = NULL;
vq->acked_features = 0;
+   vq->acked_backend_features = 0;
vq->log_base = NULL;
vq->error_ctx = NULL;
vq->kick = NULL;
@@ -1027,28 +1028,40 @@ static int vhost_process_iotlb_msg(struct vhost_dev 
*dev,
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 struct iov_iter *from)
 {
-   struct vhost_msg_node node;
-   unsigned size = sizeof(struct vhost_msg);
-   size_t ret;
-   int err;
+   struct vhost_iotlb_msg msg;
+   size_t offset;
+   int type, ret;
 
-   if (iov_iter_count(from) < size)
-   return 0;
-   ret = copy_from_iter(, size, from);
-   if (ret != size)
+   ret = copy_from_iter(, sizeof(type), from);
+   if (ret != sizeof(type))
goto done;
 
-   switch (node.msg.type) {
+   switch (type) {
case VHOST_IOTLB_MSG:
-   err = vhost_process_iotlb_msg(dev, );
-   if (err)
-   ret = err;
+   /* There maybe a hole after type for V1 message type,
+* so skip it here.
+*/
+   offset = offsetof(struct vhost_msg, iotlb) - sizeof(int);
+   break;
+   case VHOST_IOTLB_MSG_V2:
+   offset = sizeof(__u32);
break;
default:
ret = -EINVAL;
-   break;
+   goto done;
+   }
+
+   iov_iter_advance(from, offset);
+   ret = copy_from_iter(, sizeof(msg), from);
+   if (ret != sizeof(msg))
+   goto done;
+   if