[PATCH RFC] vhost: basic device IOTLB support

2015-12-30 Thread Jason Wang
This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of
iommu for a secure DMA environment in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- Fill the translation request in a preset userspace address (This
  address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
- Notify userspace through eventfd (This eventfd was set through ioctl
  VHOST_SET_IOTLB_FD).

When userspace finishes the translation, it will update the vhost
IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
snooping the IOTLB invalidation of IOMMU IOTLB and use
VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

For simplicity, IOTLB was implemented with a simple hash array. The
index were calculated from IOVA page frame number which can only works
at PAGE_SIZE level.

An qemu implementation (for reference) is available at:
g...@github.com:jasowang/qemu.git iommu

TODO & Known issues:

- read/write permission validation was not implemented.
- no feature negotiation.
- VHOST_SET_MEM_TABLE is not reused (maybe there's a chance).
- working at PAGE_SIZE level, don't support large mappings.
- better data structure for IOTLB instead of simple hash array.
- better API, e.g using mmap() instead of preset userspace address.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c|   2 +-
 drivers/vhost/vhost.c  | 190 -
 drivers/vhost/vhost.h  |  13 
 include/uapi/linux/vhost.h |  26 +++
 4 files changed, 229 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..a172be9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
r = vhost_dev_ioctl(>dev, ioctl, argp);
if (r == -ENOIOCTLCMD)
r = vhost_vring_ioctl(>dev, ioctl, argp);
-   else
+   else if (ioctl != VHOST_UPDATE_IOTLB)
vhost_net_flush(n);
mutex_unlock(>dev.mutex);
return r;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..729fe05 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -113,6 +113,11 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
 }
 #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
 
+static inline int vhost_iotlb_hash(u64 iova)
+{
+   return (iova >> PAGE_SHIFT) & (VHOST_IOTLB_SIZE - 1);
+}
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
 {
@@ -384,8 +389,14 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->memory = NULL;
dev->mm = NULL;
spin_lock_init(>work_lock);
+   spin_lock_init(>iotlb_lock);
+   mutex_init(>iotlb_req_mutex);
INIT_LIST_HEAD(>work_list);
dev->worker = NULL;
+   dev->iotlb_request = NULL;
+   dev->iotlb_ctx = NULL;
+   dev->iotlb_file = NULL;
+   dev->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
 
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -393,12 +404,17 @@ void vhost_dev_init(struct vhost_dev *dev,
vq->indirect = NULL;
vq->heads = NULL;
vq->dev = dev;
+   vq->iotlb_request = NULL;
mutex_init(>mutex);
vhost_vq_reset(dev, vq);
if (vq->handle_kick)
vhost_poll_init(>poll, vq->handle_kick,
POLLIN, dev);
}
+
+   init_completion(>iotlb_completion);
+   for (i = 0; i < VHOST_IOTLB_SIZE; i++)
+   dev->iotlb[i].flags.valid = VHOST_IOTLB_INVALID;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
 
@@ -940,9 +956,10 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
 {
struct file *eventfp, *filep = NULL;
struct eventfd_ctx *ctx = NULL;
+   struct vhost_iotlb_entry entry;
u64 p;
long r;
-   int i, fd;
+   int index, i, fd;
 
/* If you are not the owner, you can become one */
if (ioctl == VHOST_SET_OWNER) {
@@ -1008,6 +1025,80 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
if (filep)
fput(filep);
break;
+   case VHOST_SET_IOTLB_FD:
+   r = get_user(fd, (int __user *)argp);
+   if (r < 0)
+   break;
+   eventfp = fd == -1 ? NULL : eventfd_fget(fd);
+   if (IS_ERR(eventfp)) {
+   r = PTR_ERR(eventfp);
+   break;
+   }
+   if (eventfp != d->iotlb_file) {
+

[PATCH] kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa

2015-12-30 Thread David Matlack
The comment had the meaning of mmu.gva_to_gpa and nested_mmu.gva_to_gpa
swapped. Fix that, and also add some details describing how each translation
works.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/mmu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e7c2c14..098a9c2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4058,10 +4058,12 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
g_context->inject_page_fault = kvm_inject_page_fault;
 
/*
-* Note that arch.mmu.gva_to_gpa translates l2_gva to l1_gpa. The
-* translation of l2_gpa to l1_gpa addresses is done using the
-* arch.nested_mmu.gva_to_gpa function. Basically the gva_to_gpa
-* functions between mmu and nested_mmu are swapped.
+* Note that arch.mmu.gva_to_gpa translates l2_gpa to l1_gpa using
+* L1's nested page tables (e.g. EPT12). The nested translation
+* of l2_gva to l1_gpa is done by arch.nested_mmu.gva_to_gpa using
+* L2's page tables as the first level of translation and L1's
+* nested page tables as the second level of translation. Basically
+* the gva_to_gpa functions between mmu and nested_mmu are swapped.
 */
if (!is_paging(vcpu)) {
g_context->nx = false;
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to reserve guest physical region for ACPI

2015-12-30 Thread Michael S. Tsirkin
On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> On Mon, 28 Dec 2015 14:50:15 +0200
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > > 
> > > Hi Michael, Paolo,
> > > 
> > > Now it is the time to return to the challenge that how to reserve guest
> > > physical region internally used by ACPI.
> > > 
> > > Igor suggested that:
> > > | An alternative place to allocate reserve from could be high memory.
> > > | For pc we have "reserved-memory-end" which currently makes sure
> > > | that hotpluggable memory range isn't used by firmware
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> > 
> > I don't want to tie things to reserved-memory-end because this
> > does not scale: next time we need to reserve memory,
> > we'll need to find yet another way to figure out what is where.
> Could you elaborate a bit more on a problem you're seeing?
> 
> To me it looks like it scales rather well.
> For example lets imagine that we adding a device
> that has some on device memory that should be mapped into GPA
> code to do so would look like:
> 
>   pc_machine_device_plug_cb(dev)
>   {
>...
>if (dev == OUR_NEW_DEVICE_TYPE) {
>memory_region_add_subregion(as, current_reserved_end, >mr);
>set_new_reserved_end(current_reserved_end + 
> memory_region_size(>mr));
>}
>   }
> 
> we can practically add any number of new devices that way.

Yes but we'll have to build a host side allocator for these, and that's
nasty. We'll also have to maintain these addresses indefinitely (at
least per machine version) as they are guest visible.
Not only that, there's no way for guest to know if we move things
around, so basically we'll never be able to change addresses.


>  
> > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > support 64 bit RAM instead (and maybe a way to allocate and
> > zero-initialize buffer without loading it through fwcfg), this way bios
> > does the allocation, and addresses can be patched into acpi.
> and then guest side needs to parse/execute some AML that would
> initialize QEMU side so it would know where to write data.

Well not really - we can put it in a data table, by itself
so it's easy to find.

AML is only needed if access from ACPI is desired.


> bios-linker-loader is a great interface for initializing some
> guest owned data and linking it together but I think it adds
> unnecessary complexity and is misused if it's used to handle
> device owned data/on device memory in this and VMGID cases.

I want a generic interface for guest to enumerate these things.  linker
seems quite reasonable but if you see a reason why it won't do, or want
to propose a better interface, fine.

PCI would do, too - though windows guys had concerns about
returning PCI BARs from ACPI.


> There was RFC on list to make BIOS boot from NVDIMM already
> doing some ACPI table lookup/parsing. Now if they were forced
> to also parse and execute AML to initialize QEMU with guest
> allocated address that would complicate them quite a bit.

If they just need to find a table by name, it won't be
too bad, would it?

> While with NVDIMM control memory region mapped directly by QEMU,
> respective patches don't need in any way to initialize QEMU,
> all they would need just read necessary data from control region.
> 
> Also using bios-linker-loader takes away some usable RAM
> from guest and in the end that doesn't scale,
> the more devices I add the less usable RAM is left for guest OS
> while all the device needs is a piece of GPA address space
> that would belong to it.

I don't get this comment. I don't think it's MMIO that is wanted.
If it's backed by qemu virtual memory then it's RAM.

> > 
> > See patch at the bottom that might be handy.
> > 
> > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > | when writing ASL one shall make sure that only XP supported
> > > | features are in global scope, which is evaluated when tables
> > > | are loaded and features of rev2 and higher are inside methods.
> > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > | opcode and one can guard those opcodes checking _REV object if 
> > > neccesary.
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> > 
> > Yes, this technique works.
> > 
> > An alternative is to add an XSDT, XP ignores that.
> > XSDT at the moment breaks OVMF (because it loads both
> > the RSDT and the XSDT, which is wrong), but I think
> > Laszlo was working on a fix for that.
> Using XSDT would increase ACPI tables occupied RAM
> as it would duplicate DSDT + non XP supported AML
> at global namespace.

Not at all - I posted patches linking to same
tables from both RSDT and XSDT at some point.
Only the list of pointers would be different.

> So far we've managed keep DSDT compatible with XP while
> introducing features from v2 and higher ACPI 

Re: How to reserve guest physical region for ACPI

2015-12-30 Thread Igor Mammedov
On Mon, 28 Dec 2015 14:50:15 +0200
"Michael S. Tsirkin"  wrote:

> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > 
> > Hi Michael, Paolo,
> > 
> > Now it is the time to return to the challenge that how to reserve guest
> > physical region internally used by ACPI.
> > 
> > Igor suggested that:
> > | An alternative place to allocate reserve from could be high memory.
> > | For pc we have "reserved-memory-end" which currently makes sure
> > | that hotpluggable memory range isn't used by firmware
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> 
> I don't want to tie things to reserved-memory-end because this
> does not scale: next time we need to reserve memory,
> we'll need to find yet another way to figure out what is where.
Could you elaborate a bit more on a problem you're seeing?

To me it looks like it scales rather well.
For example lets imagine that we adding a device
that has some on device memory that should be mapped into GPA
code to do so would look like:

  pc_machine_device_plug_cb(dev)
  {
   ...
   if (dev == OUR_NEW_DEVICE_TYPE) {
   memory_region_add_subregion(as, current_reserved_end, >mr);
   set_new_reserved_end(current_reserved_end + 
memory_region_size(>mr));
   }
  }

we can practically add any number of new devices that way.

 
> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> support 64 bit RAM instead (and maybe a way to allocate and
> zero-initialize buffer without loading it through fwcfg), this way bios
> does the allocation, and addresses can be patched into acpi.
and then guest side needs to parse/execute some AML that would
initialize QEMU side so it would know where to write data.

bios-linker-loader is a great interface for initializing some
guest owned data and linking it together but I think it adds
unnecessary complexity and is misused if it's used to handle
device owned data/on device memory in this and VMGID cases.

There was RFC on list to make BIOS boot from NVDIMM already
doing some ACPI table lookup/parsing. Now if they were forced
to also parse and execute AML to initialize QEMU with guest
allocated address that would complicate them quite a bit.
While with NVDIMM control memory region mapped directly by QEMU,
respective patches don't need in any way to initialize QEMU,
all they would need just read necessary data from control region.

Also using bios-linker-loader takes away some usable RAM
from guest and in the end that doesn't scale,
the more devices I add the less usable RAM is left for guest OS
while all the device needs is a piece of GPA address space
that would belong to it.

> 
> See patch at the bottom that might be handy.
> 
> > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > | when writing ASL one shall make sure that only XP supported
> > | features are in global scope, which is evaluated when tables
> > | are loaded and features of rev2 and higher are inside methods.
> > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> 
> Yes, this technique works.
> 
> An alternative is to add an XSDT, XP ignores that.
> XSDT at the moment breaks OVMF (because it loads both
> the RSDT and the XSDT, which is wrong), but I think
> Laszlo was working on a fix for that.
Using XSDT would increase ACPI tables occupied RAM
as it would duplicate DSDT + non XP supported AML
at global namespace.

So far we've managed keep DSDT compatible with XP while
introducing features from v2 and higher ACPI revisions as
AML that is only evaluated on demand.
We can continue doing so unless we have to unconditionally
add incompatible AML at global scope.


> 
> > Michael, Paolo, what do you think about these ideas?
> > 
> > Thanks!
> 
> 
> 
> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> current offset so we can add that to the linker.
> 
> Won't work if you append the Name to the Aml structure (these can be
> nested to arbitrary depth using aml_append), so using plain GArray for
> this API makes sense to me.
> 
> --->
> 
> acpi: add build_append_named_dword, returning an offset in buffer
> 
> This is a very limited form of support for runtime patching -
> similar in functionality to what we can do with ACPI_EXTRACT
> macros in python, but implemented in C.
> 
> This is to allow ACPI code direct access to data tables -
> which is exactly what DataTableRegion is there for, except
> no known windows release so far implements DataTableRegion.
unsupported means Windows will BSOD, so it's practically
unusable unless MS will patch currently existing Windows
versions.

Another thing about DataTableRegion is that ACPI tables are
supposed to have static content which matches checksum in
table the 

Re: QUESTION: mmu.gva_to_gpa versus nested_mmu.gva_to_gpa

2015-12-30 Thread Paolo Bonzini


On 29/12/2015 17:37, David Matlack wrote:
>> > Yes, it's correct.

s/it's/you're/ :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio/s390: use dev_to_virtio

2015-12-30 Thread Geliang Tang
Use dev_to_virtio() instead of open-coding it.

Signed-off-by: Geliang Tang 
---
 drivers/s390/virtio/virtio_ccw.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 1b83159..97231e1 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -945,8 +945,7 @@ static struct virtio_config_ops virtio_ccw_config_ops = {
 
 static void virtio_ccw_release_dev(struct device *_d)
 {
-   struct virtio_device *dev = container_of(_d, struct virtio_device,
-dev);
+   struct virtio_device *dev = dev_to_virtio(_d);
struct virtio_ccw_device *vcdev = to_vc_device(dev);
 
kfree(vcdev->status);
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUESTION: mmu.gva_to_gpa versus nested_mmu.gva_to_gpa

2015-12-30 Thread David Matlack
On Wed, Dec 30, 2015 at 3:36 AM, Paolo Bonzini  wrote:
>
>
> On 29/12/2015 17:37, David Matlack wrote:
>>> > Yes, it's correct.
>
> s/it's/you're/ :)

Ah ok. Thanks for your help!

I will send a patch to fix the comment then.

>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html