date:20140829

Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 02:13, Andy Lutomirski ha scritto:
 Hmm.  Then, assuming that someone manages to allocate a
 cross-hypervisor MSR number for this, what am I supposed to do in the
 KVM code?  Just make it available unconditionally?  I don't see why
 that wouldn't work reliably, but it seems like an odd design.

The odd part of it is what Gleb mentioned.

 Also, the one and only native feature flag I tested (rdtscp) actually
 does work: RDTSCP seems to send #UD if QEMU is passed -cpu
 host,-rdtscp.

True, and I'm not sure why.  There are a couple others.  I was thinking
more of things like SSE, AVX or DE (that affects the availability of a
bit in CR4).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[question] virtio-blk performance degradation happened with virito-serial

2014-08-29 Thread Zhang Haoyu

Hi, all

I start a VM with virtio-serial (default ports number: 31), and found that 
virtio-blk performance degradation happened, about 25%, this problem can be 
reproduced 100%.
without virtio-serial:
4k-read-random 1186 IOPS
with virtio-serial:
4k-read-random 871 IOPS

but if use max_ports=2 option to limit the max number of virio-serial ports, 
then the IO performance degradation is not so serious, about 5%.

And, ide performance degradation does not happen with virtio-serial.

[environment]
Host OS: linux-3.10
QEMU: 2.0.1
Guest OS: windows server 2008

[qemu command]
/usr/bin/kvm -id 1587174272642 -chardev 
socket,id=qmp,path=/var/run/qemu-server/1587174272642.qmp,server,nowait -mon 
chardev=qmp,mode=control -vnc :0,websocket,to=200 -enable-kvm -pidfile 
/var/run/qemu-server/1587174272642.pid -daemonize -name win2008-32 -smp 
sockets=1,cores=1 -cpu core2duo -nodefaults -vga cirrus -no-hpet -k en-us -boot 
menu=on,splash-time=8000 -m 2048 -usb -drive 
if=none,id=drive-ide0,media=cdrom,aio=native -device 
ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200 -drive 
file=/sf/data/local/images/host-00e081de43d7/cea072c4294f/win2008-32.vm/vm-disk-1.qcow2,if=none,id=drive-ide2,cache=none,aio=native
 -device ide-hd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100 -netdev 
type=tap,id=net0,ifname=158717427264200,script=/sf/etc/kvm/vtp-bridge -device 
e1000,mac=FE:FC:FE:D3:F9:2B,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
 -rtc driftfix=slew,clock=rt,base=localtime -global 
kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3
 =1 -global PIIX4_PM.disable_s4=1

Any ideas?

Thanks,
Zhang Haoyu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] kvm-unit-tests: Check CPUID level/xlevel before using CPUID data

2014-08-29 Thread Paolo Bonzini

Il 28/08/2014 20:02, Eduardo Habkost ha scritto:
 Change the kvm-unit-tests x86 code to always check CPUID level/xlevel before
 looking at CPUID data. Otherwise, the test code will be looking at bogus data.
 
 Eduardo Habkost (3):
   x86: apic: Look up MAXPHYADDR on CPUID correctly
   x86: vmx: Use cpuid_maxphyaddr()
   x86: Check level, xlevel before returning CPUID data
 
  lib/x86/processor.h | 18 +-
  x86/apic.c  |  2 +-
  x86/vmx.c   |  6 +++---
  3 files changed, 21 insertions(+), 5 deletions(-)
 

Applying this series, thanks.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-29 Thread Paolo Bonzini

Il 28/08/2014 23:10, David Matlack ha scritto:
 Paolo,
 It seems like this patch ([PATCH 2/2] kvm: x86: fix stale mmio cache)
 is ready to go. Is there anything blocking it from being merged?
 
 (It should be fine to merge this on its own, independent of the fix
 discussed in [PATCH 1/2] KVM: fix cache stale memslot info with
 correct mmio generation number, https://lkml.org/lkml/2014/8/14/62.)

I'll post the full series today.  Sorry, I've been swamped a bit.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops

2014-08-29 Thread Alexey Kardashevskiy

Modern IBM POWERPC systems support multiple IOMMU tables per PE
so we need a more reliable way (compared to container_of()) to get
a PE pointer from the iommu_table struct pointer used in IOMMU functions.

At the moment IOMMU group data points to an iommu_table struct. This
introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
the pnv_ioda_pe struct, for others it is still a pointer to
the iommu_table struct. The ops structs correspond to the type which
iommu_owner points to.

This defines a get_table() callback which returns an iommu_table
by its number.

As the IOMMU group data pointer points to variable type instead of
iommu_table, VFIO SPAPR TCE driver is updated to use the new type.
This changes the tce_container struct to store iommu_group instead of
iommu_table.

So, it was:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to iommu_table via iommu_group_get_iommudata();

now it is:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to spapr_tce_iommu_group via
iommu_group_get_iommudata();
- spapr_tce_iommu_group points to either (depending on .get_table()):
- iommu_table;
- pnv_ioda_pe;

This uses pnv_ioda1_iommu_get_table for both IODA12 but IODA2 will
have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
will only be used for IODA1.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h|   6 ++
 arch/powerpc/include/asm/tce.h  |  13 +++
 arch/powerpc/kernel/iommu.c |  35 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  31 +-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
 arch/powerpc/platforms/powernv/pci.c|   2 +-
 arch/powerpc/platforms/pseries/iommu.c  |  10 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 148 ++--
 8 files changed, 208 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..84ee339 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, 
const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+
+struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
 extern void iommu_register_group(struct iommu_table *tbl,
+void *iommu_owner,
+struct spapr_tce_iommu_ops *ops,
 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
 static inline void iommu_register_group(struct iommu_table *tbl,
+   void *iommu_owner,
+   struct spapr_tce_iommu_ops *ops,
int pci_domain_number,
unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 743f36b..9f159eb 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -50,5 +50,18 @@
 #define TCE_PCI_READ   0x1 /* read from PCI allowed */
 #define TCE_VB_WRITE   0x1 /* write from VB allowed */
 
+struct spapr_tce_iommu_group;
+
+struct spapr_tce_iommu_ops {
+   struct iommu_table *(*get_table)(
+   struct spapr_tce_iommu_group *data,
+   int num);
+};
+
+struct spapr_tce_iommu_group {
+   void *iommu_owner;
+   struct spapr_tce_iommu_ops *ops;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TCE_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b378f78..1c5dae7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
  */
 static void group_release(void *iommu_data)
 {
-   struct iommu_table *tbl = iommu_data;
-   tbl-it_group = NULL;
+   kfree(iommu_data);
 }
 
+static struct iommu_table *spapr_tce_default_get_table(
+   struct spapr_tce_iommu_group *data, int num)
+{
+   struct iommu_table *tbl = data-iommu_owner;
+
+   switch (num) {
+   case 0:
+   if (tbl-it_size)
+   return tbl;
+   /* fallthru */
+   default:
+   return NULL;
+   }
+}
+
+static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
+   .get_table = spapr_tce_default_get_table
+};
+
 void iommu_register_group(struct iommu_table *tbl,
+   void *iommu_owner, struct spapr_tce_iommu_ops *ops,
int pci_domain_number, unsigned long

[PATCH 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows

2014-08-29 Thread Alexey Kardashevskiy

This defines and implements VFIO IOMMU API which lets the userspace
create and remove DMA windows.

This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
available windows and page mask.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
to allow the user space to create and remove window(s).

The VFIO IOMMU driver does basic sanity checks and calls corresponding
SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
implements them.

This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
VFIO_IOMMU_SPAPR_TCE_GET_INFO.

This calls platform DDW reset() callback when IOMMU is being disabled
to reset the DMA configuration to its original state.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++--
 include/uapi/linux/vfio.h   |  25 ++-
 2 files changed, 153 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0dccbc4..b518891 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container 
*container)
 
container-enabled = false;
 
-   if (!container-grp || !current-mm)
+   if (!container-grp)
return;
 
data = iommu_group_get_iommudata(container-grp);
if (!data || !data-iommu_owner || !data-ops-get_table)
return;
 
-   tbl = data-ops-get_table(data, 0);
-   if (!tbl)
-   return;
+   if (current-mm) {
+   tbl = data-ops-get_table(data, 0);
+   if (tbl)
+   decrement_locked_vm(tbl);
 
-   decrement_locked_vm(tbl);
+   tbl = data-ops-get_table(data, 1);
+   if (tbl)
+   decrement_locked_vm(tbl);
+   }
+
+   if (data-ops-reset)
+   data-ops-reset(data);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
struct tce_container *container = iommu_data;
-   unsigned long minsz;
+   unsigned long minsz, ddwsz;
long ret;
 
switch (cmd) {
@@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
info.dma32_window_size = tbl-it_size  tbl-it_page_shift;
info.flags = 0;
 
+   ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+   page_size_mask);
+
+   if (info.argsz == ddwsz) {
+   if (data-ops-query  data-ops-create 
+   data-ops-remove) {
+   info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
+
+   ret = data-ops-query(data,
+   info.current_windows,
+   info.windows_available,
+   info.page_size_mask);
+   if (ret)
+   return ret;
+   } else {
+   info.current_windows = 0;
+   info.windows_available = 0;
+   info.page_size_mask = 0;
+   }
+   minsz = ddwsz;
+   }
+
if (copy_to_user((void __user *)arg, info, minsz))
return -EFAULT;
 
@@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(container-lock);
return 0;
+
case VFIO_EEH_PE_OP:
if (!container-grp)
return -ENODEV;
 
return vfio_spapr_iommu_eeh_ioctl(container-grp,
  cmd, arg);
+
+   case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+   struct vfio_iommu_spapr_tce_create create;
+   struct spapr_tce_iommu_group *data;
+   struct iommu_table *tbl;
+
+   if (WARN_ON(!container-grp))
+   return -ENXIO;
+
+   data = iommu_group_get_iommudata(container-grp);
+
+   minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+   start_addr);
+
+   if (copy_from_user(create, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (create.argsz  minsz)
+   return -EINVAL;
+
+   if (create.flags)
+   return -EINVAL;
+
+   if (!data-ops-create || !data-iommu_owner)
+   return -ENOSYS;
+
+   BUG_ON(!data || !data-ops || !data-ops-remove);
+
+

[PATCH 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers

2014-08-29 Thread Alexey Kardashevskiy

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 71 +++--
 1 file changed, 53 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 1c1a9c4..c9fac97 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
+static long try_increment_locked_vm(struct iommu_table *tbl)
+{
+   long ret = 0, locked, lock_limit, npages;
+
+   if (!current || !current-mm)
+   return -ESRCH; /* process exited */
+
+   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+
+   down_write(current-mm-mmap_sem);
+   locked = current-mm-locked_vm + npages;
+   lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
+   if (locked  lock_limit  !capable(CAP_IPC_LOCK)) {
+   pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n,
+   rlimit(RLIMIT_MEMLOCK));
+   ret = -ENOMEM;
+   } else {
+   current-mm-locked_vm += npages;
+   }
+   up_write(current-mm-mmap_sem);
+
+   return ret;
+}
+
+static void decrement_locked_vm(struct iommu_table *tbl)
+{
+   long npages;
+
+   if (!current || !current-mm)
+   return; /* process exited */
+
+   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+
+   down_write(current-mm-mmap_sem);
+   if (npages  current-mm-locked_vm)
+   npages = current-mm-locked_vm;
+   current-mm-locked_vm -= npages;
+   up_write(current-mm-mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -86,7 +126,6 @@ static void tce_iommu_take_ownership_notify(struct 
spapr_tce_iommu_group *data,
 static int tce_iommu_enable(struct tce_container *container)
 {
int ret = 0;
-   unsigned long locked, lock_limit, npages;
struct iommu_table *tbl;
struct spapr_tce_iommu_group *data;
 
@@ -120,24 +159,23 @@ static int tce_iommu_enable(struct tce_container 
*container)
 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 * that would effectively kill the guest at random points, much better
 * enforcing the limit based on the max that the guest can map.
+*
+* Unfortunately at the moment it counts whole tables, no matter how
+* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+* each with 2GB DMA window, 8GB will be counted here. The reason for
+* this is that we cannot tell here the amount of RAM used by the guest
+* as this information is only available from KVM and VFIO is
+* KVM agnostic.
 */
tbl = data-ops-get_table(data, 0);
if (!tbl)
return -ENXIO;
 
-   down_write(current-mm-mmap_sem);
-   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
-   locked = current-mm-locked_vm + npages;
-   lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
-   if (locked  lock_limit  !capable(CAP_IPC_LOCK)) {
-   pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n,
-   rlimit(RLIMIT_MEMLOCK));
-   ret = -ENOMEM;
-   } else {
-   current-mm-locked_vm += npages;
-   container-enabled = true;
-   }
-   up_write(current-mm-mmap_sem);
+   ret = try_increment_locked_vm(tbl);
+   if (ret)
+   return ret;
+
+   container-enabled = true;
 
return ret;
 }
@@ -163,10 +201,7 @@ static void tce_iommu_disable(struct tce_container 
*container)
if (!tbl)
return;
 
-   down_write(current-mm-mmap_sem);
-   current-mm-locked_vm -= (tbl-it_size 
-   IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
-   up_write(current-mm-mmap_sem);
+   decrement_locked_vm(tbl);
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/13] vfio: powerpc/spapr: Use it_page_size

2014-08-29 Thread Alexey Kardashevskiy

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index c9fac97..0dccbc4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -36,7 +36,7 @@ static long try_increment_locked_vm(struct iommu_table *tbl)
if (!current || !current-mm)
return -ESRCH; /* process exited */
 
-   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+   npages = (tbl-it_size  tbl-it_page_shift)  PAGE_SHIFT;
 
down_write(current-mm-mmap_sem);
locked = current-mm-locked_vm + npages;
@@ -60,7 +60,7 @@ static void decrement_locked_vm(struct iommu_table *tbl)
if (!current || !current-mm)
return; /* process exited */
 
-   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+   npages = (tbl-it_size  tbl-it_page_shift)  PAGE_SHIFT;
 
down_write(current-mm-mmap_sem);
if (npages  current-mm-locked_vm)
@@ -284,8 +284,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz  minsz)
return -EINVAL;
 
-   info.dma32_window_start = tbl-it_offset  IOMMU_PAGE_SHIFT_4K;
-   info.dma32_window_size = tbl-it_size  IOMMU_PAGE_SHIFT_4K;
+   info.dma32_window_start = tbl-it_offset  tbl-it_page_shift;
+   info.dma32_window_size = tbl-it_size  tbl-it_page_shift;
info.flags = 0;
 
if (copy_to_user((void __user *)arg, info, minsz))
@@ -318,10 +318,6 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
 
-   if ((param.size  ~IOMMU_PAGE_MASK_4K) ||
-   (param.vaddr  ~IOMMU_PAGE_MASK_4K))
-   return -EINVAL;
-
/* iova is checked by the IOMMU API */
tce = param.vaddr;
if (param.flags  VFIO_DMA_MAP_FLAG_READ)
@@ -334,21 +330,25 @@ static long tce_iommu_ioctl(void *iommu_data,
return -ENXIO;
BUG_ON(!tbl-it_group);
 
+   if ((param.size  ~IOMMU_PAGE_MASK(tbl)) ||
+   (param.vaddr  ~IOMMU_PAGE_MASK(tbl)))
+   return -EINVAL;
+
ret = iommu_tce_put_param_check(tbl, param.iova, tce);
if (ret)
return ret;
 
-   for (i = 0; i  (param.size  IOMMU_PAGE_SHIFT_4K); ++i) {
+   for (i = 0; i  (param.size  tbl-it_page_shift); ++i) {
ret = iommu_put_tce_user_mode(tbl,
-   (param.iova  IOMMU_PAGE_SHIFT_4K) + i,
+   (param.iova  tbl-it_page_shift) + i,
tce);
if (ret)
break;
-   tce += IOMMU_PAGE_SIZE_4K;
+   tce += IOMMU_PAGE_SIZE(tbl);
}
if (ret)
iommu_clear_tces_and_put_pages(tbl,
-   param.iova  IOMMU_PAGE_SHIFT_4K, i);
+   param.iova  tbl-it_page_shift, i);
 
iommu_flush_tce(tbl);
 
@@ -379,23 +379,23 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
 
-   if (param.size  ~IOMMU_PAGE_MASK_4K)
-   return -EINVAL;
-
tbl = spapr_tce_find_table(container, data, param.iova);
if (!tbl)
return -ENXIO;
 
+   if (param.size  ~IOMMU_PAGE_MASK(tbl))
+   return -EINVAL;
+
BUG_ON(!tbl-it_group);
 
ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-   param.size  IOMMU_PAGE_SHIFT_4K);
+   param.size  tbl-it_page_shift);
if (ret)
return ret;
 
ret = iommu_clear_tces_and_put_pages(tbl,
-   param.iova  IOMMU_PAGE_SHIFT_4K,
-   param.size  IOMMU_PAGE_SHIFT_4K);
+   param.iova  tbl-it_page_shift,
+   param.size  tbl-it_page_shift);
iommu_flush_tce(tbl);
 
return ret;
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body

[PATCH 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA

2014-08-29 Thread Alexey Kardashevskiy

SPAPR defines an interface to create additional DMA windows dynamically.
Dynamically means that the window is not allocated before the guest
even started, the guest can request it later. In practice, existing linux
guests check for the capability and if it is there, they create and map
a DMA window as big as the entire guest RAM.

This adds 4 callbacks to the spapr_tce_iommu_ops struct:
1. query - ibm,query-pe-dma-window - returns number/size of windows
which can be created (one, any page size);

2. create - ibm,create-pe-dma-window - creates a window;

3. remove - ibm,remove-pe-dma-window - removes a window; removing
the default 32bit window is not allowed by this patch, this will be added
later if needed;

4. reset -  ibm,reset-pe-dma-window - reset the DMA windows configuration
to the default state; as the default window cannot be removed, it only
removes the additional window if it was created.

The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to
provide necessary support to the userspace.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/tce.h|  22 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 159 +-
 arch/powerpc/platforms/powernv/pci.h  |   1 +
 3 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index e6355f9..23b0362 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -58,6 +58,28 @@ struct spapr_tce_iommu_ops {
int num);
void (*take_ownership)(struct spapr_tce_iommu_group *data,
bool enable);
+
+   /* Dynamic DMA window */
+   /* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K   0x01
+#define DDW_PGSIZE_64K  0x02
+#define DDW_PGSIZE_16M  0x04
+#define DDW_PGSIZE_32M  0x08
+#define DDW_PGSIZE_64M  0x10
+#define DDW_PGSIZE_128M 0x20
+#define DDW_PGSIZE_256M 0x40
+#define DDW_PGSIZE_16G  0x80
+   long (*query)(struct spapr_tce_iommu_group *data,
+   __u32 *current_windows,
+   __u32 *windows_available,
+   __u32 *page_size_mask);
+   long (*create)(struct spapr_tce_iommu_group *data,
+   __u32 page_shift,
+   __u32 window_shift,
+   struct iommu_table **ptbl);
+   long (*remove)(struct spapr_tce_iommu_group *data,
+   struct iommu_table *tbl);
+   long (*reset)(struct spapr_tce_iommu_group *data);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 296f49b..a6318cb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1154,6 +1154,26 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static struct iommu_table *pnv_ioda2_iommu_get_table(
+   struct spapr_tce_iommu_group *data,
+   int num)
+{
+   struct pnv_ioda_pe *pe = data-iommu_owner;
+
+   switch (num) {
+   case 0:
+   if (pe-tce32.table.it_size)
+   return pe-tce32.table;
+   return NULL;
+   case 1:
+   if (pe-tce64.table.it_size)
+   return pe-tce64.table;
+   return NULL;
+   default:
+   return NULL;
+   }
+}
+
 static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 bool enable)
 {
@@ -1162,9 +1182,146 @@ static void pnv_ioda2_take_ownership(struct 
spapr_tce_iommu_group *data,
pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
+static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data,
+   __u32 *current_windows,
+   __u32 *windows_available, __u32 *page_size_mask)
+{
+   struct pnv_ioda_pe *pe = data-iommu_owner;
+
+   *windows_available = 2;
+   *current_windows = 0;
+   if (pe-tce32.table.it_size) {
+   --*windows_available;
+   ++*current_windows;
+   }
+   if (pe-tce64.table.it_size) {
+   --*windows_available;
+   ++*current_windows;
+   }
+   *page_size_mask =
+   DDW_PGSIZE_4K |
+   DDW_PGSIZE_64K |
+   DDW_PGSIZE_16M;
+
+   return 0;
+}
+
+static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
+   __u32 page_shift, __u32 window_shift,
+   struct iommu_table **ptbl)
+{
+   struct pnv_ioda_pe *pe = data-iommu_owner;
+   struct pnv_phb *phb = pe-phb;
+   struct page *tce_mem = NULL;
+   void *addr;
+   long ret;
+   unsigned long tce_table_size =
+   (1ULL  (window_shift - page_shift)) * 8;
+   unsigned

[PATCH 05/13] powerpc/iommu: Fix IOMMU ownership control functions

2014-08-29 Thread Alexey Kardashevskiy

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kernel/iommu.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c2c8d9d..cd80867 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1126,33 +1126,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl-it_size + 7)  3;
+   unsigned long flags, i, sz = (tbl-it_size + 7)  3;
+   int ret = 0, bit0 = 0;
+
+   spin_lock_irqsave(tbl-large_pool.lock, flags);
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_lock(tbl-pools[i].lock);
 
if (tbl-it_offset == 0)
-   clear_bit(0, tbl-it_map);
+   bit0 = test_and_clear_bit(0, tbl-it_map);
 
if (!bitmap_empty(tbl-it_map, tbl-it_size)) {
pr_err(iommu_tce: it_map is not empty);
-   return -EBUSY;
+   ret = -EBUSY;
+   if (bit0)
+   set_bit(0, tbl-it_map);
+   } else {
+   memset(tbl-it_map, 0xff, sz);
}
 
-   memset(tbl-it_map, 0xff, sz);
-   iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_unlock(tbl-pools[i].lock);
+   spin_unlock_irqrestore(tbl-large_pool.lock, flags);
 
-   return 0;
+   if (!ret)
+   iommu_clear_tces_and_put_pages(tbl, tbl-it_offset,
+   tbl-it_size);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl-it_size + 7)  3;
+   unsigned long flags, i, sz = (tbl-it_size + 7)  3;
 
iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
+
+   spin_lock_irqsave(tbl-large_pool.lock, flags);
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_lock(tbl-pools[i].lock);
+
memset(tbl-it_map, 0, sz);
 
/* Restore bit#0 set by iommu_init_table() */
if (tbl-it_offset == 0)
set_bit(0, tbl-it_map);
+
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_unlock(tbl-pools[i].lock);
+   spin_unlock_irqrestore(tbl-large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/13] powerpc/pseries/lpar: Enable VFIO

2014-08-29 Thread Alexey Kardashevskiy

The previous patch introduced iommu_table_ops::exchange() callback
which effectively disabled VFIO on pseries. This implements exchange()
for pseries/lpar so VFIO can work in nested guests.

Since exchaange() callback returns an old TCE, it has to call H_GET_TCE
for every TCE being put to the table so VFIO performance in guests
running under PR KVM is expected to be slower than in guests running under
HV KVM or bare metal hosts.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9a7364f..ae15b5a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, 
long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
long npages, unsigned long uaddr,
+   unsigned long *old_tces,
enum dma_data_direction direction,
struct dma_attrs *attrs)
 {
u64 rc = 0;
u64 proto_tce, tce;
u64 rpn;
-   int ret = 0;
+   int ret = 0, i = 0;
long tcenum_start = tcenum, npages_start = npages;
 
rpn = __pa(uaddr)  TCE_SHIFT;
@@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
long tcenum,
 
while (npages--) {
tce = proto_tce | (rpn  TCE_RPN_MASK)  TCE_RPN_SHIFT;
+   if (old_tces)
+   plpar_tce_get((u64)tbl-it_index, (u64)tcenum  12,
+   old_tces[i++]);
rc = plpar_tce_put((u64)tbl-it_index, (u64)tcenum  12, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
long tcenum,
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum,
 long npages, unsigned long uaddr,
+unsigned long *old_tces,
 enum dma_data_direction direction,
 struct dma_attrs *attrs)
 {
@@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+  old_tces,
   direction, attrs);
}
 
@@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+   old_tces,
direction, attrs);
}
__get_cpu_var(tce_page) = tcep;
@@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
for (l = 0; l  limit; l++) {
tcep[l] = cpu_to_be64(proto_tce | (rpn  TCE_RPN_MASK) 
 TCE_RPN_SHIFT);
rpn++;
+   if (old_tces)
+   plpar_tce_get((u64)tbl-it_index,
+   (u64)(tcenum + l)  12,
+   old_tces[tcenum + l]);
}
 
rc = plpar_tce_put_indirect((u64)tbl-it_index,
@@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+long npages, unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs)
+{
+   return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+   direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
npages)
 {
u64 rc;
@@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
.set = tce_buildmulti_pSeriesLP,
+   .exchange = tce_xchg_pSeriesLP,
.clear = tce_freemulti_pSeriesLP,
.get = tce_get_pSeriesLP
 };
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table

2014-08-29 Thread Alexey Kardashevskiy

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds an extra @ops parameter to iommu_init_table() to make sure
that we do not leave any IOMMU table without iommu_table_ops. @it_ops is
initialized in the very beginning as iommu_init_table() calls
iommu_table_clear() and the latter uses callbacks already.

This does s/tce_build/set/, s/tce_free/clear/ and removes tce_ prefixes
for better readability.

This removes tce_xxx_rm handlers from ppc_md as well but does not add
them to iommu_table_ops, this will be done later if we decide to support
TCE hypercalls in real mode.

This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as
callbacks for pseries. This changes multi callbacks to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support multitce=off
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace multi callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h| 20 +++-
 arch/powerpc/include/asm/machdep.h  | 25 ---
 arch/powerpc/kernel/iommu.c | 50 -
 arch/powerpc/kernel/vio.c   |  5 ++-
 arch/powerpc/platforms/cell/iommu.c |  9 --
 arch/powerpc/platforms/pasemi/iommu.c   |  8 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 +--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 +-
 arch/powerpc/platforms/powernv/pci.c| 24 --
 arch/powerpc/platforms/powernv/pci.h|  1 +
 arch/powerpc/platforms/pseries/iommu.c  | 42 +---
 arch/powerpc/sysdev/dart_iommu.c| 13 
 12 files changed, 102 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2b0b01d..c725e4a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+   int (*set)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
+   void (*clear)(struct iommu_table *tbl,
+   long index, long npages);
+   unsigned long (*get)(struct iommu_table *tbl, long index);
+   void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
+   struct iommu_table_ops *it_ops;
 };
 
 /* Pure 2^n version of get_order */
@@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const 
char *node_name);
  * structure
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-   int nid);
+   int nid,
+   struct iommu_table_ops *ops);
 
 struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index b125cea..1fc824d 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 * destroyed as well */
void(*hpte_clear_all)(void);
 
-   int (*tce_build)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction direction,
-struct dma_attrs *attrs);
-   void(*tce_free)(struct iommu_table *tbl,
-   long index,
-   long npages);
-   unsigned long   (*tce_get)(struct iommu_table *tbl,
-   long index);
-   void(*tce_flush)(struct iommu_table *tbl);
-
-   /* _rm versions are for real mode use only */
-   int (*tce_build_rm)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction

[PATCH 07/13] powerpc/powernv: Do not set read flag if direction==DMA_NONE

2014-08-29 Thread Alexey Kardashevskiy

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/platforms/powernv/pci.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index deddcad..ab79e2d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,10 +628,18 @@ static int pnv_tce_build(struct iommu_table *tbl, long 
index, long npages,
__be64 *tcep, *tces;
u64 rpn;
 
-   proto_tce = TCE_PCI_READ; // Read allowed
-
-   if (direction != DMA_TO_DEVICE)
-   proto_tce |= TCE_PCI_WRITE;
+   switch (direction) {
+   case DMA_BIDIRECTIONAL:
+   case DMA_FROM_DEVICE:
+   proto_tce = TCE_PCI_READ | TCE_PCI_WRITE;
+   break;
+   case DMA_TO_DEVICE:
+   proto_tce = TCE_PCI_READ;
+   break;
+   default:
+   proto_tce = 0;
+   break;
+   }
 
tces = tcep = ((__be64 *)tbl-it_base) + index - tbl-it_offset;
rpn = __pa(uaddr)  tbl-it_page_shift;
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()

2014-08-29 Thread Alexey Kardashevskiy

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

Since the set_bypass() is not really an iommu_table function but PE's
function, and we have an ops struct per IOMMU owner, let's move
set_bypass() to the spapr_tce_iommu_ops struct.

As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
has very little to do with PEs, this moves take_ownership() calls to
the VFIO SPAPR TCE driver.

This renames set_bypass() to take_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
a generic name. The bool parameter is inverted.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/iommu.h  |  1 -
 arch/powerpc/include/asm/tce.h|  2 ++
 arch/powerpc/kernel/iommu.c   | 12 
 arch/powerpc/platforms/powernv/pci-ioda.c | 20 
 drivers/vfio/vfio_iommu_spapr_tce.c   | 16 
 5 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 84ee339..2b0b01d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,7 +77,6 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
-   void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 9f159eb..e6355f9 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops {
struct iommu_table *(*get_table)(
struct spapr_tce_iommu_group *data,
int num);
+   void (*take_ownership)(struct spapr_tce_iommu_group *data,
+   bool enable);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1c5dae7..c2c8d9d 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl-it_map, 0xff, sz);
iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
 
-   /*
-* Disable iommu bypass, otherwise the user can DMA to all of
-* our physical memory via the bypass window instead of just
-* the pages that has been explicitly mapped into the iommu
-*/
-   if (tbl-set_bypass)
-   tbl-set_bypass(tbl, false);
-
return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
/* Restore bit#0 set by iommu_init_table() */
if (tbl-it_offset == 0)
set_bit(0, tbl-it_map);
-
-   /* The kernel owns the device now, we can restore the iommu bypass */
-   if (tbl-set_bypass)
-   tbl-set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2d32a1c..8cb2f31 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32.table);
uint16_t window_id = (pe-pe_number  1 ) + 1;
int64_t rc;
 
@@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table 
*tbl, bool enable)
 * host side.
 */
if (pe-pdev)
-   set_iommu_table_base(pe-pdev-dev, tbl);
+   set_iommu_table_base(pe-pdev-dev, pe-tce32.table);
else
pnv_ioda_setup_bus_dma(pe, pe-pbus, false);
}
@@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
pnv_phb *phb,
/* TVE #1 is selected by PCI address bit 59 */
pe-tce_bypass_base = 1ull  59;
 
-   /* Install set_bypass callback for VFIO */
-   pe-tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
-
/* Enable bypass by default */
-   pnv_pci_ioda2_set_bypass(pe-tce32.table, true);
+   pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
+bool

[PATCH 08/13] powerpc/powernv: Release replaced TCE

2014-08-29 Thread Alexey Kardashevskiy

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This makes iommu_tce_build() put pages returned by exchange().

This replaces iommu_clear_tce() with iommu_tce_build which now
can call exchange() with TCE==NULL (i.e. clear).

This preserves permission bits in TCE in iommu_put_tce_user_mode().

This removes use of pool locks for external IOMMU uses.

This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not
implement exchange() callback. Therefore the powernv platform is
the only supported one after this patch.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h |  8 +++--
 arch/powerpc/kernel/iommu.c  | 62 
 arch/powerpc/platforms/powernv/pci.c | 40 +++
 3 files changed, 67 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c725e4a..8e0537d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
unsigned long uaddr,
enum dma_data_direction direction,
struct dma_attrs *attrs);
+   int (*exchange)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   unsigned long *old_tces,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
void (*clear)(struct iommu_table *tbl,
long index, long npages);
unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -209,8 +215,6 @@ extern int iommu_tce_put_param_check(struct iommu_table 
*tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 678fee8..39ccce7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1006,43 +1006,11 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock((pool-lock));
-
-   oldtce = tbl-it_ops-get(tbl, entry);
-   if (oldtce  (TCE_PCI_WRITE | TCE_PCI_READ))
-   tbl-it_ops-clear(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock((pool-lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce  PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce  TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
+   iommu_tce_build(tbl, entry, 0, DMA_NONE);
}
 
return 0;
@@ -1056,18 +1024,19 @@ EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction)
 {
-   int ret = -EBUSY;
+   int ret;
unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
 
-   spin_lock((pool-lock));
+   ret = tbl-it_ops-exchange(tbl, entry, 1, hwaddr, oldtce,
+   direction, NULL);
 
-   oldtce = tbl-it_ops-get(tbl, entry);
-   /* Add new entry if it is not

[PATCH 02/13] powerpc/powernv: Make invalidate() a callback

2014-08-29 Thread Alexey Kardashevskiy

At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via
container_of(tbl). Since we are going to have to add Dynamic DMA windows
and that means having 2 IOMMU tables per PE, this is not going to work.

This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback.

This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer
to PE there. PNV's ppc_md.tce_build() call uses this to find PE and
do the invalidation. This will be used later for Dynamic DMA windows too.

This registers invalidate() callbacks for IODA1 and IODA2:
- pnv_pci_ioda1_tce_invalidate;
- pnv_pci_ioda2_tce_invalidate.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 35 ---
 arch/powerpc/platforms/powernv/pci.c  | 31 ---
 arch/powerpc/platforms/powernv/pci.h  | 13 +++-
 3 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index df241b1..136e765 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -857,7 +857,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, 
struct pci_dev *pdev
 
pe = phb-ioda.pe_array[pdn-pe_number];
WARN_ON(get_dma_ops(pdev-dev) != dma_iommu_ops);
-   set_iommu_table_base_and_group(pdev-dev, pe-tce32_table);
+   set_iommu_table_base_and_group(pdev-dev, pe-tce32.table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -884,7 +884,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(pdev-dev, Using 32-bit DMA via iommu\n);
set_dma_ops(pdev-dev, dma_iommu_ops);
-   set_iommu_table_base(pdev-dev, pe-tce32_table);
+   set_iommu_table_base(pdev-dev, pe-tce32.table);
}
*pdev-dev.dma_mask = dma_mask;
return 0;
@@ -899,9 +899,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, bus-devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(dev-dev,
-  pe-tce32_table);
+  pe-tce32.table);
else
-   set_iommu_table_base(dev-dev, pe-tce32_table);
+   set_iommu_table_base(dev-dev, pe-tce32.table);
 
if (dev-subordinate)
pnv_ioda_setup_bus_dma(pe, dev-subordinate,
@@ -988,19 +988,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-__be64 *startp, __be64 *endp, bool rm)
-{
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
-   struct pnv_phb *phb = pe-phb;
-
-   if (phb-type == PNV_PHB_IODA1)
-   pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-   else
-   pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
-}
-
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
  struct pnv_ioda_pe *pe, unsigned int base,
  unsigned int segs)
@@ -1058,9 +1045,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
}
 
/* Setup linux iommu table */
-   tbl = pe-tce32_table;
+   tbl = pe-tce32.table;
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
  base  28, IOMMU_PAGE_SHIFT_4K);
+   pe-tce32.pe = pe;
+   pe-tce32.invalidate_fn = pnv_pci_ioda1_tce_invalidate;
 
/* OPAL variant of P7IOC SW invalidated TCEs */
swinvp = of_get_property(phb-hose-dn, ibm,opal-tce-kill, NULL);
@@ -1097,7 +1086,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ tce32.table);
uint16_t window_id = (pe-pe_number  1 ) + 1;
int64_t rc;
 
@@ -1142,10 +1131,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
pnv_phb *phb,
pe-tce_bypass_base = 1ull  59;
 
/* Install set_bypass callback for VFIO */
-   pe-tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+   pe-tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
 
/* Enable bypass by default */
-   pnv_pci_ioda2_set_bypass(pe-tce32_table, true);
+   pnv_pci_ioda2_set_bypass(pe-tce32.table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1193,9 +1182,11 @@ static void

[PATCH 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size

2014-08-29 Thread Alexey Kardashevskiy

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kernel/iommu.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a10642a..b378f78 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
 #include linux/pci.h
 #include linux/iommu.h
 #include linux/sched.h
+#include linux/hugetlb.h
 #include asm/io.h
 #include asm/prom.h
 #include asm/iommu.h
@@ -1059,16 +1060,37 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, 
unsigned long entry,
tce, entry  tbl-it_page_shift, ret); */
return -EFAULT;
}
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   if (PageHuge(page)) {
+   struct page *head = compound_head(page);
+   long shift = PAGE_SHIFT + compound_order(head);
+
+   if (shift  tbl-it_page_shift) {
+   ret = -EINVAL;
+   goto put_page_exit;
+   }
+
+   }
+
hwaddr = (unsigned long) page_address(page) + offset;
 
ret = iommu_tce_build(tbl, entry, hwaddr, direction);
if (ret)
-   put_page(page);
+   goto put_page_exit;
 
-   if (ret  0)
-   pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n,
+   return 0;
+
+put_page_exit:
+   pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n,
__func__, entry  tbl-it_page_shift, tce, ret);
 
+   put_page(page);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows

2014-08-29 Thread Alexey Kardashevskiy

This enables PAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has a separate DMA window on
a PCI bus where devices are allows to perform DMA. By default there is
1 or 2GB window allocated at the host boot time and these windows are
used when an IOMMU group is passed to the userspace (guest). These windows
are mapped at zero offset on a PCI bus.

Hi-speed devices may suffer from limited size of this window. On the host
side a TCE bypass mode is enabled on POWER8 CPU which implements
direct mapping of the host memory to a PCI bus at 159.

For the guest, PAPR defines a DDW RTAS API which allows the pseries guest
to query the hypervisor if it supports DDW and what are the parameters
of possible windows.

Currently POWER8 supports 2 DMA windows per PE - already mentioned and used
small 32bit window and 64bit window which can only start from 159 and
can support various page sizes.

This patchset reworks PPC IOMMU code and adds necessary structures
to extend it to support big windows.

When the guest detectes the feature and the PE is capable of 64bit DMA,
it does:
1. query to hypervisor about number of available windows and page masks;
2. creates a window with the biggest possible page size (current guests can do
64K or 16MB TCEs);
3. maps the entire guest RAM via H_PUT_TCE* hypercalls
4. switches dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore and the guest gets
maximum performance.


Please comment. Thanks!


Alexey Kardashevskiy (13):
  powerpc/iommu: Check that TCE page size is equal to it_page_size
  powerpc/powernv: Make invalidate() a callback
  powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  powerpc/powernv: Convert/move set_bypass() callback to
take_ownership()
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/powernv: Do not set read flag if direction==DMA_NONE
  powerpc/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  vfio: powerpc/spapr: Move locked_vm accounting to helpers
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Enable Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h|  35 ++-
 arch/powerpc/include/asm/machdep.h  |  25 --
 arch/powerpc/include/asm/tce.h  |  37 +++
 arch/powerpc/kernel/iommu.c | 213 +--
 arch/powerpc/kernel/vio.c   |   5 +-
 arch/powerpc/platforms/cell/iommu.c |   9 +-
 arch/powerpc/platforms/pasemi/iommu.c   |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 233 +++--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   4 +-
 arch/powerpc/platforms/powernv/pci.c| 113 +---
 arch/powerpc/platforms/powernv/pci.h|  15 +-
 arch/powerpc/platforms/pseries/iommu.c  |  77 --
 arch/powerpc/sysdev/dart_iommu.c|  13 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 384 +++-
 include/uapi/linux/vfio.h   |  25 +-
 15 files changed, 925 insertions(+), 271 deletions(-)

-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] kvmtool: ARM: Use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target cpu

2014-08-29 Thread Reinhard Moselbach

Hi Anup,

On 26/08/14 10:22, Anup Patel wrote:
 Instead, of trying out each and every target type we should
 use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target type
 for KVM ARM/ARM64.
 
 If KVM_ARM_PREFERRED_TARGET vm ioctl fails then we fallback to
 old method of trying all known target types.

So as the algorithm currently works, it does not give us much
improvement over the current behaviour. We still need to list each
supported MPIDR both in kvmtool and in the kernel.
Looking more closely at the code, beside the target id we only need the
kvm_target_arm[] list for the compatible string and the init() function.
The latter is (currently) the same for all supported type, so we could
use that as a standard fallback function.
The compatible string seems to be completely ignored by the ARM64
kernel, so we could as well pass arm,armv8 all the time.
In ARM(32) kernels we seem to not make any real use of it for CPUs which
we care for (with virtualisation extensions).

So what about the following:
We keep the list as it is, but not extend it for future CPUs, expect
those in need for a special compatible string or a specific init
function. Instead we rely on PREFFERED_TARGET for all current and
upcoming CPUs (meaning unsupported CPUs must use a 3.12 kernel or higher).
If PREFERRED_TARGET works, we scan the list anyway (to find CPUs needing
special treatment), but on failure of finding something in the list we
just go ahead:
- with the target ID the kernel returned,
- an arm,armv8 compatible string (for arm64, not sure about arm) and
- call the standard kvmtool init function

This should relief us from the burden of adding each supported CPU to
kvmtool.

Does that make sense of am I missing something?
I will hack something up to prove that it works.

Also there is now a race on big.LITTLE systems: if the PREFERRED_TARGET
ioctl is executed on one cluster, while the KVM_ARM_VCPU_INIT call is
done on another core with a different MPIDR, then the kernel will refuse
to init the CPU. I don't know of a good solution for this (except the
sledgehammer pinning with sched_setaffinity to the current core, which
is racy, too, but should at least work somehow ;-)
Any ideas?

 Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
 Signed-off-by: Anup Patel anup.pa...@linaro.org
 ---
  tools/kvm/arm/kvm-cpu.c |   46 +++---
  1 file changed, 35 insertions(+), 11 deletions(-)
 
 diff --git a/tools/kvm/arm/kvm-cpu.c b/tools/kvm/arm/kvm-cpu.c
 index aeaa4cf..c010e9c 100644
 --- a/tools/kvm/arm/kvm-cpu.c
 +++ b/tools/kvm/arm/kvm-cpu.c
 @@ -34,6 +34,7 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   struct kvm_cpu *vcpu;
   int coalesced_offset, mmap_size, err = -1;
   unsigned int i;
 + struct kvm_vcpu_init preferred_init;
   struct kvm_vcpu_init vcpu_init = {
   .features = ARM_VCPU_FEATURE_FLAGS(kvm, cpu_id)
   };
 @@ -55,20 +56,42 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   if (vcpu-kvm_run == MAP_FAILED)
   die(unable to mmap vcpu fd);
  
 - /* Find an appropriate target CPU type. */
 - for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 - if (!kvm_arm_targets[i])
 - continue;
 - target = kvm_arm_targets[i];
 - vcpu_init.target = target-id;
 + /*
 +  * If preferred target ioctl successful then use preferred target
 +  * else try each and every target type.
 +  */
 + err = ioctl(kvm-vm_fd, KVM_ARM_PREFERRED_TARGET, preferred_init);
 + if (!err) {
 + /* Match preferred target CPU type. */
 + target = NULL;
 + for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 + if (!kvm_arm_targets[i])
 + continue;
 + if (kvm_arm_targets[i]-id == preferred_init.target) {
 + target = kvm_arm_targets[i];
 + break;
 + }
 + }
 +
 + vcpu_init.target = preferred_init.target;
   err = ioctl(vcpu-vcpu_fd, KVM_ARM_VCPU_INIT, vcpu_init);
 - if (!err)
 - break;
 + if (err || target-init(vcpu))
 + die(Unable to initialise vcpu for preferred target);

So that segfaults if the CPU is not in kvmtools list (as target is still
NULL). In the current implementation we should bail out (but better use
the algorithm described above).

Also these two line can be moved outside of the loop and joined with the
last two lines from the else clause ...

 + } else {
 + /* Find an appropriate target CPU type. */
 + for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 + if (!kvm_arm_targets[i])
 + continue;
 + target = kvm_arm_targets[i];
 +

[PATCH] KVM: vmx: VMXOFF emulation in vm86 should cause #UD

2014-08-29 Thread Nadav Amit

Unlike VMCALL, the instructions VMXOFF, VMLAUNCH and VMRESUME should cause a UD
exception in real-mode or vm86.  However, the emulator considers all these
instructions the same for the matter of mode checks, and emulation upon exit
due to #UD exception.

As a result, the hypervisor behaves incorrectly on vm86 mode. VMXOFF, VMLAUNCH
or VMRESUME cause on vm86 exit due to #UD. The hypervisor then emulates these
instruction and inject #GP to the guest instead of #UD.

This patch creates a new group for these instructions and mark only VMCALL as
an instruction which can be emulated.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e5bf130..a240fac 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3139,12 +3139,8 @@ static int em_clts(struct x86_emulate_ctxt *ctxt)
 
 static int em_vmcall(struct x86_emulate_ctxt *ctxt)
 {
-   int rc;
-
-   if (ctxt-modrm_mod != 3 || ctxt-modrm_rm != 1)
-   return X86EMUL_UNHANDLEABLE;
+   int rc = ctxt-ops-fix_hypercall(ctxt);
 
-   rc = ctxt-ops-fix_hypercall(ctxt);
if (rc != X86EMUL_CONTINUE)
return rc;
 
@@ -3562,6 +3558,12 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
F2bv(((_f) | DstReg | SrcMem | ModRM)  ~Lock, _e), \
F2bv(((_f)  ~Lock) | DstAcc | SrcImm, _e)
 
+static const struct opcode group7_rm0[] = {
+   N,
+   I(SrcNone | Priv | EmulateOnUD, em_vmcall),
+   N, N, N, N, N, N,
+};
+
 static const struct opcode group7_rm1[] = {
DI(SrcNone | Priv, monitor),
DI(SrcNone | Priv, mwait),
@@ -3655,7 +3657,7 @@ static const struct group_dual group7 = { {
II(SrcMem16 | Mov | Priv,   em_lmsw, lmsw),
II(SrcMem | ByteOp | Priv | NoAccess,   em_invlpg, invlpg),
 }, {
-   I(SrcNone | Priv | EmulateOnUD, em_vmcall),
+   EXT(0, group7_rm0),
EXT(0, group7_rm1),
N, EXT(0, group7_rm3),
II(SrcNone | DstMem | Mov,  em_smsw, smsw), N,
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: VMXOFF emulation in vm86 should cause #UD

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 10:26, Nadav Amit ha scritto:
 Unlike VMCALL, the instructions VMXOFF, VMLAUNCH and VMRESUME should cause a 
 UD
 exception in real-mode or vm86.  However, the emulator considers all these
 instructions the same for the matter of mode checks, and emulation upon exit
 due to #UD exception.
 
 As a result, the hypervisor behaves incorrectly on vm86 mode. VMXOFF, VMLAUNCH
 or VMRESUME cause on vm86 exit due to #UD. The hypervisor then emulates these
 instruction and inject #GP to the guest instead of #UD.
 
 This patch creates a new group for these instructions and mark only VMCALL as
 an instruction which can be emulated.
 
 Signed-off-by: Nadav Amit na...@cs.technion.ac.il

Patch looks good, but where is the check that MOD == 3 in the case
RMExt?  Am I just not seeing it?

Paolo

 ---
  arch/x86/kvm/emulate.c | 14 --
  1 file changed, 8 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index e5bf130..a240fac 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -3139,12 +3139,8 @@ static int em_clts(struct x86_emulate_ctxt *ctxt)
  
  static int em_vmcall(struct x86_emulate_ctxt *ctxt)
  {
 - int rc;
 -
 - if (ctxt-modrm_mod != 3 || ctxt-modrm_rm != 1)
 - return X86EMUL_UNHANDLEABLE;
 + int rc = ctxt-ops-fix_hypercall(ctxt);
  
 - rc = ctxt-ops-fix_hypercall(ctxt);
   if (rc != X86EMUL_CONTINUE)
   return rc;
  
 @@ -3562,6 +3558,12 @@ static int check_perm_out(struct x86_emulate_ctxt 
 *ctxt)
   F2bv(((_f) | DstReg | SrcMem | ModRM)  ~Lock, _e), \
   F2bv(((_f)  ~Lock) | DstAcc | SrcImm, _e)
  
 +static const struct opcode group7_rm0[] = {
 + N,
 + I(SrcNone | Priv | EmulateOnUD, em_vmcall),
 + N, N, N, N, N, N,
 +};
 +
  static const struct opcode group7_rm1[] = {
   DI(SrcNone | Priv, monitor),
   DI(SrcNone | Priv, mwait),
 @@ -3655,7 +3657,7 @@ static const struct group_dual group7 = { {
   II(SrcMem16 | Mov | Priv,   em_lmsw, lmsw),
   II(SrcMem | ByteOp | Priv | NoAccess,   em_invlpg, invlpg),
  }, {
 - I(SrcNone | Priv | EmulateOnUD, em_vmcall),
 + EXT(0, group7_rm0),
   EXT(0, group7_rm1),
   N, EXT(0, group7_rm3),
   II(SrcNone | DstMem | Mov,  em_smsw, smsw), N,
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: VMXOFF emulation in vm86 should cause #UD

2014-08-29 Thread Nadav Amit


On Aug 29, 2014, at 11:36 AM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 29/08/2014 10:26, Nadav Amit ha scritto:
 Unlike VMCALL, the instructions VMXOFF, VMLAUNCH and VMRESUME should cause a 
 UD
 exception in real-mode or vm86.  However, the emulator considers all these
 instructions the same for the matter of mode checks, and emulation upon exit
 due to #UD exception.
 
 As a result, the hypervisor behaves incorrectly on vm86 mode. VMXOFF, 
 VMLAUNCH
 or VMRESUME cause on vm86 exit due to #UD. The hypervisor then emulates these
 instruction and inject #GP to the guest instead of #UD.
 
 This patch creates a new group for these instructions and mark only VMCALL as
 an instruction which can be emulated.
 
 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 
 Patch looks good, but where is the check that MOD == 3 in the case
 RMExt?  Am I just not seeing it?
 
This seems to be part of the “case GroupDual”.


 ---
 arch/x86/kvm/emulate.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index e5bf130..a240fac 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -3139,12 +3139,8 @@ static int em_clts(struct x86_emulate_ctxt *ctxt)
 
 static int em_vmcall(struct x86_emulate_ctxt *ctxt)
 {
 -int rc;
 -
 -if (ctxt-modrm_mod != 3 || ctxt-modrm_rm != 1)
 -return X86EMUL_UNHANDLEABLE;
 +int rc = ctxt-ops-fix_hypercall(ctxt);
 
 -rc = ctxt-ops-fix_hypercall(ctxt);
  if (rc != X86EMUL_CONTINUE)
  return rc;
 
 @@ -3562,6 +3558,12 @@ static int check_perm_out(struct x86_emulate_ctxt 
 *ctxt)
  F2bv(((_f) | DstReg | SrcMem | ModRM)  ~Lock, _e), \
  F2bv(((_f)  ~Lock) | DstAcc | SrcImm, _e)
 
 +static const struct opcode group7_rm0[] = {
 +N,
 +I(SrcNone | Priv | EmulateOnUD, em_vmcall),
 +N, N, N, N, N, N,
 +};
 +
 static const struct opcode group7_rm1[] = {
  DI(SrcNone | Priv, monitor),
  DI(SrcNone | Priv, mwait),
 @@ -3655,7 +3657,7 @@ static const struct group_dual group7 = { {
  II(SrcMem16 | Mov | Priv,   em_lmsw, lmsw),
  II(SrcMem | ByteOp | Priv | NoAccess,   em_invlpg, invlpg),
 }, {
 -I(SrcNone | Priv | EmulateOnUD, em_vmcall),
 +EXT(0, group7_rm0),
  EXT(0, group7_rm1),
  N, EXT(0, group7_rm3),
  II(SrcNone | DstMem | Mov,  em_smsw, smsw), N,
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: VMXOFF emulation in vm86 should cause #UD

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 10:52, Nadav Amit ha scritto:
  Patch looks good, but where is the check that MOD == 3 in the case
  RMExt?  Am I just not seeing it?
 
 This seems to be part of the “case GroupDual”.

GroupDual handles it, but the EXT() macro you're using is exactly what 
you want:

#define RMExt   (415) /* Opcode extension in ModRM r/m if mod == 3 */

I guess what's missing is

-- 8 --
Subject: [PATCH] Check ModRM for RMExt

Some group7 extensions are only defined for mod==3.  Check this and
reject emulation if mod!=3.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 56657b0bb3bb..d472e4d50e3c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4360,6 +4360,8 @@ done_prefixes:
opcode = opcode.u.gdual-mod012[goffset];
break;
case RMExt:
+   if ((ctxt-modrm  6) == 3)
+   return EMULATION_FAILED;
goffset = ctxt-modrm  7;
opcode = opcode.u.group[goffset];
break;

What do you think?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] kvmtool: ARM: Use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target cpu

2014-08-29 Thread Andre Przywara

(resent, that was the wrong account before ...)

Hi Anup,

On 26/08/14 10:22, Anup Patel wrote:
 Instead, of trying out each and every target type we should
 use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target type
 for KVM ARM/ARM64.
 
 If KVM_ARM_PREFERRED_TARGET vm ioctl fails then we fallback to
 old method of trying all known target types.

So as the algorithm currently works, it does not give us much
improvement over the current behaviour. We still need to list each
supported MPIDR both in kvmtool and in the kernel.
Looking more closely at the code, beside the target id we only need the
kvm_target_arm[] list for the compatible string and the init() function.
The latter is (currently) the same for all supported type, so we could
use that as a standard fallback function.
The compatible string seems to be completely ignored by the ARM64
kernel, so we could as well pass arm,armv8 all the time.
In ARM(32) kernels we seem to not make any real use of it for CPUs which
we care for (with virtualisation extensions).

So what about the following:
We keep the list as it is, but not extend it for future CPUs, expect
those in need for a special compatible string or a specific init
function. Instead we rely on PREFFERED_TARGET for all current and
upcoming CPUs (meaning unsupported CPUs must use a 3.12 kernel or higher).
If PREFERRED_TARGET works, we scan the list anyway (to find CPUs needing
special treatment), but on failure of finding something in the list we
just go ahead:
- with the target ID the kernel returned,
- an arm,armv8 compatible string (for arm64, not sure about arm) and
- call the standard kvmtool init function

This should relief us from the burden of adding each supported CPU to
kvmtool.

Does that make sense of am I missing something?
I will hack something up to prove that it works.

Also there is now a race on big.LITTLE systems: if the PREFERRED_TARGET
ioctl is executed on one cluster, while the KVM_ARM_VCPU_INIT call is
done on another core with a different MPIDR, then the kernel will refuse
to init the CPU. I don't know of a good solution for this (except the
sledgehammer pinning with sched_setaffinity to the current core, which
is racy, too, but should at least work somehow ;-)
Any ideas?

 Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
 Signed-off-by: Anup Patel anup.pa...@linaro.org
 ---
  tools/kvm/arm/kvm-cpu.c |   46 +++---
  1 file changed, 35 insertions(+), 11 deletions(-)
 
 diff --git a/tools/kvm/arm/kvm-cpu.c b/tools/kvm/arm/kvm-cpu.c
 index aeaa4cf..c010e9c 100644
 --- a/tools/kvm/arm/kvm-cpu.c
 +++ b/tools/kvm/arm/kvm-cpu.c
 @@ -34,6 +34,7 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   struct kvm_cpu *vcpu;
   int coalesced_offset, mmap_size, err = -1;
   unsigned int i;
 + struct kvm_vcpu_init preferred_init;
   struct kvm_vcpu_init vcpu_init = {
   .features = ARM_VCPU_FEATURE_FLAGS(kvm, cpu_id)
   };
 @@ -55,20 +56,42 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   if (vcpu-kvm_run == MAP_FAILED)
   die(unable to mmap vcpu fd);
  
 - /* Find an appropriate target CPU type. */
 - for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 - if (!kvm_arm_targets[i])
 - continue;
 - target = kvm_arm_targets[i];
 - vcpu_init.target = target-id;
 + /*
 +  * If preferred target ioctl successful then use preferred target
 +  * else try each and every target type.
 +  */
 + err = ioctl(kvm-vm_fd, KVM_ARM_PREFERRED_TARGET, preferred_init);
 + if (!err) {
 + /* Match preferred target CPU type. */
 + target = NULL;
 + for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 + if (!kvm_arm_targets[i])
 + continue;
 + if (kvm_arm_targets[i]-id == preferred_init.target) {
 + target = kvm_arm_targets[i];
 + break;
 + }
 + }
 +
 + vcpu_init.target = preferred_init.target;
   err = ioctl(vcpu-vcpu_fd, KVM_ARM_VCPU_INIT, vcpu_init);
 - if (!err)
 - break;
 + if (err || target-init(vcpu))
 + die(Unable to initialise vcpu for preferred target);

So that segfaults if the CPU is not in kvmtools list (as target is still
NULL). In the current implementation we should bail out (but better use
the algorithm described above).

Also these two line can be moved outside of the loop and joined with the
last two lines from the else clause ...

 + } else {
 + /* Find an appropriate target CPU type. */
 + for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 + if (!kvm_arm_targets[i])
 + continue;
 +

Re: [PATCH v2 4/4] kvmtool: ARM/ARM64: Provide PSCI-0.2 to guest when KVM supports it

2014-08-29 Thread Andre Przywara

Hi Anup,

On 26/08/14 10:22, Anup Patel wrote:
 If in-kernel KVM support PSCI-0.2 emulation then we should set
 KVM_ARM_VCPU_PSCI_0_2 feature for each guest VCPU and also
 provide arm,psci-0.2,arm,psci as PSCI compatible string.
 
 This patch updates kvm_cpu__arch_init() and setup_fdt() as
 per above.
 
 Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
 Signed-off-by: Anup Patel anup.pa...@linaro.org
 ---
  tools/kvm/arm/fdt.c |   39 +--
  tools/kvm/arm/kvm-cpu.c |5 +
  2 files changed, 38 insertions(+), 6 deletions(-)
 
 diff --git a/tools/kvm/arm/fdt.c b/tools/kvm/arm/fdt.c
 index 186a718..93849cf2 100644
 --- a/tools/kvm/arm/fdt.c
 +++ b/tools/kvm/arm/fdt.c
 @@ -13,6 +13,7 @@
  #include linux/byteorder.h
  #include linux/kernel.h
  #include linux/sizes.h
 +#include linux/psci.h
  
  static char kern_cmdline[COMMAND_LINE_SIZE];
  
 @@ -162,12 +163,38 @@ static int setup_fdt(struct kvm *kvm)
  
   /* PSCI firmware */
   _FDT(fdt_begin_node(fdt, psci));
 - _FDT(fdt_property_string(fdt, compatible, arm,psci));
 - _FDT(fdt_property_string(fdt, method, hvc));
 - _FDT(fdt_property_cell(fdt, cpu_suspend, KVM_PSCI_FN_CPU_SUSPEND));
 - _FDT(fdt_property_cell(fdt, cpu_off, KVM_PSCI_FN_CPU_OFF));
 - _FDT(fdt_property_cell(fdt, cpu_on, KVM_PSCI_FN_CPU_ON));
 - _FDT(fdt_property_cell(fdt, migrate, KVM_PSCI_FN_MIGRATE));
 + if (kvm__supports_extension(kvm, KVM_CAP_ARM_PSCI_0_2)) {
 + const char compatible[] = arm,psci-0.2\0arm,psci;
 + _FDT(fdt_property(fdt, compatible,
 +   compatible, sizeof(compatible)));
 + _FDT(fdt_property_string(fdt, method, hvc));
 + if (kvm-cfg.arch.aarch32_guest) {
 + _FDT(fdt_property_cell(fdt, cpu_suspend,
 + PSCI_0_2_FN_CPU_SUSPEND));
 + _FDT(fdt_property_cell(fdt, cpu_off,
 + PSCI_0_2_FN_CPU_OFF));
 + _FDT(fdt_property_cell(fdt, cpu_on,
 + PSCI_0_2_FN_CPU_ON));
 + _FDT(fdt_property_cell(fdt, migrate,
 + PSCI_0_2_FN_MIGRATE));
 + } else {
 + _FDT(fdt_property_cell(fdt, cpu_suspend,
 + PSCI_0_2_FN64_CPU_SUSPEND));
 + _FDT(fdt_property_cell(fdt, cpu_off,
 + PSCI_0_2_FN_CPU_OFF));
 + _FDT(fdt_property_cell(fdt, cpu_on,
 + PSCI_0_2_FN64_CPU_ON));
 + _FDT(fdt_property_cell(fdt, migrate,
 + PSCI_0_2_FN64_MIGRATE));
 + }
 + } else {
 + _FDT(fdt_property_string(fdt, compatible, arm,psci));
 + _FDT(fdt_property_string(fdt, method, hvc));
 + _FDT(fdt_property_cell(fdt, cpu_suspend, 
 KVM_PSCI_FN_CPU_SUSPEND));
 + _FDT(fdt_property_cell(fdt, cpu_off, KVM_PSCI_FN_CPU_OFF));
 + _FDT(fdt_property_cell(fdt, cpu_on, KVM_PSCI_FN_CPU_ON));
 + _FDT(fdt_property_cell(fdt, migrate, KVM_PSCI_FN_MIGRATE));
 + }

I guess this could be simplified much by defining three arrays with the
respective function IDs and setting a pointer to the right one here.
Then there would still be only one set of _FDT() calls, which reference
this pointer. Like:
uint32_t *psci_fn_ids;
...
if (KVM_CAP_ARM_PSCI_0_2) {
if (aarch32_guest)
psci_fn_ids = psci_0_2_fn_ids;
else
psci_fn_ids = psci_0_2_fn64_ids;
} else
psci_fn_ids = psci_0_1_fn_ids;
_FDT(fdt_property_cell(fdt, cpu_suspend, psci_fn_ids[0]));
_FDT(fdt_property_cell(fdt, cpu_off, psci_fn_ids[1]));
...

Also I wonder if we actually need those different IDs. The binding doc
says that Linux' PSCI 0.2 code ignores them altogether, only using them
if the arm,psci branch of the compatible string is actually used (on
kernels not supporting PSCI 0.2)
So can't we just always pass the PSCI 0.1 numbers in here?
That would restrict this whole patch to just changing the compatible
string, right?

Regards,
Andre.

   _FDT(fdt_end_node(fdt));
  
   /* Finalise. */
 diff --git a/tools/kvm/arm/kvm-cpu.c b/tools/kvm/arm/kvm-cpu.c
 index c010e9c..0637e9a 100644
 --- a/tools/kvm/arm/kvm-cpu.c
 +++ b/tools/kvm/arm/kvm-cpu.c
 @@ -56,6 +56,11 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   if (vcpu-kvm_run == MAP_FAILED)
   die(unable to mmap vcpu fd);
  
 + /* Set KVM_ARM_VCPU_PSCI_0_2 if available */
 + if (kvm__supports_extension(kvm, KVM_CAP_ARM_PSCI_0_2)) {
 + vcpu_init.features[0] |= (1UL  KVM_ARM_VCPU_PSCI_0_2);
 + }
 +
   /*
* If preferred

Re: [PATCH] KVM: vmx: VMXOFF emulation in vm86 should cause #UD

2014-08-29 Thread Nadav Amit


On Aug 29, 2014, at 11:57 AM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 29/08/2014 10:52, Nadav Amit ha scritto:
 Patch looks good, but where is the check that MOD == 3 in the case
 RMExt?  Am I just not seeing it?
 
 This seems to be part of the “case GroupDual”.
 
 GroupDual handles it, but the EXT() macro you're using is exactly what 
 you want:
 
 #define RMExt   (415) /* Opcode extension in ModRM r/m if mod == 3 
 */
 
 I guess what's missing is
 
 -- 8 --
 Subject: [PATCH] Check ModRM for RMExt
 
 Some group7 extensions are only defined for mod==3.  Check this and
 reject emulation if mod!=3.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 56657b0bb3bb..d472e4d50e3c 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -4360,6 +4360,8 @@ done_prefixes:
   opcode = opcode.u.gdual-mod012[goffset];
   break;
   case RMExt:
 + if ((ctxt-modrm  6) == 3)
 + return EMULATION_FAILED;
   goffset = ctxt-modrm  7;
   opcode = opcode.u.group[goffset];
   break;
 
 What do you think?

I don’t know. I am looking at Intel SDM table A-6 (Opcode Extensions for One- 
and Two-byte Opcodes by Group Number).
According to the table, only group 7 needs RMExt, and in this case the “case 
GroupDual” makes the required checks, in the iteration prior to the “case 
RMExt”.
Therefore this code path, RMExt without GroupDual before it, should never 
occur. Nonetheless, if you want to avoid future bugs, perhaps it is good.

Nadav


signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [PATCH] KVM: vmx: VMXOFF emulation in vm86 should cause #UD

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 11:12, Nadav Amit ha scritto:
 I don’t know. I am looking at Intel SDM table A-6 (Opcode Extensions
 for One- and Two-byte Opcodes by Group Number). According to the
 table, only group 7 needs RMExt, and in this case the “case
 GroupDual” makes the required checks, in the iteration prior to the
 “case RMExt”. Therefore this code path, RMExt without GroupDual
 before it, should never occur. Nonetheless, if you want to avoid
 future bugs, perhaps it is good.

Oh, now I understand what you mean.  Thanks,

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 1/2] ARM/ARM64: KVM: Nuke Hyp-mode tlbs before enabling MMU

2014-08-29 Thread Christoffer Dall

From: Pranavkumar Sawargaonkar pranavku...@linaro.org

X-Gene u-boot runs in EL2 mode with MMU enabled hence we might
have stale EL2 tlb enteris when we enable EL2 MMU on each host CPU.

This can happen on any ARM/ARM64 board running bootloader in
Hyp-mode (or EL2-mode) with MMU enabled.

This patch ensures that we flush all Hyp-mode (or EL2-mode) TLBs
on each host CPU before enabling Hyp-mode (or EL2-mode) MMU.

Cc: sta...@vger.kernel.org
Tested-by: Mark Rutland mark.rutl...@arm.com
Reviewed-by: Marc Zyngier marc.zyng...@arm.com
Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
Signed-off-by: Anup Patel anup.pa...@linaro.org
Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
---
 arch/arm/kvm/init.S   | 4 
 arch/arm64/kvm/hyp-init.S | 4 
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
index 991415d..3988e72 100644
--- a/arch/arm/kvm/init.S
+++ b/arch/arm/kvm/init.S
@@ -99,6 +99,10 @@ __do_hyp_init:
mrc p15, 0, r0, c10, c2, 1
mcr p15, 4, r0, c10, c2, 1
 
+   @ Invalidate the stale TLBs from Bootloader
+   mcr p15, 4, r0, c8, c7, 0   @ TLBIALLH
+   dsb ish
+
@ Set the HSCTLR to:
@  - ARM/THUMB exceptions: Kernel config (Thumb-2 kernel)
@  - Endianness: Kernel config
diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S
index d968796..c319116 100644
--- a/arch/arm64/kvm/hyp-init.S
+++ b/arch/arm64/kvm/hyp-init.S
@@ -80,6 +80,10 @@ __do_hyp_init:
msr mair_el2, x4
isb
 
+   /* Invalidate the stale TLBs from Bootloader */
+   tlbialle2
+   dsb sy
+
mrs x4, sctlr_el2
and x4, x4, #SCTLR_EL2_EE   // preserve endianness of EL2
ldr x5, =SCTLR_EL2_FLAGS
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 0/2] KVM/ARM Fixes for v3.17-rc3

2014-08-29 Thread Christoffer Dall

Hi Paolo and Gleb,

The following changes since commit 30d1e0e806e5b2fadc297ba78f2d7afd6ba309cf:

  virt/kvm/assigned-dev.c: Set 'dev-irq_source_id' to '-1' after free it 
(2014-08-19 15:12:28 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvm-arm-for-v3.17-rc3

for you to fetch changes up to 05e0127f9e362b36aa35f17b1a3d52bca9322a3a:

  arm/arm64: KVM: Complete WFI/WFE instructions (2014-08-29 11:53:53 +0200)

Thanks,
-Christoffer


These fixes fix two issues in KVM for arm/arm64:
 - hyp mode initialization issues on certian boards/bootloader combos.
 - incorrect return address from trapped WFI/WFE instrucitons, which
   breaks non-linux guests.


Christoffer Dall (1):
  arm/arm64: KVM: Complete WFI/WFE instructions

Pranavkumar Sawargaonkar (1):
  ARM/ARM64: KVM: Nuke Hyp-mode tlbs before enabling MMU

 arch/arm/kvm/handle_exit.c   | 2 ++
 arch/arm/kvm/init.S  | 4 
 arch/arm64/kvm/handle_exit.c | 2 ++
 arch/arm64/kvm/hyp-init.S| 4 
 4 files changed, 12 insertions(+)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 2/2] arm/arm64: KVM: Complete WFI/WFE instructions

2014-08-29 Thread Christoffer Dall

The architecture specifies that when the processor wakes up from a WFE
or WFI instruction, the instruction is considered complete, however we
currrently return to EL1 (or EL0) at the WFI/WFE instruction itself.

While most guests may not be affected by this because their local
exception handler performs an exception returning setting the event bit
or with an interrupt pending, some guests like UEFI will get wedged due
this little mishap.

Simply skip the instruction when we have completed the emulation.

Cc: sta...@vger.kernel.org
Acked-by: Marc Zyngier marc.zyng...@arm.com
Cc: Ard Biesheuvel ard.biesheu...@linaro.org
Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
---
 arch/arm/kvm/handle_exit.c   | 2 ++
 arch/arm64/kvm/handle_exit.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index 4c979d4..a96a804 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -93,6 +93,8 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
else
kvm_vcpu_block(vcpu);
 
+   kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+
return 1;
 }
 
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e28be51..34b8bd0 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -66,6 +66,8 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
else
kvm_vcpu_block(vcpu);
 
+   kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+
return 1;
 }
 
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] arm64: fix VTTBR_BADDR_MASK

2014-08-29 Thread Marc Zyngier

On Tue, Aug 26 2014 at  7:35:21 pm BST, Joel Schopp joel.sch...@amd.com wrote:
 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 5c7aa3c..73f6ff6 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -166,6 +166,18 @@ static inline void
 coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,

  void stage2_flush_vm(struct kvm *kvm);

 +static inline int kvm_get_phys_addr_shift(void)
 +{
 +   return KVM_PHYS_SHIFT;
 +}
 +
 +static inline int set_vttbr_baddr_mask(void)
 +{
 +   vttbr_baddr_mask = VTTBR_BADDR_MASK;
 Have you tried compiling this?

 Apart from the obvious missing definition of the variable, I'm not fond
 of functions with side-effects hidden in an include file. What is wrong
 with just returning the mask and letting the common code setting it?
 I like that change, will do in v6.

 +#ifdef CONFIG_ARM64_64K_PAGES
 +static inline int t0sz_to_vttbr_x(int t0sz)
 +{
 +   if (t0sz  16 || t0sz  34) {
 +   kvm_err(Cannot support %d-bit address space\n, 64 - t0sz);
 +   return 0;
 0 is definitely a bad value for something that is an error
 case. Consider -EINVAL instead.
 OK.

 Also, what if we're in a range that only deals with more levels of page
 tables than the kernel can deal with (remember we use the kernel page
 table accessors)? See the new ARM64_VA_BITS and ARM64_PGTABLE_LEVELS
 symbols that are now available, and use them to validate the range you
 have.  
 With the simple current tests I can look at them and see they are
 correct, even if I can't make a scenario to test that they would fail. 
 However, if I add in more complicated checks I'd really like to test
 them catching the failure cases.  Can you describe a case where we can
 boot a kernel and then have the checks still fail in kvm? 

You're looking at T0SZ values that are outside of what a given
configuration of the kernel can handle (T0SZ == 16, for example, is only
valid with 3 levels of page tables, and the current implementation only
deals with 2). This is a case where things may explode, as you allow
input addresses that we simply cannot support, and in this case you want
to cap T0SZ to something sensible that is compatible with what the
kernel (and thus KVM) can express in terms of translations.

Testing is another thing, and I don't expect you to be able to test
every possible combinaison. Nonetheless, I do expect some reasonable
effort in the code to avoid deadly situations.


 +   }
 +   return 37 - t0sz;
 +}
 +#endif
 +static inline int kvm_get_phys_addr_shift(void)
 +{
 +   int pa_range = read_cpuid(ID_AA64MMFR0_EL1)  0xf;
 +
 +   switch (pa_range) {
 +   case 0: return 32;
 +   case 1: return 36;
 +   case 2: return 40;
 +   case 3: return 42;
 +   case 4: return 44;
 +   case 5: return 48;
 +   default:
 +   BUG();
 +   return 0;
 +   }
 +}
 +
 +static u64 vttbr_baddr_mask;
 Now every compilation unit that includes kvm_mmu.h has an instance of
 this variable. I doubt that it is the intended effect.
 The change for the comment farther above to just return the mask and
 have the common code set it should resolve this as well.


 +
 +/**
 + * set_vttbr_baddr_mask - set mask value for vttbr base address
 + *
 + * In ARMv8, vttbr_baddr_mask cannot be determined in compile time since 
 the
 + * stage2 input address size depends on hardware capability. Thus, we first
 + * need to read ID_AA64MMFR0_EL1.PARange and then set vttbr_baddr_mask with
 + * consideration of both the granule size and the level of
 translation tables.
 + */
 +static inline int set_vttbr_baddr_mask(void)
 +{
 +   int t0sz, vttbr_x;
 +
 +   t0sz = VTCR_EL2_T0SZ(kvm_get_phys_addr_shift());
 +   vttbr_x = t0sz_to_vttbr_x(t0sz);
 +   if (!vttbr_x)
 +   return -EINVAL;
 +   vttbr_baddr_mask = (((1LLU  (48 - vttbr_x)) - 1)  (vttbr_x - 
 1));
 I think this can now be written as GENMASK_ULL(48, (vttbr_x - 1)).
 That does improve readability, I like it.


 +   return 0;
 +}
 +
  #endif /* __ASSEMBLY__ */
  #endif /* __ARM64_KVM_MMU_H__ */
 diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S
 index d968796..c0f7634 100644
 --- a/arch/arm64/kvm/hyp-init.S
 +++ b/arch/arm64/kvm/hyp-init.S
 @@ -63,17 +63,21 @@ __do_hyp_init:
 mrs x4, tcr_el1
 ldr x5, =TCR_EL2_MASK
 and x4, x4, x5
 -   ldr x5, =TCR_EL2_FLAGS
 -   orr x4, x4, x5
 -   msr tcr_el2, x4
 -
 -   ldr x4, =VTCR_EL2_FLAGS
 /*
  * Read the PARange bits from ID_AA64MMFR0_EL1 and set the PS bits 
 in
 -* VTCR_EL2.
 +* TCR_EL2 and both PS bits and T0SZ bits in VTCR_EL2.
  */
 mrs x5, ID_AA64MMFR0_EL1
 bfi x4, x5, #16, #3
 +   msr tcr_el2, x4
 +
 +   ldr x4, =VTCR_EL2_FLAGS
 +   bfi x4, x5, #16, #3
 +   and x5, x5, #0xf
 +

Re: [GIT PULL 0/2] KVM/ARM Fixes for v3.17-rc3

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 11:59, Christoffer Dall ha scritto:
 Hi Paolo and Gleb,
 
 The following changes since commit 30d1e0e806e5b2fadc297ba78f2d7afd6ba309cf:
 
   virt/kvm/assigned-dev.c: Set 'dev-irq_source_id' to '-1' after free it 
 (2014-08-19 15:12:28 +0200)
 
 are available in the git repository at:
 
   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
 tags/kvm-arm-for-v3.17-rc3
 
 for you to fetch changes up to 05e0127f9e362b36aa35f17b1a3d52bca9322a3a:
 
   arm/arm64: KVM: Complete WFI/WFE instructions (2014-08-29 11:53:53 +0200)
 
 Thanks,
 -Christoffer
 
 
 These fixes fix two issues in KVM for arm/arm64:
  - hyp mode initialization issues on certian boards/bootloader combos.
  - incorrect return address from trapped WFI/WFE instrucitons, which
breaks non-linux guests.
 
 
 Christoffer Dall (1):
   arm/arm64: KVM: Complete WFI/WFE instructions
 
 Pranavkumar Sawargaonkar (1):
   ARM/ARM64: KVM: Nuke Hyp-mode tlbs before enabling MMU
 
  arch/arm/kvm/handle_exit.c   | 2 ++
  arch/arm/kvm/init.S  | 4 
  arch/arm64/kvm/handle_exit.c | 2 ++
  arch/arm64/kvm/hyp-init.S| 4 
  4 files changed, 12 insertions(+)
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

Thanks, picked this up.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] kvm: x86: fix stale mmio cache bug

2014-08-29 Thread Paolo Bonzini

From: David Matlack dmatl...@google.com

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

[ xiaoguangrong: adjust the code to make it simpler for stable-tree fix. ]

Cc: sta...@vger.kernel.org
Signed-off-by: David Matlack dmatl...@google.com
Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c  |  2 +-
 arch/x86/kvm/x86.h  | 20 +++-
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 031e1dcea5a6..a9dc893dcf1b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -481,6 +481,7 @@ struct kvm_vcpu_arch {
u64 mmio_gva;
unsigned access;
gfn_t mmio_gfn;
+   u64 mmio_gen;
 
struct kvm_pmu pmu;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 96515957ba82..1cd2a5fbde07 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3162,7 +3162,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
return;
 
-   vcpu_clear_mmio_info(vcpu, ~0ul);
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu-arch.mmu.root_hpa;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 306a1b77581f..985fb2c006fa 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -88,15 +88,23 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu 
*vcpu,
vcpu-arch.mmio_gva = gva  PAGE_MASK;
vcpu-arch.access = access;
vcpu-arch.mmio_gfn = gfn;
+   vcpu-arch.mmio_gen = kvm_memslots(vcpu-kvm)-generation;
+}
+
+static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
+{
+   return vcpu-arch.mmio_gen == kvm_memslots(vcpu-kvm)-generation;
 }
 
 /*
- * Clear the mmio cache info for the given gva,
- * specially, if gva is ~0ul, we clear all mmio cache info.
+ * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
+ * clear all mmio cache info.
  */
+#define MMIO_GVA_ANY (~(gva_t)0)
+
 static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
 {
-   if (gva != (~0ul)  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
+   if (gva != MMIO_GVA_ANY  vcpu-arch.mmio_gva != (gva  PAGE_MASK))
return;
 
vcpu-arch.mmio_gva = 0;
@@ -104,7 +112,8 @@ static inline void vcpu_clear_mmio_info(struct kvm_vcpu 
*vcpu, gva_t gva)
 
 static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long 
gva)
 {
-   if (vcpu-arch.mmio_gva  vcpu-arch.mmio_gva == (gva  PAGE_MASK))
+   if (vcpu_match_mmio_gen(vcpu)  vcpu-arch.mmio_gva 
+ vcpu-arch.mmio_gva == (gva  PAGE_MASK))
return true;
 
return false;
@@ -112,7 +121,8 @@ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu 
*vcpu, unsigned long gva)
 
 static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
-   if (vcpu-arch.mmio_gfn  vcpu-arch.mmio_gfn == gpa  PAGE_SHIFT)
+   if (vcpu_match_mmio_gen(vcpu)  vcpu-arch.mmio_gfn 
+ vcpu-arch.mmio_gfn == gpa  PAGE_SHIFT)
return true;
 
return false;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] KVM: do not bias the generation number in kvm_current_mmio_generation

2014-08-29 Thread Paolo Bonzini

The next patch will give a meaning (a la seqcount) to the low bit of the
generation number.  Ensure that it matches between kvm-memslots-generation
and kvm_current_mmio_generation().

Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/kvm/mmu.c  | 7 +--
 virt/kvm/kvm_main.c | 7 +++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 931467881da7..323c3f5f5c84 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -236,12 +236,7 @@ static unsigned int get_mmio_spte_generation(u64 spte)
 
 static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
 {
-   /*
-* Init kvm generation close to MMIO_MAX_GEN to easily test the
-* code of handling generation number wrap-around.
-*/
-   return (kvm_memslots(kvm)-generation +
- MMIO_MAX_GEN - 150)  MMIO_GEN_MASK;
+   return kvm_memslots(kvm)-generation  MMIO_GEN_MASK;
 }
 
 static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7176929a4cda..0bfdb673db26 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -477,6 +477,13 @@ static struct kvm *kvm_create_vm(unsigned long type)
kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm-memslots)
goto out_err_no_srcu;
+
+   /*
+* Init kvm generation close to the maximum to easily test the
+* code of handling generation number wrap-around.
+*/
+   kvm-memslots-generation = -150;
+
kvm_init_memslots_id(kvm);
if (init_srcu_struct(kvm-srcu))
goto out_err_no_srcu;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] kvm: fix potentially corrupt mmio cache

2014-08-29 Thread Paolo Bonzini

From: David Matlack dmatl...@google.com

vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.

If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.

We can prevent the following case:

   vcpu (CPU 0) | thread (CPU 1)
+--
1  vm exit  |
2  srcu_read_unlock(kvm-srcu) |
3  decide to cache something based on   |
 old memslots   |
4   | change memslots
| (increments generation)
5   | synchronize_srcu(kvm-srcu);
6  retrieve generation # from new memslots  |
7  tag cache with new memslot generation|
8  srcu_read_unlock(kvm-srcu) |
... |
   action based on cache occurs even   |
though the caching decision was based   |
on the old memslots|
... |
   action *continues* to occur until next  |
memslot generation change, which may|
be never   |
|

By incrementing the generation after synchronizing with kvm-srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).

Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots.  It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.

To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE.  This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited.  Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.

Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Signed-off-by: David Matlack dmatl...@google.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 Documentation/virtual/kvm/mmu.txt | 14 ++
 arch/x86/kvm/mmu.c| 20 
 virt/kvm/kvm_main.c   | 23 ---
 3 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt 
b/Documentation/virtual/kvm/mmu.txt
index 290894176142..53838d9c6295 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -425,6 +425,20 @@ fault through the slow path.
 Since only 19 bits are used to store generation-number on mmio spte, all
 pages are zapped when there is an overflow.
 
+Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
+times, the last one happening when the generation number is retrieved and
+stored into the MMIO spte.  Thus, the MMIO spte might be created based on
+out-of-date information, but with an up-to-date generation number.
+
+To avoid this, the generation number is incremented again after 
synchronize_srcu
+returns; thus, the low bit of kvm_memslots(kvm)-generation is only 1 during a
+memslot update, while some SRCU readers might be using the old copy.  We do not
+want to use an MMIO sptes created with an odd generation number, and we can do
+this without losing a bit in the MMIO spte.  The low bit of the generation
+is not stored in MMIO spte, and presumed zero when it is extracted out of the
+spte.  If KVM is unlucky and creates an MMIO spte while the low bit is 1,
+the next access to the spte will always be a cache miss.
+
 
 Further reading
 ===
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 323c3f5f5c84..96515957ba82 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -199,16 +199,20 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
 /*
- * spte bits of bit 3 ~ bit 11 are used as low 9 bits of generation number,
- * the bits of bits 52 ~ bit 61 are used as high 10 bits of generation
- * number.
+ * the low bit of the generation number is always presumed to be zero.
+ * This disables mmio caching during memslot updates.  The concept is
+ * similar to a seqcount but instead of retrying

[PATCH 0/3] fix bugs with stale or corrupt MMIO caches

2014-08-29 Thread Paolo Bonzini

David and Xiao, here's my take on the MMIO generation patches.  Now
with documentation, too. :)  Please review!

David Matlack (2):
  kvm: fix potentially corrupt mmio cache
  kvm: x86: fix stale mmio cache bug

Paolo Bonzini (1):
  KVM: do not bias the generation number in kvm_current_mmio_generation

 Documentation/virtual/kvm/mmu.txt | 14 ++
 arch/x86/include/asm/kvm_host.h   |  1 +
 arch/x86/kvm/mmu.c| 29 ++---
 arch/x86/kvm/x86.h| 20 +++-
 virt/kvm/kvm_main.c   | 30 +++---
 5 files changed, 67 insertions(+), 27 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: forward declare structs in kvm_types.h

2014-08-29 Thread Paolo Bonzini

Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
'struct foo' declared inside parameter list warnings (and consequent
breakage due to conflicting types).

Move them from individual files to a generic place in linux/kvm_types.h.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/arm/include/asm/kvm_host.h |  7 ++-
 arch/arm64/include/asm/kvm_host.h   |  6 ++
 arch/ia64/include/asm/kvm_host.h|  3 ---
 arch/mips/include/asm/kvm_host.h|  5 -
 arch/powerpc/include/asm/kvm_host.h |  5 -
 arch/s390/include/asm/kvm_host.h|  5 +++--
 arch/x86/include/asm/kvm_host.h |  4 
 include/linux/kvm_types.h   | 11 +++
 8 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 84291feee9e1..aea259e9431f 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -19,6 +19,8 @@
 #ifndef __ARM_KVM_HOST_H__
 #define __ARM_KVM_HOST_H__
 
+#include linux/types.h
+#include linux/kvm_types.h
 #include asm/kvm.h
 #include asm/kvm_asm.h
 #include asm/kvm_mmio.h
@@ -40,7 +42,6 @@
 
 #include kvm/arm_vgic.h
 
-struct kvm_vcpu;
 u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode);
 int kvm_target_cpu(void);
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
@@ -149,20 +150,17 @@ struct kvm_vcpu_stat {
u32 halt_wakeup;
 };
 
-struct kvm_vcpu_init;
 int kvm_vcpu_set_target(struct kvm_vcpu *vcpu,
const struct kvm_vcpu_init *init);
 int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
-struct kvm_one_reg;
 int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 u64 kvm_call_hyp(void *hypfn, ...);
 void force_vm_exit(const cpumask_t *mask);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-struct kvm;
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
@@ -187,7 +185,6 @@ struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 
 int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user *uindices);
 unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu);
-struct kvm_one_reg;
 int kvm_arm_coproc_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *);
 int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *);
 
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 94d8a3c9b644..b5045e3e05f8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -22,6 +22,8 @@
 #ifndef __ARM64_KVM_HOST_H__
 #define __ARM64_KVM_HOST_H__
 
+#include linux/types.h
+#include linux/kvm_types.h
 #include asm/kvm.h
 #include asm/kvm_asm.h
 #include asm/kvm_mmio.h
@@ -41,7 +43,6 @@
 
 #define KVM_VCPU_MAX_FEATURES 3
 
-struct kvm_vcpu;
 int kvm_target_cpu(void);
 int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
 int kvm_arch_dev_ioctl_check_extension(long ext);
@@ -164,18 +165,15 @@ struct kvm_vcpu_stat {
u32 halt_wakeup;
 };
 
-struct kvm_vcpu_init;
 int kvm_vcpu_set_target(struct kvm_vcpu *vcpu,
const struct kvm_vcpu_init *init);
 int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
-struct kvm_one_reg;
 int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-struct kvm;
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 353167d95c66..4729752b7256 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -234,9 +234,6 @@ struct kvm_vm_data {
 #define KVM_REQ_PTC_G  32
 #define KVM_REQ_RESUME 33
 
-struct kvm;
-struct kvm_vcpu;
-
 struct kvm_mmio_req {
uint64_t addr;  /*  physical address*/
uint64_t size;  /*  size in bytes   */
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index b4d900acbdb9..0b24d6622ec1 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -96,11 +96,6 @@
 #define CAUSEB_DC  27
 #define CAUSEF_DC  (_ULCAST_(1)  27)
 
-struct kvm;
-struct kvm_run;
-struct kvm_vcpu;
-struct kvm_interrupt;
-
 extern atomic_t kvm_mips_instance;
 extern pfn_t(*kvm_mips_gfn_to_pfn) (struct kvm *kvm, gfn_t gfn);

[ANNOUNCE] Jailhouse 0.1 released

2014-08-29 Thread Jan Kiszka

After its publication about 10 months ago, the Jailhouse partitioning
hypervisor for Linux [1] reached an important first milestone: all major
features required to use Jailhouse on Intel x86 CPUs are now available.
We are marking this point with a first release tag, v0.1.

This release particularly means full exploitation of VT-d DMA and
interrupt remapping to isolate assigned PCI devices from the hypervisor
and foreign cells. Moreover, the usability of Jailhouse was greatly
improved by the introduction and continuous extension of a generator for
system configuration files. Finally, a framework for writing basic cell
applications is available now. With a few lines of C code you can set up
timer interrupts, read clocks or configure PCI devices for the use in
simple bare-metal real-time applications.

The new release can be downloaded from

https://github.com/siemens/jailhouse/archive/v0.1.tar.gz

It's easiest to try out in a virtual environment provided by QEMU/KVM,
see the included README. The braver ones can pick a real compatible
machine and let jailhouse config create provide a (generally) working
configuration. Be warned that real hardware tend to require some manual
post-processing of configuration files, for the demo cells or even the
system.

Check the project homepage at

https://github.com/siemens/jailhouse

for the git repository, links to the mailing list and further
information. Don't hesitate to contact the development community on
questions, problems or suggestions.

There is still a bit work ahead to reach a version 1.0. In the near
future, we will look into integrating recently published contributions
of new architectures like AMD64 [1] and ARM 32-bit [2]. An inter-cell
communication mechanism will also be merged soon. Several features
particularly important for the use in safety-critical scenarios have
been identified and are being developed now.

Enabling Jailhouse as a certifiable component in safety-related systems
is our primary goal, though we are not excluding other use case like in
telecommunication, high-speed real-time control or scenarios we haven't
even thought of yet.

Last but not least: Many thanks to all who contributed code, reviews,
comments or sponsoring to the project! Your input was already very
valuable for the progress of Jailhouse. Keep it up!

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/116825
[2] http://thread.gmane.org/gmane.linux.jailhouse/601
[3] http://thread.gmane.org/gmane.linux.jailhouse/779

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: forward declare structs in kvm_types.h

2014-08-29 Thread Radim Krčmář

2014-08-29 14:01+0200, Paolo Bonzini:
 Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
 'struct foo' declared inside parameter list warnings (and consequent
 breakage due to conflicting types).
 
 Move them from individual files to a generic place in linux/kvm_types.h.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---

Reviewed-by: Radim Krčmář rkrc...@redhat.com

(Inclusion of linux/types.h seems to be piggybacking, but is a nice
 thing to do :)

And I'd apply these changes:  (Definitely the first hunk.)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 33d8d0a..e098dce 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -140,8 +140,6 @@ static inline bool is_error_page(struct page *page)
 #define KVM_USERSPACE_IRQ_SOURCE_ID0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID   1
 
-struct kvm;
-struct kvm_vcpu;
 extern struct kmem_cache *kvm_vcpu_cache;
 
 extern spinlock_t kvm_lock;
@@ -325,8 +323,6 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
 };
 
-struct kvm_irq_routing_table;
-
 #ifndef KVM_PRIVATE_MEM_SLOTS
 #define KVM_PRIVATE_MEM_SLOTS 0
 #endif
@@ -1036,8 +1032,6 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
 
 extern bool kvm_rebooting;
 
-struct kvm_device_ops;
-
 struct kvm_device {
struct kvm_device_ops *ops;
struct kvm *kvm;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1d6daca..53c4f20 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -19,7 +19,9 @@
 
 struct kvm;
 struct kvm_async_pf;
+struct kvm_device_ops;
 struct kvm_interrupt;
+struct kvm_irq_routing_table;
 struct kvm_memory_slot;
 struct kvm_one_reg;
 struct kvm_run;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: x86: remove Aligned bit from movntps/movntpd

2014-08-29 Thread Paolo Bonzini

These are not explicitly aligned, and do not require alignment on AVX.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/kvm/emulate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0892622f9258..20d91873d831 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3688,8 +3688,8 @@ static const struct gprefix pfx_0f_6f_0f_7f = {
I(Mmx, em_mov), I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
-static const struct gprefix pfx_vmovntpx = {
-   I(0, em_mov), N, N, N,
+static const struct gprefix pfx_0f_2b = {
+   I(0, em_mov), I(0, em_mov), N, N,
 };
 
 static const struct gprefix pfx_0f_28_0f_29 = {
@@ -3906,7 +3906,7 @@ static const struct opcode twobyte_table[256] = {
N, N, N, N,
GP(ModRM | DstReg | SrcMem | Mov | Sse, pfx_0f_28_0f_29),
GP(ModRM | DstMem | SrcReg | Mov | Sse, pfx_0f_28_0f_29),
-   N, GP(ModRM | DstMem | SrcReg | Sse | Mov | Aligned, pfx_vmovntpx),
+   N, GP(ModRM | DstMem | SrcReg | Mov | Sse, pfx_0f_2b),
N, N, N, N,
/* 0x30 - 0x3F */
II(ImplicitOps | Priv, em_wrmsr, wrmsr),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86 emulator: emulate MOVNTDQ

2014-08-29 Thread Paolo Bonzini

Il 11/07/2014 19:56, Alex Williamson ha scritto:
 Windows 8.1 guest with NVIDIA driver and GPU fails to boot with an
 emulation failure.  The KVM spew suggests the fault is with lack of
 movntdq emulation (courtesy of Paolo):
 
 Code=02 00 00 b8 08 00 00 00 f3 0f 6f 44 0a f0 f3 0f 6f 4c 0a e0 66 0f e7 
 41 f0 66 0f e7 49 e0 48 83 e9 40 f3 0f 6f 44 0a 10 f3 0f 6f 0c 0a 66 0f e7 41 
 10
 
 $ as -o a.out
 .section .text
 .byte 0x66, 0x0f, 0xe7, 0x41, 0xf0
 .byte 0x66, 0x0f, 0xe7, 0x49, 0xe0
 $ objdump -d a.out
 0:  66 0f e7 41 f0  movntdq %xmm0,-0x10(%rcx)
 5:  66 0f e7 49 e0  movntdq %xmm1,-0x20(%rcx)
 
 Add the necessary emulation.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Paolo Bonzini pbonz...@redhat.com
 ---
 
 Hope I got all the flags correct from copying similar MOV ops, but it
 allows the guest to boot, so I suspect it's ok.
 
  arch/x86/kvm/emulate.c |7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index e4e833d..ae39f08 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -3681,6 +3681,10 @@ static const struct gprefix pfx_0f_28_0f_29 = {
   I(Aligned, em_mov), I(Aligned, em_mov), N, N,
  };
  
 +static const struct gprefix pfx_0f_e7 = {
 + N, I(Sse, em_mov), N, N,
 +};
 +
  static const struct escape escape_d9 = { {
   N, N, N, N, N, N, N, I(DstMem, em_fnstcw),
  }, {
 @@ -3951,7 +3955,8 @@ static const struct opcode twobyte_table[256] = {
   /* 0xD0 - 0xDF */
   N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
   /* 0xE0 - 0xEF */
 - N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
 + N, N, N, N, N, N, N, GP(SrcReg | DstMem | ModRM | Mov, pfx_0f_e7),
 + N, N, N, N, N, N, N, N,
   /* 0xF0 - 0xFF */
   N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N
  };
 

This slipped through the cracks, I'm applying to kvm/queue now.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm tools: arm: remove register accessor macros now that they are in uapi

2014-08-29 Thread Will Deacon

The kernel now exposes register accessor macros in the uapi/ headers
for arm and arm64, so use those instead (and avoid the compile failure
from the duplicate definitions).

Signed-off-by: Will Deacon will.dea...@arm.com
---

Pekka -- please take this as a fix, since merging the 3.16 sources has
 caused some breakage for ARM. Cheers!

 tools/kvm/arm/aarch32/kvm-cpu.c | 15 +--
 tools/kvm/arm/aarch64/kvm-cpu.c | 15 ---
 2 files changed, 1 insertion(+), 29 deletions(-)

diff --git a/tools/kvm/arm/aarch32/kvm-cpu.c b/tools/kvm/arm/aarch32/kvm-cpu.c
index 464b473dc936..95fb1da5ba3d 100644
--- a/tools/kvm/arm/aarch32/kvm-cpu.c
+++ b/tools/kvm/arm/aarch32/kvm-cpu.c
@@ -7,25 +7,12 @@
 #define ARM_CORE_REG(x)(KVM_REG_ARM | KVM_REG_SIZE_U32 | 
KVM_REG_ARM_CORE | \
 KVM_REG_ARM_CORE_REG(x))
 
-#define ARM_CP15_REG_SHIFT_MASK(x,n)   \
-   (((x)  KVM_REG_ARM_ ## n ## _SHIFT)  KVM_REG_ARM_ ## n ## _MASK)
-
-#define __ARM_CP15_REG(op1,crn,crm,op2)\
-   (KVM_REG_ARM | KVM_REG_SIZE_U32 |   \
-(15  KVM_REG_ARM_COPROC_SHIFT)   |   \
-ARM_CP15_REG_SHIFT_MASK(op1, OPC1) |   \
-ARM_CP15_REG_SHIFT_MASK(crn, 32_CRN)   |   \
-ARM_CP15_REG_SHIFT_MASK(crm, CRM)  |   \
-ARM_CP15_REG_SHIFT_MASK(op2, 32_OPC2))
-
-#define ARM_CP15_REG(...)  __ARM_CP15_REG(__VA_ARGS__)
-
 unsigned long kvm_cpu__get_vcpu_mpidr(struct kvm_cpu *vcpu)
 {
struct kvm_one_reg reg;
u32 mpidr;
 
-   reg.id = ARM_CP15_REG(ARM_CPU_ID, ARM_CPU_ID_MPIDR);
+   reg.id = ARM_CP15_REG32(ARM_CPU_ID, ARM_CPU_ID_MPIDR);
reg.addr = (u64)(unsigned long)mpidr;
if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
die(KVM_GET_ONE_REG failed (get_mpidr vcpu%ld, vcpu-cpu_id);
diff --git a/tools/kvm/arm/aarch64/kvm-cpu.c b/tools/kvm/arm/aarch64/kvm-cpu.c
index 71a2a3a7789d..1b293748efd6 100644
--- a/tools/kvm/arm/aarch64/kvm-cpu.c
+++ b/tools/kvm/arm/aarch64/kvm-cpu.c
@@ -15,21 +15,6 @@
 #define ARM64_CORE_REG(x)  (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \
 KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(x))
 
-#define ARM64_SYS_REG_SHIFT_MASK(x,n)  \
-   (((x)  KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) \
-KVM_REG_ARM64_SYSREG_ ## n ## _MASK)
-
-#define __ARM64_SYS_REG(op0,op1,crn,crm,op2)   \
-   (KVM_REG_ARM64 | KVM_REG_SIZE_U64   |   \
-KVM_REG_ARM64_SYSREG   |   \
-ARM64_SYS_REG_SHIFT_MASK(op0, OP0) |   \
-ARM64_SYS_REG_SHIFT_MASK(op1, OP1) |   \
-ARM64_SYS_REG_SHIFT_MASK(crn, CRN) |   \
-ARM64_SYS_REG_SHIFT_MASK(crm, CRM) |   \
-ARM64_SYS_REG_SHIFT_MASK(op2, OP2))
-
-#define ARM64_SYS_REG(...) __ARM64_SYS_REG(__VA_ARGS__)
-
 unsigned long kvm_cpu__get_vcpu_mpidr(struct kvm_cpu *vcpu)
 {
struct kvm_one_reg reg;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: forward declare structs in kvm_types.h

2014-08-29 Thread Christian Borntraeger

On 29/08/14 14:01, Paolo Bonzini wrote:
 Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
 'struct foo' declared inside parameter list warnings (and consequent
 breakage due to conflicting types).
 
 Move them from individual files to a generic place in linux/kvm_types.h.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

I can confirm that s390 still builds and works.

 ---
  arch/arm/include/asm/kvm_host.h |  7 ++-
  arch/arm64/include/asm/kvm_host.h   |  6 ++
  arch/ia64/include/asm/kvm_host.h|  3 ---
  arch/mips/include/asm/kvm_host.h|  5 -
  arch/powerpc/include/asm/kvm_host.h |  5 -
  arch/s390/include/asm/kvm_host.h|  5 +++--
  arch/x86/include/asm/kvm_host.h |  4 
  include/linux/kvm_types.h   | 11 +++
  8 files changed, 18 insertions(+), 28 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
 index 84291feee9e1..aea259e9431f 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -19,6 +19,8 @@
  #ifndef __ARM_KVM_HOST_H__
  #define __ARM_KVM_HOST_H__
 
 +#include linux/types.h
 +#include linux/kvm_types.h
  #include asm/kvm.h
  #include asm/kvm_asm.h
  #include asm/kvm_mmio.h
 @@ -40,7 +42,6 @@
 
  #include kvm/arm_vgic.h
 
 -struct kvm_vcpu;
  u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode);
  int kvm_target_cpu(void);
  int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
 @@ -149,20 +150,17 @@ struct kvm_vcpu_stat {
   u32 halt_wakeup;
  };
 
 -struct kvm_vcpu_init;
  int kvm_vcpu_set_target(struct kvm_vcpu *vcpu,
   const struct kvm_vcpu_init *init);
  int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
  unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
  int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 -struct kvm_one_reg;
  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
  u64 kvm_call_hyp(void *hypfn, ...);
  void force_vm_exit(const cpumask_t *mask);
 
  #define KVM_ARCH_WANT_MMU_NOTIFIER
 -struct kvm;
  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
  int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
 @@ -187,7 +185,6 @@ struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 
  int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user *uindices);
  unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu);
 -struct kvm_one_reg;
  int kvm_arm_coproc_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg 
 *);
  int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg 
 *);
 
 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index 94d8a3c9b644..b5045e3e05f8 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -22,6 +22,8 @@
  #ifndef __ARM64_KVM_HOST_H__
  #define __ARM64_KVM_HOST_H__
 
 +#include linux/types.h
 +#include linux/kvm_types.h
  #include asm/kvm.h
  #include asm/kvm_asm.h
  #include asm/kvm_mmio.h
 @@ -41,7 +43,6 @@
 
  #define KVM_VCPU_MAX_FEATURES 3
 
 -struct kvm_vcpu;
  int kvm_target_cpu(void);
  int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
  int kvm_arch_dev_ioctl_check_extension(long ext);
 @@ -164,18 +165,15 @@ struct kvm_vcpu_stat {
   u32 halt_wakeup;
  };
 
 -struct kvm_vcpu_init;
  int kvm_vcpu_set_target(struct kvm_vcpu *vcpu,
   const struct kvm_vcpu_init *init);
  int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
  unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
  int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 -struct kvm_one_reg;
  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 
  #define KVM_ARCH_WANT_MMU_NOTIFIER
 -struct kvm;
  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
  int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
 diff --git a/arch/ia64/include/asm/kvm_host.h 
 b/arch/ia64/include/asm/kvm_host.h
 index 353167d95c66..4729752b7256 100644
 --- a/arch/ia64/include/asm/kvm_host.h
 +++ b/arch/ia64/include/asm/kvm_host.h
 @@ -234,9 +234,6 @@ struct kvm_vm_data {
  #define KVM_REQ_PTC_G32
  #define KVM_REQ_RESUME   33
 
 -struct kvm;
 -struct kvm_vcpu;
 -
  struct kvm_mmio_req {
   uint64_t addr;  /*  physical address*/
   uint64_t size;  /*  size in bytes   */
 diff --git a/arch/mips/include/asm/kvm_host.h 
 b/arch/mips/include/asm/kvm_host.h
 index b4d900acbdb9..0b24d6622ec1 100644
 --- a/arch/mips/include/asm/kvm_host.h
 +++ b/arch/mips/include/asm/kvm_host.h
 @@ -96,11 +96,6 @@
  #define CAUSEB_DC27
  #define CAUSEF_DC

Re: [PATCH] KVM: forward declare structs in kvm_types.h

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 14:46, Radim Krčmář ha scritto:
 2014-08-29 14:01+0200, Paolo Bonzini:
 Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
 'struct foo' declared inside parameter list warnings (and consequent
 breakage due to conflicting types).

 Move them from individual files to a generic place in linux/kvm_types.h.

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
 
 Reviewed-by: Radim Krčmář rkrc...@redhat.com
 
 (Inclusion of linux/types.h seems to be piggybacking, but is a nice
  thing to do :)

I just wasn't sure if including linux/kvm_types.h was enough alone.

 And I'd apply these changes:  (Definitely the first hunk.)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 33d8d0a..e098dce 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -140,8 +140,6 @@ static inline bool is_error_page(struct page *page)
  #define KVM_USERSPACE_IRQ_SOURCE_ID  0
  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1
  
 -struct kvm;
 -struct kvm_vcpu;
  extern struct kmem_cache *kvm_vcpu_cache;
  
  extern spinlock_t kvm_lock;
 @@ -325,8 +323,6 @@ struct kvm_kernel_irq_routing_entry {
   struct hlist_node link;
  };
  
 -struct kvm_irq_routing_table;
 -
  #ifndef KVM_PRIVATE_MEM_SLOTS
  #define KVM_PRIVATE_MEM_SLOTS 0
  #endif
 @@ -1036,8 +1032,6 @@ static inline bool kvm_check_request(int req, struct 
 kvm_vcpu *vcpu)
  
  extern bool kvm_rebooting;
  
 -struct kvm_device_ops;
 -
  struct kvm_device {
   struct kvm_device_ops *ops;
   struct kvm *kvm;
 diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
 index 1d6daca..53c4f20 100644
 --- a/include/linux/kvm_types.h
 +++ b/include/linux/kvm_types.h
 @@ -19,7 +19,9 @@
  
  struct kvm;
  struct kvm_async_pf;
 +struct kvm_device_ops;
  struct kvm_interrupt;
 +struct kvm_irq_routing_table;
  struct kvm_memory_slot;
  struct kvm_one_reg;
  struct kvm_run;
 

Good idea, thanks.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-08-29 Thread Amit Shah

On (Fri) 29 Aug 2014 [15:45:30], Zhang Haoyu wrote:
 Hi, all
 
 I start a VM with virtio-serial (default ports number: 31), and found that 
 virtio-blk performance degradation happened, about 25%, this problem can be 
 reproduced 100%.
 without virtio-serial:
 4k-read-random 1186 IOPS
 with virtio-serial:
 4k-read-random 871 IOPS
 
 but if use max_ports=2 option to limit the max number of virio-serial ports, 
 then the IO performance degradation is not so serious, about 5%.
 
 And, ide performance degradation does not happen with virtio-serial.

Pretty sure it's related to MSI vectors in use.  It's possible that
the virtio-serial device takes up all the avl vectors in the guests,
leaving old-style irqs for the virtio-blk device.

If you restrict the number of vectors the virtio-serial device gets
(using the -device virtio-serial-pci,vectors= param), does that make
things better for you?


Amit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] kvmtool: ARM: Use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target cpu

2014-08-29 Thread Will Deacon

On Fri, Aug 29, 2014 at 10:10:52AM +0100, Andre Przywara wrote:
 (resent, that was the wrong account before ...)

Aha, and now your true identity has been revealed to all! Nice try Andre...
or should I say, Rienhard?

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] kvmtool: ARM: Use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target cpu

2014-08-29 Thread Andre Przywara


On 29/08/14 17:17, Will Deacon wrote:
 On Fri, Aug 29, 2014 at 10:10:52AM +0100, Andre Przywara wrote:
 (resent, that was the wrong account before ...)

 Aha, and now your true identity has been revealed to all! Nice try Andre...
 or should I say, Rienhard?

Psst, don't give Google funny ideas about this (now very secret)
relationship ;-)

Cheers,
Andre R P.

-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered 
in England  Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, 
Registered in England  Wales, Company No:  2548782

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/3] KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs

2014-08-29 Thread Paolo Bonzini

Il 19/08/2014 11:04, Wanpeng Li ha scritto:
 Section 11.11.2.3 of the SDM mentions All other bits in the 
 IA32_MTRR_PHYSBASEn 
 and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a 
 general-protection exception(#GP) if software attempts to write to them. 
 This 
 patch do it in kvm.
 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com

This breaks if the guest maxphyaddr is higher than the host's (which 
sometimes happens depending on your hardware and how QEMU is 
configured).

You need to use cpuid_maxphyaddr, like this

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a375dfc42f6a..916e89515210 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1726,7 +1726,7 @@ static bool valid_mtrr_type(unsigned t)
 static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
int i;
-   u64 mask = 0;
+   u64 mask;
 
if (!msr_mtrr_valid(msr))
return false;
@@ -1750,8 +1750,7 @@ static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
/* variable MTRRs */
WARN_ON(!(msr = 0x200  msr  0x200 + 2 * KVM_NR_VAR_MTRR));
 
-   for (i = 63; i  boot_cpu_data.x86_phys_bits; i--)
-   mask |= (1ULL  i);
+   mask = (~0ULL)  cpuid_maxphyaddr(vcpu);
if ((msr  1) == 0) {
/* MTRR base */
if (!valid_mtrr_type(data  0xff))


Jan, can you see if this patch fixes the SeaBIOS triple fault you reported?

Paolo

 ---
  arch/x86/kvm/x86.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fb3ea7a..b85da5f 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1726,6 +1726,7 @@ static bool valid_mtrr_type(unsigned t)
  static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
  {
   int i;
 + u64 mask = 0;
  
   if (!msr_mtrr_valid(msr))
   return false;
 @@ -1749,10 +1750,21 @@ static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 
 msr, u64 data)
   /* variable MTRRs */
   WARN_ON(!(msr = 0x200  msr  0x200 + 2 * KVM_NR_VAR_MTRR));
  
 - if ((msr  1) == 0)
 + for (i = 63; i  boot_cpu_data.x86_phys_bits; i--)
 + mask |= (1ULL  i);
 + if ((msr  1) == 0) {
   /* MTRR base */
 - return valid_mtrr_type(data  0xff);
 - /* MTRR mask */
 + if (!valid_mtrr_type(data  0xff))
 + return false;
 +  mask |= 0xf00;
 + } else
 + /* MTRR mask */
 + mask |= 0x7ff;
 + if (data  mask) {
 + kvm_inject_gp(vcpu, 0);
 + return false;
 + }
 +
   return true;
  }
  
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/3] KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs

2014-08-29 Thread Jan Kiszka

On 2014-08-29 18:47, Paolo Bonzini wrote:
 Il 19/08/2014 11:04, Wanpeng Li ha scritto:
 Section 11.11.2.3 of the SDM mentions All other bits in the 
 IA32_MTRR_PHYSBASEn 
 and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a 
 general-protection exception(#GP) if software attempts to write to them. 
 This 
 patch do it in kvm.

 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 
 This breaks if the guest maxphyaddr is higher than the host's (which 
 sometimes happens depending on your hardware and how QEMU is 
 configured).
 
 You need to use cpuid_maxphyaddr, like this
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index a375dfc42f6a..916e89515210 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1726,7 +1726,7 @@ static bool valid_mtrr_type(unsigned t)
  static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
  {
   int i;
 - u64 mask = 0;
 + u64 mask;
  
   if (!msr_mtrr_valid(msr))
   return false;
 @@ -1750,8 +1750,7 @@ static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, 
 u64 data)
   /* variable MTRRs */
   WARN_ON(!(msr = 0x200  msr  0x200 + 2 * KVM_NR_VAR_MTRR));
  
 - for (i = 63; i  boot_cpu_data.x86_phys_bits; i--)
 - mask |= (1ULL  i);
 + mask = (~0ULL)  cpuid_maxphyaddr(vcpu);
   if ((msr  1) == 0) {
   /* MTRR base */
   if (!valid_mtrr_type(data  0xff))
 
 
 Jan, can you see if this patch fixes the SeaBIOS triple fault you reported?

Yep, it does.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core

2014-08-29 Thread Mihai Caraman

ePAPR represents hardware threads as cpu node properties in device tree.
So with existing QEMU, hardware threads are simply exposed as vcpus with
one hardware thread.

The e6500 core shares TLBs between hardware threads. Without tlb write
conditional instruction, the Linux kernel uses per core mechanisms to
protect against duplicate TLB entries.

The guest is unable to detect real siblings threads, so it can't use a
TLB protection mechanism. An alternative solution is to use the hypervisor
to allocate different lpids to guest's vcpus running simultaneous on real
siblings threads. On systems with two threads per core this patch halves
the size of the lpid pool that the allocator sees and use two lpids per VM.
Use even numbers to speedup vcpu lpid computation with consecutive lpids
per VM: vm1 will use lpids 2 and 3, vm2 lpids 4 and 5, and so on.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/include/asm/kvm_booke.h |  5 +++-
 arch/powerpc/kvm/e500.h  | 20 
 arch/powerpc/kvm/e500_mmu_host.c | 16 ++---
 arch/powerpc/kvm/e500mc.c| 46 ++--
 4 files changed, 64 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index f7aa5cc..630134d 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -23,7 +23,10 @@
 #include linux/types.h
 #include linux/kvm_host.h
 
-/* LPIDs we support with this build -- runtime limit may be lower */
+/*
+ * Number of available lpids. Only the low-order 6 bits of LPID rgister are
+ * implemented on e500mc+ cores.
+ */
 #define KVMPPC_NR_LPIDS64
 
 #define KVMPPC_INST_EHPRIV 0x7c00021c
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index a326178..7b74453 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -22,6 +22,7 @@
 #include linux/kvm_host.h
 #include asm/mmu-book3e.h
 #include asm/tlb.h
+#include asm/cputhreads.h
 
 enum vcpu_ftr {
VCPU_FTR_MMU_V2
@@ -289,6 +290,25 @@ void kvmppc_e500_tlbil_all(struct kvmppc_vcpu_e500 
*vcpu_e500);
 #define kvmppc_e500_get_tlb_stid(vcpu, gtlbe)   get_tlb_tid(gtlbe)
 #define get_tlbmiss_tid(vcpu)   get_cur_pid(vcpu)
 #define get_tlb_sts(gtlbe)  (gtlbe-mas1  MAS1_TS)
+
+/*
+ * This functios should be called with preemtion disabled
+ * and the returned value is valid only in that context
+ */
+static inline int get_thread_specific_lpid(int vm_lpid)
+{
+   int vcpu_lpid = vm_lpid;
+
+   if (threads_per_core == 2)
+   vcpu_lpid |= smp_processor_id()  1;
+
+   return vcpu_lpid;
+}
+
+static inline int get_lpid(struct kvm_vcpu *vcpu)
+{
+   return get_thread_specific_lpid(vcpu-kvm-arch.lpid);
+}
 #else
 unsigned int kvmppc_e500_get_tlb_stid(struct kvm_vcpu *vcpu,
  struct kvm_book3e_206_tlb_entry *gtlbe);
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 08f14bb..5759608 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -69,7 +69,8 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int 
usermode)
  * writing shadow tlb entry to host TLB
  */
 static inline void __write_host_tlbe(struct kvm_book3e_206_tlb_entry *stlbe,
-uint32_t mas0)
+uint32_t mas0,
+uint32_t lpid)
 {
unsigned long flags;
 
@@ -80,7 +81,7 @@ static inline void __write_host_tlbe(struct 
kvm_book3e_206_tlb_entry *stlbe,
mtspr(SPRN_MAS3, (u32)stlbe-mas7_3);
mtspr(SPRN_MAS7, (u32)(stlbe-mas7_3  32));
 #ifdef CONFIG_KVM_BOOKE_HV
-   mtspr(SPRN_MAS8, stlbe-mas8);
+   mtspr(SPRN_MAS8, MAS8_TGS | get_thread_specific_lpid(lpid));
 #endif
asm volatile(isync; tlbwe : : : memory);
 
@@ -129,11 +130,12 @@ static inline void write_host_tlbe(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
if (tlbsel == 0) {
mas0 = get_host_mas0(stlbe-mas2);
-   __write_host_tlbe(stlbe, mas0);
+   __write_host_tlbe(stlbe, mas0, vcpu_e500-vcpu.kvm-arch.lpid);
} else {
__write_host_tlbe(stlbe,
  MAS0_TLBSEL(1) |
- MAS0_ESEL(to_htlb1_esel(sesel)));
+ MAS0_ESEL(to_htlb1_esel(sesel)),
+ vcpu_e500-vcpu.kvm-arch.lpid);
}
 }
 
@@ -317,10 +319,6 @@ static void kvmppc_e500_setup_stlbe(
stlbe-mas2 = (gvaddr  MAS2_EPN) | (ref-flags  E500_TLB_MAS2_ATTR);
stlbe-mas7_3 = ((u64)pfn  PAGE_SHIFT) |
e500_shadow_mas3_attrib(gtlbe-mas7_3, pr);
-
-#ifdef CONFIG_KVM_BOOKE_HV
-   stlbe-mas8 = MAS8_TGS | vcpu-kvm-arch.lpid;
-#endif
 }
 
 static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500

[PATCH 2/2] KVM: PPC: Book3E: Enable e6500 core

2014-08-29 Thread Mihai Caraman

Now that AltiVec and hardware threading support are in place enable e6500 core.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/kvm/e500mc.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index bf8f99f..2fdc872 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -180,6 +180,16 @@ int kvmppc_core_check_processor_compat(void)
r = 0;
else if (strcmp(cur_cpu_spec-cpu_name, e5500) == 0)
r = 0;
+#ifdef CONFIG_ALTIVEC
+   /*
+* Since guests have the priviledge to enable AltiVec, we need AltiVec
+* support in the host to save/restore their context.
+* Don't use CPU_FTR_ALTIVEC to identify cores with AltiVec unit
+* because it's cleared in the absence of CONFIG_ALTIVEC!
+*/
+   else if (strcmp(cur_cpu_spec-cpu_name, e6500) == 0)
+   r = 0;
+#endif
else
r = -ENOTSUPP;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm-unit-test failures

2014-08-29 Thread Chris J Arges

On 08/27/2014 05:05 PM, Paolo Bonzini wrote:
 
 
 - Messaggio originale -
 Da: Chris J Arges chris.j.ar...@canonical.com
 A: Paolo Bonzini pbonz...@redhat.com, kvm@vger.kernel.org
 Inviato: Mercoledì, 27 agosto 2014 23:24:14
 Oggetto: kvm-unit-test failures (was: [PATCH 1/2 v3] add check parameter to 
 run_tests configuration)

 snip
 Thanks, looks good.  Are there more failures?

 Paolo


 Paolo,
 Thanks for applying those patches!

 I now only see the two failures on my machine:
 model name  : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz

 I'm running with the tip of kvm master:
 0ac625df43ce9d085d4ff54c1f739611f4308b13 (Merge tag 'kvm-s390-20140825')

 sudo ./x86-run x86/apic.flat -smp 2 -cpu qemu64,+x2apic,+tsc-deadline |
 grep -v PASS
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/apic.flat -smp 2 -cpu
 qemu64,+x2apic,+tsc-deadline
 enabling apic
 enabling apic
 paging enabled
 cr0 = 80010011
 cr3 = 7fff000
 cr4 = 20
 apic version: 1050014
 x2apic enabled
 FAIL: tsc deadline timer clearing
 tsc deadline timer enabled
 
 This is fixed in kvm/next (3.18).
 
 SUMMARY: 16 tests, 1 unexpected failures
 Return value from qemu: 3

 sudo ./x86-run x86/kvmclock_test.flat -smp 2 --append 1000 `date +%s`
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/kvmclock_test.flat -smp 2 --append
 1000 1409174399
 enabling apic
 enabling apic
 kvm-clock: cpu 0, msr 0x:44d4c0
 kvm-clock: cpu 0, msr 0x:44d4c0
 Wallclock test, threshold 5
 Seconds get from host: 1409174399
 Seconds get from kvmclock: 1409173176
 Offset:-1223
 
 Weird, your clock is 20 minutes behind in the VM than it
 is in the host.  Is the offset always around -1200?  What
 happens if you reboot?
 
 (I get 0, 1 or sometimes 2).
 
 Paolo
 

Hi Paolo,

Results building with kvm queue tree
(fd2752352bbc98850d83b5448a288d8991590317):
CPU:
model name  : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz

I still get failures with the following test, I actually tested on
multiple machines with identical hardware and the same failure occurred.
In v3.13/v3.16 series kernels this passes. I'll look into which commit
changed this result for me. I suspect it was fairly recent.

./x86-run x86/kvmclock_test.flat -smp 2 --append 1000 `date +%s` |
grep -v PASS
qemu-system-x86_64 -enable-kvm -device pc-testdev -device
isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
-device pci-testdev -kernel x86/kvmclock_test.flat -smp 2 --append
1000 1409160326
enabling apic
enabling apic
kvm-clock: cpu 0, msr 0x:44d520
kvm-clock: cpu 0, msr 0x:44d520
Wallclock test, threshold 5
Seconds get from host: 1409160326
Seconds get from kvmclock: 1409153484
Offset:-6842
offset too large!
Check the stability of raw cycle ...
Worst warp -6841795339348
Total vcpus: 2
Test  loops: 1000
Total warps:  1
Total stalls: 0
Worst warp:   -6841795339348
Raw cycle is not stable
Monotonic cycle test:
Worst warp -6836691572679
Total vcpus: 2
Test  loops: 1000
Total warps:  1
Total stalls: 0
Worst warp:   -6836691572679
Measure the performance of raw cycle ...
Total vcpus: 2
Test  loops: 1000
TSC cycles:  1098400654
Measure the performance of adjusted cycle ...
Total vcpus: 2
Test  loops: 1000
TSC cycles:  1106302952
Return value from qemu: 3

This is another test that fails or hangs, this passes in 3.13, but fails
on 3.16 with my testing. I'll dig into this more perhaps to find out
which commit changes things.

./x86-run x86/vmx.flat -smp 1 -cpu host,+vmx | grep -v PASS
qemu-system-x86_64 -enable-kvm -device pc-testdev -device
isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
-device pci-testdev -kernel x86/vmx.flat -smp 1 -cpu host,+vmx
enabling apic
paging enabled
cr0 = 80010011
cr3 = 7fff000
cr4 = 20

Test suite: VMX capability reporting

Test suite: vmenter

Test suite: preemption timer

Test suite: control field PAT

Test suite: control field EFER

Test suite: CR shadowing

Test suite: I/O bitmap

Test suite: instruction intercept

Test suite: EPT framework
FAIL: EPT violation - paging structure

Test suite: interrupt
`ASS: running a guest with interrupt acknowledgement set

Test suite: debug controls

SUMMARY: 109 tests, 1 unexpected failures
Return value from qemu: 3

Here is the test case hanging:

./x86-run x86/vmx.flat -smp 1 -cpu host,+vmx | grep -v PASS
qemu-system-x86_64 -enable-kvm -device pc-testdev -device
isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
-device pci-testdev -kernel x86/vmx.flat -smp 1 -cpu host,+vmx
enabling apic
paging enabled
cr0 = 80010011
cr3 = 7fff000
cr4 = 20

Test suite: VMX capability reporting

Test suite: vmenter

Test suite: preemption timer

Test suite: control field PAT

Test suite: control field EFER

Test suite:

Re: [PATCH 06/16] KVM: Add KVM_EXIT_SYSTEM_EVENT to user space API header

2014-08-29 Thread Peter Maydell

On 25 May 2014 19:18, Christoffer Dall christoffer.d...@linaro.org wrote:
 From: Anup Patel anup.pa...@linaro.org

 Currently, we don't have an exit reason to notify user space about
 a system-level event (for e.g. system reset or shutdown) triggered
 by the VCPU. This patch adds exit reason KVM_EXIT_SYSTEM_EVENT for
 this purpose. We can also inform user space about the 'type' and
 architecture specific 'flags' of a system-level event using the
 kvm_run structure.

 This newly added KVM_EXIT_SYSTEM_EVENT will be used by KVM ARM/ARM64
 in-kernel PSCI v0.2 support to reset/shutdown VMs.

 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2740,6 +2740,21 @@ It gets triggered whenever both KVM_CAP_PPC_EPR are 
 enabled and an
  external interrupt has just been delivered into the guest. User space
  should put the acknowledged interrupt vector into the 'epr' field.

 +   /* KVM_EXIT_SYSTEM_EVENT */
 +   struct {
 +#define KVM_SYSTEM_EVENT_SHUTDOWN   1
 +#define KVM_SYSTEM_EVENT_RESET  2
 +   __u32 type;
 +   __u64 flags;
 +   } system_event;
 +
 +If exit_reason is KVM_EXIT_SYSTEM_EVENT then the vcpu has triggered
 +a system-level event using some architecture specific mechanism (hypercall
 +or some special instruction). In case of ARM/ARM64, this is triggered using
 +HVC instruction based PSCI call from the vcpu. The 'type' field describes
 +the system-level event type. The 'flags' field describes architecture
 +specific flags for the system-level event.

Talking with Ard I realised that there's actually a hole in the
specification of this new ABI. Did we intend these shutdown
and reset exits to be:
 (1) requests from the guest for the shutdown/reset to be
   scheduled in the near future (and we'll continue to execute
   the guest until the shutdown actually happens)
 (2) requests for shutdown/reset right now, with no further
   guest instructions to be executed

?

As currently implemented in QEMU we get behaviour (1),
but I think the kernel PSCI implementation assumes
behaviour (2). Who's right?

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-29 Thread David Matlack

On Fri, Aug 29, 2014 at 12:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 28/08/2014 23:10, David Matlack ha scritto:
 Paolo,
 It seems like this patch ([PATCH 2/2] kvm: x86: fix stale mmio cache)
 is ready to go. Is there anything blocking it from being merged?

 (It should be fine to merge this on its own, independent of the fix
 discussed in [PATCH 1/2] KVM: fix cache stale memslot info with
 correct mmio generation number, https://lkml.org/lkml/2014/8/14/62.)

 I'll post the full series today.  Sorry, I've been swamped a bit.

Thanks, I'll start testing it. Hope things quiet down for you soon :)


 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm-unit-test failures

2014-08-29 Thread Chris J Arges



On 08/29/2014 12:36 PM, Chris J Arges wrote:
 On 08/27/2014 05:05 PM, Paolo Bonzini wrote:


 - Messaggio originale -
 Da: Chris J Arges chris.j.ar...@canonical.com
 A: Paolo Bonzini pbonz...@redhat.com, kvm@vger.kernel.org
 Inviato: Mercoledì, 27 agosto 2014 23:24:14
 Oggetto: kvm-unit-test failures (was: [PATCH 1/2 v3] add check parameter to 
 run_tests configuration)

 snip
 Thanks, looks good.  Are there more failures?

 Paolo


 Paolo,
 Thanks for applying those patches!

 I now only see the two failures on my machine:
 model name  : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz

 I'm running with the tip of kvm master:
 0ac625df43ce9d085d4ff54c1f739611f4308b13 (Merge tag 'kvm-s390-20140825')

 sudo ./x86-run x86/apic.flat -smp 2 -cpu qemu64,+x2apic,+tsc-deadline |
 grep -v PASS
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/apic.flat -smp 2 -cpu
 qemu64,+x2apic,+tsc-deadline
 enabling apic
 enabling apic
 paging enabled
 cr0 = 80010011
 cr3 = 7fff000
 cr4 = 20
 apic version: 1050014
 x2apic enabled
 FAIL: tsc deadline timer clearing
 tsc deadline timer enabled

 This is fixed in kvm/next (3.18).

 SUMMARY: 16 tests, 1 unexpected failures
 Return value from qemu: 3

 sudo ./x86-run x86/kvmclock_test.flat -smp 2 --append 1000 `date +%s`
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/kvmclock_test.flat -smp 2 --append
 1000 1409174399
 enabling apic
 enabling apic
 kvm-clock: cpu 0, msr 0x:44d4c0
 kvm-clock: cpu 0, msr 0x:44d4c0
 Wallclock test, threshold 5
 Seconds get from host: 1409174399
 Seconds get from kvmclock: 1409173176
 Offset:-1223

 Weird, your clock is 20 minutes behind in the VM than it
 is in the host.  Is the offset always around -1200?  What
 happens if you reboot?

 (I get 0, 1 or sometimes 2).

 Paolo

 
 Hi Paolo,
 
 Results building with kvm queue tree
 (fd2752352bbc98850d83b5448a288d8991590317):
 CPU:
 model name  : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
 
 I still get failures with the following test, I actually tested on
 multiple machines with identical hardware and the same failure occurred.
 In v3.13/v3.16 series kernels this passes. I'll look into which commit
 changed this result for me. I suspect it was fairly recent.
 

I did a quick grep through the patches between v3.16 and the latest and
I found: (0d3da0d26e3c3515997c99451ce3b0ad1a69a36c KVM: x86: fix TSC
matching).

I reverted this patch and the test case passed for me. Looking at the
patch, I added an extra else statement as so:

if (!matched) {
kvm-arch.nr_vcpus_matched_tsc = 0;
} else if (!already_matched) {
kvm-arch.nr_vcpus_matched_tsc++;
} else {
printk(kvm: do we get here?\n);
}

And indeed there is a condition where matched  already_matched are
both true. In this case we don't zero or increment nr_vcpus_matched_tsc.
Incrementing nr_vcpus_matched_tsc in that last else clause allows the
test to pass; however this is identical to the  logic before the patch.

Any suggestions for fixing this case?

Thanks,
--chris j arges

 ./x86-run x86/kvmclock_test.flat -smp 2 --append 1000 `date +%s` |
 grep -v PASS
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/kvmclock_test.flat -smp 2 --append
 1000 1409160326
 enabling apic
 enabling apic
 kvm-clock: cpu 0, msr 0x:44d520
 kvm-clock: cpu 0, msr 0x:44d520
 Wallclock test, threshold 5
 Seconds get from host: 1409160326
 Seconds get from kvmclock: 1409153484
 Offset:-6842
 offset too large!
 Check the stability of raw cycle ...
 Worst warp -6841795339348
 Total vcpus: 2
 Test  loops: 1000
 Total warps:  1
 Total stalls: 0
 Worst warp:   -6841795339348
 Raw cycle is not stable
 Monotonic cycle test:
 Worst warp -6836691572679
 Total vcpus: 2
 Test  loops: 1000
 Total warps:  1
 Total stalls: 0
 Worst warp:   -6836691572679
 Measure the performance of raw cycle ...
 Total vcpus: 2
 Test  loops: 1000
 TSC cycles:  1098400654
 Measure the performance of adjusted cycle ...
 Total vcpus: 2
 Test  loops: 1000
 TSC cycles:  1106302952
 Return value from qemu: 3
 
 This is another test that fails or hangs, this passes in 3.13, but fails
 on 3.16 with my testing. I'll dig into this more perhaps to find out
 which commit changes things.
 
 ./x86-run x86/vmx.flat -smp 1 -cpu host,+vmx | grep -v PASS
 qemu-system-x86_64 -enable-kvm -device pc-testdev -device
 isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio
 -device pci-testdev -kernel x86/vmx.flat -smp 1 -cpu host,+vmx
 enabling apic
 paging enabled
 cr0 = 80010011
 cr3 = 7fff000
 cr4 = 20
 
 Test suite: VMX

Re: kvm-unit-test failures

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 19:36, Chris J Arges ha scritto:
 I still get failures with the following test, I actually tested on
 multiple machines with identical hardware and the same failure occurred.
 In v3.13/v3.16 series kernels this passes. I'll look into which commit
 changed this result for me. I suspect it was fairly recent.

I would try bisecting between 0e5ac3a8b100469ea154f87dd57b685fbdd356f6
(might be bad) and 41fa4215f8e8150bdc5d2a5f8704915f1b059fa8 (might be good).

Is this a laptop or desktop?  Any relationship between the Offset and
how long the machine has been up?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm-unit-test failures

2014-08-29 Thread Chris J Arges



On 08/29/2014 04:08 PM, Paolo Bonzini wrote:
 Il 29/08/2014 19:36, Chris J Arges ha scritto:
 I still get failures with the following test, I actually tested on
 multiple machines with identical hardware and the same failure occurred.
 In v3.13/v3.16 series kernels this passes. I'll look into which commit
 changed this result for me. I suspect it was fairly recent.
 
 I would try bisecting between 0e5ac3a8b100469ea154f87dd57b685fbdd356f6
 (might be bad) and 41fa4215f8e8150bdc5d2a5f8704915f1b059fa8 (might be good).
 
Just sent the other email with my bisect results, the 'bad' commit for
me is:
0d3da0d26e3c3515997c99451ce3b0ad1a69a36c
A revert on this commit allows the tests to pass again.

 Is this a laptop or desktop?  Any relationship between the Offset and
 how long the machine has been up?

 Paolo
 
This is a server; and no, I've reboot the machine and re-run the test.
I've also run it after days of uptime.

--chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] kvmtool: ARM: Use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target cpu

2014-08-29 Thread Anup Patel

Hi Andre,

On 29 August 2014 14:40, Andre Przywara andre.przyw...@arm.com wrote:
 (resent, that was the wrong account before ...)

 Hi Anup,

 On 26/08/14 10:22, Anup Patel wrote:
 Instead, of trying out each and every target type we should
 use KVM_ARM_PREFERRED_TARGET vm ioctl to determine target type
 for KVM ARM/ARM64.

 If KVM_ARM_PREFERRED_TARGET vm ioctl fails then we fallback to
 old method of trying all known target types.

 So as the algorithm currently works, it does not give us much
 improvement over the current behaviour. We still need to list each
 supported MPIDR both in kvmtool and in the kernel.
 Looking more closely at the code, beside the target id we only need the
 kvm_target_arm[] list for the compatible string and the init() function.
 The latter is (currently) the same for all supported type, so we could
 use that as a standard fallback function.
 The compatible string seems to be completely ignored by the ARM64
 kernel, so we could as well pass arm,armv8 all the time.
 In ARM(32) kernels we seem to not make any real use of it for CPUs which
 we care for (with virtualisation extensions).

You are absolutely right here. I was just trying to keep
KVMTOOL changes to minimum.


 So what about the following:
 We keep the list as it is, but not extend it for future CPUs, expect
 those in need for a special compatible string or a specific init
 function. Instead we rely on PREFFERED_TARGET for all current and
 upcoming CPUs (meaning unsupported CPUs must use a 3.12 kernel or higher).
 If PREFERRED_TARGET works, we scan the list anyway (to find CPUs needing
 special treatment), but on failure of finding something in the list we
 just go ahead:
 - with the target ID the kernel returned,
 - an arm,armv8 compatible string (for arm64, not sure about arm) and
 - call the standard kvmtool init function

 This should relief us from the burden of adding each supported CPU to
 kvmtool.

 Does that make sense of am I missing something?
 I will hack something up to prove that it works.

Yes, this makes sense. In fact, QEMU does something similar
for -cpu host -M virt command line options.

I think I should be less lazy on this one. I will rework this
and make it more like QEMU -cpu host option.

Thanks,
Anup


 Also there is now a race on big.LITTLE systems: if the PREFERRED_TARGET
 ioctl is executed on one cluster, while the KVM_ARM_VCPU_INIT call is
 done on another core with a different MPIDR, then the kernel will refuse
 to init the CPU. I don't know of a good solution for this (except the
 sledgehammer pinning with sched_setaffinity to the current core, which
 is racy, too, but should at least work somehow ;-)
 Any ideas?

 Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
 Signed-off-by: Anup Patel anup.pa...@linaro.org
 ---
  tools/kvm/arm/kvm-cpu.c |   46 
 +++---
  1 file changed, 35 insertions(+), 11 deletions(-)

 diff --git a/tools/kvm/arm/kvm-cpu.c b/tools/kvm/arm/kvm-cpu.c
 index aeaa4cf..c010e9c 100644
 --- a/tools/kvm/arm/kvm-cpu.c
 +++ b/tools/kvm/arm/kvm-cpu.c
 @@ -34,6 +34,7 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   struct kvm_cpu *vcpu;
   int coalesced_offset, mmap_size, err = -1;
   unsigned int i;
 + struct kvm_vcpu_init preferred_init;
   struct kvm_vcpu_init vcpu_init = {
   .features = ARM_VCPU_FEATURE_FLAGS(kvm, cpu_id)
   };
 @@ -55,20 +56,42 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, 
 unsigned long cpu_id)
   if (vcpu-kvm_run == MAP_FAILED)
   die(unable to mmap vcpu fd);

 - /* Find an appropriate target CPU type. */
 - for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 - if (!kvm_arm_targets[i])
 - continue;
 - target = kvm_arm_targets[i];
 - vcpu_init.target = target-id;
 + /*
 +  * If preferred target ioctl successful then use preferred target
 +  * else try each and every target type.
 +  */
 + err = ioctl(kvm-vm_fd, KVM_ARM_PREFERRED_TARGET, preferred_init);
 + if (!err) {
 + /* Match preferred target CPU type. */
 + target = NULL;
 + for (i = 0; i  ARRAY_SIZE(kvm_arm_targets); ++i) {
 + if (!kvm_arm_targets[i])
 + continue;
 + if (kvm_arm_targets[i]-id == preferred_init.target) {
 + target = kvm_arm_targets[i];
 + break;
 + }
 + }
 +
 + vcpu_init.target = preferred_init.target;
   err = ioctl(vcpu-vcpu_fd, KVM_ARM_VCPU_INIT, vcpu_init);
 - if (!err)
 - break;
 + if (err || target-init(vcpu))
 + die(Unable to initialise vcpu for preferred target);

 So that segfaults if the CPU is not in kvmtools list (as target is still
 NULL). In the current

Re: [PATCH v2 4/4] kvmtool: ARM/ARM64: Provide PSCI-0.2 to guest when KVM supports it

2014-08-29 Thread Anup Patel

Hi Andre,

On 29 August 2014 14:41, Andre Przywara andre.przyw...@arm.com wrote:
 Hi Anup,

 On 26/08/14 10:22, Anup Patel wrote:
 If in-kernel KVM support PSCI-0.2 emulation then we should set
 KVM_ARM_VCPU_PSCI_0_2 feature for each guest VCPU and also
 provide arm,psci-0.2,arm,psci as PSCI compatible string.

 This patch updates kvm_cpu__arch_init() and setup_fdt() as
 per above.

 Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
 Signed-off-by: Anup Patel anup.pa...@linaro.org
 ---
  tools/kvm/arm/fdt.c |   39 +--
  tools/kvm/arm/kvm-cpu.c |5 +
  2 files changed, 38 insertions(+), 6 deletions(-)

 diff --git a/tools/kvm/arm/fdt.c b/tools/kvm/arm/fdt.c
 index 186a718..93849cf2 100644
 --- a/tools/kvm/arm/fdt.c
 +++ b/tools/kvm/arm/fdt.c
 @@ -13,6 +13,7 @@
  #include linux/byteorder.h
  #include linux/kernel.h
  #include linux/sizes.h
 +#include linux/psci.h

  static char kern_cmdline[COMMAND_LINE_SIZE];

 @@ -162,12 +163,38 @@ static int setup_fdt(struct kvm *kvm)

   /* PSCI firmware */
   _FDT(fdt_begin_node(fdt, psci));
 - _FDT(fdt_property_string(fdt, compatible, arm,psci));
 - _FDT(fdt_property_string(fdt, method, hvc));
 - _FDT(fdt_property_cell(fdt, cpu_suspend, KVM_PSCI_FN_CPU_SUSPEND));
 - _FDT(fdt_property_cell(fdt, cpu_off, KVM_PSCI_FN_CPU_OFF));
 - _FDT(fdt_property_cell(fdt, cpu_on, KVM_PSCI_FN_CPU_ON));
 - _FDT(fdt_property_cell(fdt, migrate, KVM_PSCI_FN_MIGRATE));
 + if (kvm__supports_extension(kvm, KVM_CAP_ARM_PSCI_0_2)) {
 + const char compatible[] = arm,psci-0.2\0arm,psci;
 + _FDT(fdt_property(fdt, compatible,
 +   compatible, sizeof(compatible)));
 + _FDT(fdt_property_string(fdt, method, hvc));
 + if (kvm-cfg.arch.aarch32_guest) {
 + _FDT(fdt_property_cell(fdt, cpu_suspend,
 + PSCI_0_2_FN_CPU_SUSPEND));
 + _FDT(fdt_property_cell(fdt, cpu_off,
 + PSCI_0_2_FN_CPU_OFF));
 + _FDT(fdt_property_cell(fdt, cpu_on,
 + PSCI_0_2_FN_CPU_ON));
 + _FDT(fdt_property_cell(fdt, migrate,
 + PSCI_0_2_FN_MIGRATE));
 + } else {
 + _FDT(fdt_property_cell(fdt, cpu_suspend,
 + PSCI_0_2_FN64_CPU_SUSPEND));
 + _FDT(fdt_property_cell(fdt, cpu_off,
 + PSCI_0_2_FN_CPU_OFF));
 + _FDT(fdt_property_cell(fdt, cpu_on,
 + PSCI_0_2_FN64_CPU_ON));
 + _FDT(fdt_property_cell(fdt, migrate,
 + PSCI_0_2_FN64_MIGRATE));
 + }
 + } else {
 + _FDT(fdt_property_string(fdt, compatible, arm,psci));
 + _FDT(fdt_property_string(fdt, method, hvc));
 + _FDT(fdt_property_cell(fdt, cpu_suspend, 
 KVM_PSCI_FN_CPU_SUSPEND));
 + _FDT(fdt_property_cell(fdt, cpu_off, KVM_PSCI_FN_CPU_OFF));
 + _FDT(fdt_property_cell(fdt, cpu_on, KVM_PSCI_FN_CPU_ON));
 + _FDT(fdt_property_cell(fdt, migrate, KVM_PSCI_FN_MIGRATE));
 + }

 I guess this could be simplified much by defining three arrays with the
 respective function IDs and setting a pointer to the right one here.
 Then there would still be only one set of _FDT() calls, which reference
 this pointer. Like:
 uint32_t *psci_fn_ids;
 ...
 if (KVM_CAP_ARM_PSCI_0_2) {
 if (aarch32_guest)
 psci_fn_ids = psci_0_2_fn_ids;
 else
 psci_fn_ids = psci_0_2_fn64_ids;
 } else
 psci_fn_ids = psci_0_1_fn_ids;
 _FDT(fdt_property_cell(fdt, cpu_suspend, psci_fn_ids[0]));
 _FDT(fdt_property_cell(fdt, cpu_off, psci_fn_ids[1]));
 ...

 Also I wonder if we actually need those different IDs. The binding doc
 says that Linux' PSCI 0.2 code ignores them altogether, only using them
 if the arm,psci branch of the compatible string is actually used (on
 kernels not supporting PSCI 0.2)

Your suggestion looks good to me. I will rework this patch accordingly.

 So can't we just always pass the PSCI 0.1 numbers in here?
 That would restrict this whole patch to just changing the compatible
 string, right?

If we always pass PSCI 0.1 numbers irrespective to compatible
string then it breaks the case where we have latest host kernel with
PSCI 0.2, latest KVMTOOL with PSCI 0.2, and  older guest kernel
with only PSCI 0.1 support. There was a issue in QEMU and Christoffer
had send-out a patch to fix this in QEMU.
(Refer, https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01311.html)

Regards,
Anup


 Regards,
 Andre.

   _FDT(fdt_end_node(fdt));

   /*

[PATCH 2/2] KVM: PPC: Book3E: Enable e6500 core

2014-08-29 Thread Mihai Caraman

Now that AltiVec and hardware threading support are in place enable e6500 core.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/kvm/e500mc.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index bf8f99f..2fdc872 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -180,6 +180,16 @@ int kvmppc_core_check_processor_compat(void)
r = 0;
else if (strcmp(cur_cpu_spec-cpu_name, e5500) == 0)
r = 0;
+#ifdef CONFIG_ALTIVEC
+   /*
+* Since guests have the priviledge to enable AltiVec, we need AltiVec
+* support in the host to save/restore their context.
+* Don't use CPU_FTR_ALTIVEC to identify cores with AltiVec unit
+* because it's cleared in the absence of CONFIG_ALTIVEC!
+*/
+   else if (strcmp(cur_cpu_spec-cpu_name, e6500) == 0)
+   r = 0;
+#endif
else
r = -ENOTSUPP;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

60 matches

Mail list logo