RE: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER

2013-01-11 Thread Pandarathil, Vijaymohan R


 -Original Message-
 From: Alex Williamson [mailto:alex.william...@redhat.com]
 Sent: Wednesday, January 09, 2013 11:05 AM
 To: Pandarathil, Vijaymohan R
 Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu-
 de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting
 AER
 
 On Wed, 2013-01-09 at 10:52 -0700, Alex Williamson wrote:
  On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote:
 - New ioctl which is used to pass the eventfd that is signaled when
 an error occurs in the vfio_pci_device
  
 - Register pci_error_handler for the vfio_pci driver
  
 - When the device encounters an error, the error handler registered
 by
 the vfio_pci driver gets invoked by the AER infrastructure
  
 - In the error handler, signal the eventfd registered for the device.
  
 - This results in the qemu eventfd handler getting invoked and
 appropriate action taken for the guest.
  
   Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com
   ---
drivers/vfio/pci/vfio_pci.c | 29 +
drivers/vfio/pci/vfio_pci_private.h |  1 +
drivers/vfio/vfio.c |  8 
include/linux/vfio.h|  1 +
include/uapi/linux/vfio.h   |  9 +
5 files changed, 48 insertions(+)
  
   diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
   index 6c11994..4ae9526 100644
   --- a/drivers/vfio/pci/vfio_pci.c
   +++ b/drivers/vfio/pci/vfio_pci.c
   @@ -207,6 +207,8 @@ static long vfio_pci_ioctl(void *device_data,
 if (vdev-reset_works)
 info.flags |= VFIO_DEVICE_FLAGS_RESET;
  
   + info.flags |= VFIO_DEVICE_FLAGS_AER_NOTIFY;
   +
 
  This appears to be a PCI specific flag, so the name should include
  _PCI_.  We also support non-PCIe devices and it seems like it would be
  possible to not have AER support available, so shouldn't this be
  conditional?

Will do that.

 
 info.num_regions = VFIO_PCI_NUM_REGIONS;
 info.num_irqs = VFIO_PCI_NUM_IRQS;
  
   @@ -348,6 +350,19 @@ static long vfio_pci_ioctl(void *device_data,
  
 return ret;
  
   + } else if (cmd == VFIO_DEVICE_SET_ERRFD) {
   + int32_t fd = (int32_t)arg;
   +
   + if (fd  0)
   + return -EINVAL;
   +
   + vdev-err_trigger = eventfd_ctx_fdget(fd);
   +
   + if (IS_ERR(vdev-err_trigger))
   + return PTR_ERR(vdev-err_trigger);
   +
   + return 0;
   +
 
  I'm not sure why we wouldn't describe this as just another interrupt
  from the device and configure it via SET_IRQ.  This ioctl has very
  limited use and doesn't follow any of the conventions of all the other
  vfio ioctls.

I thought this was a fairly simple ioctl to implement this way. Moreover, 
there is no device interrupt involved. Let me know if you really want to 
model it as a SET_IRQ ioctl.

 
 } else if (cmd == VFIO_DEVICE_RESET)
 return vdev-reset_works ?
 pci_reset_function(vdev-pdev) : -EINVAL;
   @@ -527,11 +542,25 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 kfree(vdev);
}
  
   +static pci_ers_result_t vfio_err_detected(struct pci_dev *pdev,
   + pci_channel_state_t state)
   +{
   + struct vfio_pci_device *vdev = vfio_get_vdev(pdev-dev);
   +
   + eventfd_signal(vdev-err_trigger, 1);
   + return PCI_ERS_RESULT_CAN_RECOVER;
   +}
 
  What if err_trigger hasn't been set?

Will fix that.

 
   +
   +static const struct pci_error_handlers vfio_err_handlers = {
   + .error_detected = vfio_err_detected,
   +};
   +
static struct pci_driver vfio_pci_driver = {
 .name   = vfio-pci,
 .id_table   = NULL, /* only dynamic ids */
 .probe  = vfio_pci_probe,
 .remove = vfio_pci_remove,
   + .err_handler= vfio_err_handlers,
};
  
static void __exit vfio_pci_cleanup(void)
   diff --git a/drivers/vfio/pci/vfio_pci_private.h
 b/drivers/vfio/pci/vfio_pci_private.h
   index 611827c..daee62f 100644
   --- a/drivers/vfio/pci/vfio_pci_private.h
   +++ b/drivers/vfio/pci/vfio_pci_private.h
   @@ -55,6 +55,7 @@ struct vfio_pci_device {
 boolbardirty;
 struct pci_saved_state  *pci_saved_state;
 atomic_trefcnt;
   + struct eventfd_ctx  *err_trigger;
};
  
#define is_intx(vdev) (vdev-irq_type == VFIO_PCI_INTX_IRQ_INDEX)
   diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
   index 56097c6..5ed5a54 100644
   --- a/drivers/vfio/vfio.c
   +++ b/drivers/vfio/vfio.c
   @@ -693,6 +693,14 @@ void *vfio_del_group_dev(struct device *dev)
}
EXPORT_SYMBOL_GPL(vfio_del_group_dev);
  
   +void *vfio_get_vdev(struct device *dev)
   +{
   + struct vfio_device *device = dev_get_drvdata(dev);
   +
   + return 

RE: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-01-11 Thread Pandarathil, Vijaymohan R


 -Original Message-
 From: Alex Williamson [mailto:alex.william...@redhat.com]
 Sent: Wednesday, January 09, 2013 10:08 AM
 To: Pandarathil, Vijaymohan R
 Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu-
 de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI
 devices
 
 On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote:
  - Create eventfd per vfio device assigned to a guest and register an
event handler
 
  - This fd is passed to the vfio_pci driver through a new ioctl
 
  - When the device encounters an error, the eventfd is signaled
and the qemu eventfd handler gets invoked.
 
  - In the handler decide what action to take. Current action taken
is to terminate the guest.
 
  Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com
  ---
   hw/vfio_pci.c  | 56
 ++
   linux-headers/linux/vfio.h |  9 
   2 files changed, 65 insertions(+)
 
  diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
  index 28c8303..9c3c28b 100644
  --- a/hw/vfio_pci.c
  +++ b/hw/vfio_pci.c
  @@ -38,6 +38,7 @@
   #include qemu/error-report.h
   #include qemu/queue.h
   #include qemu/range.h
  +#include sysemu/sysemu.h
 
   /* #define DEBUG_VFIO */
   #ifdef DEBUG_VFIO
  @@ -130,6 +131,8 @@ typedef struct VFIODevice {
   QLIST_ENTRY(VFIODevice) next;
   struct VFIOGroup *group;
   bool reset_works;
  +EventNotifier errfd;
 
 Misleading name, it's an EventNotifier not an fd.

Will make the change.

 
  +__u32 dev_info_flags;
   } VFIODevice;
 
   typedef struct VFIOGroup {
  @@ -1805,6 +1808,8 @@ static int vfio_get_device(VFIOGroup *group, const
 char *name, VFIODevice *vdev)
   DPRINTF(Device %s flags: %u, regions: %u, irgs: %u\n, name,
   dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
 
  +vdev-dev_info_flags = dev_info.flags;
  +
 
 We test the VFIO_DEVICE_FLAGS_RESET elsewhere by caching the bit into a
 variable, why not do that here too?

I was wondering if there was any specific reason to cache the bit into a 
separate variable. Wouldn't a flags field which can contain the specific bit be 
more apt ?

 
   if (!(dev_info.flags  VFIO_DEVICE_FLAGS_PCI)) {
   error_report(vfio: Um, this isn't a PCI device\n);
   goto error;
  @@ -1900,6 +1905,55 @@ static void vfio_put_device(VFIODevice *vdev)
   }
   }
 
  +static void vfio_errfd_handler(void *opaque)
  +{
  +VFIODevice *vdev = opaque;
  +
  +if (!event_notifier_test_and_clear(vdev-errfd)) {
  +return;
  +}
  +
  +/*
  + * TBD. Retrieve the error details and decide what action
  + * needs to be taken. One of the actions could be to pass
  + * the error to the guest and have the guest driver recover
  + * the error. This requires that PCIe capabilities be
  + * exposed to the guest. At present, we just terminate the
  + * guest to contain the error.
  + */
  +error_report(%s(%04x:%02x:%02x.%x) 
  +Unrecoverable error detected... Terminating guest\n,
  +__func__, vdev-host.domain, vdev-host.bus, vdev-host.slot,
  +vdev-host.function);
  +
  +qemu_system_shutdown_request();
 
 I would have figured hw_error

Hw_error() is probably more appropriate. Will make the change.

 
  +return;
  +}
  +
  +static void vfio_register_errfd(VFIODevice *vdev)
  +{
  +int32_t pfd;
 
 pfd is used elsewhere in vfio as Pointer to FD, this is just a fd.

Will change.

 
  +int ret;
  +
  +if (!(vdev-dev_info_flags  VFIO_DEVICE_FLAGS_AER_NOTIFY)) {
  +error_report(vfio: Warning: Error notification not supported
 for the device\n);
 
 This should probably just be a debug print.

Okay.

 
  +return;
  +}
  +if (event_notifier_init(vdev-errfd, 0)) {
  +error_report(vfio: Warning: Unable to init event notifier for
 error detection\n);
  +return;
  +}
  +pfd = event_notifier_get_fd(vdev-errfd);
  +qemu_set_fd_handler(pfd, vfio_errfd_handler, NULL, vdev);
  +
  +ret = ioctl(vdev-fd, VFIO_DEVICE_SET_ERRFD, pfd);
  +if (ret) {
  +error_report(vfio: Warning: Failed to setup error fd: %d\n,
 ret);
  +qemu_set_fd_handler(pfd, NULL, NULL, vdev);
  +event_notifier_cleanup(vdev-errfd);
  +}
  +return;
  +}
   static int vfio_initfn(PCIDevice *pdev)
   {
   VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
  @@ -2010,6 +2064,8 @@ static int vfio_initfn(PCIDevice *pdev)
   }
   }
 
  +vfio_register_errfd(vdev);
  +
 
 No cleanup in exitfn?!  Thanks,

Missed that. Will fix.

Vijay

 
 Alex
 
   return 0;
 
   out_teardown:
  diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
  index 4758d1b..0ca4eeb 100644
  --- a/linux-headers/linux/vfio.h
  +++ 

RE: [Qemu-devel] [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-01-11 Thread Pandarathil, Vijaymohan R


 -Original Message-
 From: Blue Swirl [mailto:blauwir...@gmail.com]
 Sent: Wednesday, January 09, 2013 1:23 PM
 To: Pandarathil, Vijaymohan R
 Cc: Alex Williamson; Bjorn Helgaas; Gleb Natapov; linux-
 p...@vger.kernel.org; qemu-de...@nongnu.org; kvm@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: [Qemu-devel] [PATCH 2/2] QEMU-AER: Qemu changes to support AER
 for VFIO-PCI devices
 
 On Wed, Jan 9, 2013 at 6:26 AM, Pandarathil, Vijaymohan R
 vijaymohan.pandarat...@hp.com wrote:
  - Create eventfd per vfio device assigned to a guest and register
 an
event handler
 
  - This fd is passed to the vfio_pci driver through a new ioctl
 
  - When the device encounters an error, the eventfd is signaled
and the qemu eventfd handler gets invoked.
 
  - In the handler decide what action to take. Current action taken
is to terminate the guest.
 
  Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com
  ---
   hw/vfio_pci.c  | 56
 ++
   linux-headers/linux/vfio.h |  9 
   2 files changed, 65 insertions(+)
 
  diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
  index 28c8303..9c3c28b 100644
  --- a/hw/vfio_pci.c
  +++ b/hw/vfio_pci.c
  @@ -38,6 +38,7 @@
   #include qemu/error-report.h
   #include qemu/queue.h
   #include qemu/range.h
  +#include sysemu/sysemu.h
 
   /* #define DEBUG_VFIO */
   #ifdef DEBUG_VFIO
  @@ -130,6 +131,8 @@ typedef struct VFIODevice {
   QLIST_ENTRY(VFIODevice) next;
   struct VFIOGroup *group;
   bool reset_works;
  +EventNotifier errfd;
  +__u32 dev_info_flags;
 
 QEMU is not kernel code, please use uint32_t.

As you saw later in the code, I was using the same type in the structure 
copied from the kernel. Will change this to uint32_t.

Thanks

Vijay

 
   } VFIODevice;
 
   typedef struct VFIOGroup {
  @@ -1805,6 +1808,8 @@ static int vfio_get_device(VFIOGroup *group, const
 char *name, VFIODevice *vdev)
   DPRINTF(Device %s flags: %u, regions: %u, irgs: %u\n, name,
   dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
 
  +vdev-dev_info_flags = dev_info.flags;
  +
   if (!(dev_info.flags  VFIO_DEVICE_FLAGS_PCI)) {
   error_report(vfio: Um, this isn't a PCI device\n);
   goto error;
  @@ -1900,6 +1905,55 @@ static void vfio_put_device(VFIODevice *vdev)
   }
   }
 
  +static void vfio_errfd_handler(void *opaque)
  +{
  +VFIODevice *vdev = opaque;
  +
  +if (!event_notifier_test_and_clear(vdev-errfd)) {
  +return;
  +}
  +
  +/*
  + * TBD. Retrieve the error details and decide what action
  + * needs to be taken. One of the actions could be to pass
  + * the error to the guest and have the guest driver recover
  + * the error. This requires that PCIe capabilities be
  + * exposed to the guest. At present, we just terminate the
  + * guest to contain the error.
  + */
  +error_report(%s(%04x:%02x:%02x.%x) 
  +Unrecoverable error detected... Terminating guest\n,
  +__func__, vdev-host.domain, vdev-host.bus, vdev-host.slot,
  +vdev-host.function);
  +
  +qemu_system_shutdown_request();
  +return;
  +}
  +
  +static void vfio_register_errfd(VFIODevice *vdev)
  +{
  +int32_t pfd;
  +int ret;
  +
  +if (!(vdev-dev_info_flags  VFIO_DEVICE_FLAGS_AER_NOTIFY)) {
  +error_report(vfio: Warning: Error notification not supported
 for the device\n);
  +return;
  +}
  +if (event_notifier_init(vdev-errfd, 0)) {
  +error_report(vfio: Warning: Unable to init event notifier for
 error detection\n);
  +return;
  +}
  +pfd = event_notifier_get_fd(vdev-errfd);
  +qemu_set_fd_handler(pfd, vfio_errfd_handler, NULL, vdev);
  +
  +ret = ioctl(vdev-fd, VFIO_DEVICE_SET_ERRFD, pfd);
  +if (ret) {
  +error_report(vfio: Warning: Failed to setup error fd: %d\n,
 ret);
  +qemu_set_fd_handler(pfd, NULL, NULL, vdev);
  +event_notifier_cleanup(vdev-errfd);
  +}
  +return;
  +}
   static int vfio_initfn(PCIDevice *pdev)
   {
   VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
  @@ -2010,6 +2064,8 @@ static int vfio_initfn(PCIDevice *pdev)
   }
   }
 
  +vfio_register_errfd(vdev);
  +
   return 0;
 
   out_teardown:
  diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
  index 4758d1b..0ca4eeb 100644
  --- a/linux-headers/linux/vfio.h
  +++ b/linux-headers/linux/vfio.h
  @@ -147,6 +147,7 @@ struct vfio_device_info {
  __u32   flags;
   #define VFIO_DEVICE_FLAGS_RESET(1  0)/* Device
 supports reset */
   #define VFIO_DEVICE_FLAGS_PCI  (1  1)/* vfio-pci device */
  +#define VFIO_DEVICE_FLAGS_AER_NOTIFY (1  2)   /* Supports aer
 notification */
  __u32   num_regions;/* Max region index + 1 */
  __u32   num_irqs;  

[PATCH 0/4] KVM: Clean up and optimize __kvm_set_memory_region() - part1

2013-01-11 Thread Takuya Yoshikawa
Patches 1 to 3 are trivial.

Patch 4 is the main cause of the increased lines, but I think the new
code makes it easier to understand why each condition in
__kvm_set_memory_region() is there.

If you don't agree with patch 4, please consider taking the rest of the
series at this time.

Takuya Yoshikawa (4):
  KVM: set_memory_region: Don't jump to out_free unnecessarily
  KVM: set_memory_region: Don't check for overlaps unless we create or move a 
slot
  KVM: set_memory_region: Remove unnecessary variable memslot
  KVM: set_memory_region: Identify the requested change explicitly

 virt/kvm/kvm_main.c |   94 ---
 1 files changed, 59 insertions(+), 35 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] KVM: set_memory_region: Don't jump to out_free unnecessarily

2013-01-11 Thread Takuya Yoshikawa
This makes the separation between the sanity checks and the rest of the
code a bit clearer.

Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
---
 virt/kvm/kvm_main.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e45c20c..d5e4bf9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -778,9 +778,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 */
r = -EINVAL;
if (npages  old.npages  npages != old.npages)
-   goto out_free;
+   goto out;
if (!npages  !old.npages)
-   goto out_free;
+   goto out;
 
/* Check for overlaps */
r = -EEXIST;
@@ -789,7 +789,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
continue;
if (!((base_gfn + npages = slot-base_gfn) ||
  (base_gfn = slot-base_gfn + slot-npages)))
-   goto out_free;
+   goto out;
}
 
/* Free page dirty bitmap if unneeded */
@@ -891,7 +891,6 @@ out_free:
kvm_free_physmem_slot(new, old);
 out:
return r;
-
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM: set_memory_region: Don't check for overlaps unless we create or move a slot

2013-01-11 Thread Takuya Yoshikawa
Don't need the check for deleting an existing slot or just modifiying
the flags.

Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
---
 virt/kvm/kvm_main.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d5e4bf9..f6c8cdc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -782,14 +782,16 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!npages  !old.npages)
goto out;
 
-   /* Check for overlaps */
-   r = -EEXIST;
-   kvm_for_each_memslot(slot, kvm-memslots) {
-   if (slot-id = KVM_USER_MEM_SLOTS || slot == memslot)
-   continue;
-   if (!((base_gfn + npages = slot-base_gfn) ||
- (base_gfn = slot-base_gfn + slot-npages)))
-   goto out;
+   if ((npages  !old.npages) || (base_gfn != old.base_gfn)) {
+   /* Check for overlaps */
+   r = -EEXIST;
+   kvm_for_each_memslot(slot, kvm-memslots) {
+   if (slot-id = KVM_USER_MEM_SLOTS || slot == memslot)
+   continue;
+   if (!((base_gfn + npages = slot-base_gfn) ||
+ (base_gfn = slot-base_gfn + slot-npages)))
+   goto out;
+   }
}
 
/* Free page dirty bitmap if unneeded */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM: set_memory_region: Remove unnecessary variable memslot

2013-01-11 Thread Takuya Yoshikawa
One such variable, slot, is enough for holding a pointer temporarily.
We also remove another local variable named slot, which is limited in
a block, since it is confusing to have the same name in this function.

Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
---
 virt/kvm/kvm_main.c |   11 +--
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f6c8cdc..8cb4e71 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -728,7 +728,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int r;
gfn_t base_gfn;
unsigned long npages;
-   struct kvm_memory_slot *memslot, *slot;
+   struct kvm_memory_slot *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots = NULL, *old_memslots;
 
@@ -754,7 +754,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (mem-guest_phys_addr + mem-memory_size  mem-guest_phys_addr)
goto out;
 
-   memslot = id_to_memslot(kvm-memslots, mem-slot);
+   slot = id_to_memslot(kvm-memslots, mem-slot);
base_gfn = mem-guest_phys_addr  PAGE_SHIFT;
npages = mem-memory_size  PAGE_SHIFT;
 
@@ -765,7 +765,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!npages)
mem-flags = ~KVM_MEM_LOG_DIRTY_PAGES;
 
-   new = old = *memslot;
+   new = old = *slot;
 
new.id = mem-slot;
new.base_gfn = base_gfn;
@@ -786,7 +786,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
/* Check for overlaps */
r = -EEXIST;
kvm_for_each_memslot(slot, kvm-memslots) {
-   if (slot-id = KVM_USER_MEM_SLOTS || slot == memslot)
+   if ((slot-id = KVM_USER_MEM_SLOTS) ||
+   (slot-id == mem-slot))
continue;
if (!((base_gfn + npages = slot-base_gfn) ||
  (base_gfn = slot-base_gfn + slot-npages)))
@@ -823,8 +824,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
}
 
if (!npages || base_gfn != old.base_gfn) {
-   struct kvm_memory_slot *slot;
-
r = -ENOMEM;
slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots),
GFP_KERNEL);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: set_memory_region: Identify the requested change explicitly

2013-01-11 Thread Takuya Yoshikawa
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
what kind of change is being requested by checking the parameters.  The
current code does this checking at various points in code and each
condition used there is not easy to understand at first glance.

This patch consolidates these and introduces an enum to name the
possible changes to clean up the code.

Although this does not introduce any user visible changes, there are two
changes which optimize the code a bit:

  1. If we have nothing to change, the new code returns immediately.
  2. If we just modify the flags, no changes about the mapping, the
 new code does not call kvm_iommu_map_pages().

One interesting thing we noticed was about the case 1.  Although this
case seemed unimportant, QEMU was relying on the current behaviour:
when we returned a value other than 0, e.g. -EINVAL, live migration
failed at the final stage with a section mismatch error.

Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
---
 virt/kvm/kvm_main.c |   64 +++
 1 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8cb4e71..5426e96 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -714,6 +714,24 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
 }
 
 /*
+ * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
+ * - create a new memory slot
+ * - delete an existing memory slot
+ * - modify an existing memory slot
+ *   -- move it in the guest physical memory space
+ *   -- just change its flags
+ *
+ * Since flags can be changed by some of these operations, the following
+ * differentiation is the best we can do for __kvm_set_memory_region():
+ */
+enum kvm_mr_change {
+   KVM_MR_CREATE,
+   KVM_MR_DELETE,
+   KVM_MR_MOVE,
+   KVM_MR_FLAGS_ONLY,
+};
+
+/*
  * Allocate some memory and give it an address in the guest physical address
  * space.
  *
@@ -731,6 +749,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_memory_slot *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots = NULL, *old_memslots;
+   enum kvm_mr_change change;
 
r = check_memory_region_flags(mem);
if (r)
@@ -772,17 +791,30 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.npages = npages;
new.flags = mem-flags;
 
-   /*
-* Disallow changing a memory slot's size or changing anything about
-* zero sized slots that doesn't involve making them non-zero.
-*/
r = -EINVAL;
-   if (npages  old.npages  npages != old.npages)
-   goto out;
-   if (!npages  !old.npages)
+   if (npages) {
+   if (!old.npages)
+   change = KVM_MR_CREATE;
+   else { /* Modify an existing slot. */
+   if ((mem-userspace_addr != old.userspace_addr) ||
+   (npages != old.npages))
+   goto out;
+
+   if (base_gfn != old.base_gfn)
+   change = KVM_MR_MOVE;
+   else if (new.flags != old.flags)
+   change = KVM_MR_FLAGS_ONLY;
+   else { /* Nothing to change. */
+   r = 0;
+   goto out;
+   }
+   }
+   } else if (old.npages) {
+   change = KVM_MR_DELETE;
+   } else /* Modify a non-existent slot: disallowed. */
goto out;
 
-   if ((npages  !old.npages) || (base_gfn != old.base_gfn)) {
+   if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
/* Check for overlaps */
r = -EEXIST;
kvm_for_each_memslot(slot, kvm-memslots) {
@@ -800,20 +832,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.dirty_bitmap = NULL;
 
r = -ENOMEM;
-
-   /*
-* Allocate if a slot is being created.  If modifying a slot,
-* the userspace_addr cannot change.
-*/
-   if (!old.npages) {
+   if (change == KVM_MR_CREATE) {
new.user_alloc = user_alloc;
new.userspace_addr = mem-userspace_addr;
 
if (kvm_arch_create_memslot(new, npages))
goto out_free;
-   } else if (npages  mem-userspace_addr != old.userspace_addr) {
-   r = -EINVAL;
-   goto out_free;
}
 
/* Allocate page dirty bitmap if needed */
@@ -823,7 +847,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
/* destroy any largepage mappings for dirty tracking */
}
 
-   if (!npages || base_gfn != old.base_gfn) {
+   if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) {
r = -ENOMEM;
slots = kmemdup(kvm-memslots, sizeof(struct 

Re: [RFC PATCH] virtio-net: introduce a new macaddr control

2013-01-11 Thread Stefan Hajnoczi
On Thu, Jan 10, 2013 at 10:51:57PM +0800, ak...@redhat.com wrote:
 @@ -349,6 +351,13 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t 
 cmd,
  {
  struct virtio_net_ctrl_mac mac_data;
  
 +if (cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET  elem-out_num == 2) {
 +/* Set MAC address */
 +memcpy(n-mac, elem-out_sg[1].iov_base, elem-out_sg[1].iov_len);

We cannot trust the guest's iov_len, it could overflow n-mac.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support

2013-01-11 Thread Gleb Natapov
On Fri, Jan 11, 2013 at 02:37:15AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-01-10:
  On Thu, Jan 10, 2013 at 08:32:06AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-01-10:
  On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  basically to benefit from apicv, we need to enable virtualized x2apic 
  mode.
  Currently, we only enable it when guest is really using x2apic.
  
  Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled
  x2apic:
  0x800 - 0x8ff: no read intercept for apicv register virtualization,
  except APIC ID and TMCCT.
  APIC ID and TMCCT: need software's assistance to get right value.
  TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
  Signed-off-by: Kevin Tian kevin.t...@intel.com
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
   arch/x86/include/asm/kvm_host.h |2 + arch/x86/include/asm/vmx.h
 |1 + arch/x86/kvm/lapic.c|5 +-
 arch/x86/kvm/svm.c  |6 + arch/x86/kvm/vmx.c |  194
 +-- 5 files
  changed, 200
   insertions(+), 8 deletions(-)
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 ---
  a/arch/x86/include/asm/kvm_host.h +++
  b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,8 @@ struct
  kvm_x86_ops {
   void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
   void (*enable_irq_window)(struct kvm_vcpu *vcpu);
   void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, 
  int irr);
  +void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
  +void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
  Make one callback with enable/disable parameter. And do not forget SVM.
  
   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
   int (*get_tdp_level)(void);
   u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
  is_mmio);
  diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
  index 44c3f7e..0a54df0 100644
  --- a/arch/x86/include/asm/vmx.h
  +++ b/arch/x86/include/asm/vmx.h
  @@ -139,6 +139,7 @@
   #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x0001 #define
   SECONDARY_EXEC_ENABLE_EPT   0x0002 #define
   SECONDARY_EXEC_RDTSCP   0x0008 +#define
   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x0010 #define
   SECONDARY_EXEC_ENABLE_VPID  0x0020 #define
   SECONDARY_EXEC_WBINVD_EXITING   0x0040 #define
   SECONDARY_EXEC_UNRESTRICTED_GUEST   0x0080
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
  index 0664c13..ec38906 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu
  *vcpu,
  u64 value)
   u32 id = kvm_apic_id(apic);
   u32 ldr = ((id  4)  16) | (1  (id  0xf));
   kvm_apic_set_ldr(apic, ldr);
  -}
  +kvm_x86_ops-enable_virtual_x2apic_mode(vcpu);
  +} else
  +kvm_x86_ops-disable_virtual_x2apic_mode(vcpu);
  +
  You just broke SVM.
   apic-base_address = apic-vcpu-arch.apic_base 
MSR_IA32_APICBASE_BASE;
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index d29d3cd..0b82cb1 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct
  kvm_vcpu
  *vcpu, int tpr, int irr)
   set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
   }
  +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
  +{
  +return;
  +}
  +
   static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { struct vcpu_svm
   *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops
   svm_x86_ops = { .enable_nmi_window = enable_nmi_window,
   .enable_irq_window = enable_irq_window, 
  .update_cr8_intercept =
   update_cr8_intercept,
  +.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
  
   .set_tss_addr = svm_set_tss_addr,
   .get_tdp_level = get_npt_level,
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 688f43f..b203ce7 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -433,6 +433,8 @@ struct vcpu_vmx {
  
   bool rdtscp_enabled;
  +bool virtual_x2apic_enabled;
  +
   /* Support for a guest hypervisor (nested VMX) */
   struct nested_vmx nested;
   };
  @@ -767,12 +769,23 @@ static inline bool
  cpu_has_vmx_virtualize_apic_accesses(void)
   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
   }
  +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
  +{
  +return vmcs_config.cpu_based_2nd_exec_ctrl 
  +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
  +}
  +
  

[uq/master PATCH] vmxcap: bit 9 of VMX_PROCBASED_CTLS2 is 'virtual interrupt delivery'

2013-01-11 Thread Marcelo Tosatti

Bit 9 of MSR_IA32_VMX_PROCBASED_CTLS2 is
virtual interrupt delivery.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/scripts/kvm/vmxcap b/scripts/kvm/vmxcap
index cbe6440..0b23f77 100755
--- a/scripts/kvm/vmxcap
+++ b/scripts/kvm/vmxcap
@@ -147,6 +147,7 @@ controls = [
 5: 'Enable VPID',
 6: 'WBINVD exiting',
 7: 'Unrestricted guest',
+9: 'Virtual interrupt delivery',
 10: 'PAUSE-loop exiting',
 11: 'RDRAND exiting',
 12: 'Enable INVPCID',
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 5/5] KVM: x86: improve reexecute_instruction

2013-01-11 Thread Marcelo Tosatti
On Fri, Jan 11, 2013 at 04:18:22AM +0800, Xiao Guangrong wrote:
 On 01/11/2013 03:48 AM, Marcelo Tosatti wrote:
  On Fri, Jan 11, 2013 at 02:05:33AM +0800, Xiao Guangrong wrote:
  On 01/11/2013 01:26 AM, Marcelo Tosatti wrote:
  On Tue, Jan 08, 2013 at 02:38:36PM +0800, Xiao Guangrong wrote:
  The current reexecute_instruction can not well detect the failed 
  instruction
  emulation. It allows guest to retry all the instructions except it 
  accesses
  on error pfn
 
  For example, some cases are nested-write-protect - if the page we want to
  write is used as PDE but it chains to itself. Under this case, we should
  stop the emulation and report the case to userspace
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/include/asm/kvm_host.h |7 +++
   arch/x86/kvm/paging_tmpl.h  |   27 ---
   arch/x86/kvm/x86.c  |8 +++-
   3 files changed, 34 insertions(+), 8 deletions(-)
 
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index c431b33..d6ab8d2 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -502,6 +502,13 @@ struct kvm_vcpu_arch {
   u64 msr_val;
   struct gfn_to_hva_cache data;
   } pv_eoi;
  +
  +/*
  + * Indicate whether the access faults on its page table in guest
  + * which is set when fix page fault and used to detect 
  unhandeable
  + * instruction.
  + */
  +bool write_fault_to_shadow_pgtable;
   };
 
   struct kvm_lpage_info {
  diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
  index 67b390d..df50560 100644
  --- a/arch/x86/kvm/paging_tmpl.h
  +++ b/arch/x86/kvm/paging_tmpl.h
  @@ -497,26 +497,34 @@ out_gpte_changed:
* created when kvm establishes shadow page table that stop kvm using 
  large
* page size. Do it early can avoid unnecessary #PF and emulation.
*
  + * @write_fault_to_shadow_pgtable will return true if the fault gfn is
  + * currently used as its page table.
  + *
* Note: the PDPT page table is not checked for PAE-32 bit guest. It is 
  ok
* since the PDPT is always shadowed, that means, we can not use large 
  page
* size to map the gfn which is used as PDPT.
*/
   static bool
   FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
  -  struct guest_walker *walker, int 
  user_fault)
  +  struct guest_walker *walker, int 
  user_fault,
  +  bool *write_fault_to_shadow_pgtable)
   {
   int level;
   gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker-level) - 1);
  +bool self_changed = false;
 
   if (!(walker-pte_access  ACC_WRITE_MASK ||
 (!is_write_protection(vcpu)  !user_fault)))
   return false;
 
  -for (level = walker-level; level = walker-max_level; level++)
  -if (!((walker-gfn ^ walker-table_gfn[level - 1])  
  mask))
  -return true;
  +for (level = walker-level; level = walker-max_level; 
  level++) {
  +gfn_t gfn = walker-gfn ^ walker-table_gfn[level - 1];
  +
  +self_changed |= !(gfn  mask);
  +*write_fault_to_shadow_pgtable |= !gfn;
  +}
 
  -return false;
  +return self_changed;
   }
 
   /*
  @@ -544,7 +552,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, 
  gva_t addr, u32 error_code,
   int level = PT_PAGE_TABLE_LEVEL;
   int force_pt_level;
   unsigned long mmu_seq;
  -bool map_writable;
  +bool map_writable, is_self_change_mapping;
 
   pgprintk(%s: addr %lx err %x\n, __func__, addr, error_code);
 
  @@ -572,9 +580,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, 
  gva_t addr, u32 error_code,
   return 0;
   }
 
  +vcpu-arch.write_fault_to_shadow_pgtable = false;
  +
  +is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
  +  walker, user_fault, 
  vcpu-arch.write_fault_to_shadow_pgtable);
  +
   if (walker.level = PT_DIRECTORY_LEVEL)
   force_pt_level = mapping_level_dirty_bitmap(vcpu, 
  walker.gfn)
  -   || FNAME(is_self_change_mapping)(vcpu, walker, 
  user_fault);
  +   || is_self_change_mapping;
   else
   force_pt_level = 1;
   if (!force_pt_level) {
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 6f13e03..2957012 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -4810,7 +4810,13 @@ static bool reexecute_instruction(struct kvm_vcpu 
  *vcpu, gva_t cr2)
* guest to let CPU execute the instruction.
*/
   kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  -return true;
  +
  +/*
  + * If the 

Re: [PATCH v5 5/5] KVM: x86: improve reexecute_instruction

2013-01-11 Thread Marcelo Tosatti
On Fri, Jan 11, 2013 at 02:16:11AM +0800, Xiao Guangrong wrote:
 On 01/11/2013 01:30 AM, Marcelo Tosatti wrote:
  On Tue, Jan 08, 2013 at 02:38:36PM +0800, Xiao Guangrong wrote:
  The current reexecute_instruction can not well detect the failed 
  instruction
  emulation. It allows guest to retry all the instructions except it accesses
  on error pfn
 
  For example, some cases are nested-write-protect - if the page we want to
  write is used as PDE but it chains to itself. Under this case, we should
  stop the emulation and report the case to userspace
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/include/asm/kvm_host.h |7 +++
   arch/x86/kvm/paging_tmpl.h  |   27 ---
   arch/x86/kvm/x86.c  |8 +++-
   3 files changed, 34 insertions(+), 8 deletions(-)
 
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index c431b33..d6ab8d2 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -502,6 +502,13 @@ struct kvm_vcpu_arch {
 u64 msr_val;
 struct gfn_to_hva_cache data;
 } pv_eoi;
  +
  +  /*
  +   * Indicate whether the access faults on its page table in guest
  +   * which is set when fix page fault and used to detect unhandeable
  +   * instruction.
  +   */
  +  bool write_fault_to_shadow_pgtable;
   };
 
   struct kvm_lpage_info {
  diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
  index 67b390d..df50560 100644
  --- a/arch/x86/kvm/paging_tmpl.h
  +++ b/arch/x86/kvm/paging_tmpl.h
  @@ -497,26 +497,34 @@ out_gpte_changed:
* created when kvm establishes shadow page table that stop kvm using 
  large
* page size. Do it early can avoid unnecessary #PF and emulation.
*
  + * @write_fault_to_shadow_pgtable will return true if the fault gfn is
  + * currently used as its page table.
  + *
* Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok
* since the PDPT is always shadowed, that means, we can not use large 
  page
* size to map the gfn which is used as PDPT.
*/
   static bool
   FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
  -struct guest_walker *walker, int user_fault)
  +struct guest_walker *walker, int user_fault,
  +bool *write_fault_to_shadow_pgtable)
   {
 int level;
 gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker-level) - 1);
  +  bool self_changed = false;
 
 if (!(walker-pte_access  ACC_WRITE_MASK ||
   (!is_write_protection(vcpu)  !user_fault)))
 return false;
 
  -  for (level = walker-level; level = walker-max_level; level++)
  -  if (!((walker-gfn ^ walker-table_gfn[level - 1])  mask))
  -  return true;
  +  for (level = walker-level; level = walker-max_level; level++) {
  +  gfn_t gfn = walker-gfn ^ walker-table_gfn[level - 1];
  +
  +  self_changed |= !(gfn  mask);
  +  *write_fault_to_shadow_pgtable |= !gfn;
  +  }
 
  -  return false;
  +  return self_changed;
   }
 
   /*
  @@ -544,7 +552,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, 
  gva_t addr, u32 error_code,
 int level = PT_PAGE_TABLE_LEVEL;
 int force_pt_level;
 unsigned long mmu_seq;
  -  bool map_writable;
  +  bool map_writable, is_self_change_mapping;
 
 pgprintk(%s: addr %lx err %x\n, __func__, addr, error_code);
 
  @@ -572,9 +580,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, 
  gva_t addr, u32 error_code,
 return 0;
 }
 
  +  vcpu-arch.write_fault_to_shadow_pgtable = false;
  +
  +  is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
  +walker, user_fault, vcpu-arch.write_fault_to_shadow_pgtable);
  +
 if (walker.level = PT_DIRECTORY_LEVEL)
 force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn)
  - || FNAME(is_self_change_mapping)(vcpu, walker, user_fault);
  + || is_self_change_mapping;
 else
 force_pt_level = 1;
 if (!force_pt_level) {
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 6f13e03..2957012 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -4810,7 +4810,13 @@ static bool reexecute_instruction(struct kvm_vcpu 
  *vcpu, gva_t cr2)
  * guest to let CPU execute the instruction.
  */
 kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  -  return true;
  +
  +  /*
  +   * If the access faults on its page table, it can not
  +   * be fixed by unprotecting shadow page and it should
  +   * be reported to userspace.
  +   */
  +  return !vcpu-arch.write_fault_to_shadow_pgtable;
   }
 
   static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
  
  Also should make sure vcpu-arch.write_fault_to_shadow_pgtable is never
  reused. Say, clean when exiting x86_emulate_instruction?
 
 Yes, it is more clear.
 
 But i am thinking if it is 

Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support

2013-01-11 Thread Gleb Natapov
On Thu, Jan 10, 2013 at 07:36:28PM -0200, Marcelo Tosatti wrote:
 Hi,
 
 Getting into good shape.
 
 On Thu, Jan 10, 2013 at 03:26:08PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
  manually, which is fully taken care of by the hardware. This needs
  some special awareness into existing interrupr injection path:
  
  - for pending interrupt, instead of direct injection, we may need
update architecture specific indicators before resuming to guest.
  
  - A pending interrupt, which is masked by ISR, should be also
considered in above update action, since hardware will decide
when to inject it at right time. Current has_interrupt and
get_interrupt only returns a valid vector from injection p.o.v.
  
  Signed-off-by: Kevin Tian kevin.t...@intel.com
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
   arch/x86/include/asm/kvm_host.h |5 +
   arch/x86/include/asm/vmx.h  |   11 +++
   arch/x86/kvm/irq.c  |   56 +++-
   arch/x86/kvm/lapic.c|   72 +--
   arch/x86/kvm/lapic.h|   23 +
   arch/x86/kvm/svm.c  |   18 
   arch/x86/kvm/vmx.c  |  191 
  +--
   arch/x86/kvm/x86.c  |   14 +++-
   include/linux/kvm_host.h|3 +
   virt/kvm/ioapic.c   |   18 
   virt/kvm/ioapic.h   |4 +
   virt/kvm/irq_comm.c |   22 +
   virt/kvm/kvm_main.c |5 +
   13 files changed, 399 insertions(+), 43 deletions(-)
  
 
   static void recalculate_apic_map(struct kvm *kvm)
   {
  struct kvm_apic_map *new, *old = NULL;
  @@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic 
  *apic, u8 id)
   {
  apic_set_reg(apic, APIC_ID, id  24);
  recalculate_apic_map(apic-vcpu-kvm);
  +   ioapic_update_eoi_exitmap(apic-vcpu-kvm);
   }
 
 Move ioapic_update_eoi_exitmap into recalculate_apic_map.
 
   static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
   {
  apic_set_reg(apic, APIC_LDR, id);
  recalculate_apic_map(apic-vcpu-kvm);
  +   ioapic_update_eoi_exitmap(apic-vcpu-kvm);
   }
   
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index b203ce7..990409a 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -434,6 +434,7 @@ struct vcpu_vmx {
  bool rdtscp_enabled;
   
  bool virtual_x2apic_enabled;
  +   unsigned long eoi_exit_bitmap[4];
 
 Use DECLARE_BITMAP (unsigned long is 4 bytes on 32-bit host).
 
  /* Support for a guest hypervisor (nested VMX) */
  struct nested_vmx nested;
  @@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
   
   static inline bool cpu_has_vmx_virtual_intr_delivery(void)
   {
  -   return false;
  +   return vmcs_config.cpu_based_2nd_exec_ctrl 
  +   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
   }
   
   static inline bool cpu_has_vmx_flexpriority(void)
  @@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct 
  vmcs_config *vmcs_conf)
  SECONDARY_EXEC_PAUSE_LOOP_EXITING |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_ENABLE_INVPCID |
  -   SECONDARY_EXEC_APIC_REGISTER_VIRT;
  +   SECONDARY_EXEC_APIC_REGISTER_VIRT |
  +   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
  if (adjust_vmx_controls(min2, opt2,
  MSR_IA32_VMX_PROCBASED_CTLS2,
  _cpu_based_2nd_exec_control)  0)
  @@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct 
  vmcs_config *vmcs_conf)
   
  if (!(_cpu_based_exec_control  CPU_BASED_TPR_SHADOW))
  _cpu_based_2nd_exec_control = ~(
  -   SECONDARY_EXEC_APIC_REGISTER_VIRT);
  +   SECONDARY_EXEC_APIC_REGISTER_VIRT |
  +   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 
 Nevermind the earlier comment about ().
 
  +static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu,
  +   u32 vector)
  +{
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +
  +   if (WARN_ONCE((vector  255),
  +   KVM VMX: vector (%d) out of range\n, vector))
  +   return;
  +   __set_bit(vector, vmx-eoi_exit_bitmap);
  +}
  +
  +void vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq 
  *irq)
  +{
  +   struct kvm_lapic **dst;
  +   struct kvm_apic_map *map;
  +   unsigned long bitmap = 1;
  +   int i;
  +
  +   rcu_read_lock();
  +   map = rcu_dereference(vcpu-kvm-arch.apic_map);
  +
  +   if (unlikely(!map)) {
  +   set_eoi_exitmap_one(vcpu, irq-vector);
  +   goto out;
  +   }
  +
  +   if (irq-dest_mode == 0) { /* physical mode */
  +   if (irq-delivery_mode == APIC_DM_LOWEST ||
  +   irq-dest_id == 0xff) {
  +   

[PATCH v5 6/5] KVM: x86: clear write_fault_to_shadow_pgtable explicitly

2013-01-11 Thread Xiao Guangrong
On 01/11/2013 09:15 PM, Marcelo Tosatti wrote:

 
 This is cryptic. Its not obvious at all for someone modifying the code, 
 for example.
 
 Can you please clear it explicitly? 

Sure, this is the patch to apply your idea, is it good to you? :)


Subject: [PATCH 6/6] KVM: x86: clear write_fault_to_shadow_pgtable explicitly

Clear it explicitly when exiting x86_emulate_instruction to clarify the code,
it is suggested by Marcelo

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2957012..89d01a8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4753,7 +4753,8 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
return r;
 }

-static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2)
+static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
+ bool write_fault_to_shadow_pgtable)
 {
gpa_t gpa = cr2;
pfn_t pfn;
@@ -4816,7 +4817,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, 
gva_t cr2)
 * be fixed by unprotecting shadow page and it should
 * be reported to userspace.
 */
-   return !vcpu-arch.write_fault_to_shadow_pgtable;
+   return !write_fault_to_shadow_pgtable;
 }

 static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
@@ -4875,7 +4876,13 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
int r;
struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt;
bool writeback = true;
+   bool write_fault_to_spt = vcpu-arch.write_fault_to_shadow_pgtable;

+   /*
+* Clear write_fault_to_shadow_pgtable here to ensure it is
+* never reused.
+*/
+   vcpu-arch.write_fault_to_shadow_pgtable = false;
kvm_clear_exception_queue(vcpu);

if (!(emulation_type  EMULTYPE_NO_DECODE)) {
@@ -4894,7 +4901,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
if (r != EMULATION_OK)  {
if (emulation_type  EMULTYPE_TRAP_UD)
return EMULATE_FAIL;
-   if (reexecute_instruction(vcpu, cr2))
+   if (reexecute_instruction(vcpu, cr2,
+ write_fault_to_spt))
return EMULATE_DONE;
if (emulation_type  EMULTYPE_SKIP)
return EMULATE_FAIL;
@@ -4924,7 +4932,7 @@ restart:
return EMULATE_DONE;

if (r == EMULATION_FAILED) {
-   if (reexecute_instruction(vcpu, cr2))
+   if (reexecute_instruction(vcpu, cr2, write_fault_to_spt))
return EMULATE_DONE;

return handle_emulation_failure(vcpu);
-- 
1.7.7.6


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] make mac programming for virtio net more robust

2013-01-11 Thread John Fastabend

On 1/10/2013 11:46 PM, Michael S. Tsirkin wrote:

On Fri, Jan 11, 2013 at 12:53:07PM +1030, Rusty Russell wrote:

Michael S. Tsirkin m...@redhat.com writes:


On Thu, Jan 10, 2013 at 10:45:39PM +0800, ak...@redhat.com wrote:

From: Amos Kong ak...@redhat.com

Currenly mac is programmed byte by byte. This means that we
have an intermediate step where mac is wrong.

Second patch introduced a new vq control command to set mac
address in one time.


As you mention we could alternatively do it without
new commands, simply add a feature bit that says that MACs are
in the mac table.
This would be a much bigger patch, and I'm fine with either way.
Rusty what do you think?


Hmm, mac filtering and my mac address are not quite the same thing.  I
don't know if it matters for anyone: does it?
The mac address is abused
for things like identifying machines, etc.


I don't know either. I think net core differentiates between mac and
uc_list because linux has to know which mac to use when building
up packets, so at some level, I agree it might be useful to identify the
machine.

BTW netdev/davem should have been copied on this, Amos I think it's a
good idea to remember to do it next time you post.



If we keep it as a separate concept, Amos' patch seems to make sense.


Yes. It also keeps the patch small, I just thought I'd mention the
option.



Cheers,
Rusty.




Don't have the entire context here but if you implement the
ndo_fdb_dump() probably hooking it up to ndo_dflt_fdb_dump() you could
use the 'bridge' tool dump the uc_list.

Then use ndo_fdb_add() and ndo_fdb_del() to add and remove entries
from the uc_list. We do this today in macvlan and the ixgbe driver when
it is in SR-IOV mode and the embedded switch needs to be programmed.

fdb is forwarding database its a bit different then mac filtering
in that its telling the switch how to forward mac addresses, in
ixgbe and macvlan at least we have been overloading it a bit to also
stop filtering the mac address. I think this makes sense if you setup
forwarding to a port it doesn't make much sense to then drop them.

Maybe its not entirely applicable here just thought I would mention it.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FreeBSD-amd64 fails to start with SMP on quemu-kvm

2013-01-11 Thread Artur Samborski

W dniu 09.01.2013 11:15, Gleb Natapov pisze:

On Tue, Jan 08, 2013 at 09:45:35PM +0100, Artur Samborski wrote:

W dniu 08.01.2013 00:00, Marcelo Tosatti pisze:

On Mon, Jan 07, 2013 at 06:13:22PM +0100, Artur Samborski wrote:

Hello,

When i try to run FreeBSD-amd64 on more than 1 vcpu in quemu-kvm
(Fedora Core 17) eg. to run FreeBSD-9.0-RELEASE-amd64 with:

qemu-kvm -m 1024m -cpu host -smp 2 -cdrom
/storage/iso/FreeBSD-9.0-RELEASE-amd64-dvd1.iso

it freezes KVM with:

KVM internal error. Suberror: 1
emulation failure
RAX=80b0d4c0 RBX=0009f000 RCX=c080
RDX=
RSI=d238 RDI= RBP=
RSP=
R8 = R9 = R10=
R11=
R12= R13= R14=
R15=
RIP=0009f076 RFL=00010086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   f300 DPL=3 DS16 [-WA]
CS =0008   00209900 DPL=0 CS64 [--A]
SS =9f00 0009f000  f300 DPL=3 DS16 [-WA]
DS =0018   00c09300 DPL=0 DS   [-WA]
FS =   f300 DPL=3 DS16 [-WA]
GS =   f300 DPL=3 DS16 [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS64-busy
GDT= 0009f080 0020
IDT=  
CR0=8011 CR2= CR3=0009c000 CR4=0030
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=0501
Code=00 00 00 80 0f 22 c0 ea 70 f0 09 00 08 00 48 b8 c0 d4 b0 80
ff ff ff ff ff e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 99 20 00 ff ff 00 00


Artur,

Can you check whether

https://patchwork-mail.kernel.org/patch/1942681/

fixes your problem



Hi, thanks for the reply.

Unfortunately, the patch does not help. Attempt to start FreeBSD
amd64 via quemu-kvm with -smp parameter greater than 1 fails in
exactly the same way as before.

The patch was applied to the kernel from the 3.6.11-1.fc17.src.rpm package.

Do I need some additional patches?



Can you try queue branch from kvm.git?



Thanks for the advice - I used kernel from kvm.git(queue) and was able 
to run FreeBSD-amd guest with SMP on my kvm-host server without previous 
problems. Unfortunately, after few hours server was hung up. Case 
requires further investigations, but generally speaking we went forward. 
Unfortunately, experiments on the server are difficult, because it is 
used for everyday work.



--
Artur
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Alexander Graf

On 11.01.2013, at 02:10, Scott Wood wrote:

 On 01/10/2013 06:35:02 PM, Marcelo Tosatti wrote:
 On Thu, Jan 10, 2013 at 04:40:12PM -0600, Scott Wood wrote:
  On 01/10/2013 04:28:01 PM, Marcelo Tosatti wrote:
  Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that?
 
  Besides the above, and my original complaint that it shouldn't be
  specific to addresses?
 
  -Scott
 I did not really grasp that ('shouldnt be specific to addresses'), but
 anyway.
 
 A device may have other configuration parameters that need to be set,
 besides addresses.  PPC MPIC will require information about the vendor
 and version, for example.
 
 OK, can you write down your proposed improvements to the interface?
 In case you have something ready, otherwise there is time pressure
 to merge the ARM port.
 
 My original request was just to change the name to something like
 KVM_SET_DEVICE_CONFIG or KVM_SET_DEVICE_ATTR, and not make the id
 encoding architecture-specific (preferably, separate into a device id
 field and an attribute id field rather than using bitfields).  Actual
 values for device id could be architecture-specific (or there could be a
 global enumeration), and attribute id values would be device-specific.
 
 Alex suggested that an ideal interface might accept values larger than 64
 bits, though I think it's good enough -- there are currently no proposed
 uses that need more than 64 bits for a single attribute (unlike ONE_REG),
 and if it is needed, such configuration could be split up between
 multiple attributes, or the attribute could specify that value be a
 userspace pointer to the actual data (as with ONE_REG).
 
 Here's a writeup (the ARM details would go under ARM/vGIC-specific
 documentation):
 
 4.80 KVM_SET_DEVICE_ATTR
 
 Capability: KVM_CAP_SET_DEVICE_ATTR
 Type: vm ioctl
 Parameters: struct kvm_device_attr (in)
 Returns: 0 on success, -1 on error
 Errors:
  ENODEV: The device id is unknown
  ENXIO:  Device not supported on current system
  Other errors may be returned by specific devices and attributes.
 
 struct kvm_device_attr {
   __u32 device;

This needs some semantic specification. Is device a constant value? Is it the 
return value of CREATE_IRQCHIP?

   __u32 attr;
   __u64 value;
 };
 
 Specify an attribute of a device emulated or directly exposed by the
 kernel, which the host kernel needs to know about.  The device field is an
 architecture-specific identifier for a specific device.  The attr field
 is a device-specific identifier for a specific attribute.  Individual
 attributes may have particular requirements for when they can and cannot
 be set.
 
 That is, if you have interest/energy to spend in a possibly reusable
 interface, as long as that does not delay integration of the ARM code,
 i don't think the ARM people will mind that.
 
 The impression I've been given is that just about any change will delay
 the integration at this point.  If that's the case, and everyone's OK
 with having an interface that is deprecated on arrival, then fine.

To me it's perfectly fine to merge the patches with the arm specific interface 
and then push an interface like the above in in the next merge window.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH KVM v2 1/4] KVM: fix i8254 IRQ0 to be normally high

2013-01-11 Thread Gleb Natapov
On Thu, Jan 10, 2013 at 11:40:07PM -0700, Matthew Ogilvie wrote:
 On Tue, Jan 08, 2013 at 09:31:19AM +0200, Gleb Natapov wrote:
  On Mon, Jan 07, 2013 at 06:17:22PM -0600, mmogi...@miniinfo.net wrote:
   On Mon, 7 Jan 2013 11:39:18 +0200, Gleb Natapov g...@redhat.com wrote:
On Wed, Dec 26, 2012 at 10:39:53PM -0700, Matthew Ogilvie wrote:
Reading the spec, it is clear that most modes normally leave the IRQ
output line high, and only pulse it low to generate a leading edge.
Especially the most commonly used mode 2.

The KVM i8254 model does not try to emulate the duration of the pulse 
at
all, so just swap the high/low settings it to leave it high most of
the time.

This fix is a prerequisite to improving the i8259 model to handle
the trailing edge of an interupt request as indicated in its spec:
If it gets a trailing edge of an IRQ line before it starts to service
the interrupt, the request should be canceled.

See http://bochs.sourceforge.net/techspec/intel-82c54-timer.pdf.gz
or search the net for 23124406.pdf.

Risks:

There is a risk that migrating a running guest between versions
with and without this patch will lose or gain a single timer
interrupt during the migration process.  The only case where
Can you elaborate on how exactly this can happen? Do not see it.

   
   KVM 8254: In the corrected model, when the count expires, the model
   briefly pulses output low and then high again, with the low to high
   transition being what triggers the interrupt.  In the old model,
   when the count expires, the model expects the output line
   to already be low, and briefly pulses it high (triggering the
   interrupt) and then low again.  But if the line was already
   high (because it migrated from the corrected model),
   this won't generate a new leading edge (low to high) and won't
   trigger a new interrupt (the first post-back-migration pulse turns
   into a simple trailing edge instead of a pulse).
   
   Unless there is something I'm missing?
   
  No, I missed that pic-last_irr/ioapic-irr  will be migrated as 1. But
  this means that next interrupt after migration from new to old will
  always be lost.  What about clearing pit bit from last_irr/irr before
  migration? Should not affect new-new migration and should fix new-old
  one. The only problem is that we may need to consult irq routing table
  to know how pit is connected to ioapic.
 
 We should not clear both last_irr and irr.  That
 cancels the interrupt early if the CPU hasn't started servicing
 it yet:  If the guest CPU has interrupts disabled
 and hasn't begun to service the interrupt, a new-new migration could
 lose the interrupt much earlier in the countdown cycle than it otherwise
 might be lost.
 
I am talking about last_irr in pic and irr in ioapic. Of course we
shouldn't clear pic-irr on migration. ioapic uses irr to detect edge.

 Potentially we could clear the last_irr bit only.  This effectively
 makes migration behave like the old unfixed code.  But if this is
 considered acceptable, I would be more inclined to not fix IRQ0 at all,
If we do not fix IRQ0 next timer interrupt after migration will always be
lost.

 rather than make IRQ0 work subtly differently normally vs during
 migration.  One of my earlier patch series (qemu v7 dated Nov 25)
 attempts to use basically this strategy for the qemu model, at
 least in the short term (and then potentially fix it properly
 in the longer term), although the details of series might
 be tweaked.
 
 Or the minimal-risk strategy is to ONLY fix the cascade IRQ2's
 trailing edge [the original i825* problem that spawned this whole
 thing], leaving all other IRQs as-is.
 
  
  Still do not see how can we gain one interrupt.
 
 In most cases I don't think we get an extra interrupt from
 a direct fix.  But some kinds of attempts to work around missing
 interrupts during migration can cause cause extra interrupts in other
 cases.  In the old qemu model (but perhaps not kvm), the mode 2 leading
 edge occurs one clock tick earlier than it ought to.  So if you
 try to be tricky with adjusting levels during migration, you
 may introduce possible cases where it gets an interrupt in
 the old model, then migrates and gets another interrupt one tick
 later in the new model...
 
 Also, it occurs to me that maybe there might be an off-by-one issue in
 the kvm model of mode 2 that ought to be fixed as well?  Although the  
 zero length pulse in kvm suggests that one-off issues in counters do  
 not matter.  In the qemu model, the leading edge of OUT in mode 2
 moves by one tick...
 
  
   The qemu 8254 model actually models each edge at independent
   clock ticks instead of combining both into a very brief pulse at one time.
   I've found it handy to draw out old and new timing diagrams on paper
   (for each mode), and then carefully think about what happens with respect
   to levels and edges when you 

Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER

2013-01-11 Thread Alex Williamson
On Fri, 2013-01-11 at 08:45 +, Pandarathil, Vijaymohan R wrote:
 
  -Original Message-
  From: Alex Williamson [mailto:alex.william...@redhat.com]
  Sent: Wednesday, January 09, 2013 11:05 AM
  To: Pandarathil, Vijaymohan R
  Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu-
  de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
  Subject: Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting
  AER
  
  On Wed, 2013-01-09 at 10:52 -0700, Alex Williamson wrote:
   On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote:
- New ioctl which is used to pass the eventfd that is signaled 
when
  an error occurs in the vfio_pci_device
   
- Register pci_error_handler for the vfio_pci driver
   
- When the device encounters an error, the error handler 
registered
  by
  the vfio_pci driver gets invoked by the AER infrastructure
   
- In the error handler, signal the eventfd registered for the 
device.
   
- This results in the qemu eventfd handler getting invoked and
  appropriate action taken for the guest.
   
Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com
---
 drivers/vfio/pci/vfio_pci.c | 29 +
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 drivers/vfio/vfio.c |  8 
 include/linux/vfio.h|  1 +
 include/uapi/linux/vfio.h   |  9 +
 5 files changed, 48 insertions(+)
   
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c11994..4ae9526 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -207,6 +207,8 @@ static long vfio_pci_ioctl(void *device_data,
if (vdev-reset_works)
info.flags |= VFIO_DEVICE_FLAGS_RESET;
   
+   info.flags |= VFIO_DEVICE_FLAGS_AER_NOTIFY;
+
  
   This appears to be a PCI specific flag, so the name should include
   _PCI_.  We also support non-PCIe devices and it seems like it would be
   possible to not have AER support available, so shouldn't this be
   conditional?
 
 Will do that.
 
  
info.num_regions = VFIO_PCI_NUM_REGIONS;
info.num_irqs = VFIO_PCI_NUM_IRQS;
   
@@ -348,6 +350,19 @@ static long vfio_pci_ioctl(void *device_data,
   
return ret;
   
+   } else if (cmd == VFIO_DEVICE_SET_ERRFD) {
+   int32_t fd = (int32_t)arg;
+
+   if (fd  0)
+   return -EINVAL;
+
+   vdev-err_trigger = eventfd_ctx_fdget(fd);
+
+   if (IS_ERR(vdev-err_trigger))
+   return PTR_ERR(vdev-err_trigger);
+
+   return 0;
+
  
   I'm not sure why we wouldn't describe this as just another interrupt
   from the device and configure it via SET_IRQ.  This ioctl has very
   limited use and doesn't follow any of the conventions of all the other
   vfio ioctls.
 
 I thought this was a fairly simple ioctl to implement this way. Moreover, 
 there is no device interrupt involved. Let me know if you really want to 
 model it as a SET_IRQ ioctl.

I'd like to see something else for sure.  We're trying to do AER
piecemeal but this ioctl leaves no room for anything else.  Seems like a
soon to be wasted ioctl number.  Note that there isn't even a de-assign
interface, nor a flag to set to indicate de-assignment.  The SET_IRQ
ioctl already handles these in a very similar way to how we'd want to
handle it for error signaling.  It doesn't seem like that much of a
stretch to me to include it there, but I'd also entertain other ideas.
Thanks,

Alex

  
} else if (cmd == VFIO_DEVICE_RESET)
return vdev-reset_works ?
pci_reset_function(vdev-pdev) : -EINVAL;
@@ -527,11 +542,25 @@ static void vfio_pci_remove(struct pci_dev *pdev)
kfree(vdev);
 }
   
+static pci_ers_result_t vfio_err_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct vfio_pci_device *vdev = vfio_get_vdev(pdev-dev);
+
+   eventfd_signal(vdev-err_trigger, 1);
+   return PCI_ERS_RESULT_CAN_RECOVER;
+}
  
   What if err_trigger hasn't been set?
 
 Will fix that.
 
  
+
+static const struct pci_error_handlers vfio_err_handlers = {
+   .error_detected = vfio_err_detected,
+};
+
 static struct pci_driver vfio_pci_driver = {
.name   = vfio-pci,
.id_table   = NULL, /* only dynamic ids */
.probe  = vfio_pci_probe,
.remove = vfio_pci_remove,
+   .err_handler= vfio_err_handlers,
 };
   
 static void __exit 

Re: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-01-11 Thread Alex Williamson
On Fri, 2013-01-11 at 08:53 +, Pandarathil, Vijaymohan R wrote:
 
  -Original Message-
  From: Alex Williamson [mailto:alex.william...@redhat.com]
  Sent: Wednesday, January 09, 2013 10:08 AM
  To: Pandarathil, Vijaymohan R
  Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu-
  de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
  Subject: Re: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI
  devices
  
  On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote:
 - Create eventfd per vfio device assigned to a guest and register an
 event handler
  
 - This fd is passed to the vfio_pci driver through a new ioctl
  
 - When the device encounters an error, the eventfd is signaled
 and the qemu eventfd handler gets invoked.
  
 - In the handler decide what action to take. Current action taken
 is to terminate the guest.
  
   Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com
   ---
hw/vfio_pci.c  | 56
  ++
linux-headers/linux/vfio.h |  9 
2 files changed, 65 insertions(+)
  
   diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
   index 28c8303..9c3c28b 100644
   --- a/hw/vfio_pci.c
   +++ b/hw/vfio_pci.c
   @@ -38,6 +38,7 @@
#include qemu/error-report.h
#include qemu/queue.h
#include qemu/range.h
   +#include sysemu/sysemu.h
  
/* #define DEBUG_VFIO */
#ifdef DEBUG_VFIO
   @@ -130,6 +131,8 @@ typedef struct VFIODevice {
QLIST_ENTRY(VFIODevice) next;
struct VFIOGroup *group;
bool reset_works;
   +EventNotifier errfd;
  
  Misleading name, it's an EventNotifier not an fd.
 
 Will make the change.
 
  
   +__u32 dev_info_flags;
} VFIODevice;
  
typedef struct VFIOGroup {
   @@ -1805,6 +1808,8 @@ static int vfio_get_device(VFIOGroup *group, const
  char *name, VFIODevice *vdev)
DPRINTF(Device %s flags: %u, regions: %u, irgs: %u\n, name,
dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
  
   +vdev-dev_info_flags = dev_info.flags;
   +
  
  We test the VFIO_DEVICE_FLAGS_RESET elsewhere by caching the bit into a
  variable, why not do that here too?
 
 I was wondering if there was any specific reason to cache the bit into
 a separate variable. Wouldn't a flags field which can contain the
 specific bit be more apt ?

Then we have to mask it out ever time we want to reference it.  Qemu
doesn't seem to like macros or inlines for that, so using a new variable
keeps things neat.  Caching two fields to bools is still more space
efficient than saving the whole thing and we can easily switch to named
bits if we get enough to start caring about the space.  Thanks,

Alex
 
if (!(dev_info.flags  VFIO_DEVICE_FLAGS_PCI)) {
error_report(vfio: Um, this isn't a PCI device\n);
goto error;
   @@ -1900,6 +1905,55 @@ static void vfio_put_device(VFIODevice *vdev)
}
}
  
   +static void vfio_errfd_handler(void *opaque)
   +{
   +VFIODevice *vdev = opaque;
   +
   +if (!event_notifier_test_and_clear(vdev-errfd)) {
   +return;
   +}
   +
   +/*
   + * TBD. Retrieve the error details and decide what action
   + * needs to be taken. One of the actions could be to pass
   + * the error to the guest and have the guest driver recover
   + * the error. This requires that PCIe capabilities be
   + * exposed to the guest. At present, we just terminate the
   + * guest to contain the error.
   + */
   +error_report(%s(%04x:%02x:%02x.%x) 
   +Unrecoverable error detected... Terminating guest\n,
   +__func__, vdev-host.domain, vdev-host.bus, vdev-host.slot,
   +vdev-host.function);
   +
   +qemu_system_shutdown_request();
  
  I would have figured hw_error
 
 Hw_error() is probably more appropriate. Will make the change.
 
  
   +return;
   +}
   +
   +static void vfio_register_errfd(VFIODevice *vdev)
   +{
   +int32_t pfd;
  
  pfd is used elsewhere in vfio as Pointer to FD, this is just a fd.
 
 Will change.
 
  
   +int ret;
   +
   +if (!(vdev-dev_info_flags  VFIO_DEVICE_FLAGS_AER_NOTIFY)) {
   +error_report(vfio: Warning: Error notification not supported
  for the device\n);
  
  This should probably just be a debug print.
 
 Okay.
 
  
   +return;
   +}
   +if (event_notifier_init(vdev-errfd, 0)) {
   +error_report(vfio: Warning: Unable to init event notifier for
  error detection\n);
   +return;
   +}
   +pfd = event_notifier_get_fd(vdev-errfd);
   +qemu_set_fd_handler(pfd, vfio_errfd_handler, NULL, vdev);
   +
   +ret = ioctl(vdev-fd, VFIO_DEVICE_SET_ERRFD, pfd);
   +if (ret) {
   +error_report(vfio: Warning: Failed to setup error fd: %d\n,
  ret);
   +qemu_set_fd_handler(pfd, NULL, NULL, vdev);
   +

Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support

2013-01-11 Thread Gleb Natapov
On Fri, Jan 11, 2013 at 07:36:44AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-01-10:
  On Thu, Jan 10, 2013 at 12:22:51PM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-01-10:
  On Thu, Jan 10, 2013 at 11:54:21AM +, Zhang, Yang Z wrote:
  The right logic should be:
  When register virtualization enabled, read all apic msr(except apic id 
  reg
  and
  tmcct which have the hook) not cause vmexit and only write TPR not cause
  vmexit.
  When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
  It's better to put the patch of envirtualize x2apic mode as first 
  patch.
  
  There is no point whatsoever to enable virtualize x2apic without
  allowing at least one non intercepted access to x2apic MSR range and
  this is what your patch is doing when run on cpu without vid capability.
  No, at least read/write TPR will not cause vmexit if virtualize x2apic 
  mode is
  enabled.
  For that you need to disable 808H MSR intercept, which your patch does not
  do.
  I want to do this in next patch.
  
  Then don't. There is no point in the patch that enables virtualize
  x2apic and does not allow at least one non-intercepted MSR access.
 As I said, read/write TPR will not cause vmexit if virtual x2apic is set.
 
From my reading of the spec you need to disable 808H intercept for that.
Is this not the case?

  I am not sure whether I understand your comments right in previous
  discussion, here is my thinking: 1. enable virtualize x2apic mode if
  guest is really using x2apic and clear TPR in msr read  and write
  bitmap. This will benefit reading TPR. 2. If APIC registers
  virtualization is enabled, clear all bit in rang
  0x800-0x8ff(except apic id reg and tmcct).
  Clear all read bits in the range.
  
  3. If vid is enabled, clear EOI and SELF IPI in msr write map.
  
  Looks OK.
  
  One concern you mentioned is that vid enabled and virtualize x2apic is
  disabled
  but guest still use x2apic. In this case, we still use MSR bitmap to
  intercept x2apic. This means the vEOI update will never happen. But we
  still can benefit from interrupt window.
  
  What interrupt windows? Without virtualized x2apic TPR/EOI
  virtualization will not happen and we rely on that in the code.
  Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
  delivery to guest without vmexit when guest enables interrupt if vid is 
  enabled.
  See SDM vol 3, 29.2.2.
  
  What does it matter that you eliminated vmexit of interrupt window if
  you broke everything else? The vid patch assumes that apic emulation
  either entirely happens in a software when vid is disabled or in a CPU
  if vid is enabled. Mixed mode will not work (apic-isr_count = 1 trick
  will break things for instance). And it is not worth it to complicate
  the code to make it work.
 Yes, you are right. It too complicated.
 Another question? Why not to hide x2apic capability to guest when vid is 
 supported and virtual x2apic mode is not supported? It should be more 
 reasonable than disable vid when virtual x2apic mode is unavailable.
 
Because I think there will be no such HW. If such HW expected to be
common then we can go this route.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Russell King - ARM Linux
On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 + unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 + unsigned long val;
 +
 + switch (psci_fn) {
 + case KVM_PSCI_FN_CPU_OFF:
 + kvm_psci_vcpu_off(vcpu);
 + val = KVM_PSCI_RET_SUCCESS;
 + break;
 + case KVM_PSCI_FN_CPU_ON:
 + val = kvm_psci_vcpu_on(vcpu);
 + break;
 + case KVM_PSCI_FN_CPU_SUSPEND:
 + case KVM_PSCI_FN_MIGRATE:
 + val = KVM_PSCI_RET_NI;
 + break;
 +
 + default:
 + return -1;
 + }
 +
 + *vcpu_reg(vcpu, 0) = val;
 + return 0;
 +}

We were discussing recently on #kernel about kernel APIs and the way that
our integer-returning functions pretty much use 0 for success, and -errno
for failures, whereas our pointer-returning functions are a mess.

And above we have something returning -1 to some other chunk of code outside
this compilation unit.  That doesn't sound particularly clever to me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Marc Zyngier
On 11/01/13 17:12, Russell King - ARM Linux wrote:
 On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 +unsigned long val;
 +
 +switch (psci_fn) {
 +case KVM_PSCI_FN_CPU_OFF:
 +kvm_psci_vcpu_off(vcpu);
 +val = KVM_PSCI_RET_SUCCESS;
 +break;
 +case KVM_PSCI_FN_CPU_ON:
 +val = kvm_psci_vcpu_on(vcpu);
 +break;
 +case KVM_PSCI_FN_CPU_SUSPEND:
 +case KVM_PSCI_FN_MIGRATE:
 +val = KVM_PSCI_RET_NI;
 +break;
 +
 +default:
 +return -1;
 +}
 +
 +*vcpu_reg(vcpu, 0) = val;
 +return 0;
 +}
 
 We were discussing recently on #kernel about kernel APIs and the way that
 our integer-returning functions pretty much use 0 for success, and -errno
 for failures, whereas our pointer-returning functions are a mess.
 
 And above we have something returning -1 to some other chunk of code outside
 this compilation unit.  That doesn't sound particularly clever to me.

The original code used to return -EINVAL, see:
https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html

Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
the code to its original state though.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Christoffer Dall
On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:12, Russell King - ARM Linux wrote:
 On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 +unsigned long val;
 +
 +switch (psci_fn) {
 +case KVM_PSCI_FN_CPU_OFF:
 +kvm_psci_vcpu_off(vcpu);
 +val = KVM_PSCI_RET_SUCCESS;
 +break;
 +case KVM_PSCI_FN_CPU_ON:
 +val = kvm_psci_vcpu_on(vcpu);
 +break;
 +case KVM_PSCI_FN_CPU_SUSPEND:
 +case KVM_PSCI_FN_MIGRATE:
 +val = KVM_PSCI_RET_NI;
 +break;
 +
 +default:
 +return -1;
 +}
 +
 +*vcpu_reg(vcpu, 0) = val;
 +return 0;
 +}

 We were discussing recently on #kernel about kernel APIs and the way that
 our integer-returning functions pretty much use 0 for success, and -errno
 for failures, whereas our pointer-returning functions are a mess.

 And above we have something returning -1 to some other chunk of code outside
 this compilation unit.  That doesn't sound particularly clever to me.

 The original code used to return -EINVAL, see:
 https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html

 Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
 the code to its original state though.

I don't want to return -EINVAL, because for the rest of the KVM code
this would mean kill the guest.

The convention used in other archs of KVM as well as for ARM is that
the handle_exit functions return:

-ERRNO: Error, report this error to user space
0: Everything is fine, but return to user space to let it do I/O
emulation and whatever it wants to do
1: Everything is fine, return directly to the guest without going to user space

And then you do:
if (handle_something() == 0)
return 1;

which I thought was confusing, so I said make the function a bool, to
avoid the confusion, like Rusty did for all the coprocessor emulation
functions.

There are obviously other ways to handle the return 1 case, like
having an extra state that you carry around, and we can change all the
code to do that, but I just don't think it's worth it, as we are in
fact quite close to the existing kernel API.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Russell King - ARM Linux
On Fri, Jan 11, 2013 at 12:33:15PM -0500, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote:
  On 11/01/13 17:12, Russell King - ARM Linux wrote:
  On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
  +int kvm_psci_call(struct kvm_vcpu *vcpu)
  +{
  +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
  +unsigned long val;
  +
  +switch (psci_fn) {
  +case KVM_PSCI_FN_CPU_OFF:
  +kvm_psci_vcpu_off(vcpu);
  +val = KVM_PSCI_RET_SUCCESS;
  +break;
  +case KVM_PSCI_FN_CPU_ON:
  +val = kvm_psci_vcpu_on(vcpu);
  +break;
  +case KVM_PSCI_FN_CPU_SUSPEND:
  +case KVM_PSCI_FN_MIGRATE:
  +val = KVM_PSCI_RET_NI;
  +break;
  +
  +default:
  +return -1;
  +}
  +
  +*vcpu_reg(vcpu, 0) = val;
  +return 0;
  +}
 
  We were discussing recently on #kernel about kernel APIs and the way that
  our integer-returning functions pretty much use 0 for success, and -errno
  for failures, whereas our pointer-returning functions are a mess.
 
  And above we have something returning -1 to some other chunk of code 
  outside
  this compilation unit.  That doesn't sound particularly clever to me.
 
  The original code used to return -EINVAL, see:
  https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html
 
  Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
  the code to its original state though.
 
 I don't want to return -EINVAL, because for the rest of the KVM code
 this would mean kill the guest.
 
 The convention used in other archs of KVM as well as for ARM is that
 the handle_exit functions return:
 
 -ERRNO: Error, report this error to user space
 0: Everything is fine, but return to user space to let it do I/O
 emulation and whatever it wants to do
 1: Everything is fine, return directly to the guest without going to user 
 space

Right, so the above return -1 _is_ doing the thing that I really hate,
which is it's actually doing a return -EPERM without anyone realising
that's what it's doing.

This is precisely why I hate (and pick up on, and have my mail reader to
highlight) any return -1.  It's mostly always a bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Marc Zyngier
On 11/01/13 17:33, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:12, Russell King - ARM Linux wrote:
 On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 +unsigned long val;
 +
 +switch (psci_fn) {
 +case KVM_PSCI_FN_CPU_OFF:
 +kvm_psci_vcpu_off(vcpu);
 +val = KVM_PSCI_RET_SUCCESS;
 +break;
 +case KVM_PSCI_FN_CPU_ON:
 +val = kvm_psci_vcpu_on(vcpu);
 +break;
 +case KVM_PSCI_FN_CPU_SUSPEND:
 +case KVM_PSCI_FN_MIGRATE:
 +val = KVM_PSCI_RET_NI;
 +break;
 +
 +default:
 +return -1;
 +}
 +
 +*vcpu_reg(vcpu, 0) = val;
 +return 0;
 +}

 We were discussing recently on #kernel about kernel APIs and the way that
 our integer-returning functions pretty much use 0 for success, and -errno
 for failures, whereas our pointer-returning functions are a mess.

 And above we have something returning -1 to some other chunk of code outside
 this compilation unit.  That doesn't sound particularly clever to me.

 The original code used to return -EINVAL, see:
 https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html

 Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
 the code to its original state though.

 I don't want to return -EINVAL, because for the rest of the KVM code
 this would mean kill the guest.
 
 The convention used in other archs of KVM as well as for ARM is that
 the handle_exit functions return:
 
 -ERRNO: Error, report this error to user space
 0: Everything is fine, but return to user space to let it do I/O
 emulation and whatever it wants to do
 1: Everything is fine, return directly to the guest without going to user 
 space

That is assuming we propagate the handle_exit convention down to the
leaf calls, and I object to that. The 3 possible values only apply to
handle_exit, and we should keep that convention as local as possible,
because this is the odd case.

 And then you do:
 if (handle_something() == 0)
 return 1;
 
 which I thought was confusing, so I said make the function a bool, to
 avoid the confusion, like Rusty did for all the coprocessor emulation
 functions.

I don't see a compelling reason to propagate this convention to areas
that do not require it. In the PSCI case, we have a basic
handled/not-handled state, the later indicating the reason. The exit
handling functions can convert the error codes to whatever the run loop
requires.

 There are obviously other ways to handle the return 1 case, like
 having an extra state that you carry around, and we can change all the
 code to do that, but I just don't think it's worth it, as we are in
 fact quite close to the existing kernel API.

-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Christoffer Dall
On Fri, Jan 11, 2013 at 12:43 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:33, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:12, Russell King - ARM Linux wrote:
 On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 +unsigned long val;
 +
 +switch (psci_fn) {
 +case KVM_PSCI_FN_CPU_OFF:
 +kvm_psci_vcpu_off(vcpu);
 +val = KVM_PSCI_RET_SUCCESS;
 +break;
 +case KVM_PSCI_FN_CPU_ON:
 +val = kvm_psci_vcpu_on(vcpu);
 +break;
 +case KVM_PSCI_FN_CPU_SUSPEND:
 +case KVM_PSCI_FN_MIGRATE:
 +val = KVM_PSCI_RET_NI;
 +break;
 +
 +default:
 +return -1;
 +}
 +
 +*vcpu_reg(vcpu, 0) = val;
 +return 0;
 +}

 We were discussing recently on #kernel about kernel APIs and the way that
 our integer-returning functions pretty much use 0 for success, and -errno
 for failures, whereas our pointer-returning functions are a mess.

 And above we have something returning -1 to some other chunk of code 
 outside
 this compilation unit.  That doesn't sound particularly clever to me.

 The original code used to return -EINVAL, see:
 https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html

 Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
 the code to its original state though.

 I don't want to return -EINVAL, because for the rest of the KVM code
 this would mean kill the guest.

 The convention used in other archs of KVM as well as for ARM is that
 the handle_exit functions return:

 -ERRNO: Error, report this error to user space
 0: Everything is fine, but return to user space to let it do I/O
 emulation and whatever it wants to do
 1: Everything is fine, return directly to the guest without going to user 
 space

 That is assuming we propagate the handle_exit convention down to the
 leaf calls, and I object to that. The 3 possible values only apply to
 handle_exit, and we should keep that convention as local as possible,
 because this is the odd case.

I don't agree - it requires you to carefully follow a potentially deep
call trace to make sense of what it does, and worse, it's directly
misleading in the case of KVM/ARM. So either change it everywhere and
have a consistent calling convention or adhere to what we do elsewhere
and use the bool.


 And then you do:
 if (handle_something() == 0)
 return 1;

 which I thought was confusing, so I said make the function a bool, to
 avoid the confusion, like Rusty did for all the coprocessor emulation
 functions.

 I don't see a compelling reason to propagate this convention to areas
 that do not require it. In the PSCI case, we have a basic
 handled/not-handled state, the later indicating the reason. The exit
 handling functions can convert the error codes to whatever the run loop
 requires.


again, that's why I suggest returning a bool instead. You just said
it: it's a basic handled/not-handled state. Why do you want to return
-EINVAL if that's not propogated anywhere?

If the return codes are local and do not map reasonably to error codes
and more complicated than a bool, I would vote for a #define
HVC_HANDLE_SUCCESS, HVC_HANDLE_ERROR, HVC_HANDLE_XX, ...



 There are obviously other ways to handle the return 1 case, like
 having an extra state that you carry around, and we can change all the
 code to do that, but I just don't think it's worth it, as we are in
 fact quite close to the existing kernel API.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Russell King - ARM Linux
On Fri, Jan 11, 2013 at 12:48:45PM -0500, Christoffer Dall wrote:
 again, that's why I suggest returning a bool instead. You just said
 it: it's a basic handled/not-handled state. Why do you want to return
 -EINVAL if that's not propogated anywhere?

We have a well established principle throughout the kernel interfaces that
functions will return positive values for success and an appropriate -ve
errno for failure.

We *certainly* hate functions which return 0 for failure and non-zero
for success - it makes review a real pain because you start seeing code
doing this:

if (!function()) {
deal_with_failure();
}

and you have to then start looking at the function to properly understand
what it's return semantics are.

We have more than enough proof already that this doesn't work: people
don't care to understand what the return values from functions mean.
You only have to do an audit of a few of the IS_ERR_OR_NULL() uses to
find that out, and you quickly realise that people just use what they
_think_ is the right test and which happens to pass their very simple
testing at the time.

We've avoided major problems so far in the kernel by having most of the
integer-returning functions following the established principle, and
that's good.  I really really think that there must be a _very_ good
reason, and overwhelming reason to deviate from the established
principle in any large project.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support

2013-01-11 Thread Marcelo Tosatti
  
  Move inside irq_lock protection.
  

void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
   @@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 hlist_del_init_rcu(kian-link);
 mutex_unlock(kvm-irq_lock);
 synchronize_rcu();
   + ioapic_update_eoi_exitmap(kvm);
  
  Move both synchronize_rcu and ioapic_update_eoi_exitmap inside irq_lock
  protection.
 Why? (here and one above). If vcpu uses stale data during update it will
 find recalculate request during guest entry and will recalculate again.
 
 --
   Gleb.

Indeed, its not necessary and also not necessary to grab irq_lock during 
vcpu entry, while processing the KVM_REQ_ bit.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Christoffer Dall
On Fri, Jan 11, 2013 at 12:57 PM, Russell King - ARM Linux
li...@arm.linux.org.uk wrote:
 On Fri, Jan 11, 2013 at 12:48:45PM -0500, Christoffer Dall wrote:
 again, that's why I suggest returning a bool instead. You just said
 it: it's a basic handled/not-handled state. Why do you want to return
 -EINVAL if that's not propogated anywhere?

 We have a well established principle throughout the kernel interfaces that
 functions will return positive values for success and an appropriate -ve
 errno for failure.

 We *certainly* hate functions which return 0 for failure and non-zero
 for success - it makes review a real pain because you start seeing code
 doing this:

 if (!function()) {
 deal_with_failure();
 }


and you'd certainly not find that in any of the KVM/ARM - that would
be despicable.

 and you have to then start looking at the function to properly understand
 what it's return semantics are.

 We have more than enough proof already that this doesn't work: people
 don't care to understand what the return values from functions mean.
 You only have to do an audit of a few of the IS_ERR_OR_NULL() uses to
 find that out, and you quickly realise that people just use what they
 _think_ is the right test and which happens to pass their very simple
 testing at the time.

 We've avoided major problems so far in the kernel by having most of the
 integer-returning functions following the established principle, and
 that's good.  I really really think that there must be a _very_ good
 reason, and overwhelming reason to deviate from the established
 principle in any large project.

The _very_ good reason here, is that we have two success cases: return
to guest and return to user space. As I said, we can save this state
in another bit somewhere and change all the KVM/ARM code to do so, but
the KVM guys back then would like to use the same convention as other
KVM archs.

I would prefer not having to change all that KVM/ARM code at this
point, but of course, if you insists, I will have to do that.

BUT, returning -EINVAL should indicate an actual error. This is not
the case for the concrete psci example, the case is simply that the
number in r0 does not fall within the psci range, and therefore could
potentially be handled by something else, and using -EINVAL as a
fall-through to the next value is equally weird and misleading.

An alternative is to do something like foo(..., handled), but I think
using a bool as the function type is perfect here: what is the problem
with that again? Certainly it wouldn't be the only boolean function in
the kernel.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Marc Zyngier
On 11/01/13 17:48, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:43 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:33, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:12, Russell King - ARM Linux wrote:
 On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 +unsigned long val;
 +
 +switch (psci_fn) {
 +case KVM_PSCI_FN_CPU_OFF:
 +kvm_psci_vcpu_off(vcpu);
 +val = KVM_PSCI_RET_SUCCESS;
 +break;
 +case KVM_PSCI_FN_CPU_ON:
 +val = kvm_psci_vcpu_on(vcpu);
 +break;
 +case KVM_PSCI_FN_CPU_SUSPEND:
 +case KVM_PSCI_FN_MIGRATE:
 +val = KVM_PSCI_RET_NI;
 +break;
 +
 +default:
 +return -1;
 +}
 +
 +*vcpu_reg(vcpu, 0) = val;
 +return 0;
 +}

 We were discussing recently on #kernel about kernel APIs and the way that
 our integer-returning functions pretty much use 0 for success, and -errno
 for failures, whereas our pointer-returning functions are a mess.

 And above we have something returning -1 to some other chunk of code 
 outside
 this compilation unit.  That doesn't sound particularly clever to me.

 The original code used to return -EINVAL, see:
 https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html

 Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
 the code to its original state though.

 I don't want to return -EINVAL, because for the rest of the KVM code
 this would mean kill the guest.

 The convention used in other archs of KVM as well as for ARM is that
 the handle_exit functions return:

 -ERRNO: Error, report this error to user space
 0: Everything is fine, but return to user space to let it do I/O
 emulation and whatever it wants to do
 1: Everything is fine, return directly to the guest without going to user 
 space

 That is assuming we propagate the handle_exit convention down to the
 leaf calls, and I object to that. The 3 possible values only apply to
 handle_exit, and we should keep that convention as local as possible,
 because this is the odd case.
 
 I don't agree - it requires you to carefully follow a potentially deep
 call trace to make sense of what it does, and worse, it's directly
 misleading in the case of KVM/ARM. So either change it everywhere and
 have a consistent calling convention or adhere to what we do elsewhere
 and use the bool.

Sorry, but I do not buy this.

In this particular case, the meaning of the value returned has nothing
to do with the handle_exit convention. We never return to user space,
let alone signaling an error.

Trying to force all the code in this convention makes it actually harder
to review, because this is the exception in the kernel. This is why I
suggest keeping the handle_exit return convention at this exact point,
and not impose it on anything else.


 And then you do:
 if (handle_something() == 0)
 return 1;

 which I thought was confusing, so I said make the function a bool, to
 avoid the confusion, like Rusty did for all the coprocessor emulation
 functions.

 I don't see a compelling reason to propagate this convention to areas
 that do not require it. In the PSCI case, we have a basic
 handled/not-handled state, the later indicating the reason. The exit
 handling functions can convert the error codes to whatever the run loop
 requires.

 
 again, that's why I suggest returning a bool instead. You just said
 it: it's a basic handled/not-handled state. Why do you want to return
 -EINVAL if that's not propogated anywhere?
 
 If the return codes are local and do not map reasonably to error codes
 and more complicated than a bool, I would vote for a #define
 HVC_HANDLE_SUCCESS, HVC_HANDLE_ERROR, HVC_HANDLE_XX, ...
 
 
 
 There are obviously other ways to handle the return 1 case, like
 having an extra state that you carry around, and we can change all the
 code to do that, but I just don't think it's worth it, as we are in
 fact quite close to the existing kernel API.

 


-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Russell King - ARM Linux
On Fri, Jan 11, 2013 at 01:07:31PM -0500, Christoffer Dall wrote:
 The _very_ good reason here, is that we have two success cases: return
 to guest and return to user space. As I said, we can save this state
 in another bit somewhere and change all the KVM/ARM code to do so, but
 the KVM guys back then would like to use the same convention as other
 KVM archs.

Can you please credit me for not objecting to returning 0/1 to have
different success meanings.  What I'm merely objecting to is that
return -1 statement in the code (notice the negative sign.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Christoffer Dall
On Fri, Jan 11, 2013 at 1:14 PM, Russell King - ARM Linux
li...@arm.linux.org.uk wrote:
 On Fri, Jan 11, 2013 at 01:07:31PM -0500, Christoffer Dall wrote:
 The _very_ good reason here, is that we have two success cases: return
 to guest and return to user space. As I said, we can save this state
 in another bit somewhere and change all the KVM/ARM code to do so, but
 the KVM guys back then would like to use the same convention as other
 KVM archs.

 Can you please credit me for not objecting to returning 0/1 to have
 different success meanings.  What I'm merely objecting to is that
 return -1 statement in the code (notice the negative sign.)

Sorry if I misunderstood you. Yes, the return -1 has to be changed.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation

2013-01-11 Thread Christoffer Dall
On Fri, Jan 11, 2013 at 1:09 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:48, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:43 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 11/01/13 17:33, Christoffer Dall wrote:
 On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com 
 wrote:
 On 11/01/13 17:12, Russell King - ARM Linux wrote:
 On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote:
 +int kvm_psci_call(struct kvm_vcpu *vcpu)
 +{
 +unsigned long psci_fn = *vcpu_reg(vcpu, 0)  ~((u32) 0);
 +unsigned long val;
 +
 +switch (psci_fn) {
 +case KVM_PSCI_FN_CPU_OFF:
 +kvm_psci_vcpu_off(vcpu);
 +val = KVM_PSCI_RET_SUCCESS;
 +break;
 +case KVM_PSCI_FN_CPU_ON:
 +val = kvm_psci_vcpu_on(vcpu);
 +break;
 +case KVM_PSCI_FN_CPU_SUSPEND:
 +case KVM_PSCI_FN_MIGRATE:
 +val = KVM_PSCI_RET_NI;
 +break;
 +
 +default:
 +return -1;
 +}
 +
 +*vcpu_reg(vcpu, 0) = val;
 +return 0;
 +}

 We were discussing recently on #kernel about kernel APIs and the way that
 our integer-returning functions pretty much use 0 for success, and -errno
 for failures, whereas our pointer-returning functions are a mess.

 And above we have something returning -1 to some other chunk of code 
 outside
 this compilation unit.  That doesn't sound particularly clever to me.

 The original code used to return -EINVAL, see:
 https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html

 Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert
 the code to its original state though.

 I don't want to return -EINVAL, because for the rest of the KVM code
 this would mean kill the guest.

 The convention used in other archs of KVM as well as for ARM is that
 the handle_exit functions return:

 -ERRNO: Error, report this error to user space
 0: Everything is fine, but return to user space to let it do I/O
 emulation and whatever it wants to do
 1: Everything is fine, return directly to the guest without going to user 
 space

 That is assuming we propagate the handle_exit convention down to the
 leaf calls, and I object to that. The 3 possible values only apply to
 handle_exit, and we should keep that convention as local as possible,
 because this is the odd case.

 I don't agree - it requires you to carefully follow a potentially deep
 call trace to make sense of what it does, and worse, it's directly
 misleading in the case of KVM/ARM. So either change it everywhere and
 have a consistent calling convention or adhere to what we do elsewhere
 and use the bool.

 Sorry, but I do not buy this.

 In this particular case, the meaning of the value returned has nothing
 to do with the handle_exit convention. We never return to user space,
 let alone signaling an error.


But that is by no means obvious.

 Trying to force all the code in this convention makes it actually harder
 to review, because this is the exception in the kernel. This is why I
 suggest keeping the handle_exit return convention at this exact point,
 and not impose it on anything else.


Ok, then come up with a nicer convention, but using -EINVAL to mean I
did not handle this one, maybe someone else will handle it, and it's
not an error is definitely not easy to read.  What then if another
function called from the same caller could actually return an error,
would you have a switch case on the return value checking for EINVAL
vs. EFAULT?

I repeat: what's the problem with a bool for this particular case,
which would make it look like all the other sub-calls of handle exit
functions - could we be consistent here?

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER

2013-01-11 Thread Marcelo Tosatti
On Fri, Jan 11, 2013 at 08:46:35AM -0700, Alex Williamson wrote:
 On Fri, 2013-01-11 at 08:45 +, Pandarathil, Vijaymohan R wrote:
  
   -Original Message-
   From: Alex Williamson [mailto:alex.william...@redhat.com]
   Sent: Wednesday, January 09, 2013 11:05 AM
   To: Pandarathil, Vijaymohan R
   Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu-
   de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
   Subject: Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting
   AER
   
   On Wed, 2013-01-09 at 10:52 -0700, Alex Williamson wrote:
On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote:
   - New ioctl which is used to pass the eventfd that is signaled 
 when
   an error occurs in the vfio_pci_device

   - Register pci_error_handler for the vfio_pci driver

   - When the device encounters an error, the error handler 
 registered
   by
   the vfio_pci driver gets invoked by the AER infrastructure

   - In the error handler, signal the eventfd registered for the 
 device.

   - This results in the qemu eventfd handler getting invoked and
   appropriate action taken for the guest.

 Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com
 ---
  drivers/vfio/pci/vfio_pci.c | 29 
 +
  drivers/vfio/pci/vfio_pci_private.h |  1 +
  drivers/vfio/vfio.c |  8 
  include/linux/vfio.h|  1 +
  include/uapi/linux/vfio.h   |  9 +
  5 files changed, 48 insertions(+)

 diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
 index 6c11994..4ae9526 100644
 --- a/drivers/vfio/pci/vfio_pci.c
 +++ b/drivers/vfio/pci/vfio_pci.c
 @@ -207,6 +207,8 @@ static long vfio_pci_ioctl(void *device_data,
   if (vdev-reset_works)
   info.flags |= VFIO_DEVICE_FLAGS_RESET;

 + info.flags |= VFIO_DEVICE_FLAGS_AER_NOTIFY;
 +
   
This appears to be a PCI specific flag, so the name should include
_PCI_.  We also support non-PCIe devices and it seems like it would be
possible to not have AER support available, so shouldn't this be
conditional?
  
  Will do that.
  
   
   info.num_regions = VFIO_PCI_NUM_REGIONS;
   info.num_irqs = VFIO_PCI_NUM_IRQS;

 @@ -348,6 +350,19 @@ static long vfio_pci_ioctl(void *device_data,

   return ret;

 + } else if (cmd == VFIO_DEVICE_SET_ERRFD) {
 + int32_t fd = (int32_t)arg;
 +
 + if (fd  0)
 + return -EINVAL;
 +
 + vdev-err_trigger = eventfd_ctx_fdget(fd);
 +
 + if (IS_ERR(vdev-err_trigger))
 + return PTR_ERR(vdev-err_trigger);
 +
 + return 0;
 +
   
I'm not sure why we wouldn't describe this as just another interrupt
from the device and configure it via SET_IRQ.  This ioctl has very
limited use and doesn't follow any of the conventions of all the other
vfio ioctls.
  
  I thought this was a fairly simple ioctl to implement this way. Moreover, 
  there is no device interrupt involved. Let me know if you really want to 
  model it as a SET_IRQ ioctl.
 
 I'd like to see something else for sure.  We're trying to do AER
 piecemeal but this ioctl leaves no room for anything else.  Seems like a
 soon to be wasted ioctl number.  Note that there isn't even a de-assign
 interface, nor a flag to set to indicate de-assignment.  The SET_IRQ
 ioctl already handles these in a very similar way to how we'd want to
 handle it for error signaling.  It doesn't seem like that much of a
 stretch to me to include it there, but I'd also entertain other ideas.
 Thanks,
 
 Alex

Yes it would be good to have a picture of the dependencies necessary to
implement passthrough, before introducing an incompatible interface.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Marcelo Tosatti
On Fri, Jan 11, 2013 at 02:26:51AM -0500, Christoffer Dall wrote:
 On 10/01/2013, at 20.10, Scott Wood scottw...@freescale.com wrote:
 
  On 01/10/2013 06:35:02 PM, Marcelo Tosatti wrote:
  On Thu, Jan 10, 2013 at 04:40:12PM -0600, Scott Wood wrote:
   On 01/10/2013 04:28:01 PM, Marcelo Tosatti wrote:
   Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that?
  
   Besides the above, and my original complaint that it shouldn't be
   specific to addresses?
  
   -Scott
  I did not really grasp that ('shouldnt be specific to addresses'), but
  anyway.
  
  A device may have other configuration parameters that need to be set,
  besides addresses.  PPC MPIC will require information about the vendor
  and version, for example.
  
  OK, can you write down your proposed improvements to the interface?
  In case you have something ready, otherwise there is time pressure
  to merge the ARM port.
  
  My original request was just to change the name to something like
  KVM_SET_DEVICE_CONFIG or KVM_SET_DEVICE_ATTR, and not make the id
  encoding architecture-specific (preferably, separate into a device id
  field and an attribute id field rather than using bitfields).  Actual
  values for device id could be architecture-specific (or there could be a
  global enumeration), and attribute id values would be device-specific.
  
  Alex suggested that an ideal interface might accept values larger than 64
  bits, though I think it's good enough -- there are currently no proposed
  uses that need more than 64 bits for a single attribute (unlike ONE_REG),
  and if it is needed, such configuration could be split up between
  multiple attributes, or the attribute could specify that value be a
  userspace pointer to the actual data (as with ONE_REG).
  
  Here's a writeup (the ARM details would go under ARM/vGIC-specific
  documentation):
  
  4.80 KVM_SET_DEVICE_ATTR
  
  Capability: KVM_CAP_SET_DEVICE_ATTR
  Type: vm ioctl
  Parameters: struct kvm_device_attr (in)
  Returns: 0 on success, -1 on error
  Errors:
   ENODEV: The device id is unknown
   ENXIO:  Device not supported on current system
   Other errors may be returned by specific devices and attributes.
  
  struct kvm_device_attr {
 __u32 device;
 __u32 attr;
 __u64 value;
  };
  
  Specify an attribute of a device emulated or directly exposed by the
  kernel, which the host kernel needs to know about.  The device field is an
  architecture-specific identifier for a specific device.  The attr field
  is a device-specific identifier for a specific attribute.  Individual
  attributes may have particular requirements for when they can and cannot
  be set.
  
  That is, if you have interest/energy to spend in a possibly reusable
  interface, as long as that does not delay integration of the ARM code,
  i don't think the ARM people will mind that.
  
  The impression I've been given is that just about any change will delay
  the integration at this point.  If that's the case, and everyone's OK
  with having an interface that is deprecated on arrival, then fine.
 
 
 That is not entirely the case, but there wasn't event agreement on this 
 revised API, and we didn't want to wait for weeks until a decision was made. 
 Alex suggested we use a DEV_REG API similar to the ONE_REG API, but I am 
 quite strongly against having such an API for this specific purpose as it is 
 too semantically distant to what we do on ARM. (Having a DEV_REG API for 
 other purposes might be fine though). 
 
 So I have no problem with your suggestion, although we could consider having 
 a size and addr field in the struct instead and be slightly more extensible. 
 But I don't feel strongly about it. 
 
 If we can agree on Scott's suggestion or with my modification (Alex, Gleb, 
 Marcelo ?) then I'll change the KVM/ARM patches to use this and resend them 
 right away. But we have to agree now!
 
 If not, I really think we should keep things as they are now, as we're really 
 discussing idealistic scenarios here, and the ARM patches have been out of 
 tree for long enough. As Marcelo pointed out, the benefits of the perfect API 
 are really minimal.
 
 -Christoffer--

Can you make KVM_SET_ARM_DEVICE address extensible? 
Add some reserved space and a flags field.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Alexander Graf

On 11.01.2013, at 19:39, Marcelo Tosatti wrote:

 On Fri, Jan 11, 2013 at 02:26:51AM -0500, Christoffer Dall wrote:
 On 10/01/2013, at 20.10, Scott Wood scottw...@freescale.com wrote:
 
 On 01/10/2013 06:35:02 PM, Marcelo Tosatti wrote:
 On Thu, Jan 10, 2013 at 04:40:12PM -0600, Scott Wood wrote:
 On 01/10/2013 04:28:01 PM, Marcelo Tosatti wrote:
 Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that?
 
 Besides the above, and my original complaint that it shouldn't be
 specific to addresses?
 
 -Scott
 I did not really grasp that ('shouldnt be specific to addresses'), but
 anyway.
 
 A device may have other configuration parameters that need to be set,
 besides addresses.  PPC MPIC will require information about the vendor
 and version, for example.
 
 OK, can you write down your proposed improvements to the interface?
 In case you have something ready, otherwise there is time pressure
 to merge the ARM port.
 
 My original request was just to change the name to something like
 KVM_SET_DEVICE_CONFIG or KVM_SET_DEVICE_ATTR, and not make the id
 encoding architecture-specific (preferably, separate into a device id
 field and an attribute id field rather than using bitfields).  Actual
 values for device id could be architecture-specific (or there could be a
 global enumeration), and attribute id values would be device-specific.
 
 Alex suggested that an ideal interface might accept values larger than 64
 bits, though I think it's good enough -- there are currently no proposed
 uses that need more than 64 bits for a single attribute (unlike ONE_REG),
 and if it is needed, such configuration could be split up between
 multiple attributes, or the attribute could specify that value be a
 userspace pointer to the actual data (as with ONE_REG).
 
 Here's a writeup (the ARM details would go under ARM/vGIC-specific
 documentation):
 
 4.80 KVM_SET_DEVICE_ATTR
 
 Capability: KVM_CAP_SET_DEVICE_ATTR
 Type: vm ioctl
 Parameters: struct kvm_device_attr (in)
 Returns: 0 on success, -1 on error
 Errors:
 ENODEV: The device id is unknown
 ENXIO:  Device not supported on current system
 Other errors may be returned by specific devices and attributes.
 
 struct kvm_device_attr {
   __u32 device;
   __u32 attr;
   __u64 value;
 };
 
 Specify an attribute of a device emulated or directly exposed by the
 kernel, which the host kernel needs to know about.  The device field is an
 architecture-specific identifier for a specific device.  The attr field
 is a device-specific identifier for a specific attribute.  Individual
 attributes may have particular requirements for when they can and cannot
 be set.
 
 That is, if you have interest/energy to spend in a possibly reusable
 interface, as long as that does not delay integration of the ARM code,
 i don't think the ARM people will mind that.
 
 The impression I've been given is that just about any change will delay
 the integration at this point.  If that's the case, and everyone's OK
 with having an interface that is deprecated on arrival, then fine.
 
 
 That is not entirely the case, but there wasn't event agreement on this 
 revised API, and we didn't want to wait for weeks until a decision was made. 
 Alex suggested we use a DEV_REG API similar to the ONE_REG API, but I am 
 quite strongly against having such an API for this specific purpose as it is 
 too semantically distant to what we do on ARM. (Having a DEV_REG API for 
 other purposes might be fine though). 
 
 So I have no problem with your suggestion, although we could consider having 
 a size and addr field in the struct instead and be slightly more extensible. 
 But I don't feel strongly about it. 
 
 If we can agree on Scott's suggestion or with my modification (Alex, Gleb, 
 Marcelo ?) then I'll change the KVM/ARM patches to use this and resend them 
 right away. But we have to agree now!
 
 If not, I really think we should keep things as they are now, as we're 
 really discussing idealistic scenarios here, and the ARM patches have been 
 out of tree for long enough. As Marcelo pointed out, the benefits of the 
 perfect API are really minimal.
 
 -Christoffer--
 
 Can you make KVM_SET_ARM_DEVICE address extensible? 
 Add some reserved space and a flags field.

Can't we do this for the new ioctl that we all would agree on rather than the 
interim one that's only a short term solution for a greater problem?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Alexander Graf

On 10.01.2013, at 23:28, Marcelo Tosatti wrote:

 On Wed, Jan 09, 2013 at 10:37:20PM +0100, Alexander Graf wrote:
 
 
 We can start to introduce (and fix ARM) with a generic ioctl in the MPIC 
 patches then.
 
 The ioctl is already generic, except for its name.
 It's making a few wrong assumptions:
 * maximum size of value is u64
 
 This is tolerable IMHO.
 
 * combining device id (variable) with addr type id (const) into a single 
 field. It could just be split into multiple fields
 
 I agree, but that could be lived with as well.
 
 I get that there's a tradeoff between getting something in now, versus 
 waiting until the API is more refined.  Tagging it with a particular ISA 
 seems like an odd way of saying soon to be deprecated, though.  What 
 happens if we're still squabbling over the perfect replacement API when 
 we're trying to push PPC MPIC stuff in?
 
 As mentioned, i fail to see the benefit in sharing 0.0x% of the code (the
 interface), while the remaining code is not shared.

Important code sharing happens in user space.

 
 Then we're the ones who have to come up with a good interface.
 
 Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that?

Yes. Harder to read code and a lack of commonality. Also it wouldn't cut it for 
us, as we need to set more than just addresses. But most of all I really hate 
to take a generic problem and solve it only for myself.

Btw, x86 has the same issue too, it just only somehow got away with not caring 
too much about it. The LAPIC has a version register that Mac OS X guests read. 
If the version is too old, it doesn't boot. So some time ago we just changed 
the version register.

The real, backwards compatible change however would've been to have user space 
tell us which version to emulate and set it accordingly. The way we did it 
basically broke backwards compatibility. We only got away with it because the 
register is rarely read.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Marcelo Tosatti
  Can you make KVM_SET_ARM_DEVICE address extensible? 
  Add some reserved space and a flags field.
 
 Can't we do this for the new ioctl that we all would agree on rather than the 
 interim one that's only a short term solution for a greater problem?
 
 
 Alex

OK.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Christoffer Dall
On Fri, Jan 11, 2013 at 2:18 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  Can you make KVM_SET_ARM_DEVICE address extensible?
  Add some reserved space and a flags field.

 Can't we do this for the new ioctl that we all would agree on rather than 
 the interim one that's only a short term solution for a greater problem?


 Alex

 OK.


Good. It's decided - the KVM/ARM patches remain in their current state
with an ARM specific ioctl, and will transition to something generic
post merge.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 6/5] KVM: x86: clear write_fault_to_shadow_pgtable explicitly

2013-01-11 Thread Marcelo Tosatti
On Fri, Jan 11, 2013 at 10:12:58PM +0800, Xiao Guangrong wrote:
 On 01/11/2013 09:15 PM, Marcelo Tosatti wrote:
 
  
  This is cryptic. Its not obvious at all for someone modifying the code, 
  for example.
  
  Can you please clear it explicitly? 
 
 Sure, this is the patch to apply your idea, is it good to you? :)
 
 
 Subject: [PATCH 6/6] KVM: x86: clear write_fault_to_shadow_pgtable explicitly
 
 Clear it explicitly when exiting x86_emulate_instruction to clarify the code,
 it is suggested by Marcelo
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/x86.c |   16 
  1 files changed, 12 insertions(+), 4 deletions(-)

Fine, please rebase against queue.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Scott Wood

On 01/11/2013 09:42:55 AM, Alexander Graf wrote:


On 11.01.2013, at 02:10, Scott Wood wrote:

 struct kvm_device_attr {
__u32 device;

This needs some semantic specification. Is device a constant value?  
Is it the return value of CREATE_IRQCHIP?


As proposed, it's up to the architecture to provide that  
specification.  In theory this could be used for things other than IRQ  
chips.  We could still say that device creation functions return a  
valid device ID (if the device has any attributes), as well as have  
other architecture-specific ways of describing device IDs (static  
enumeration).  Or we could have non-architecture-specific static  
enumeration.  Or just require that all devices be explicitly created by  
something that returns the ID.


Do you have a preferred approach?

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS

2013-01-11 Thread Alexander Graf


Am 11.01.2013 um 21:11 schrieb Scott Wood scottw...@freescale.com:

 On 01/11/2013 09:42:55 AM, Alexander Graf wrote:
 On 11.01.2013, at 02:10, Scott Wood wrote:
  struct kvm_device_attr {
 __u32 device;
 This needs some semantic specification. Is device a constant value? Is it 
 the return value of CREATE_IRQCHIP?
 
 As proposed, it's up to the architecture to provide that specification.  In 
 theory this could be used for things other than IRQ chips.  We could still 
 say that device creation functions return a valid device ID (if the device 
 has any attributes), as well as have other architecture-specific ways of 
 describing device IDs (static enumeration).  Or we could have 
 non-architecture-specific static enumeration.  Or just require that all 
 devices be explicitly created by something that returns the ID.
 
 Do you have a preferred approach?

I am a fan of dynamically generated ids that get returned from the create 
functions (like open() and an fd).

I'm not sure if that would always work here though. It would mean that we 
explicitly have to create per-cpu interrupt controllers. Or append their state 
onto vcpu state.

Alex

 
 -Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html