RE: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER
-Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Wednesday, January 09, 2013 11:05 AM To: Pandarathil, Vijaymohan R Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu- de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER On Wed, 2013-01-09 at 10:52 -0700, Alex Williamson wrote: On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote: - New ioctl which is used to pass the eventfd that is signaled when an error occurs in the vfio_pci_device - Register pci_error_handler for the vfio_pci driver - When the device encounters an error, the error handler registered by the vfio_pci driver gets invoked by the AER infrastructure - In the error handler, signal the eventfd registered for the device. - This results in the qemu eventfd handler getting invoked and appropriate action taken for the guest. Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com --- drivers/vfio/pci/vfio_pci.c | 29 + drivers/vfio/pci/vfio_pci_private.h | 1 + drivers/vfio/vfio.c | 8 include/linux/vfio.h| 1 + include/uapi/linux/vfio.h | 9 + 5 files changed, 48 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 6c11994..4ae9526 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -207,6 +207,8 @@ static long vfio_pci_ioctl(void *device_data, if (vdev-reset_works) info.flags |= VFIO_DEVICE_FLAGS_RESET; + info.flags |= VFIO_DEVICE_FLAGS_AER_NOTIFY; + This appears to be a PCI specific flag, so the name should include _PCI_. We also support non-PCIe devices and it seems like it would be possible to not have AER support available, so shouldn't this be conditional? Will do that. info.num_regions = VFIO_PCI_NUM_REGIONS; info.num_irqs = VFIO_PCI_NUM_IRQS; @@ -348,6 +350,19 @@ static long vfio_pci_ioctl(void *device_data, return ret; + } else if (cmd == VFIO_DEVICE_SET_ERRFD) { + int32_t fd = (int32_t)arg; + + if (fd 0) + return -EINVAL; + + vdev-err_trigger = eventfd_ctx_fdget(fd); + + if (IS_ERR(vdev-err_trigger)) + return PTR_ERR(vdev-err_trigger); + + return 0; + I'm not sure why we wouldn't describe this as just another interrupt from the device and configure it via SET_IRQ. This ioctl has very limited use and doesn't follow any of the conventions of all the other vfio ioctls. I thought this was a fairly simple ioctl to implement this way. Moreover, there is no device interrupt involved. Let me know if you really want to model it as a SET_IRQ ioctl. } else if (cmd == VFIO_DEVICE_RESET) return vdev-reset_works ? pci_reset_function(vdev-pdev) : -EINVAL; @@ -527,11 +542,25 @@ static void vfio_pci_remove(struct pci_dev *pdev) kfree(vdev); } +static pci_ers_result_t vfio_err_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct vfio_pci_device *vdev = vfio_get_vdev(pdev-dev); + + eventfd_signal(vdev-err_trigger, 1); + return PCI_ERS_RESULT_CAN_RECOVER; +} What if err_trigger hasn't been set? Will fix that. + +static const struct pci_error_handlers vfio_err_handlers = { + .error_detected = vfio_err_detected, +}; + static struct pci_driver vfio_pci_driver = { .name = vfio-pci, .id_table = NULL, /* only dynamic ids */ .probe = vfio_pci_probe, .remove = vfio_pci_remove, + .err_handler= vfio_err_handlers, }; static void __exit vfio_pci_cleanup(void) diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 611827c..daee62f 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -55,6 +55,7 @@ struct vfio_pci_device { boolbardirty; struct pci_saved_state *pci_saved_state; atomic_trefcnt; + struct eventfd_ctx *err_trigger; }; #define is_intx(vdev) (vdev-irq_type == VFIO_PCI_INTX_IRQ_INDEX) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 56097c6..5ed5a54 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -693,6 +693,14 @@ void *vfio_del_group_dev(struct device *dev) } EXPORT_SYMBOL_GPL(vfio_del_group_dev); +void *vfio_get_vdev(struct device *dev) +{ + struct vfio_device *device = dev_get_drvdata(dev); + + return
RE: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
-Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Wednesday, January 09, 2013 10:08 AM To: Pandarathil, Vijaymohan R Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu- de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through a new ioctl - When the device encounters an error, the eventfd is signaled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com --- hw/vfio_pci.c | 56 ++ linux-headers/linux/vfio.h | 9 2 files changed, 65 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 28c8303..9c3c28b 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -38,6 +38,7 @@ #include qemu/error-report.h #include qemu/queue.h #include qemu/range.h +#include sysemu/sysemu.h /* #define DEBUG_VFIO */ #ifdef DEBUG_VFIO @@ -130,6 +131,8 @@ typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; bool reset_works; +EventNotifier errfd; Misleading name, it's an EventNotifier not an fd. Will make the change. +__u32 dev_info_flags; } VFIODevice; typedef struct VFIOGroup { @@ -1805,6 +1808,8 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) DPRINTF(Device %s flags: %u, regions: %u, irgs: %u\n, name, dev_info.flags, dev_info.num_regions, dev_info.num_irqs); +vdev-dev_info_flags = dev_info.flags; + We test the VFIO_DEVICE_FLAGS_RESET elsewhere by caching the bit into a variable, why not do that here too? I was wondering if there was any specific reason to cache the bit into a separate variable. Wouldn't a flags field which can contain the specific bit be more apt ? if (!(dev_info.flags VFIO_DEVICE_FLAGS_PCI)) { error_report(vfio: Um, this isn't a PCI device\n); goto error; @@ -1900,6 +1905,55 @@ static void vfio_put_device(VFIODevice *vdev) } } +static void vfio_errfd_handler(void *opaque) +{ +VFIODevice *vdev = opaque; + +if (!event_notifier_test_and_clear(vdev-errfd)) { +return; +} + +/* + * TBD. Retrieve the error details and decide what action + * needs to be taken. One of the actions could be to pass + * the error to the guest and have the guest driver recover + * the error. This requires that PCIe capabilities be + * exposed to the guest. At present, we just terminate the + * guest to contain the error. + */ +error_report(%s(%04x:%02x:%02x.%x) +Unrecoverable error detected... Terminating guest\n, +__func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, +vdev-host.function); + +qemu_system_shutdown_request(); I would have figured hw_error Hw_error() is probably more appropriate. Will make the change. +return; +} + +static void vfio_register_errfd(VFIODevice *vdev) +{ +int32_t pfd; pfd is used elsewhere in vfio as Pointer to FD, this is just a fd. Will change. +int ret; + +if (!(vdev-dev_info_flags VFIO_DEVICE_FLAGS_AER_NOTIFY)) { +error_report(vfio: Warning: Error notification not supported for the device\n); This should probably just be a debug print. Okay. +return; +} +if (event_notifier_init(vdev-errfd, 0)) { +error_report(vfio: Warning: Unable to init event notifier for error detection\n); +return; +} +pfd = event_notifier_get_fd(vdev-errfd); +qemu_set_fd_handler(pfd, vfio_errfd_handler, NULL, vdev); + +ret = ioctl(vdev-fd, VFIO_DEVICE_SET_ERRFD, pfd); +if (ret) { +error_report(vfio: Warning: Failed to setup error fd: %d\n, ret); +qemu_set_fd_handler(pfd, NULL, NULL, vdev); +event_notifier_cleanup(vdev-errfd); +} +return; +} static int vfio_initfn(PCIDevice *pdev) { VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev); @@ -2010,6 +2064,8 @@ static int vfio_initfn(PCIDevice *pdev) } } +vfio_register_errfd(vdev); + No cleanup in exitfn?! Thanks, Missed that. Will fix. Vijay Alex return 0; out_teardown: diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 4758d1b..0ca4eeb 100644 --- a/linux-headers/linux/vfio.h +++
RE: [Qemu-devel] [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
-Original Message- From: Blue Swirl [mailto:blauwir...@gmail.com] Sent: Wednesday, January 09, 2013 1:23 PM To: Pandarathil, Vijaymohan R Cc: Alex Williamson; Bjorn Helgaas; Gleb Natapov; linux- p...@vger.kernel.org; qemu-de...@nongnu.org; kvm@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: [Qemu-devel] [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices On Wed, Jan 9, 2013 at 6:26 AM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through a new ioctl - When the device encounters an error, the eventfd is signaled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com --- hw/vfio_pci.c | 56 ++ linux-headers/linux/vfio.h | 9 2 files changed, 65 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 28c8303..9c3c28b 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -38,6 +38,7 @@ #include qemu/error-report.h #include qemu/queue.h #include qemu/range.h +#include sysemu/sysemu.h /* #define DEBUG_VFIO */ #ifdef DEBUG_VFIO @@ -130,6 +131,8 @@ typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; bool reset_works; +EventNotifier errfd; +__u32 dev_info_flags; QEMU is not kernel code, please use uint32_t. As you saw later in the code, I was using the same type in the structure copied from the kernel. Will change this to uint32_t. Thanks Vijay } VFIODevice; typedef struct VFIOGroup { @@ -1805,6 +1808,8 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) DPRINTF(Device %s flags: %u, regions: %u, irgs: %u\n, name, dev_info.flags, dev_info.num_regions, dev_info.num_irqs); +vdev-dev_info_flags = dev_info.flags; + if (!(dev_info.flags VFIO_DEVICE_FLAGS_PCI)) { error_report(vfio: Um, this isn't a PCI device\n); goto error; @@ -1900,6 +1905,55 @@ static void vfio_put_device(VFIODevice *vdev) } } +static void vfio_errfd_handler(void *opaque) +{ +VFIODevice *vdev = opaque; + +if (!event_notifier_test_and_clear(vdev-errfd)) { +return; +} + +/* + * TBD. Retrieve the error details and decide what action + * needs to be taken. One of the actions could be to pass + * the error to the guest and have the guest driver recover + * the error. This requires that PCIe capabilities be + * exposed to the guest. At present, we just terminate the + * guest to contain the error. + */ +error_report(%s(%04x:%02x:%02x.%x) +Unrecoverable error detected... Terminating guest\n, +__func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, +vdev-host.function); + +qemu_system_shutdown_request(); +return; +} + +static void vfio_register_errfd(VFIODevice *vdev) +{ +int32_t pfd; +int ret; + +if (!(vdev-dev_info_flags VFIO_DEVICE_FLAGS_AER_NOTIFY)) { +error_report(vfio: Warning: Error notification not supported for the device\n); +return; +} +if (event_notifier_init(vdev-errfd, 0)) { +error_report(vfio: Warning: Unable to init event notifier for error detection\n); +return; +} +pfd = event_notifier_get_fd(vdev-errfd); +qemu_set_fd_handler(pfd, vfio_errfd_handler, NULL, vdev); + +ret = ioctl(vdev-fd, VFIO_DEVICE_SET_ERRFD, pfd); +if (ret) { +error_report(vfio: Warning: Failed to setup error fd: %d\n, ret); +qemu_set_fd_handler(pfd, NULL, NULL, vdev); +event_notifier_cleanup(vdev-errfd); +} +return; +} static int vfio_initfn(PCIDevice *pdev) { VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev); @@ -2010,6 +2064,8 @@ static int vfio_initfn(PCIDevice *pdev) } } +vfio_register_errfd(vdev); + return 0; out_teardown: diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 4758d1b..0ca4eeb 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -147,6 +147,7 @@ struct vfio_device_info { __u32 flags; #define VFIO_DEVICE_FLAGS_RESET(1 0)/* Device supports reset */ #define VFIO_DEVICE_FLAGS_PCI (1 1)/* vfio-pci device */ +#define VFIO_DEVICE_FLAGS_AER_NOTIFY (1 2) /* Supports aer notification */ __u32 num_regions;/* Max region index + 1 */ __u32 num_irqs;
[PATCH 0/4] KVM: Clean up and optimize __kvm_set_memory_region() - part1
Patches 1 to 3 are trivial. Patch 4 is the main cause of the increased lines, but I think the new code makes it easier to understand why each condition in __kvm_set_memory_region() is there. If you don't agree with patch 4, please consider taking the rest of the series at this time. Takuya Yoshikawa (4): KVM: set_memory_region: Don't jump to out_free unnecessarily KVM: set_memory_region: Don't check for overlaps unless we create or move a slot KVM: set_memory_region: Remove unnecessary variable memslot KVM: set_memory_region: Identify the requested change explicitly virt/kvm/kvm_main.c | 94 --- 1 files changed, 59 insertions(+), 35 deletions(-) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] KVM: set_memory_region: Don't jump to out_free unnecessarily
This makes the separation between the sanity checks and the rest of the code a bit clearer. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- virt/kvm/kvm_main.c |7 +++ 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e45c20c..d5e4bf9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -778,9 +778,9 @@ int __kvm_set_memory_region(struct kvm *kvm, */ r = -EINVAL; if (npages old.npages npages != old.npages) - goto out_free; + goto out; if (!npages !old.npages) - goto out_free; + goto out; /* Check for overlaps */ r = -EEXIST; @@ -789,7 +789,7 @@ int __kvm_set_memory_region(struct kvm *kvm, continue; if (!((base_gfn + npages = slot-base_gfn) || (base_gfn = slot-base_gfn + slot-npages))) - goto out_free; + goto out; } /* Free page dirty bitmap if unneeded */ @@ -891,7 +891,6 @@ out_free: kvm_free_physmem_slot(new, old); out: return r; - } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] KVM: set_memory_region: Don't check for overlaps unless we create or move a slot
Don't need the check for deleting an existing slot or just modifiying the flags. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- virt/kvm/kvm_main.c | 18 ++ 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d5e4bf9..f6c8cdc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -782,14 +782,16 @@ int __kvm_set_memory_region(struct kvm *kvm, if (!npages !old.npages) goto out; - /* Check for overlaps */ - r = -EEXIST; - kvm_for_each_memslot(slot, kvm-memslots) { - if (slot-id = KVM_USER_MEM_SLOTS || slot == memslot) - continue; - if (!((base_gfn + npages = slot-base_gfn) || - (base_gfn = slot-base_gfn + slot-npages))) - goto out; + if ((npages !old.npages) || (base_gfn != old.base_gfn)) { + /* Check for overlaps */ + r = -EEXIST; + kvm_for_each_memslot(slot, kvm-memslots) { + if (slot-id = KVM_USER_MEM_SLOTS || slot == memslot) + continue; + if (!((base_gfn + npages = slot-base_gfn) || + (base_gfn = slot-base_gfn + slot-npages))) + goto out; + } } /* Free page dirty bitmap if unneeded */ -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] KVM: set_memory_region: Remove unnecessary variable memslot
One such variable, slot, is enough for holding a pointer temporarily. We also remove another local variable named slot, which is limited in a block, since it is confusing to have the same name in this function. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- virt/kvm/kvm_main.c | 11 +-- 1 files changed, 5 insertions(+), 6 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f6c8cdc..8cb4e71 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -728,7 +728,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int r; gfn_t base_gfn; unsigned long npages; - struct kvm_memory_slot *memslot, *slot; + struct kvm_memory_slot *slot; struct kvm_memory_slot old, new; struct kvm_memslots *slots = NULL, *old_memslots; @@ -754,7 +754,7 @@ int __kvm_set_memory_region(struct kvm *kvm, if (mem-guest_phys_addr + mem-memory_size mem-guest_phys_addr) goto out; - memslot = id_to_memslot(kvm-memslots, mem-slot); + slot = id_to_memslot(kvm-memslots, mem-slot); base_gfn = mem-guest_phys_addr PAGE_SHIFT; npages = mem-memory_size PAGE_SHIFT; @@ -765,7 +765,7 @@ int __kvm_set_memory_region(struct kvm *kvm, if (!npages) mem-flags = ~KVM_MEM_LOG_DIRTY_PAGES; - new = old = *memslot; + new = old = *slot; new.id = mem-slot; new.base_gfn = base_gfn; @@ -786,7 +786,8 @@ int __kvm_set_memory_region(struct kvm *kvm, /* Check for overlaps */ r = -EEXIST; kvm_for_each_memslot(slot, kvm-memslots) { - if (slot-id = KVM_USER_MEM_SLOTS || slot == memslot) + if ((slot-id = KVM_USER_MEM_SLOTS) || + (slot-id == mem-slot)) continue; if (!((base_gfn + npages = slot-base_gfn) || (base_gfn = slot-base_gfn + slot-npages))) @@ -823,8 +824,6 @@ int __kvm_set_memory_region(struct kvm *kvm, } if (!npages || base_gfn != old.base_gfn) { - struct kvm_memory_slot *slot; - r = -ENOMEM; slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots), GFP_KERNEL); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] KVM: set_memory_region: Identify the requested change explicitly
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify what kind of change is being requested by checking the parameters. The current code does this checking at various points in code and each condition used there is not easy to understand at first glance. This patch consolidates these and introduces an enum to name the possible changes to clean up the code. Although this does not introduce any user visible changes, there are two changes which optimize the code a bit: 1. If we have nothing to change, the new code returns immediately. 2. If we just modify the flags, no changes about the mapping, the new code does not call kvm_iommu_map_pages(). One interesting thing we noticed was about the case 1. Although this case seemed unimportant, QEMU was relying on the current behaviour: when we returned a value other than 0, e.g. -EINVAL, live migration failed at the final stage with a section mismatch error. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- virt/kvm/kvm_main.c | 64 +++ 1 files changed, 44 insertions(+), 20 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8cb4e71..5426e96 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -714,6 +714,24 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, } /* + * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations: + * - create a new memory slot + * - delete an existing memory slot + * - modify an existing memory slot + * -- move it in the guest physical memory space + * -- just change its flags + * + * Since flags can be changed by some of these operations, the following + * differentiation is the best we can do for __kvm_set_memory_region(): + */ +enum kvm_mr_change { + KVM_MR_CREATE, + KVM_MR_DELETE, + KVM_MR_MOVE, + KVM_MR_FLAGS_ONLY, +}; + +/* * Allocate some memory and give it an address in the guest physical address * space. * @@ -731,6 +749,7 @@ int __kvm_set_memory_region(struct kvm *kvm, struct kvm_memory_slot *slot; struct kvm_memory_slot old, new; struct kvm_memslots *slots = NULL, *old_memslots; + enum kvm_mr_change change; r = check_memory_region_flags(mem); if (r) @@ -772,17 +791,30 @@ int __kvm_set_memory_region(struct kvm *kvm, new.npages = npages; new.flags = mem-flags; - /* -* Disallow changing a memory slot's size or changing anything about -* zero sized slots that doesn't involve making them non-zero. -*/ r = -EINVAL; - if (npages old.npages npages != old.npages) - goto out; - if (!npages !old.npages) + if (npages) { + if (!old.npages) + change = KVM_MR_CREATE; + else { /* Modify an existing slot. */ + if ((mem-userspace_addr != old.userspace_addr) || + (npages != old.npages)) + goto out; + + if (base_gfn != old.base_gfn) + change = KVM_MR_MOVE; + else if (new.flags != old.flags) + change = KVM_MR_FLAGS_ONLY; + else { /* Nothing to change. */ + r = 0; + goto out; + } + } + } else if (old.npages) { + change = KVM_MR_DELETE; + } else /* Modify a non-existent slot: disallowed. */ goto out; - if ((npages !old.npages) || (base_gfn != old.base_gfn)) { + if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) { /* Check for overlaps */ r = -EEXIST; kvm_for_each_memslot(slot, kvm-memslots) { @@ -800,20 +832,12 @@ int __kvm_set_memory_region(struct kvm *kvm, new.dirty_bitmap = NULL; r = -ENOMEM; - - /* -* Allocate if a slot is being created. If modifying a slot, -* the userspace_addr cannot change. -*/ - if (!old.npages) { + if (change == KVM_MR_CREATE) { new.user_alloc = user_alloc; new.userspace_addr = mem-userspace_addr; if (kvm_arch_create_memslot(new, npages)) goto out_free; - } else if (npages mem-userspace_addr != old.userspace_addr) { - r = -EINVAL; - goto out_free; } /* Allocate page dirty bitmap if needed */ @@ -823,7 +847,7 @@ int __kvm_set_memory_region(struct kvm *kvm, /* destroy any largepage mappings for dirty tracking */ } - if (!npages || base_gfn != old.base_gfn) { + if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) { r = -ENOMEM; slots = kmemdup(kvm-memslots, sizeof(struct
Re: [RFC PATCH] virtio-net: introduce a new macaddr control
On Thu, Jan 10, 2013 at 10:51:57PM +0800, ak...@redhat.com wrote: @@ -349,6 +351,13 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t cmd, { struct virtio_net_ctrl_mac mac_data; +if (cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET elem-out_num == 2) { +/* Set MAC address */ +memcpy(n-mac, elem-out_sg[1].iov_base, elem-out_sg[1].iov_len); We cannot trust the guest's iov_len, it could overflow n-mac. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
On Fri, Jan 11, 2013 at 02:37:15AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-01-10: On Thu, Jan 10, 2013 at 08:32:06AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-01-10: On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com basically to benefit from apicv, we need to enable virtualized x2apic mode. Currently, we only enable it when guest is really using x2apic. Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic: 0x800 - 0x8ff: no read intercept for apicv register virtualization, except APIC ID and TMCCT. APIC ID and TMCCT: need software's assistance to get right value. TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery. Signed-off-by: Kevin Tian kevin.t...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/include/asm/vmx.h |1 + arch/x86/kvm/lapic.c|5 +- arch/x86/kvm/svm.c |6 + arch/x86/kvm/vmx.c | 194 +-- 5 files changed, 200 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,8 @@ struct kvm_x86_ops { void (*enable_nmi_window)(struct kvm_vcpu *vcpu); void (*enable_irq_window)(struct kvm_vcpu *vcpu); void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); +void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu); +void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu); Make one callback with enable/disable parameter. And do not forget SVM. int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); int (*get_tdp_level)(void); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 44c3f7e..0a54df0 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -139,6 +139,7 @@ #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x0001 #define SECONDARY_EXEC_ENABLE_EPT 0x0002 #define SECONDARY_EXEC_RDTSCP 0x0008 +#define SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE 0x0010 #define SECONDARY_EXEC_ENABLE_VPID 0x0020 #define SECONDARY_EXEC_WBINVD_EXITING 0x0040 #define SECONDARY_EXEC_UNRESTRICTED_GUEST 0x0080 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 0664c13..ec38906 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) u32 id = kvm_apic_id(apic); u32 ldr = ((id 4) 16) | (1 (id 0xf)); kvm_apic_set_ldr(apic, ldr); -} +kvm_x86_ops-enable_virtual_x2apic_mode(vcpu); +} else +kvm_x86_ops-disable_virtual_x2apic_mode(vcpu); + You just broke SVM. apic-base_address = apic-vcpu-arch.apic_base MSR_IA32_APICBASE_BASE; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d29d3cd..0b82cb1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) set_cr_intercept(svm, INTERCEPT_CR8_WRITE); } +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu) +{ +return; +} + static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops svm_x86_ops = { .enable_nmi_window = enable_nmi_window, .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, +.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode, .set_tss_addr = svm_set_tss_addr, .get_tdp_level = get_npt_level, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 688f43f..b203ce7 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -433,6 +433,8 @@ struct vcpu_vmx { bool rdtscp_enabled; +bool virtual_x2apic_enabled; + /* Support for a guest hypervisor (nested VMX) */ struct nested_vmx nested; }; @@ -767,12 +769,23 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void) SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; } +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void) +{ +return vmcs_config.cpu_based_2nd_exec_ctrl +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; +} +
[uq/master PATCH] vmxcap: bit 9 of VMX_PROCBASED_CTLS2 is 'virtual interrupt delivery'
Bit 9 of MSR_IA32_VMX_PROCBASED_CTLS2 is virtual interrupt delivery. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/scripts/kvm/vmxcap b/scripts/kvm/vmxcap index cbe6440..0b23f77 100755 --- a/scripts/kvm/vmxcap +++ b/scripts/kvm/vmxcap @@ -147,6 +147,7 @@ controls = [ 5: 'Enable VPID', 6: 'WBINVD exiting', 7: 'Unrestricted guest', +9: 'Virtual interrupt delivery', 10: 'PAUSE-loop exiting', 11: 'RDRAND exiting', 12: 'Enable INVPCID', -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 5/5] KVM: x86: improve reexecute_instruction
On Fri, Jan 11, 2013 at 04:18:22AM +0800, Xiao Guangrong wrote: On 01/11/2013 03:48 AM, Marcelo Tosatti wrote: On Fri, Jan 11, 2013 at 02:05:33AM +0800, Xiao Guangrong wrote: On 01/11/2013 01:26 AM, Marcelo Tosatti wrote: On Tue, Jan 08, 2013 at 02:38:36PM +0800, Xiao Guangrong wrote: The current reexecute_instruction can not well detect the failed instruction emulation. It allows guest to retry all the instructions except it accesses on error pfn For example, some cases are nested-write-protect - if the page we want to write is used as PDE but it chains to itself. Under this case, we should stop the emulation and report the case to userspace Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |7 +++ arch/x86/kvm/paging_tmpl.h | 27 --- arch/x86/kvm/x86.c |8 +++- 3 files changed, 34 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c431b33..d6ab8d2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -502,6 +502,13 @@ struct kvm_vcpu_arch { u64 msr_val; struct gfn_to_hva_cache data; } pv_eoi; + +/* + * Indicate whether the access faults on its page table in guest + * which is set when fix page fault and used to detect unhandeable + * instruction. + */ +bool write_fault_to_shadow_pgtable; }; struct kvm_lpage_info { diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 67b390d..df50560 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -497,26 +497,34 @@ out_gpte_changed: * created when kvm establishes shadow page table that stop kvm using large * page size. Do it early can avoid unnecessary #PF and emulation. * + * @write_fault_to_shadow_pgtable will return true if the fault gfn is + * currently used as its page table. + * * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok * since the PDPT is always shadowed, that means, we can not use large page * size to map the gfn which is used as PDPT. */ static bool FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu, - struct guest_walker *walker, int user_fault) + struct guest_walker *walker, int user_fault, + bool *write_fault_to_shadow_pgtable) { int level; gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker-level) - 1); +bool self_changed = false; if (!(walker-pte_access ACC_WRITE_MASK || (!is_write_protection(vcpu) !user_fault))) return false; -for (level = walker-level; level = walker-max_level; level++) -if (!((walker-gfn ^ walker-table_gfn[level - 1]) mask)) -return true; +for (level = walker-level; level = walker-max_level; level++) { +gfn_t gfn = walker-gfn ^ walker-table_gfn[level - 1]; + +self_changed |= !(gfn mask); +*write_fault_to_shadow_pgtable |= !gfn; +} -return false; +return self_changed; } /* @@ -544,7 +552,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, int level = PT_PAGE_TABLE_LEVEL; int force_pt_level; unsigned long mmu_seq; -bool map_writable; +bool map_writable, is_self_change_mapping; pgprintk(%s: addr %lx err %x\n, __func__, addr, error_code); @@ -572,9 +580,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, return 0; } +vcpu-arch.write_fault_to_shadow_pgtable = false; + +is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu, + walker, user_fault, vcpu-arch.write_fault_to_shadow_pgtable); + if (walker.level = PT_DIRECTORY_LEVEL) force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn) - || FNAME(is_self_change_mapping)(vcpu, walker, user_fault); + || is_self_change_mapping; else force_pt_level = 1; if (!force_pt_level) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6f13e03..2957012 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4810,7 +4810,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2) * guest to let CPU execute the instruction. */ kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); -return true; + +/* + * If the
Re: [PATCH v5 5/5] KVM: x86: improve reexecute_instruction
On Fri, Jan 11, 2013 at 02:16:11AM +0800, Xiao Guangrong wrote: On 01/11/2013 01:30 AM, Marcelo Tosatti wrote: On Tue, Jan 08, 2013 at 02:38:36PM +0800, Xiao Guangrong wrote: The current reexecute_instruction can not well detect the failed instruction emulation. It allows guest to retry all the instructions except it accesses on error pfn For example, some cases are nested-write-protect - if the page we want to write is used as PDE but it chains to itself. Under this case, we should stop the emulation and report the case to userspace Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |7 +++ arch/x86/kvm/paging_tmpl.h | 27 --- arch/x86/kvm/x86.c |8 +++- 3 files changed, 34 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c431b33..d6ab8d2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -502,6 +502,13 @@ struct kvm_vcpu_arch { u64 msr_val; struct gfn_to_hva_cache data; } pv_eoi; + + /* + * Indicate whether the access faults on its page table in guest + * which is set when fix page fault and used to detect unhandeable + * instruction. + */ + bool write_fault_to_shadow_pgtable; }; struct kvm_lpage_info { diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 67b390d..df50560 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -497,26 +497,34 @@ out_gpte_changed: * created when kvm establishes shadow page table that stop kvm using large * page size. Do it early can avoid unnecessary #PF and emulation. * + * @write_fault_to_shadow_pgtable will return true if the fault gfn is + * currently used as its page table. + * * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok * since the PDPT is always shadowed, that means, we can not use large page * size to map the gfn which is used as PDPT. */ static bool FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu, -struct guest_walker *walker, int user_fault) +struct guest_walker *walker, int user_fault, +bool *write_fault_to_shadow_pgtable) { int level; gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker-level) - 1); + bool self_changed = false; if (!(walker-pte_access ACC_WRITE_MASK || (!is_write_protection(vcpu) !user_fault))) return false; - for (level = walker-level; level = walker-max_level; level++) - if (!((walker-gfn ^ walker-table_gfn[level - 1]) mask)) - return true; + for (level = walker-level; level = walker-max_level; level++) { + gfn_t gfn = walker-gfn ^ walker-table_gfn[level - 1]; + + self_changed |= !(gfn mask); + *write_fault_to_shadow_pgtable |= !gfn; + } - return false; + return self_changed; } /* @@ -544,7 +552,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, int level = PT_PAGE_TABLE_LEVEL; int force_pt_level; unsigned long mmu_seq; - bool map_writable; + bool map_writable, is_self_change_mapping; pgprintk(%s: addr %lx err %x\n, __func__, addr, error_code); @@ -572,9 +580,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, return 0; } + vcpu-arch.write_fault_to_shadow_pgtable = false; + + is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu, +walker, user_fault, vcpu-arch.write_fault_to_shadow_pgtable); + if (walker.level = PT_DIRECTORY_LEVEL) force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn) - || FNAME(is_self_change_mapping)(vcpu, walker, user_fault); + || is_self_change_mapping; else force_pt_level = 1; if (!force_pt_level) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6f13e03..2957012 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4810,7 +4810,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2) * guest to let CPU execute the instruction. */ kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); - return true; + + /* + * If the access faults on its page table, it can not + * be fixed by unprotecting shadow page and it should + * be reported to userspace. + */ + return !vcpu-arch.write_fault_to_shadow_pgtable; } static bool retry_instruction(struct x86_emulate_ctxt *ctxt, Also should make sure vcpu-arch.write_fault_to_shadow_pgtable is never reused. Say, clean when exiting x86_emulate_instruction? Yes, it is more clear. But i am thinking if it is
Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
On Thu, Jan 10, 2013 at 07:36:28PM -0200, Marcelo Tosatti wrote: Hi, Getting into good shape. On Thu, Jan 10, 2013 at 03:26:08PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Virtual interrupt delivery avoids KVM to inject vAPIC interrupts manually, which is fully taken care of by the hardware. This needs some special awareness into existing interrupr injection path: - for pending interrupt, instead of direct injection, we may need update architecture specific indicators before resuming to guest. - A pending interrupt, which is masked by ISR, should be also considered in above update action, since hardware will decide when to inject it at right time. Current has_interrupt and get_interrupt only returns a valid vector from injection p.o.v. Signed-off-by: Kevin Tian kevin.t...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/kvm_host.h |5 + arch/x86/include/asm/vmx.h | 11 +++ arch/x86/kvm/irq.c | 56 +++- arch/x86/kvm/lapic.c| 72 +-- arch/x86/kvm/lapic.h| 23 + arch/x86/kvm/svm.c | 18 arch/x86/kvm/vmx.c | 191 +-- arch/x86/kvm/x86.c | 14 +++- include/linux/kvm_host.h|3 + virt/kvm/ioapic.c | 18 virt/kvm/ioapic.h |4 + virt/kvm/irq_comm.c | 22 + virt/kvm/kvm_main.c |5 + 13 files changed, 399 insertions(+), 43 deletions(-) static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; @@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id) { apic_set_reg(apic, APIC_ID, id 24); recalculate_apic_map(apic-vcpu-kvm); + ioapic_update_eoi_exitmap(apic-vcpu-kvm); } Move ioapic_update_eoi_exitmap into recalculate_apic_map. static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id) { apic_set_reg(apic, APIC_LDR, id); recalculate_apic_map(apic-vcpu-kvm); + ioapic_update_eoi_exitmap(apic-vcpu-kvm); } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index b203ce7..990409a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -434,6 +434,7 @@ struct vcpu_vmx { bool rdtscp_enabled; bool virtual_x2apic_enabled; + unsigned long eoi_exit_bitmap[4]; Use DECLARE_BITMAP (unsigned long is 4 bytes on 32-bit host). /* Support for a guest hypervisor (nested VMX) */ struct nested_vmx nested; @@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void) static inline bool cpu_has_vmx_virtual_intr_delivery(void) { - return false; + return vmcs_config.cpu_based_2nd_exec_ctrl + SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY; } static inline bool cpu_has_vmx_flexpriority(void) @@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) SECONDARY_EXEC_PAUSE_LOOP_EXITING | SECONDARY_EXEC_RDTSCP | SECONDARY_EXEC_ENABLE_INVPCID | - SECONDARY_EXEC_APIC_REGISTER_VIRT; + SECONDARY_EXEC_APIC_REGISTER_VIRT | + SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY; if (adjust_vmx_controls(min2, opt2, MSR_IA32_VMX_PROCBASED_CTLS2, _cpu_based_2nd_exec_control) 0) @@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) if (!(_cpu_based_exec_control CPU_BASED_TPR_SHADOW)) _cpu_based_2nd_exec_control = ~( - SECONDARY_EXEC_APIC_REGISTER_VIRT); + SECONDARY_EXEC_APIC_REGISTER_VIRT | + SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); Nevermind the earlier comment about (). +static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu, + u32 vector) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (WARN_ONCE((vector 255), + KVM VMX: vector (%d) out of range\n, vector)) + return; + __set_bit(vector, vmx-eoi_exit_bitmap); +} + +void vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq) +{ + struct kvm_lapic **dst; + struct kvm_apic_map *map; + unsigned long bitmap = 1; + int i; + + rcu_read_lock(); + map = rcu_dereference(vcpu-kvm-arch.apic_map); + + if (unlikely(!map)) { + set_eoi_exitmap_one(vcpu, irq-vector); + goto out; + } + + if (irq-dest_mode == 0) { /* physical mode */ + if (irq-delivery_mode == APIC_DM_LOWEST || + irq-dest_id == 0xff) { +
[PATCH v5 6/5] KVM: x86: clear write_fault_to_shadow_pgtable explicitly
On 01/11/2013 09:15 PM, Marcelo Tosatti wrote: This is cryptic. Its not obvious at all for someone modifying the code, for example. Can you please clear it explicitly? Sure, this is the patch to apply your idea, is it good to you? :) Subject: [PATCH 6/6] KVM: x86: clear write_fault_to_shadow_pgtable explicitly Clear it explicitly when exiting x86_emulate_instruction to clarify the code, it is suggested by Marcelo Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2957012..89d01a8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4753,7 +4753,8 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu) return r; } -static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2) +static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2, + bool write_fault_to_shadow_pgtable) { gpa_t gpa = cr2; pfn_t pfn; @@ -4816,7 +4817,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2) * be fixed by unprotecting shadow page and it should * be reported to userspace. */ - return !vcpu-arch.write_fault_to_shadow_pgtable; + return !write_fault_to_shadow_pgtable; } static bool retry_instruction(struct x86_emulate_ctxt *ctxt, @@ -4875,7 +4876,13 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, int r; struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt; bool writeback = true; + bool write_fault_to_spt = vcpu-arch.write_fault_to_shadow_pgtable; + /* +* Clear write_fault_to_shadow_pgtable here to ensure it is +* never reused. +*/ + vcpu-arch.write_fault_to_shadow_pgtable = false; kvm_clear_exception_queue(vcpu); if (!(emulation_type EMULTYPE_NO_DECODE)) { @@ -4894,7 +4901,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, if (r != EMULATION_OK) { if (emulation_type EMULTYPE_TRAP_UD) return EMULATE_FAIL; - if (reexecute_instruction(vcpu, cr2)) + if (reexecute_instruction(vcpu, cr2, + write_fault_to_spt)) return EMULATE_DONE; if (emulation_type EMULTYPE_SKIP) return EMULATE_FAIL; @@ -4924,7 +4932,7 @@ restart: return EMULATE_DONE; if (r == EMULATION_FAILED) { - if (reexecute_instruction(vcpu, cr2)) + if (reexecute_instruction(vcpu, cr2, write_fault_to_spt)) return EMULATE_DONE; return handle_emulation_failure(vcpu); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] make mac programming for virtio net more robust
On 1/10/2013 11:46 PM, Michael S. Tsirkin wrote: On Fri, Jan 11, 2013 at 12:53:07PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Thu, Jan 10, 2013 at 10:45:39PM +0800, ak...@redhat.com wrote: From: Amos Kong ak...@redhat.com Currenly mac is programmed byte by byte. This means that we have an intermediate step where mac is wrong. Second patch introduced a new vq control command to set mac address in one time. As you mention we could alternatively do it without new commands, simply add a feature bit that says that MACs are in the mac table. This would be a much bigger patch, and I'm fine with either way. Rusty what do you think? Hmm, mac filtering and my mac address are not quite the same thing. I don't know if it matters for anyone: does it? The mac address is abused for things like identifying machines, etc. I don't know either. I think net core differentiates between mac and uc_list because linux has to know which mac to use when building up packets, so at some level, I agree it might be useful to identify the machine. BTW netdev/davem should have been copied on this, Amos I think it's a good idea to remember to do it next time you post. If we keep it as a separate concept, Amos' patch seems to make sense. Yes. It also keeps the patch small, I just thought I'd mention the option. Cheers, Rusty. Don't have the entire context here but if you implement the ndo_fdb_dump() probably hooking it up to ndo_dflt_fdb_dump() you could use the 'bridge' tool dump the uc_list. Then use ndo_fdb_add() and ndo_fdb_del() to add and remove entries from the uc_list. We do this today in macvlan and the ixgbe driver when it is in SR-IOV mode and the embedded switch needs to be programmed. fdb is forwarding database its a bit different then mac filtering in that its telling the switch how to forward mac addresses, in ixgbe and macvlan at least we have been overloading it a bit to also stop filtering the mac address. I think this makes sense if you setup forwarding to a port it doesn't make much sense to then drop them. Maybe its not entirely applicable here just thought I would mention it. Thanks, John -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD-amd64 fails to start with SMP on quemu-kvm
W dniu 09.01.2013 11:15, Gleb Natapov pisze: On Tue, Jan 08, 2013 at 09:45:35PM +0100, Artur Samborski wrote: W dniu 08.01.2013 00:00, Marcelo Tosatti pisze: On Mon, Jan 07, 2013 at 06:13:22PM +0100, Artur Samborski wrote: Hello, When i try to run FreeBSD-amd64 on more than 1 vcpu in quemu-kvm (Fedora Core 17) eg. to run FreeBSD-9.0-RELEASE-amd64 with: qemu-kvm -m 1024m -cpu host -smp 2 -cdrom /storage/iso/FreeBSD-9.0-RELEASE-amd64-dvd1.iso it freezes KVM with: KVM internal error. Suberror: 1 emulation failure RAX=80b0d4c0 RBX=0009f000 RCX=c080 RDX= RSI=d238 RDI= RBP= RSP= R8 = R9 = R10= R11= R12= R13= R14= R15= RIP=0009f076 RFL=00010086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = f300 DPL=3 DS16 [-WA] CS =0008 00209900 DPL=0 CS64 [--A] SS =9f00 0009f000 f300 DPL=3 DS16 [-WA] DS =0018 00c09300 DPL=0 DS [-WA] FS = f300 DPL=3 DS16 [-WA] GS = f300 DPL=3 DS16 [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS64-busy GDT= 0009f080 0020 IDT= CR0=8011 CR2= CR3=0009c000 CR4=0030 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0501 Code=00 00 00 80 0f 22 c0 ea 70 f0 09 00 08 00 48 b8 c0 d4 b0 80 ff ff ff ff ff e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 99 20 00 ff ff 00 00 Artur, Can you check whether https://patchwork-mail.kernel.org/patch/1942681/ fixes your problem Hi, thanks for the reply. Unfortunately, the patch does not help. Attempt to start FreeBSD amd64 via quemu-kvm with -smp parameter greater than 1 fails in exactly the same way as before. The patch was applied to the kernel from the 3.6.11-1.fc17.src.rpm package. Do I need some additional patches? Can you try queue branch from kvm.git? Thanks for the advice - I used kernel from kvm.git(queue) and was able to run FreeBSD-amd guest with SMP on my kvm-host server without previous problems. Unfortunately, after few hours server was hung up. Case requires further investigations, but generally speaking we went forward. Unfortunately, experiments on the server are difficult, because it is used for everyday work. -- Artur -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
On 11.01.2013, at 02:10, Scott Wood wrote: On 01/10/2013 06:35:02 PM, Marcelo Tosatti wrote: On Thu, Jan 10, 2013 at 04:40:12PM -0600, Scott Wood wrote: On 01/10/2013 04:28:01 PM, Marcelo Tosatti wrote: Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that? Besides the above, and my original complaint that it shouldn't be specific to addresses? -Scott I did not really grasp that ('shouldnt be specific to addresses'), but anyway. A device may have other configuration parameters that need to be set, besides addresses. PPC MPIC will require information about the vendor and version, for example. OK, can you write down your proposed improvements to the interface? In case you have something ready, otherwise there is time pressure to merge the ARM port. My original request was just to change the name to something like KVM_SET_DEVICE_CONFIG or KVM_SET_DEVICE_ATTR, and not make the id encoding architecture-specific (preferably, separate into a device id field and an attribute id field rather than using bitfields). Actual values for device id could be architecture-specific (or there could be a global enumeration), and attribute id values would be device-specific. Alex suggested that an ideal interface might accept values larger than 64 bits, though I think it's good enough -- there are currently no proposed uses that need more than 64 bits for a single attribute (unlike ONE_REG), and if it is needed, such configuration could be split up between multiple attributes, or the attribute could specify that value be a userspace pointer to the actual data (as with ONE_REG). Here's a writeup (the ARM details would go under ARM/vGIC-specific documentation): 4.80 KVM_SET_DEVICE_ATTR Capability: KVM_CAP_SET_DEVICE_ATTR Type: vm ioctl Parameters: struct kvm_device_attr (in) Returns: 0 on success, -1 on error Errors: ENODEV: The device id is unknown ENXIO: Device not supported on current system Other errors may be returned by specific devices and attributes. struct kvm_device_attr { __u32 device; This needs some semantic specification. Is device a constant value? Is it the return value of CREATE_IRQCHIP? __u32 attr; __u64 value; }; Specify an attribute of a device emulated or directly exposed by the kernel, which the host kernel needs to know about. The device field is an architecture-specific identifier for a specific device. The attr field is a device-specific identifier for a specific attribute. Individual attributes may have particular requirements for when they can and cannot be set. That is, if you have interest/energy to spend in a possibly reusable interface, as long as that does not delay integration of the ARM code, i don't think the ARM people will mind that. The impression I've been given is that just about any change will delay the integration at this point. If that's the case, and everyone's OK with having an interface that is deprecated on arrival, then fine. To me it's perfectly fine to merge the patches with the arm specific interface and then push an interface like the above in in the next merge window. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH KVM v2 1/4] KVM: fix i8254 IRQ0 to be normally high
On Thu, Jan 10, 2013 at 11:40:07PM -0700, Matthew Ogilvie wrote: On Tue, Jan 08, 2013 at 09:31:19AM +0200, Gleb Natapov wrote: On Mon, Jan 07, 2013 at 06:17:22PM -0600, mmogi...@miniinfo.net wrote: On Mon, 7 Jan 2013 11:39:18 +0200, Gleb Natapov g...@redhat.com wrote: On Wed, Dec 26, 2012 at 10:39:53PM -0700, Matthew Ogilvie wrote: Reading the spec, it is clear that most modes normally leave the IRQ output line high, and only pulse it low to generate a leading edge. Especially the most commonly used mode 2. The KVM i8254 model does not try to emulate the duration of the pulse at all, so just swap the high/low settings it to leave it high most of the time. This fix is a prerequisite to improving the i8259 model to handle the trailing edge of an interupt request as indicated in its spec: If it gets a trailing edge of an IRQ line before it starts to service the interrupt, the request should be canceled. See http://bochs.sourceforge.net/techspec/intel-82c54-timer.pdf.gz or search the net for 23124406.pdf. Risks: There is a risk that migrating a running guest between versions with and without this patch will lose or gain a single timer interrupt during the migration process. The only case where Can you elaborate on how exactly this can happen? Do not see it. KVM 8254: In the corrected model, when the count expires, the model briefly pulses output low and then high again, with the low to high transition being what triggers the interrupt. In the old model, when the count expires, the model expects the output line to already be low, and briefly pulses it high (triggering the interrupt) and then low again. But if the line was already high (because it migrated from the corrected model), this won't generate a new leading edge (low to high) and won't trigger a new interrupt (the first post-back-migration pulse turns into a simple trailing edge instead of a pulse). Unless there is something I'm missing? No, I missed that pic-last_irr/ioapic-irr will be migrated as 1. But this means that next interrupt after migration from new to old will always be lost. What about clearing pit bit from last_irr/irr before migration? Should not affect new-new migration and should fix new-old one. The only problem is that we may need to consult irq routing table to know how pit is connected to ioapic. We should not clear both last_irr and irr. That cancels the interrupt early if the CPU hasn't started servicing it yet: If the guest CPU has interrupts disabled and hasn't begun to service the interrupt, a new-new migration could lose the interrupt much earlier in the countdown cycle than it otherwise might be lost. I am talking about last_irr in pic and irr in ioapic. Of course we shouldn't clear pic-irr on migration. ioapic uses irr to detect edge. Potentially we could clear the last_irr bit only. This effectively makes migration behave like the old unfixed code. But if this is considered acceptable, I would be more inclined to not fix IRQ0 at all, If we do not fix IRQ0 next timer interrupt after migration will always be lost. rather than make IRQ0 work subtly differently normally vs during migration. One of my earlier patch series (qemu v7 dated Nov 25) attempts to use basically this strategy for the qemu model, at least in the short term (and then potentially fix it properly in the longer term), although the details of series might be tweaked. Or the minimal-risk strategy is to ONLY fix the cascade IRQ2's trailing edge [the original i825* problem that spawned this whole thing], leaving all other IRQs as-is. Still do not see how can we gain one interrupt. In most cases I don't think we get an extra interrupt from a direct fix. But some kinds of attempts to work around missing interrupts during migration can cause cause extra interrupts in other cases. In the old qemu model (but perhaps not kvm), the mode 2 leading edge occurs one clock tick earlier than it ought to. So if you try to be tricky with adjusting levels during migration, you may introduce possible cases where it gets an interrupt in the old model, then migrates and gets another interrupt one tick later in the new model... Also, it occurs to me that maybe there might be an off-by-one issue in the kvm model of mode 2 that ought to be fixed as well? Although the zero length pulse in kvm suggests that one-off issues in counters do not matter. In the qemu model, the leading edge of OUT in mode 2 moves by one tick... The qemu 8254 model actually models each edge at independent clock ticks instead of combining both into a very brief pulse at one time. I've found it handy to draw out old and new timing diagrams on paper (for each mode), and then carefully think about what happens with respect to levels and edges when you
Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER
On Fri, 2013-01-11 at 08:45 +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Wednesday, January 09, 2013 11:05 AM To: Pandarathil, Vijaymohan R Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu- de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER On Wed, 2013-01-09 at 10:52 -0700, Alex Williamson wrote: On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote: - New ioctl which is used to pass the eventfd that is signaled when an error occurs in the vfio_pci_device - Register pci_error_handler for the vfio_pci driver - When the device encounters an error, the error handler registered by the vfio_pci driver gets invoked by the AER infrastructure - In the error handler, signal the eventfd registered for the device. - This results in the qemu eventfd handler getting invoked and appropriate action taken for the guest. Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com --- drivers/vfio/pci/vfio_pci.c | 29 + drivers/vfio/pci/vfio_pci_private.h | 1 + drivers/vfio/vfio.c | 8 include/linux/vfio.h| 1 + include/uapi/linux/vfio.h | 9 + 5 files changed, 48 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 6c11994..4ae9526 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -207,6 +207,8 @@ static long vfio_pci_ioctl(void *device_data, if (vdev-reset_works) info.flags |= VFIO_DEVICE_FLAGS_RESET; + info.flags |= VFIO_DEVICE_FLAGS_AER_NOTIFY; + This appears to be a PCI specific flag, so the name should include _PCI_. We also support non-PCIe devices and it seems like it would be possible to not have AER support available, so shouldn't this be conditional? Will do that. info.num_regions = VFIO_PCI_NUM_REGIONS; info.num_irqs = VFIO_PCI_NUM_IRQS; @@ -348,6 +350,19 @@ static long vfio_pci_ioctl(void *device_data, return ret; + } else if (cmd == VFIO_DEVICE_SET_ERRFD) { + int32_t fd = (int32_t)arg; + + if (fd 0) + return -EINVAL; + + vdev-err_trigger = eventfd_ctx_fdget(fd); + + if (IS_ERR(vdev-err_trigger)) + return PTR_ERR(vdev-err_trigger); + + return 0; + I'm not sure why we wouldn't describe this as just another interrupt from the device and configure it via SET_IRQ. This ioctl has very limited use and doesn't follow any of the conventions of all the other vfio ioctls. I thought this was a fairly simple ioctl to implement this way. Moreover, there is no device interrupt involved. Let me know if you really want to model it as a SET_IRQ ioctl. I'd like to see something else for sure. We're trying to do AER piecemeal but this ioctl leaves no room for anything else. Seems like a soon to be wasted ioctl number. Note that there isn't even a de-assign interface, nor a flag to set to indicate de-assignment. The SET_IRQ ioctl already handles these in a very similar way to how we'd want to handle it for error signaling. It doesn't seem like that much of a stretch to me to include it there, but I'd also entertain other ideas. Thanks, Alex } else if (cmd == VFIO_DEVICE_RESET) return vdev-reset_works ? pci_reset_function(vdev-pdev) : -EINVAL; @@ -527,11 +542,25 @@ static void vfio_pci_remove(struct pci_dev *pdev) kfree(vdev); } +static pci_ers_result_t vfio_err_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct vfio_pci_device *vdev = vfio_get_vdev(pdev-dev); + + eventfd_signal(vdev-err_trigger, 1); + return PCI_ERS_RESULT_CAN_RECOVER; +} What if err_trigger hasn't been set? Will fix that. + +static const struct pci_error_handlers vfio_err_handlers = { + .error_detected = vfio_err_detected, +}; + static struct pci_driver vfio_pci_driver = { .name = vfio-pci, .id_table = NULL, /* only dynamic ids */ .probe = vfio_pci_probe, .remove = vfio_pci_remove, + .err_handler= vfio_err_handlers, }; static void __exit
Re: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Fri, 2013-01-11 at 08:53 +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Wednesday, January 09, 2013 10:08 AM To: Pandarathil, Vijaymohan R Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu- de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 2/2] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through a new ioctl - When the device encounters an error, the eventfd is signaled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com --- hw/vfio_pci.c | 56 ++ linux-headers/linux/vfio.h | 9 2 files changed, 65 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 28c8303..9c3c28b 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -38,6 +38,7 @@ #include qemu/error-report.h #include qemu/queue.h #include qemu/range.h +#include sysemu/sysemu.h /* #define DEBUG_VFIO */ #ifdef DEBUG_VFIO @@ -130,6 +131,8 @@ typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; bool reset_works; +EventNotifier errfd; Misleading name, it's an EventNotifier not an fd. Will make the change. +__u32 dev_info_flags; } VFIODevice; typedef struct VFIOGroup { @@ -1805,6 +1808,8 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) DPRINTF(Device %s flags: %u, regions: %u, irgs: %u\n, name, dev_info.flags, dev_info.num_regions, dev_info.num_irqs); +vdev-dev_info_flags = dev_info.flags; + We test the VFIO_DEVICE_FLAGS_RESET elsewhere by caching the bit into a variable, why not do that here too? I was wondering if there was any specific reason to cache the bit into a separate variable. Wouldn't a flags field which can contain the specific bit be more apt ? Then we have to mask it out ever time we want to reference it. Qemu doesn't seem to like macros or inlines for that, so using a new variable keeps things neat. Caching two fields to bools is still more space efficient than saving the whole thing and we can easily switch to named bits if we get enough to start caring about the space. Thanks, Alex if (!(dev_info.flags VFIO_DEVICE_FLAGS_PCI)) { error_report(vfio: Um, this isn't a PCI device\n); goto error; @@ -1900,6 +1905,55 @@ static void vfio_put_device(VFIODevice *vdev) } } +static void vfio_errfd_handler(void *opaque) +{ +VFIODevice *vdev = opaque; + +if (!event_notifier_test_and_clear(vdev-errfd)) { +return; +} + +/* + * TBD. Retrieve the error details and decide what action + * needs to be taken. One of the actions could be to pass + * the error to the guest and have the guest driver recover + * the error. This requires that PCIe capabilities be + * exposed to the guest. At present, we just terminate the + * guest to contain the error. + */ +error_report(%s(%04x:%02x:%02x.%x) +Unrecoverable error detected... Terminating guest\n, +__func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, +vdev-host.function); + +qemu_system_shutdown_request(); I would have figured hw_error Hw_error() is probably more appropriate. Will make the change. +return; +} + +static void vfio_register_errfd(VFIODevice *vdev) +{ +int32_t pfd; pfd is used elsewhere in vfio as Pointer to FD, this is just a fd. Will change. +int ret; + +if (!(vdev-dev_info_flags VFIO_DEVICE_FLAGS_AER_NOTIFY)) { +error_report(vfio: Warning: Error notification not supported for the device\n); This should probably just be a debug print. Okay. +return; +} +if (event_notifier_init(vdev-errfd, 0)) { +error_report(vfio: Warning: Unable to init event notifier for error detection\n); +return; +} +pfd = event_notifier_get_fd(vdev-errfd); +qemu_set_fd_handler(pfd, vfio_errfd_handler, NULL, vdev); + +ret = ioctl(vdev-fd, VFIO_DEVICE_SET_ERRFD, pfd); +if (ret) { +error_report(vfio: Warning: Failed to setup error fd: %d\n, ret); +qemu_set_fd_handler(pfd, NULL, NULL, vdev); +
Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
On Fri, Jan 11, 2013 at 07:36:44AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-01-10: On Thu, Jan 10, 2013 at 12:22:51PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-01-10: On Thu, Jan 10, 2013 at 11:54:21AM +, Zhang, Yang Z wrote: The right logic should be: When register virtualization enabled, read all apic msr(except apic id reg and tmcct which have the hook) not cause vmexit and only write TPR not cause vmexit. When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit. It's better to put the patch of envirtualize x2apic mode as first patch. There is no point whatsoever to enable virtualize x2apic without allowing at least one non intercepted access to x2apic MSR range and this is what your patch is doing when run on cpu without vid capability. No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is enabled. For that you need to disable 808H MSR intercept, which your patch does not do. I want to do this in next patch. Then don't. There is no point in the patch that enables virtualize x2apic and does not allow at least one non-intercepted MSR access. As I said, read/write TPR will not cause vmexit if virtual x2apic is set. From my reading of the spec you need to disable 808H intercept for that. Is this not the case? I am not sure whether I understand your comments right in previous discussion, here is my thinking: 1. enable virtualize x2apic mode if guest is really using x2apic and clear TPR in msr read and write bitmap. This will benefit reading TPR. 2. If APIC registers virtualization is enabled, clear all bit in rang 0x800-0x8ff(except apic id reg and tmcct). Clear all read bits in the range. 3. If vid is enabled, clear EOI and SELF IPI in msr write map. Looks OK. One concern you mentioned is that vid enabled and virtualize x2apic is disabled but guest still use x2apic. In this case, we still use MSR bitmap to intercept x2apic. This means the vEOI update will never happen. But we still can benefit from interrupt window. What interrupt windows? Without virtualized x2apic TPR/EOI virtualization will not happen and we rely on that in the code. Yes, but we can eliminate vmexit of interrupt window. Interrupt still can delivery to guest without vmexit when guest enables interrupt if vid is enabled. See SDM vol 3, 29.2.2. What does it matter that you eliminated vmexit of interrupt window if you broke everything else? The vid patch assumes that apic emulation either entirely happens in a software when vid is disabled or in a CPU if vid is enabled. Mixed mode will not work (apic-isr_count = 1 trick will break things for instance). And it is not worth it to complicate the code to make it work. Yes, you are right. It too complicated. Another question? Why not to hide x2apic capability to guest when vid is supported and virtual x2apic mode is not supported? It should be more reasonable than disable vid when virtual x2apic mode is unavailable. Because I think there will be no such HW. If such HW expected to be common then we can go this route. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ + unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); + unsigned long val; + + switch (psci_fn) { + case KVM_PSCI_FN_CPU_OFF: + kvm_psci_vcpu_off(vcpu); + val = KVM_PSCI_RET_SUCCESS; + break; + case KVM_PSCI_FN_CPU_ON: + val = kvm_psci_vcpu_on(vcpu); + break; + case KVM_PSCI_FN_CPU_SUSPEND: + case KVM_PSCI_FN_MIGRATE: + val = KVM_PSCI_RET_NI; + break; + + default: + return -1; + } + + *vcpu_reg(vcpu, 0) = val; + return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. I don't want to return -EINVAL, because for the rest of the KVM code this would mean kill the guest. The convention used in other archs of KVM as well as for ARM is that the handle_exit functions return: -ERRNO: Error, report this error to user space 0: Everything is fine, but return to user space to let it do I/O emulation and whatever it wants to do 1: Everything is fine, return directly to the guest without going to user space And then you do: if (handle_something() == 0) return 1; which I thought was confusing, so I said make the function a bool, to avoid the confusion, like Rusty did for all the coprocessor emulation functions. There are obviously other ways to handle the return 1 case, like having an extra state that you carry around, and we can change all the code to do that, but I just don't think it's worth it, as we are in fact quite close to the existing kernel API. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 12:33:15PM -0500, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. I don't want to return -EINVAL, because for the rest of the KVM code this would mean kill the guest. The convention used in other archs of KVM as well as for ARM is that the handle_exit functions return: -ERRNO: Error, report this error to user space 0: Everything is fine, but return to user space to let it do I/O emulation and whatever it wants to do 1: Everything is fine, return directly to the guest without going to user space Right, so the above return -1 _is_ doing the thing that I really hate, which is it's actually doing a return -EPERM without anyone realising that's what it's doing. This is precisely why I hate (and pick up on, and have my mail reader to highlight) any return -1. It's mostly always a bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On 11/01/13 17:33, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. I don't want to return -EINVAL, because for the rest of the KVM code this would mean kill the guest. The convention used in other archs of KVM as well as for ARM is that the handle_exit functions return: -ERRNO: Error, report this error to user space 0: Everything is fine, but return to user space to let it do I/O emulation and whatever it wants to do 1: Everything is fine, return directly to the guest without going to user space That is assuming we propagate the handle_exit convention down to the leaf calls, and I object to that. The 3 possible values only apply to handle_exit, and we should keep that convention as local as possible, because this is the odd case. And then you do: if (handle_something() == 0) return 1; which I thought was confusing, so I said make the function a bool, to avoid the confusion, like Rusty did for all the coprocessor emulation functions. I don't see a compelling reason to propagate this convention to areas that do not require it. In the PSCI case, we have a basic handled/not-handled state, the later indicating the reason. The exit handling functions can convert the error codes to whatever the run loop requires. There are obviously other ways to handle the return 1 case, like having an extra state that you carry around, and we can change all the code to do that, but I just don't think it's worth it, as we are in fact quite close to the existing kernel API. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 12:43 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:33, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. I don't want to return -EINVAL, because for the rest of the KVM code this would mean kill the guest. The convention used in other archs of KVM as well as for ARM is that the handle_exit functions return: -ERRNO: Error, report this error to user space 0: Everything is fine, but return to user space to let it do I/O emulation and whatever it wants to do 1: Everything is fine, return directly to the guest without going to user space That is assuming we propagate the handle_exit convention down to the leaf calls, and I object to that. The 3 possible values only apply to handle_exit, and we should keep that convention as local as possible, because this is the odd case. I don't agree - it requires you to carefully follow a potentially deep call trace to make sense of what it does, and worse, it's directly misleading in the case of KVM/ARM. So either change it everywhere and have a consistent calling convention or adhere to what we do elsewhere and use the bool. And then you do: if (handle_something() == 0) return 1; which I thought was confusing, so I said make the function a bool, to avoid the confusion, like Rusty did for all the coprocessor emulation functions. I don't see a compelling reason to propagate this convention to areas that do not require it. In the PSCI case, we have a basic handled/not-handled state, the later indicating the reason. The exit handling functions can convert the error codes to whatever the run loop requires. again, that's why I suggest returning a bool instead. You just said it: it's a basic handled/not-handled state. Why do you want to return -EINVAL if that's not propogated anywhere? If the return codes are local and do not map reasonably to error codes and more complicated than a bool, I would vote for a #define HVC_HANDLE_SUCCESS, HVC_HANDLE_ERROR, HVC_HANDLE_XX, ... There are obviously other ways to handle the return 1 case, like having an extra state that you carry around, and we can change all the code to do that, but I just don't think it's worth it, as we are in fact quite close to the existing kernel API. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 12:48:45PM -0500, Christoffer Dall wrote: again, that's why I suggest returning a bool instead. You just said it: it's a basic handled/not-handled state. Why do you want to return -EINVAL if that's not propogated anywhere? We have a well established principle throughout the kernel interfaces that functions will return positive values for success and an appropriate -ve errno for failure. We *certainly* hate functions which return 0 for failure and non-zero for success - it makes review a real pain because you start seeing code doing this: if (!function()) { deal_with_failure(); } and you have to then start looking at the function to properly understand what it's return semantics are. We have more than enough proof already that this doesn't work: people don't care to understand what the return values from functions mean. You only have to do an audit of a few of the IS_ERR_OR_NULL() uses to find that out, and you quickly realise that people just use what they _think_ is the right test and which happens to pass their very simple testing at the time. We've avoided major problems so far in the kernel by having most of the integer-returning functions following the established principle, and that's good. I really really think that there must be a _very_ good reason, and overwhelming reason to deviate from the established principle in any large project. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
Move inside irq_lock protection. void kvm_unregister_irq_ack_notifier(struct kvm *kvm, @@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm, hlist_del_init_rcu(kian-link); mutex_unlock(kvm-irq_lock); synchronize_rcu(); + ioapic_update_eoi_exitmap(kvm); Move both synchronize_rcu and ioapic_update_eoi_exitmap inside irq_lock protection. Why? (here and one above). If vcpu uses stale data during update it will find recalculate request during guest entry and will recalculate again. -- Gleb. Indeed, its not necessary and also not necessary to grab irq_lock during vcpu entry, while processing the KVM_REQ_ bit. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 12:57 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Fri, Jan 11, 2013 at 12:48:45PM -0500, Christoffer Dall wrote: again, that's why I suggest returning a bool instead. You just said it: it's a basic handled/not-handled state. Why do you want to return -EINVAL if that's not propogated anywhere? We have a well established principle throughout the kernel interfaces that functions will return positive values for success and an appropriate -ve errno for failure. We *certainly* hate functions which return 0 for failure and non-zero for success - it makes review a real pain because you start seeing code doing this: if (!function()) { deal_with_failure(); } and you'd certainly not find that in any of the KVM/ARM - that would be despicable. and you have to then start looking at the function to properly understand what it's return semantics are. We have more than enough proof already that this doesn't work: people don't care to understand what the return values from functions mean. You only have to do an audit of a few of the IS_ERR_OR_NULL() uses to find that out, and you quickly realise that people just use what they _think_ is the right test and which happens to pass their very simple testing at the time. We've avoided major problems so far in the kernel by having most of the integer-returning functions following the established principle, and that's good. I really really think that there must be a _very_ good reason, and overwhelming reason to deviate from the established principle in any large project. The _very_ good reason here, is that we have two success cases: return to guest and return to user space. As I said, we can save this state in another bit somewhere and change all the KVM/ARM code to do so, but the KVM guys back then would like to use the same convention as other KVM archs. I would prefer not having to change all that KVM/ARM code at this point, but of course, if you insists, I will have to do that. BUT, returning -EINVAL should indicate an actual error. This is not the case for the concrete psci example, the case is simply that the number in r0 does not fall within the psci range, and therefore could potentially be handled by something else, and using -EINVAL as a fall-through to the next value is equally weird and misleading. An alternative is to do something like foo(..., handled), but I think using a bool as the function type is perfect here: what is the problem with that again? Certainly it wouldn't be the only boolean function in the kernel. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On 11/01/13 17:48, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:43 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:33, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. I don't want to return -EINVAL, because for the rest of the KVM code this would mean kill the guest. The convention used in other archs of KVM as well as for ARM is that the handle_exit functions return: -ERRNO: Error, report this error to user space 0: Everything is fine, but return to user space to let it do I/O emulation and whatever it wants to do 1: Everything is fine, return directly to the guest without going to user space That is assuming we propagate the handle_exit convention down to the leaf calls, and I object to that. The 3 possible values only apply to handle_exit, and we should keep that convention as local as possible, because this is the odd case. I don't agree - it requires you to carefully follow a potentially deep call trace to make sense of what it does, and worse, it's directly misleading in the case of KVM/ARM. So either change it everywhere and have a consistent calling convention or adhere to what we do elsewhere and use the bool. Sorry, but I do not buy this. In this particular case, the meaning of the value returned has nothing to do with the handle_exit convention. We never return to user space, let alone signaling an error. Trying to force all the code in this convention makes it actually harder to review, because this is the exception in the kernel. This is why I suggest keeping the handle_exit return convention at this exact point, and not impose it on anything else. And then you do: if (handle_something() == 0) return 1; which I thought was confusing, so I said make the function a bool, to avoid the confusion, like Rusty did for all the coprocessor emulation functions. I don't see a compelling reason to propagate this convention to areas that do not require it. In the PSCI case, we have a basic handled/not-handled state, the later indicating the reason. The exit handling functions can convert the error codes to whatever the run loop requires. again, that's why I suggest returning a bool instead. You just said it: it's a basic handled/not-handled state. Why do you want to return -EINVAL if that's not propogated anywhere? If the return codes are local and do not map reasonably to error codes and more complicated than a bool, I would vote for a #define HVC_HANDLE_SUCCESS, HVC_HANDLE_ERROR, HVC_HANDLE_XX, ... There are obviously other ways to handle the return 1 case, like having an extra state that you carry around, and we can change all the code to do that, but I just don't think it's worth it, as we are in fact quite close to the existing kernel API. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 01:07:31PM -0500, Christoffer Dall wrote: The _very_ good reason here, is that we have two success cases: return to guest and return to user space. As I said, we can save this state in another bit somewhere and change all the KVM/ARM code to do so, but the KVM guys back then would like to use the same convention as other KVM archs. Can you please credit me for not objecting to returning 0/1 to have different success meanings. What I'm merely objecting to is that return -1 statement in the code (notice the negative sign.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 1:14 PM, Russell King - ARM Linux li...@arm.linux.org.uk wrote: On Fri, Jan 11, 2013 at 01:07:31PM -0500, Christoffer Dall wrote: The _very_ good reason here, is that we have two success cases: return to guest and return to user space. As I said, we can save this state in another bit somewhere and change all the KVM/ARM code to do so, but the KVM guys back then would like to use the same convention as other KVM archs. Can you please credit me for not objecting to returning 0/1 to have different success meanings. What I'm merely objecting to is that return -1 statement in the code (notice the negative sign.) Sorry if I misunderstood you. Yes, the return -1 has to be changed. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 2/2] ARM: KVM: Power State Coordination Interface implementation
On Fri, Jan 11, 2013 at 1:09 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:48, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:43 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:33, Christoffer Dall wrote: On Fri, Jan 11, 2013 at 12:24 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 11/01/13 17:12, Russell King - ARM Linux wrote: On Thu, Jan 10, 2013 at 04:06:45PM +, Marc Zyngier wrote: +int kvm_psci_call(struct kvm_vcpu *vcpu) +{ +unsigned long psci_fn = *vcpu_reg(vcpu, 0) ~((u32) 0); +unsigned long val; + +switch (psci_fn) { +case KVM_PSCI_FN_CPU_OFF: +kvm_psci_vcpu_off(vcpu); +val = KVM_PSCI_RET_SUCCESS; +break; +case KVM_PSCI_FN_CPU_ON: +val = kvm_psci_vcpu_on(vcpu); +break; +case KVM_PSCI_FN_CPU_SUSPEND: +case KVM_PSCI_FN_MIGRATE: +val = KVM_PSCI_RET_NI; +break; + +default: +return -1; +} + +*vcpu_reg(vcpu, 0) = val; +return 0; +} We were discussing recently on #kernel about kernel APIs and the way that our integer-returning functions pretty much use 0 for success, and -errno for failures, whereas our pointer-returning functions are a mess. And above we have something returning -1 to some other chunk of code outside this compilation unit. That doesn't sound particularly clever to me. The original code used to return -EINVAL, see: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-January/004509.html Christoffer (Cc-ed) didn't like this, hence the -1. I'm happy to revert the code to its original state though. I don't want to return -EINVAL, because for the rest of the KVM code this would mean kill the guest. The convention used in other archs of KVM as well as for ARM is that the handle_exit functions return: -ERRNO: Error, report this error to user space 0: Everything is fine, but return to user space to let it do I/O emulation and whatever it wants to do 1: Everything is fine, return directly to the guest without going to user space That is assuming we propagate the handle_exit convention down to the leaf calls, and I object to that. The 3 possible values only apply to handle_exit, and we should keep that convention as local as possible, because this is the odd case. I don't agree - it requires you to carefully follow a potentially deep call trace to make sense of what it does, and worse, it's directly misleading in the case of KVM/ARM. So either change it everywhere and have a consistent calling convention or adhere to what we do elsewhere and use the bool. Sorry, but I do not buy this. In this particular case, the meaning of the value returned has nothing to do with the handle_exit convention. We never return to user space, let alone signaling an error. But that is by no means obvious. Trying to force all the code in this convention makes it actually harder to review, because this is the exception in the kernel. This is why I suggest keeping the handle_exit return convention at this exact point, and not impose it on anything else. Ok, then come up with a nicer convention, but using -EINVAL to mean I did not handle this one, maybe someone else will handle it, and it's not an error is definitely not easy to read. What then if another function called from the same caller could actually return an error, would you have a switch case on the return value checking for EINVAL vs. EFAULT? I repeat: what's the problem with a bool for this particular case, which would make it look like all the other sub-calls of handle exit functions - could we be consistent here? -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER
On Fri, Jan 11, 2013 at 08:46:35AM -0700, Alex Williamson wrote: On Fri, 2013-01-11 at 08:45 +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Wednesday, January 09, 2013 11:05 AM To: Pandarathil, Vijaymohan R Cc: Bjorn Helgaas; Gleb Natapov; kvm@vger.kernel.org; qemu- de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 1/2] VFIO-AER: Vfio-pci driver changes for supporting AER On Wed, 2013-01-09 at 10:52 -0700, Alex Williamson wrote: On Wed, 2013-01-09 at 06:26 +, Pandarathil, Vijaymohan R wrote: - New ioctl which is used to pass the eventfd that is signaled when an error occurs in the vfio_pci_device - Register pci_error_handler for the vfio_pci driver - When the device encounters an error, the error handler registered by the vfio_pci driver gets invoked by the AER infrastructure - In the error handler, signal the eventfd registered for the device. - This results in the qemu eventfd handler getting invoked and appropriate action taken for the guest. Signed-off-by: Vijay Mohan Pandarathil vijaymohan.pandarat...@hp.com --- drivers/vfio/pci/vfio_pci.c | 29 + drivers/vfio/pci/vfio_pci_private.h | 1 + drivers/vfio/vfio.c | 8 include/linux/vfio.h| 1 + include/uapi/linux/vfio.h | 9 + 5 files changed, 48 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 6c11994..4ae9526 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -207,6 +207,8 @@ static long vfio_pci_ioctl(void *device_data, if (vdev-reset_works) info.flags |= VFIO_DEVICE_FLAGS_RESET; + info.flags |= VFIO_DEVICE_FLAGS_AER_NOTIFY; + This appears to be a PCI specific flag, so the name should include _PCI_. We also support non-PCIe devices and it seems like it would be possible to not have AER support available, so shouldn't this be conditional? Will do that. info.num_regions = VFIO_PCI_NUM_REGIONS; info.num_irqs = VFIO_PCI_NUM_IRQS; @@ -348,6 +350,19 @@ static long vfio_pci_ioctl(void *device_data, return ret; + } else if (cmd == VFIO_DEVICE_SET_ERRFD) { + int32_t fd = (int32_t)arg; + + if (fd 0) + return -EINVAL; + + vdev-err_trigger = eventfd_ctx_fdget(fd); + + if (IS_ERR(vdev-err_trigger)) + return PTR_ERR(vdev-err_trigger); + + return 0; + I'm not sure why we wouldn't describe this as just another interrupt from the device and configure it via SET_IRQ. This ioctl has very limited use and doesn't follow any of the conventions of all the other vfio ioctls. I thought this was a fairly simple ioctl to implement this way. Moreover, there is no device interrupt involved. Let me know if you really want to model it as a SET_IRQ ioctl. I'd like to see something else for sure. We're trying to do AER piecemeal but this ioctl leaves no room for anything else. Seems like a soon to be wasted ioctl number. Note that there isn't even a de-assign interface, nor a flag to set to indicate de-assignment. The SET_IRQ ioctl already handles these in a very similar way to how we'd want to handle it for error signaling. It doesn't seem like that much of a stretch to me to include it there, but I'd also entertain other ideas. Thanks, Alex Yes it would be good to have a picture of the dependencies necessary to implement passthrough, before introducing an incompatible interface. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
On Fri, Jan 11, 2013 at 02:26:51AM -0500, Christoffer Dall wrote: On 10/01/2013, at 20.10, Scott Wood scottw...@freescale.com wrote: On 01/10/2013 06:35:02 PM, Marcelo Tosatti wrote: On Thu, Jan 10, 2013 at 04:40:12PM -0600, Scott Wood wrote: On 01/10/2013 04:28:01 PM, Marcelo Tosatti wrote: Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that? Besides the above, and my original complaint that it shouldn't be specific to addresses? -Scott I did not really grasp that ('shouldnt be specific to addresses'), but anyway. A device may have other configuration parameters that need to be set, besides addresses. PPC MPIC will require information about the vendor and version, for example. OK, can you write down your proposed improvements to the interface? In case you have something ready, otherwise there is time pressure to merge the ARM port. My original request was just to change the name to something like KVM_SET_DEVICE_CONFIG or KVM_SET_DEVICE_ATTR, and not make the id encoding architecture-specific (preferably, separate into a device id field and an attribute id field rather than using bitfields). Actual values for device id could be architecture-specific (or there could be a global enumeration), and attribute id values would be device-specific. Alex suggested that an ideal interface might accept values larger than 64 bits, though I think it's good enough -- there are currently no proposed uses that need more than 64 bits for a single attribute (unlike ONE_REG), and if it is needed, such configuration could be split up between multiple attributes, or the attribute could specify that value be a userspace pointer to the actual data (as with ONE_REG). Here's a writeup (the ARM details would go under ARM/vGIC-specific documentation): 4.80 KVM_SET_DEVICE_ATTR Capability: KVM_CAP_SET_DEVICE_ATTR Type: vm ioctl Parameters: struct kvm_device_attr (in) Returns: 0 on success, -1 on error Errors: ENODEV: The device id is unknown ENXIO: Device not supported on current system Other errors may be returned by specific devices and attributes. struct kvm_device_attr { __u32 device; __u32 attr; __u64 value; }; Specify an attribute of a device emulated or directly exposed by the kernel, which the host kernel needs to know about. The device field is an architecture-specific identifier for a specific device. The attr field is a device-specific identifier for a specific attribute. Individual attributes may have particular requirements for when they can and cannot be set. That is, if you have interest/energy to spend in a possibly reusable interface, as long as that does not delay integration of the ARM code, i don't think the ARM people will mind that. The impression I've been given is that just about any change will delay the integration at this point. If that's the case, and everyone's OK with having an interface that is deprecated on arrival, then fine. That is not entirely the case, but there wasn't event agreement on this revised API, and we didn't want to wait for weeks until a decision was made. Alex suggested we use a DEV_REG API similar to the ONE_REG API, but I am quite strongly against having such an API for this specific purpose as it is too semantically distant to what we do on ARM. (Having a DEV_REG API for other purposes might be fine though). So I have no problem with your suggestion, although we could consider having a size and addr field in the struct instead and be slightly more extensible. But I don't feel strongly about it. If we can agree on Scott's suggestion or with my modification (Alex, Gleb, Marcelo ?) then I'll change the KVM/ARM patches to use this and resend them right away. But we have to agree now! If not, I really think we should keep things as they are now, as we're really discussing idealistic scenarios here, and the ARM patches have been out of tree for long enough. As Marcelo pointed out, the benefits of the perfect API are really minimal. -Christoffer-- Can you make KVM_SET_ARM_DEVICE address extensible? Add some reserved space and a flags field. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
On 11.01.2013, at 19:39, Marcelo Tosatti wrote: On Fri, Jan 11, 2013 at 02:26:51AM -0500, Christoffer Dall wrote: On 10/01/2013, at 20.10, Scott Wood scottw...@freescale.com wrote: On 01/10/2013 06:35:02 PM, Marcelo Tosatti wrote: On Thu, Jan 10, 2013 at 04:40:12PM -0600, Scott Wood wrote: On 01/10/2013 04:28:01 PM, Marcelo Tosatti wrote: Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that? Besides the above, and my original complaint that it shouldn't be specific to addresses? -Scott I did not really grasp that ('shouldnt be specific to addresses'), but anyway. A device may have other configuration parameters that need to be set, besides addresses. PPC MPIC will require information about the vendor and version, for example. OK, can you write down your proposed improvements to the interface? In case you have something ready, otherwise there is time pressure to merge the ARM port. My original request was just to change the name to something like KVM_SET_DEVICE_CONFIG or KVM_SET_DEVICE_ATTR, and not make the id encoding architecture-specific (preferably, separate into a device id field and an attribute id field rather than using bitfields). Actual values for device id could be architecture-specific (or there could be a global enumeration), and attribute id values would be device-specific. Alex suggested that an ideal interface might accept values larger than 64 bits, though I think it's good enough -- there are currently no proposed uses that need more than 64 bits for a single attribute (unlike ONE_REG), and if it is needed, such configuration could be split up between multiple attributes, or the attribute could specify that value be a userspace pointer to the actual data (as with ONE_REG). Here's a writeup (the ARM details would go under ARM/vGIC-specific documentation): 4.80 KVM_SET_DEVICE_ATTR Capability: KVM_CAP_SET_DEVICE_ATTR Type: vm ioctl Parameters: struct kvm_device_attr (in) Returns: 0 on success, -1 on error Errors: ENODEV: The device id is unknown ENXIO: Device not supported on current system Other errors may be returned by specific devices and attributes. struct kvm_device_attr { __u32 device; __u32 attr; __u64 value; }; Specify an attribute of a device emulated or directly exposed by the kernel, which the host kernel needs to know about. The device field is an architecture-specific identifier for a specific device. The attr field is a device-specific identifier for a specific attribute. Individual attributes may have particular requirements for when they can and cannot be set. That is, if you have interest/energy to spend in a possibly reusable interface, as long as that does not delay integration of the ARM code, i don't think the ARM people will mind that. The impression I've been given is that just about any change will delay the integration at this point. If that's the case, and everyone's OK with having an interface that is deprecated on arrival, then fine. That is not entirely the case, but there wasn't event agreement on this revised API, and we didn't want to wait for weeks until a decision was made. Alex suggested we use a DEV_REG API similar to the ONE_REG API, but I am quite strongly against having such an API for this specific purpose as it is too semantically distant to what we do on ARM. (Having a DEV_REG API for other purposes might be fine though). So I have no problem with your suggestion, although we could consider having a size and addr field in the struct instead and be slightly more extensible. But I don't feel strongly about it. If we can agree on Scott's suggestion or with my modification (Alex, Gleb, Marcelo ?) then I'll change the KVM/ARM patches to use this and resend them right away. But we have to agree now! If not, I really think we should keep things as they are now, as we're really discussing idealistic scenarios here, and the ARM patches have been out of tree for long enough. As Marcelo pointed out, the benefits of the perfect API are really minimal. -Christoffer-- Can you make KVM_SET_ARM_DEVICE address extensible? Add some reserved space and a flags field. Can't we do this for the new ioctl that we all would agree on rather than the interim one that's only a short term solution for a greater problem? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
On 10.01.2013, at 23:28, Marcelo Tosatti wrote: On Wed, Jan 09, 2013 at 10:37:20PM +0100, Alexander Graf wrote: We can start to introduce (and fix ARM) with a generic ioctl in the MPIC patches then. The ioctl is already generic, except for its name. It's making a few wrong assumptions: * maximum size of value is u64 This is tolerable IMHO. * combining device id (variable) with addr type id (const) into a single field. It could just be split into multiple fields I agree, but that could be lived with as well. I get that there's a tradeoff between getting something in now, versus waiting until the API is more refined. Tagging it with a particular ISA seems like an odd way of saying soon to be deprecated, though. What happens if we're still squabbling over the perfect replacement API when we're trying to push PPC MPIC stuff in? As mentioned, i fail to see the benefit in sharing 0.0x% of the code (the interface), while the remaining code is not shared. Important code sharing happens in user space. Then we're the ones who have to come up with a good interface. Or just have KVM_SET_PPC_DEVICE_ADDRESS. Is there a downside to that? Yes. Harder to read code and a lack of commonality. Also it wouldn't cut it for us, as we need to set more than just addresses. But most of all I really hate to take a generic problem and solve it only for myself. Btw, x86 has the same issue too, it just only somehow got away with not caring too much about it. The LAPIC has a version register that Mac OS X guests read. If the version is too old, it doesn't boot. So some time ago we just changed the version register. The real, backwards compatible change however would've been to have user space tell us which version to emulate and set it accordingly. The way we did it basically broke backwards compatibility. We only got away with it because the register is rarely read. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
Can you make KVM_SET_ARM_DEVICE address extensible? Add some reserved space and a flags field. Can't we do this for the new ioctl that we all would agree on rather than the interim one that's only a short term solution for a greater problem? Alex OK. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
On Fri, Jan 11, 2013 at 2:18 PM, Marcelo Tosatti mtosa...@redhat.com wrote: Can you make KVM_SET_ARM_DEVICE address extensible? Add some reserved space and a flags field. Can't we do this for the new ioctl that we all would agree on rather than the interim one that's only a short term solution for a greater problem? Alex OK. Good. It's decided - the KVM/ARM patches remain in their current state with an ARM specific ioctl, and will transition to something generic post merge. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 6/5] KVM: x86: clear write_fault_to_shadow_pgtable explicitly
On Fri, Jan 11, 2013 at 10:12:58PM +0800, Xiao Guangrong wrote: On 01/11/2013 09:15 PM, Marcelo Tosatti wrote: This is cryptic. Its not obvious at all for someone modifying the code, for example. Can you please clear it explicitly? Sure, this is the patch to apply your idea, is it good to you? :) Subject: [PATCH 6/6] KVM: x86: clear write_fault_to_shadow_pgtable explicitly Clear it explicitly when exiting x86_emulate_instruction to clarify the code, it is suggested by Marcelo Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) Fine, please rebase against queue. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
On 01/11/2013 09:42:55 AM, Alexander Graf wrote: On 11.01.2013, at 02:10, Scott Wood wrote: struct kvm_device_attr { __u32 device; This needs some semantic specification. Is device a constant value? Is it the return value of CREATE_IRQCHIP? As proposed, it's up to the architecture to provide that specification. In theory this could be used for things other than IRQ chips. We could still say that device creation functions return a valid device ID (if the device has any attributes), as well as have other architecture-specific ways of describing device IDs (static enumeration). Or we could have non-architecture-specific static enumeration. Or just require that all devices be explicitly created by something that returns the ID. Do you have a preferred approach? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v5.1 0/2] KVM: ARM: Rename KVM_SET_DEVICE_ADDRESS
Am 11.01.2013 um 21:11 schrieb Scott Wood scottw...@freescale.com: On 01/11/2013 09:42:55 AM, Alexander Graf wrote: On 11.01.2013, at 02:10, Scott Wood wrote: struct kvm_device_attr { __u32 device; This needs some semantic specification. Is device a constant value? Is it the return value of CREATE_IRQCHIP? As proposed, it's up to the architecture to provide that specification. In theory this could be used for things other than IRQ chips. We could still say that device creation functions return a valid device ID (if the device has any attributes), as well as have other architecture-specific ways of describing device IDs (static enumeration). Or we could have non-architecture-specific static enumeration. Or just require that all devices be explicitly created by something that returns the ID. Do you have a preferred approach? I am a fan of dynamically generated ids that get returned from the create functions (like open() and an fd). I'm not sure if that would always work here though. It would mean that we explicitly have to create per-cpu interrupt controllers. Or append their state onto vcpu state. Alex -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html