Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote: This makes use of the new level irqfd support enabling bypass of qemu userspace both on INTx injection and unmask. This significantly boosts the performance of devices making use of legacy interrupts. Signed-off-by: Alex Williamson alex.william...@redhat.com --- My INTx routing workaround below will probably raise some eyebrows, but I don't feel it's worth subjecting users to core dumps if they want to try vfio-pci on new platforms. INTx routing is part of some larger plan, but until that plan materializes we have to try to avoid the API unless we think there's a good chance it might be there. I'll accept the maintenance of updating a whitelist in the interim. Thanks, Alex hw/vfio_pci.c | 224 + 1 file changed, 224 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 639371e..777a5f8 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); /* + * PCI code refuses to make it possible to probe whether the chipset + * supports pci_device_route_intx_to_irq() and booby traps the call + * to assert if doesn't. For us, this is just an optimization, so + * only enable it when we know it's present. Unfortunately PCIBus is + * private, so we can't just look at the function pointer. + */ +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev) +{ +#ifdef CONFIG_KVM +BusState *bus = qdev_get_parent_bus(pdev-qdev); +DeviceState *dev; + +if (!kvm_irqchip_in_kernel() || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { + return false; +} Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE? Also for KVM_IRQFD_FLAG_RESAMPLE. + +for (; bus-parent; bus = qdev_get_parent_bus(dev)) { + +dev = bus-parent; + +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 14)) { +return true; +} +} + +error_report(vfio-pci: VM chipset does not support INTx routing, + using slow INTx mode\n); When does this code trigger? It seems irqchip implies piix ATM - is this just dead code? +#endif +return false; +} + +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int pin) +{ +if (!vfio_pci_bus_has_intx_route(pdev)) { +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 }; +} + +return pci_device_route_intx_to_irq(pdev, pin); +} + +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new) +{ +return old-mode != new-mode || old-irq != new-irq; +} + Didn't you add an API for this? It's on pci branch but I can drop it if not needed. +/* * Common VFIO interrupt disable */ static void vfio_disable_irqindex(VFIODevice *vdev, int index) @@ -185,6 +232,21 @@ static void vfio_unmask_intx(VFIODevice *vdev) ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set); } +#ifdef CONFIG_KVM +static void vfio_mask_intx(VFIODevice *vdev) +{ +struct vfio_irq_set irq_set = { +.argsz = sizeof(irq_set), +.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK, +.index = VFIO_PCI_INTX_IRQ_INDEX, +.start = 0, +.count = 1, +}; + +ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set); +} +#endif + /* * Disabling BAR mmaping can be slow, but toggling it around INTx can * also be a huge overhead. We try to get the best of both worlds by @@ -248,6 +310,161 @@ static void vfio_eoi(VFIODevice *vdev) vfio_unmask_intx(vdev); } +static void vfio_enable_intx_kvm(VFIODevice *vdev) +{ +#ifdef CONFIG_KVM +struct kvm_irqfd irqfd = { +.fd = event_notifier_get_fd(vdev-intx.interrupt), +.gsi = vdev-intx.route.irq, +.flags = KVM_IRQFD_FLAG_RESAMPLE, Should not kvm ioctl handling be localized in kvm-all.c? E.g. extend kvm_irqchip_add_irqfd_notifier in some way? Same question for KVM_CAP_IRQFD_RESAMPLE use above ... +}; +struct vfio_irq_set *irq_set; +int ret, argsz; +int32_t *pfd; + +if (!kvm_irqchip_in_kernel() || +vdev-intx.route.mode != PCI_INTX_ENABLED || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { +return; +} + +/* Get to a known interrupt state */ +qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev); +vfio_mask_intx(vdev); +vdev-intx.pending = false; +qemu_set_irq(vdev-pdev.irq[vdev-intx.pin], 0); + +/* Get an eventfd for resample/unmask */ +if (event_notifier_init(vdev-intx.unmask, 0)) { +error_report(vfio: Error: event_notifier_init failed eoi\n); +goto fail; +} + +/* KVM triggers it, VFIO listens for it */ +irqfd.resamplefd =
[Patch]KVM: enabling per domain PLE
Setting the same PLE parameter arbitrarily for different workloads is not a good solution. The solution enables per domain PLE which gives user ability to set PLE parameter for different domain for better performance. Signed-off-by: Xuekun Hu xuekun...@intel.com --- arch/x86/include/asm/kvm_host.h |6 ++ arch/x86/kvm/svm.c | 21 + arch/x86/kvm/vmx.c | 32 ++-- arch/x86/kvm/x86.c | 15 ++- include/linux/kvm.h |3 +++ 5 files changed, 74 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..b54f38b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -569,6 +569,8 @@ struct kvm_arch { #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; #endif + int ple_gap; + int ple_window; }; struct kvm_vm_stat { @@ -621,6 +623,10 @@ struct kvm_x86_ops { int (*hardware_setup)(void); /* __init */ void (*hardware_unsetup)(void);/* __exit */ bool (*cpu_has_accelerated_tpr)(void); + bool (*cpu_has_ple)(void); + void (*ple_setup)(struct kvm *kvm); /* __init */ + void (*set_ple_gap)(struct kvm *kvm, int value); + void (*set_ple_window)(struct kvm *kvm, int value); void (*cpuid_update)(struct kvm_vcpu *vcpu); /* Create, but do not attach this VCPU */ diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d017df3..311ed96 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3976,6 +3976,23 @@ static bool svm_cpu_has_accelerated_tpr(void) return false; } +static bool cpu_has_svm_ple(void) +{ + return false; +} + +static void svm_ple_setup(struct kvm *kvm) +{ +} + +static void svm_set_ple_gap(struct kvm *kvm, int value) +{ +} + +static void svm_set_ple_window(struct kvm *kvm, int value) +{ +} + static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; @@ -4229,6 +4246,10 @@ static struct kvm_x86_ops svm_x86_ops = { .hardware_enable = svm_hardware_enable, .hardware_disable = svm_hardware_disable, .cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr, + .cpu_has_ple = cpu_has_svm_ple, + .ple_setup = svm_ple_setup, + .set_ple_gap = svm_set_ple_gap, + .set_ple_window = svm_set_ple_window, .vcpu_create = svm_create_vcpu, .vcpu_free = svm_free_vcpu, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ad6b1dd..b761669 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3805,6 +3805,29 @@ static void ept_set_mmio_spte_mask(void) kvm_mmu_set_mmio_spte_mask(0xffull 49 | 0x6ull); } +static void vmx_ple_setup(struct kvm *kvm) +{ + /* +* set up ple default value +*/ + if (ple_gap) { + kvm-arch.ple_gap = ple_gap; + kvm-arch.ple_window = ple_window; + } +} + +static void vmx_set_ple_gap(struct kvm *kvm, int value) +{ + if (ple_gap) + kvm-arch.ple_gap = value; +} + +static void vmx_set_ple_window(struct kvm *kvm, int value) +{ + if (ple_gap) + kvm-arch.ple_window = value; +} + /* * Sets up the vmcs for emulated real mode. */ @@ -3814,6 +3837,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) unsigned long a; #endif int i; + struct kvm *kvm = vmx-vcpu.kvm; /* I/O */ vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@ -3836,8 +3860,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) } if (ple_gap) { - vmcs_write32(PLE_GAP, ple_gap); - vmcs_write32(PLE_WINDOW, ple_window); + vmcs_write32(PLE_GAP, kvm-arch.ple_gap); + vmcs_write32(PLE_WINDOW, kvm-arch.ple_window); } vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); @@ -7233,6 +7257,10 @@ static struct kvm_x86_ops vmx_x86_ops = { .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, .cpu_has_accelerated_tpr = report_flexpriority, + .cpu_has_ple = cpu_has_vmx_ple, + .ple_setup = vmx_ple_setup, + .set_ple_gap = vmx_set_ple_gap, + .set_ple_window = vmx_set_ple_window, .vcpu_create = vmx_create_vcpu, .vcpu_free = vmx_free_vcpu, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..89b8291 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2197,6 +2197,9 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_VAPIC: r = !kvm_x86_ops-cpu_has_accelerated_tpr(); break; + case KVM_CAP_PLE: + r = kvm_x86_ops-cpu_has_ple(); + break; case KVM_CAP_NR_VCPUS: r = KVM_SOFT_MAX_VCPUS; break; @@ -3462,7 +3465,16 @@ long
[Patch]QEMU: Add -ple-gap and -ple-window options for per domain PLE
QEMU: Add -ple-gap and -ple-window options for per domain PLE Signed-off-by: Xuekun Hu xuekun...@intel.com --- linux-headers/linux/kvm.h |3 +++ qemu-options.hx | 16 sysemu.h |2 ++ target-i386/kvm.c | 13 + vl.c | 29 + 5 files changed, 63 insertions(+), 0 deletions(-) diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 4b9e575..a32c68a 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_GET_SMMU_INFO 78 #define KVM_CAP_S390_COW 79 #define KVM_CAP_PPC_ALLOC_HTAB 80 +#define KVM_CAP_PLE 83 #ifdef KVM_CAP_IRQ_ROUTING @@ -769,6 +770,8 @@ struct kvm_msi { struct kvm_userspace_memory_region) #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64) +#define KVM_SET_PLE_GAP _IOW(KVMIO, 0x49, __u32) +#define KVM_SET_PLE_WINDOW_IOW(KVMIO, 0x4a, __u32) /* enable ucontrol for s390 */ struct kvm_s390_ucas_mapping { diff --git a/qemu-options.hx b/qemu-options.hx index 3cd9243..c1c5c7b 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -98,6 +98,22 @@ given, the total number of CPUs @var{n} can be omitted. @var{maxcpus} specifies the maximum number of hotpluggable CPUs. ETEXI +DEF(ple-gap, HAS_ARG, QEMU_OPTION_ple_gap, +-ple-gap n Set ple_gap per vm\n, QEMU_ARCH_I386) +STEXI +@item -ple-gap @var{n} +@findex -ple-gap +Set ple_gap to vm. +ETEXI + +DEF(ple-window, HAS_ARG, QEMU_OPTION_ple_window, +-ple-window n Set ple_window per vm\n, QEMU_ARCH_I386) +STEXI +@item -ple-window @var{n} +@findex -ple-window +Set ple_gap to vm. +ETEXI + DEF(numa, HAS_ARG, QEMU_OPTION_numa, -numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]\n, QEMU_ARCH_ALL) STEXI diff --git a/sysemu.h b/sysemu.h index b38b1a3..e25d111 100644 --- a/sysemu.h +++ b/sysemu.h @@ -130,6 +130,8 @@ extern uint8_t *boot_splash_filedata; extern int boot_splash_filedata_size; extern uint8_t qemu_extra_params_fw[2]; extern QEMUClock *rtc_clock; +extern int32_t ple_gap; +extern int32_t ple_window; #define MAX_NODES 64 #define MAX_CPUMASK_BITS 255 diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 5b18383..12e0b2a 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -724,6 +724,19 @@ int kvm_arch_init(KVMState *s) } } } + +if (kvm_check_extension(s, KVM_CAP_PLE) (ple_gap != 0)) { +ret = kvm_vm_ioctl(s, KVM_SET_PLE_GAP, ple_gap); +if (ret 0) { +fprintf(stderr, kvm_set_ple_gap failed\n); +return ret; +} +ret = kvm_vm_ioctl(s, KVM_SET_PLE_WINDOW, ple_window); +if (ret 0) { +fprintf(stderr, kvm_set_ple_window failed\n); +return ret; +} +} return 0; } diff --git a/vl.c b/vl.c index a1c0aa7..96dbc4b 100644 --- a/vl.c +++ b/vl.c @@ -234,6 +234,9 @@ uint8_t *boot_splash_filedata; int boot_splash_filedata_size; uint8_t qemu_extra_params_fw[2]; +int32_t ple_gap = 0; +int32_t ple_window = 0; + typedef struct FWBootEntry FWBootEntry; struct FWBootEntry { @@ -2590,6 +2593,32 @@ int main(int argc, char **argv, char **envp) } } break; +case QEMU_OPTION_ple_gap: +{ +int32_t value; +char *end; + +value = strtol(optarg, end, 10); +if (value 0 || *end) { +fprintf(stderr, qemu: invalid ple_gap vaule: %s\n, optarg); +exit(1); +} +ple_gap = value; +} +break; +case QEMU_OPTION_ple_window: +{ +int32_t value; +char *end; + +value = strtol(optarg, end, 10); +if (value 0 || *end) { +fprintf(stderr, qemu: invalid ple_gap vaule: %s\n, optarg); +exit(1); +} +ple_window = value; +} +break; case QEMU_OPTION_numa: if (nb_numa_nodes = MAX_NODES) { fprintf(stderr, qemu: too many NUMA nodes\n); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: fix help output for run command
On Fri, 12 Oct 2012, William Dauchy wrote: This dummy patch remove tabs in help output. Introduced in commit: ae9ec23 kvm tools: generate command line options dynamically Signed-off-by: William Dauchy wdau...@gmail.com Applied, thanks William! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA
Thierry reported that the iron out patch for isolate_freepages_block() had problems due to the strict check being too strict with mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range() -fix1. It's possible that more pages than necessary are isolated but the check still fails and I missed that this fix was not picked up before RC1. This same problem has been identified in 3.7-RC1 by Tony Prisk and should be addressed by the following patch. Signed-off-by: Mel Gorman mgor...@suse.de Tested-by: Tony Prisk li...@prisktech.co.nz --- mm/compaction.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 2c4ce17..9eef558 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -346,7 +346,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, * pages requested were isolated. If there were any failures, 0 is * returned and CMA will fail. */ - if (strict nr_strict_required != total_isolated) + if (strict nr_strict_required total_isolated) total_isolated = 0; if (locked) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm-clock clocksource efficiency versus tsc...
OS: Centos 6.2 KVM version: qemu-kvm-tools-0.12.1.2-2.209.el6_2.4.x86_64 qemu-kvm-0.12.1.2-2.209.el6_2.4.x86_64 uname -a: Linux myhost 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux Hi, I have been performance testing a time tracing utiliity for a Java enterprise application at work. The idea is that we measure time for different parts of our application and build time trees for this information. The time tree can be viewed and analysed in case of problems. It is also automatically output in case a certain operation is taking longer than a threshold. As part of these tests we found that it matters a lot what type of clocksource is used. In particular, with hpet and acpi_pm the execution is very slow (700ns per call (similar results using clock_gettime() in a C program). In addition, hpet and acpi_pm sycnrhonize the application. This is of course a disaster for server applications that tend to query the current time quite a lot. What works quite well is the tsc clocksource. In that case, the time drops to about 38ns on one of our systems and we can prove that there is no synchronization anymore which is good. When running inside a KVM VM the default clock source is kvm-clock. This clock takes about 160ns per call and also does not synchronise the application. However, using the tsc clock source delivers similar performance on the virtual machine as on the host. I now have the following questions: * can the performance of kvm-clock be optimized further to be (almost) identical to that of the host's clock source? * what are the consequences of using the tsc clock in combination with NTPD? Will this result in system instability or larger than usual clock skew? Cheers Erik -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for 2012-10-16
-boot and -bootindex On 15.10.2012, at 12:17, Juan Quintela quint...@redhat.com wrote: Hi Please send in any agenda topics you are interested in. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Secure migration of LVM based guests over WAN
Hey all, I have a question about a solution for migrate LVM based guests directly over the network. So the situation: Two KVM hosts with libvirt, multiple LVM based guests Want to do: Migrate a LVM based guest directly to the other host over an secure connection I know that migration is possible when the VM disks are stored on an NFS, GFS2 filer/cluster etc. So would it be possible to do an offline migration directly with netcat or something like that? Would be greate when someone could help me. Best Regards -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch]KVM: enabling per domain PLE
On 10/16/2012 08:53 AM, Hu, Xuekun wrote: Setting the same PLE parameter arbitrarily for different workloads is not a good solution. True. The solution enables per domain PLE which gives user ability to set PLE parameter for different domain for better performance. The problem with this is that it requires an administrator to understand the workload, not only of the guest, but also of other guests on the machine. With low overcommit, a high PLE window reduces unneeded exits, but with high overcommit we need those exits to reduce spinning. In addition, most kvm hosts don't have an administrator. They are controlled by a management system, which means we'll need some algorithm in userspace to control the PLE window. Taking the two together, we need a dynamic (for changing workloads) algorithm. There are threads discussing this dynamic algorithm, we are making slow progress because it's such a difficult problem, but I think this is much more useful than anything requiring user intervention. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Secure migration of LVM based guests over WAN
On 10/16/2012 11:12 AM, Lukas Laukamp wrote: Hey all, I have a question about a solution for migrate LVM based guests directly over the network. So the situation: Two KVM hosts with libvirt, multiple LVM based guests Want to do: Migrate a LVM based guest directly to the other host over an secure connection I know that migration is possible when the VM disks are stored on an NFS, GFS2 filer/cluster etc. So would it be possible to do an offline migration directly with netcat or something like that? If all you need is offline, you can use scp to copy each volume to the destination volume. Make sure the guests are shut down when you do that. It is also possible to do a live migration, but unless the destination and source are in the same IP subnet, the guests are going to lose connectivity. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
On 10/16/2012 05:59 AM, Paul Mackerras wrote: The mmu_notifier_retry() function, used to test whether any page invalidations are in progress, currently takes a vcpu pointer, though the code only needs the VM's struct kvm pointer. Forthcoming patches to the powerpc Book3S HV code will need to test for retry within a VM ioctl, where a struct kvm pointer is available but a struct vcpu pointer isn't. Therefore this creates a variant of mmu_notifier_retry called kvm_mmu_notifier_retry that takes a struct kvm pointer, and implements mmu_notifier_retry in terms of it. Why not change mmu_notifier_retry() and all its callers? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Secure migration of LVM based guests over WAN
Am 16.10.2012 11:40, schrieb Avi Kivity: On 10/16/2012 11:12 AM, Lukas Laukamp wrote: Hey all, I have a question about a solution for migrate LVM based guests directly over the network. So the situation: Two KVM hosts with libvirt, multiple LVM based guests Want to do: Migrate a LVM based guest directly to the other host over an secure connection I know that migration is possible when the VM disks are stored on an NFS, GFS2 filer/cluster etc. So would it be possible to do an offline migration directly with netcat or something like that? If all you need is offline, you can use scp to copy each volume to the destination volume. Make sure the guests are shut down when you do that. It is also possible to do a live migration, but unless the destination and source are in the same IP subnet, the guests are going to lose connectivity. Hello Avi, so can I simply copy an logical volume to the path of the volume group with scp? For the live migration theme, it would be no problem when the guests looses connectivity, how could be done a live migration? Best Regards -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2012-10-16
On Mon, 15 Oct 2012 18:27:51 +0200 Igor Mammedov imamm...@redhat.com wrote: CPU as DEVICE http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg00719.html latest known tree for testing: https://github.com/ehabkost/qemu-hacks/commits/work/cpu-devicestate-qdev-core may be we could agree on proposed RFC. Rebased series that applies to yesterday's master. http://www.mail-archive.com/qemu-devel@nongnu.org/msg136505.html git tree for testing: https://github.com/imammedo/qemu/tree/cpu_as_dev.v3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
On 16.10.2012, at 11:44, Avi Kivity wrote: On 10/16/2012 05:59 AM, Paul Mackerras wrote: The mmu_notifier_retry() function, used to test whether any page invalidations are in progress, currently takes a vcpu pointer, though the code only needs the VM's struct kvm pointer. Forthcoming patches to the powerpc Book3S HV code will need to test for retry within a VM ioctl, where a struct kvm pointer is available but a struct vcpu pointer isn't. Therefore this creates a variant of mmu_notifier_retry called kvm_mmu_notifier_retry that takes a struct kvm pointer, and implements mmu_notifier_retry in terms of it. Why not change mmu_notifier_retry() and all its callers? Why not use Christoffer's patch? :) Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On 10/16/2012 06:01 AM, Paul Mackerras wrote: A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on this fd return the contents of the HPT (hashed page table), writes create and/or remove entries in the HPT. There is a new capability, KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl takes an argument structure with the index of the first HPT entry to read out and a set of flags. The flags indicate whether the user is intending to read or write the HPT, and whether to return all entries or only the bolted entries (those with the bolted bit, 0x10, set in the first doubleword). This is intended for use in implementing qemu's savevm/loadvm and for live migration. Therefore, on reads, the first pass returns information about all HPTEs (or all bolted HPTEs). When the first pass reaches the end of the HPT, it returns from the read. Subsequent reads only return information about HPTEs that have changed since they were last read. A read that finds no changed HPTEs in the HPT following where the last read finished will return 0 bytes. Copying people with interest in migration. +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE ((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? What happens if the read buffer size is not a multiple of entry size? Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Suppose new hardware arrives that supports nesting HPTs, so that kvm is no longer synchronously aware of the guest HPT (similar to how NPT/EPT made kvm unaware of guest virtual-physical translations on x86). How will we deal with that? But I guess this will be a non-guest-transparent and non-userspace-transparent change, unlike NPT/EPT, so a userspace ABI addition will be needed anyway). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Secure migration of LVM based guests over WAN
On 10/16/2012 11:48 AM, Lukas Laukamp wrote: Am 16.10.2012 11:40, schrieb Avi Kivity: On 10/16/2012 11:12 AM, Lukas Laukamp wrote: Hey all, I have a question about a solution for migrate LVM based guests directly over the network. So the situation: Two KVM hosts with libvirt, multiple LVM based guests Want to do: Migrate a LVM based guest directly to the other host over an secure connection I know that migration is possible when the VM disks are stored on an NFS, GFS2 filer/cluster etc. So would it be possible to do an offline migration directly with netcat or something like that? If all you need is offline, you can use scp to copy each volume to the destination volume. Make sure the guests are shut down when you do that. It is also possible to do a live migration, but unless the destination and source are in the same IP subnet, the guests are going to lose connectivity. Hello Avi, so can I simply copy an logical volume to the path of the volume group with scp? Yes. Best to enable compression to avoid sending zero blocks. For the live migration theme, it would be no problem when the guests looses connectivity, how could be done a live migration? See the -b option to the migrate command. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip
On 10/15/2012 02:02 PM, Alexander Graf wrote: The current eventfd code assumes that when we have eventfd, we also have irqfd for in-kernel interrupt delivery. This is not necessarily true. On PPC we don't have an in-kernel irqchip yet, but we can still support easily support eventfd. Don't you need, in addition, to reject PIO space for ioeventfd? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip
On 16.10.2012, at 12:57, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: The current eventfd code assumes that when we have eventfd, we also have irqfd for in-kernel interrupt delivery. This is not necessarily true. On PPC we don't have an in-kernel irqchip yet, but we can still support easily support eventfd. Don't you need, in addition, to reject PIO space for ioeventfd? Yeah, we could tell user space that it's doing something stupid. Not sure how critical it is - right now it would just register it and never get triggered, because the bus doesn't exist. But sanity checks are a good thing usually. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: On 10/16/2012 06:01 AM, Paul Mackerras wrote: +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE ((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. Because it is a hash table, updates tend to be scattered throughout the whole table, which is another reason why per-page dirty tracking and updates would be pretty inefficient. As for the change rate, it depends on the application of course, but basically every time the guest changes a PTE in its Linux page tables we do the corresponding change to the corresponding HPT entry, so the rate can be quite high. Workloads that do a lot of fork, exit, mmap, exec, etc. have a high rate of HPT updates. Suppose new hardware arrives that supports nesting HPTs, so that kvm is no longer synchronously aware of the guest HPT (similar to how NPT/EPT made
[PATCH v5 1/6] KVM: MMU: fix release noslot pfn
We can not directly call kvm_release_pfn_clean to release the pfn since we can meet noslot pfn which is used to cache mmio info into spte Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |3 +-- virt/kvm/kvm_main.c |4 +--- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d289fee..6f85fe0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2497,8 +2497,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } } - if (!is_error_pfn(pfn)) - kvm_release_pfn_clean(pfn); + kvm_release_pfn_clean(pfn); } static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c353b45..a65bc02 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1322,9 +1322,7 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(pfn_t pfn) { - WARN_ON(is_error_pfn(pfn)); - - if (!kvm_is_mmio_pfn(pfn)) + if (!is_error_pfn(pfn) !kvm_is_mmio_pfn(pfn)) put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 2/6] KVM: MMU: remove mmu_is_invalid
Remove mmu_is_invalid and use is_invalid_pfn instead Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |5 - arch/x86/kvm/paging_tmpl.h |4 ++-- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6f85fe0..b8d13d7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2699,11 +2699,6 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, } } -static bool mmu_invalid_pfn(pfn_t pfn) -{ - return unlikely(is_invalid_pfn(pfn)); -} - static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, pfn_t pfn, unsigned access, int *ret_val) { diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 714e2c0..045d31a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -340,7 +340,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, pte_access = sp-role.access gpte_access(vcpu, gpte); protect_clean_gpte(pte_access, gpte); pfn = gfn_to_pfn_atomic(vcpu-kvm, gpte_to_gfn(gpte)); - if (mmu_invalid_pfn(pfn)) + if (is_invalid_pfn(pfn)) return; /* @@ -416,7 +416,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, gfn = gpte_to_gfn(gpte); pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn, pte_access ACC_WRITE_MASK); - if (mmu_invalid_pfn(pfn)) + if (is_invalid_pfn(pfn)) break; mmu_set_spte(vcpu, spte, sp-role.access, pte_access, 0, 0, -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/6] KVM: MMU: cleanup FNAME(page_fault)
Let it return emulate state instead of spte like __direct_map Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/paging_tmpl.h | 32 +--- 1 files changed, 13 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 045d31a..c32 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -427,21 +427,21 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, /* * Fetch a shadow pte for a specific level in the paging hierarchy. + * If the guest tries to write a write-protected page, we need to + * emulate this operation, return 1 to indicate this case. */ -static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, +static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, struct guest_walker *gw, int user_fault, int write_fault, int hlevel, -int *emulate, pfn_t pfn, bool map_writable, -bool prefault) +pfn_t pfn, bool map_writable, bool prefault) { - unsigned access = gw-pt_access; struct kvm_mmu_page *sp = NULL; - int top_level; - unsigned direct_access; struct kvm_shadow_walk_iterator it; + unsigned direct_access, access = gw-pt_access; + int top_level, emulate = 0; if (!is_present_gpte(gw-ptes[gw-level - 1])) - return NULL; + return 0; direct_access = gw-pte_access; @@ -505,17 +505,17 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, clear_sp_write_flooding_count(it.sptep); mmu_set_spte(vcpu, it.sptep, access, gw-pte_access, -user_fault, write_fault, emulate, it.level, +user_fault, write_fault, emulate, it.level, gw-gfn, pfn, prefault, map_writable); FNAME(pte_prefetch)(vcpu, gw, it.sptep); - return it.sptep; + return emulate; out_gpte_changed: if (sp) kvm_mmu_put_page(sp, it.sptep); kvm_release_pfn_clean(pfn); - return NULL; + return 0; } /* @@ -538,8 +538,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, int write_fault = error_code PFERR_WRITE_MASK; int user_fault = error_code PFERR_USER_MASK; struct guest_walker walker; - u64 *sptep; - int emulate = 0; int r; pfn_t pfn; int level = PT_PAGE_TABLE_LEVEL; @@ -601,17 +599,13 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, kvm_mmu_free_some_pages(vcpu); if (!force_pt_level) transparent_hugepage_adjust(vcpu, walker.gfn, pfn, level); - sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault, -level, emulate, pfn, map_writable, prefault); - (void)sptep; - pgprintk(%s: shadow pte %p %llx emulate %d\n, __func__, -sptep, *sptep, emulate); - + r = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault, +level, pfn, map_writable, prefault); ++vcpu-stat.pf_fixed; kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT); spin_unlock(vcpu-kvm-mmu_lock); - return emulate; + return r; out_unlock: spin_unlock(vcpu-kvm-mmu_lock); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/6] KVM: MMU: move prefetch_invalid_gpte out of pagaing_tmp.h
The function does not depend on guest mmu mode, move it out from paging_tmpl.h Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 36 arch/x86/kvm/paging_tmpl.h | 26 +++--- 2 files changed, 31 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b8d13d7..56c0e39 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2505,6 +2505,14 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) mmu_free_roots(vcpu); } +static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level) +{ + int bit7; + + bit7 = (gpte 7) 1; + return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; +} + static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, bool no_dirty_log) { @@ -2517,6 +2525,26 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, return gfn_to_pfn_memslot_atomic(slot, gfn); } +static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, u64 *spte, + u64 gpte) +{ + if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) + goto no_present; + + if (!is_present_gpte(gpte)) + goto no_present; + + if (!(gpte PT_ACCESSED_MASK)) + goto no_present; + + return false; + +no_present: + drop_spte(vcpu-kvm, spte); + return true; +} + static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *start, u64 *end) @@ -3394,14 +3422,6 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } -static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level) -{ - int bit7; - - bit7 = (gpte 7) 1; - return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; -} - static inline void protect_clean_gpte(unsigned *access, unsigned gpte) { unsigned mask; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index c32..36a80ed 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -305,26 +305,6 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker, addr, access); } -static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, u64 *spte, - pt_element_t gpte) -{ - if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) - goto no_present; - - if (!is_present_gpte(gpte)) - goto no_present; - - if (!(gpte PT_ACCESSED_MASK)) - goto no_present; - - return false; - -no_present: - drop_spte(vcpu-kvm, spte); - return true; -} - static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *spte, const void *pte) { @@ -333,7 +313,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, pfn_t pfn; gpte = *(const pt_element_t *)pte; - if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte)) + if (prefetch_invalid_gpte(vcpu, sp, spte, gpte)) return; pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte); @@ -408,7 +388,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, gpte = gptep[i]; - if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte)) + if (prefetch_invalid_gpte(vcpu, sp, spte, gpte)) continue; pte_access = sp-role.access gpte_access(vcpu, gpte); @@ -751,7 +731,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) sizeof(pt_element_t))) return -EINVAL; - if (FNAME(prefetch_invalid_gpte)(vcpu, sp, sp-spt[i], gpte)) { + if (prefetch_invalid_gpte(vcpu, sp, sp-spt[i], gpte)) { vcpu-kvm-tlbs_dirty++; continue; } -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 5/6] KVM: MMU: introduce FNAME(prefetch_gpte)
The only difference between FNAME(update_pte) and FNAME(pte_prefetch) is that the former is allowed to prefetch gfn from dirty logged slot, so introduce a common function to prefetch spte Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/paging_tmpl.h | 55 +++ 1 files changed, 24 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 36a80ed..f887e4c 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -305,31 +305,43 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker, addr, access); } -static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, - u64 *spte, const void *pte) +static bool +FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, +u64 *spte, pt_element_t gpte, bool no_dirty_log) { - pt_element_t gpte; unsigned pte_access; + gfn_t gfn; pfn_t pfn; - gpte = *(const pt_element_t *)pte; if (prefetch_invalid_gpte(vcpu, sp, spte, gpte)) - return; + return false; pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte); + + gfn = gpte_to_gfn(gpte); pte_access = sp-role.access gpte_access(vcpu, gpte); protect_clean_gpte(pte_access, gpte); - pfn = gfn_to_pfn_atomic(vcpu-kvm, gpte_to_gfn(gpte)); + pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn, + no_dirty_log (pte_access ACC_WRITE_MASK)); if (is_invalid_pfn(pfn)) - return; + return false; /* -* we call mmu_set_spte() with host_writable = true because that -* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1). +* we call mmu_set_spte() with host_writable = true because +* pte_prefetch_gfn_to_pfn always gets a writable pfn. */ mmu_set_spte(vcpu, spte, sp-role.access, pte_access, 0, 0, -NULL, PT_PAGE_TABLE_LEVEL, -gpte_to_gfn(gpte), pfn, true, true); +NULL, PT_PAGE_TABLE_LEVEL, gfn, pfn, true, true); + + return true; +} + +static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + u64 *spte, const void *pte) +{ + pt_element_t gpte = *(const pt_element_t *)pte; + + FNAME(prefetch_gpte)(vcpu, sp, spte, gpte, false); } static bool FNAME(gpte_changed)(struct kvm_vcpu *vcpu, @@ -375,33 +387,14 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, spte = sp-spt + i; for (i = 0; i PTE_PREFETCH_NUM; i++, spte++) { - pt_element_t gpte; - unsigned pte_access; - gfn_t gfn; - pfn_t pfn; - if (spte == sptep) continue; if (is_shadow_present_pte(*spte)) continue; - gpte = gptep[i]; - - if (prefetch_invalid_gpte(vcpu, sp, spte, gpte)) - continue; - - pte_access = sp-role.access gpte_access(vcpu, gpte); - protect_clean_gpte(pte_access, gpte); - gfn = gpte_to_gfn(gpte); - pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn, - pte_access ACC_WRITE_MASK); - if (is_invalid_pfn(pfn)) + if (!FNAME(prefetch_gpte)(vcpu, sp, spte, gptep[i], true)) break; - - mmu_set_spte(vcpu, spte, sp-role.access, pte_access, 0, 0, -NULL, PT_PAGE_TABLE_LEVEL, gfn, -pfn, true, true); } } -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 6/6] KVM: do not treat noslot pfn as a error pfn
This patch filters noslot pfn out from error pfns based on Marcelo comment: noslot pfn is not a error pfn After this patch, - is_noslot_pfn indicates that the gfn is not in slot - is_error_pfn indicates that the gfn is in slot but the error is occurred when translate the gfn to pfn - is_error_noslot_pfn indicates that the pfn either it is error pfns or it is noslot pfn And is_invalid_pfn can be removed, it makes the code more clean [ The patch has been compiling tested for powerpc ] Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_32_mmu_host.c |2 +- arch/powerpc/kvm/book3s_64_mmu_host.c |2 +- arch/powerpc/kvm/e500_tlb.c |2 +- arch/x86/kvm/mmu.c|4 ++-- arch/x86/kvm/paging_tmpl.h|2 +- arch/x86/kvm/x86.c|2 +- include/linux/kvm_host.h | 28 virt/kvm/iommu.c |4 ++-- virt/kvm/kvm_main.c |6 +++--- 9 files changed, 32 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c index 9fac010..844d591 100644 --- a/arch/powerpc/kvm/book3s_32_mmu_host.c +++ b/arch/powerpc/kvm/book3s_32_mmu_host.c @@ -155,7 +155,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte) /* Get host physical address for gpa */ hpaddr = kvmppc_gfn_to_pfn(vcpu, orig_pte-raddr PAGE_SHIFT); - if (is_error_pfn(hpaddr)) { + if (is_error_noslot_pfn(hpaddr)) { printk(KERN_INFO Couldn't get guest page for gfn %lx!\n, orig_pte-eaddr); r = -EINVAL; diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c index 6b2c80e..53f390f 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_host.c +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c @@ -92,7 +92,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte) /* Get host physical address for gpa */ hpaddr = kvmppc_gfn_to_pfn(vcpu, orig_pte-raddr PAGE_SHIFT); - if (is_error_pfn(hpaddr)) { + if (is_error_noslot_pfn(hpaddr)) { printk(KERN_INFO Couldn't get guest page for gfn %lx!\n, orig_pte-eaddr); r = -EINVAL; goto out; diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c index c733894..6305ee6 100644 --- a/arch/powerpc/kvm/e500_tlb.c +++ b/arch/powerpc/kvm/e500_tlb.c @@ -524,7 +524,7 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, if (likely(!pfnmap)) { unsigned long tsize_pages = 1 (tsize + 10 - PAGE_SHIFT); pfn = gfn_to_pfn_memslot(slot, gfn); - if (is_error_pfn(pfn)) { + if (is_error_noslot_pfn(pfn)) { printk(KERN_ERR Couldn't get real page for gfn %lx!\n, (long)gfn); return; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 56c0e39..54c3557 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2699,7 +2699,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, * PT_PAGE_TABLE_LEVEL and there would be no adjustment done * here. */ - if (!is_error_pfn(pfn) !kvm_is_mmio_pfn(pfn) + if (!is_error_noslot_pfn(pfn) !kvm_is_mmio_pfn(pfn) level == PT_PAGE_TABLE_LEVEL PageTransCompound(pfn_to_page(pfn)) !has_wrprotected_page(vcpu-kvm, gfn, PT_DIRECTORY_LEVEL)) { @@ -2733,7 +2733,7 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, bool ret = true; /* The pfn is invalid, report the error! */ - if (unlikely(is_invalid_pfn(pfn))) { + if (unlikely(is_error_pfn(pfn))) { *ret_val = kvm_handle_bad_page(vcpu, gfn, pfn); goto exit; } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index f887e4c..89f3241 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -323,7 +323,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, protect_clean_gpte(pte_access, gpte); pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn, no_dirty_log (pte_access ACC_WRITE_MASK)); - if (is_invalid_pfn(pfn)) + if (is_error_pfn(pfn)) return false; /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 841d09b..8b90dd5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4504,7 +4504,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t gva) * instruction - ... */ pfn = gfn_to_pfn(vcpu-kvm, gpa_to_gfn(gpa)); - if (!is_error_pfn(pfn)) { + if (!is_error_noslot_pfn(pfn)) {
Re: [PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA
On 10/16/2012 04:39 AM, Mel Gorman wrote: Thierry reported that the iron out patch for isolate_freepages_block() had problems due to the strict check being too strict with mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range() -fix1. It's possible that more pages than necessary are isolated but the check still fails and I missed that this fix was not picked up before RC1. This same problem has been identified in 3.7-RC1 by Tony Prisk and should be addressed by the following patch. Signed-off-by: Mel Gorman mgor...@suse.de Tested-by: Tony Prisk li...@prisktech.co.nz Acked-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On 10/16/2012 01:58 PM, Paul Mackerras wrote: On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: On 10/16/2012 06:01 AM, Paul Mackerras wrote: +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. That's sane and expected, but it should be documented. Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. This forces userspace to dedicate a thread for the HPT. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). I meant the internal data structure that holds HPT entries. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does it warrant a live migration protocol? Because it is
Re: [PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA
On Tue, Oct 16, 2012 at 09:39:27AM +0100, Mel Gorman wrote: Thierry reported that the iron out patch for isolate_freepages_block() had problems due to the strict check being too strict with mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range() -fix1. It's possible that more pages than necessary are isolated but the check still fails and I missed that this fix was not picked up before RC1. This same problem has been identified in 3.7-RC1 by Tony Prisk and should be addressed by the following patch. Signed-off-by: Mel Gorman mgor...@suse.de Tested-by: Tony Prisk li...@prisktech.co.nz Acked-by: Minchan Kim minc...@kernel.org -- Kind Regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using PCI config space to indicate config location
Michael S. Tsirkin m...@redhat.com writes: On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote: For writes, the standard seems to be a commit latch. We could abuse the generation count for this: the driver writes to it to commit config changes. I think this will work. There are a couple of things that bother me: This assumes read accesses have no side effects, and these are sometimes handy. Also the semantics for write aren't very clear to me. I guess device must buffer data until generation count write? This assumes the device has a buffer to store writes, and it must track each byte written. I kind of dislike this tracking of accessed bytes. Also, device would need to resolve conflicts if any in some device specific way. It should be trivial to implement: you keep a scratch copy of the config space, and copy it to the master copy when they hit the latch. Implementation of this will show whether I've missed anything here, I think. What I refer to: what happens if driver does: - write offset 1 - write offset 3 - hit commit latch - nothing - nothing - effect of offset 1 and offset 3 writes OK so this means that you also need to track which bytes where written in order to know to skip byte 2. This is what I referred to. If instead we ask driver to specify offset/length explicitly device only needs to remember that. I was assuming the implementation would keep two complete copies of the config space: writes go to the scratch version, which gets copied to the master version upon latch write. But I do wonder if we should just skip this for now, since we don't have any immediate need. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using PCI config space to indicate config location
On Tue, Oct 16, 2012 at 11:45:41PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote: For writes, the standard seems to be a commit latch. We could abuse the generation count for this: the driver writes to it to commit config changes. I think this will work. There are a couple of things that bother me: This assumes read accesses have no side effects, and these are sometimes handy. Also the semantics for write aren't very clear to me. I guess device must buffer data until generation count write? This assumes the device has a buffer to store writes, and it must track each byte written. I kind of dislike this tracking of accessed bytes. Also, device would need to resolve conflicts if any in some device specific way. It should be trivial to implement: you keep a scratch copy of the config space, and copy it to the master copy when they hit the latch. Implementation of this will show whether I've missed anything here, I think. What I refer to: what happens if driver does: - write offset 1 - write offset 3 - hit commit latch - nothing - nothing - effect of offset 1 and offset 3 writes OK so this means that you also need to track which bytes where written in order to know to skip byte 2. This is what I referred to. If instead we ask driver to specify offset/length explicitly device only needs to remember that. I was assuming the implementation would keep two complete copies of the config space: writes go to the scratch version, which gets copied to the master version upon latch write. Yes but config space has some host modifiable registers too. So host needs to be careful to avoid overwriting these. If accesses have side effects that of course breaks too ... But I do wonder if we should just skip this for now, since we don't have any immediate need. Cheers, Rusty. MAC setting from guest needs this right now, no? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 01:06 PM, Alexander Graf wrote: On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. Couldn't you attach that payload to the irqfd? On x86 an irqfd is associated with a gsi, and a gsi with extra information, including all that is needed to queue an MSI. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. If there is a lot of prioritization and/or queuing logic, then yes. But what about MSI? Doesn't that have a direct path? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote: On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote: This makes use of the new level irqfd support enabling bypass of qemu userspace both on INTx injection and unmask. This significantly boosts the performance of devices making use of legacy interrupts. Signed-off-by: Alex Williamson alex.william...@redhat.com --- My INTx routing workaround below will probably raise some eyebrows, but I don't feel it's worth subjecting users to core dumps if they want to try vfio-pci on new platforms. INTx routing is part of some larger plan, but until that plan materializes we have to try to avoid the API unless we think there's a good chance it might be there. I'll accept the maintenance of updating a whitelist in the interim. Thanks, Alex hw/vfio_pci.c | 224 + 1 file changed, 224 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 639371e..777a5f8 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); /* + * PCI code refuses to make it possible to probe whether the chipset + * supports pci_device_route_intx_to_irq() and booby traps the call + * to assert if doesn't. For us, this is just an optimization, so + * only enable it when we know it's present. Unfortunately PCIBus is + * private, so we can't just look at the function pointer. + */ +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev) +{ +#ifdef CONFIG_KVM +BusState *bus = qdev_get_parent_bus(pdev-qdev); +DeviceState *dev; + +if (!kvm_irqchip_in_kernel() || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { + return false; +} Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE? Also for KVM_IRQFD_FLAG_RESAMPLE. I posted the patch for that separately yesterday. I'll only request a pull once that's in. + +for (; bus-parent; bus = qdev_get_parent_bus(dev)) { + +dev = bus-parent; + +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 14)) { +return true; +} +} + +error_report(vfio-pci: VM chipset does not support INTx routing, + using slow INTx mode\n); When does this code trigger? It seems irqchip implies piix ATM - is this just dead code? Unused, but not unnecessary. Another chipset is under development, which means very quickly irqchip will not imply piix. Likewise irqfd support is being added to other architectures, so I don't know how long the kvm specific tests will hold up. Testing for a specific chipset could of course be avoided if we were willing to support: bool pci_device_intx_route_supported(PCIDevice *pdev) or the NOROUTE option I posted previously. +#endif +return false; +} + +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int pin) +{ +if (!vfio_pci_bus_has_intx_route(pdev)) { +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 }; +} + +return pci_device_route_intx_to_irq(pdev, pin); +} + +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new) +{ +return old-mode != new-mode || old-irq != new-irq; +} + Didn't you add an API for this? It's on pci branch but I can drop it if not needed. I did and I'll switch to it when available, but I have no idea when that will be, so I've hedged my bets by re-implementing it here. 2 week+ turnover for a patch makes it difficult to coordinate dependent changes on short qemu release cycles. +/* * Common VFIO interrupt disable */ static void vfio_disable_irqindex(VFIODevice *vdev, int index) @@ -185,6 +232,21 @@ static void vfio_unmask_intx(VFIODevice *vdev) ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set); } +#ifdef CONFIG_KVM +static void vfio_mask_intx(VFIODevice *vdev) +{ +struct vfio_irq_set irq_set = { +.argsz = sizeof(irq_set), +.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK, +.index = VFIO_PCI_INTX_IRQ_INDEX, +.start = 0, +.count = 1, +}; + +ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set); +} +#endif + /* * Disabling BAR mmaping can be slow, but toggling it around INTx can * also be a huge overhead. We try to get the best of both worlds by @@ -248,6 +310,161 @@ static void vfio_eoi(VFIODevice *vdev) vfio_unmask_intx(vdev); } +static void vfio_enable_intx_kvm(VFIODevice *vdev) +{ +#ifdef CONFIG_KVM +struct kvm_irqfd irqfd = { +.fd = event_notifier_get_fd(vdev-intx.interrupt), +.gsi = vdev-intx.route.irq, +.flags =
Re: Using PCI config space to indicate config location
Michael S. Tsirkin m...@redhat.com writes: On Tue, Oct 16, 2012 at 11:45:41PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: I was assuming the implementation would keep two complete copies of the config space: writes go to the scratch version, which gets copied to the master version upon latch write. Yes but config space has some host modifiable registers too. So host needs to be careful to avoid overwriting these. If accesses have side effects that of course breaks too ... Yes. But I do wonder if we should just skip this for now, since we don't have any immediate need. Cheers, Rusty. MAC setting from guest needs this right now, no? Ah, I missed that in my table: Driver Config Device changesDriver writes... after init? net YY NN block YY YY console YY NN rng NN NN balloon YY YY scsiYN YN 9p YN NN First line should be: net YY YN So we could add a new cvq command (eg. #define VIRTIO_NET_CTRL_MAC_SET 1) and a VIRTIO_NET_F_CVQ_MAC_SET feature. Or go ahead with the latching scheme (which doesn't really help other busses). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote: On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote: This makes use of the new level irqfd support enabling bypass of qemu userspace both on INTx injection and unmask. This significantly boosts the performance of devices making use of legacy interrupts. Signed-off-by: Alex Williamson alex.william...@redhat.com --- My INTx routing workaround below will probably raise some eyebrows, but I don't feel it's worth subjecting users to core dumps if they want to try vfio-pci on new platforms. INTx routing is part of some larger plan, but until that plan materializes we have to try to avoid the API unless we think there's a good chance it might be there. I'll accept the maintenance of updating a whitelist in the interim. Thanks, Alex hw/vfio_pci.c | 224 + 1 file changed, 224 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 639371e..777a5f8 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); /* + * PCI code refuses to make it possible to probe whether the chipset + * supports pci_device_route_intx_to_irq() and booby traps the call + * to assert if doesn't. For us, this is just an optimization, so + * only enable it when we know it's present. Unfortunately PCIBus is + * private, so we can't just look at the function pointer. + */ +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev) +{ +#ifdef CONFIG_KVM +BusState *bus = qdev_get_parent_bus(pdev-qdev); +DeviceState *dev; + +if (!kvm_irqchip_in_kernel() || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { + return false; +} Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE? Also for KVM_IRQFD_FLAG_RESAMPLE. I posted the patch for that separately yesterday. I'll only request a pull once that's in. OK missed that. In the future, might be a good idea to note dependencies in the patchset or repost them as part of patchset with appropriate tagging. + +for (; bus-parent; bus = qdev_get_parent_bus(dev)) { + +dev = bus-parent; + +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 14)) { +return true; +} +} + +error_report(vfio-pci: VM chipset does not support INTx routing, + using slow INTx mode\n); When does this code trigger? It seems irqchip implies piix ATM - is this just dead code? Unused, but not unnecessary. Another chipset is under development, which means very quickly irqchip will not imply piix. So this is for purposes of an out of tree stuff, let's keep these hacks out of tree too. My guess is q35 can just be added with pci_device_route_intx straight away. Likewise irqfd support is being added to other architectures, so I don't know how long the kvm specific tests will hold up. Testing for a specific chipset could of course be avoided if we were willing to support: bool pci_device_intx_route_supported(PCIDevice *pdev) or the NOROUTE option I posted previously. This is just moving the pain to users who will get bad performance and will have to debug it. Injecting through userspace is slow, new architectures should simply add irqfd straight away instead of adding work arounds in userspace. This is exactly why we have the assert in pci core. +#endif +return false; +} + +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int pin) +{ +if (!vfio_pci_bus_has_intx_route(pdev)) { +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 }; +} + +return pci_device_route_intx_to_irq(pdev, pin); +} + +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new) +{ +return old-mode != new-mode || old-irq != new-irq; +} + Didn't you add an API for this? It's on pci branch but I can drop it if not needed. I did and I'll switch to it when available, but I have no idea when that will be, so I've hedged my bets by re-implementing it here. 2 week+ turnover for a patch makes it difficult to coordinate dependent changes on short qemu release cycles. It's available on pci branch, please base on that instead of master. Yes I merge at about 2 week intervals but the point is using subtrees. You do not need to block waiting for stuff to land in master. +/* * Common VFIO interrupt disable */ static void vfio_disable_irqindex(VFIODevice *vdev, int index) @@ -185,6 +232,21 @@ static void
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote: On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote: On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote: This makes use of the new level irqfd support enabling bypass of qemu userspace both on INTx injection and unmask. This significantly boosts the performance of devices making use of legacy interrupts. Signed-off-by: Alex Williamson alex.william...@redhat.com --- My INTx routing workaround below will probably raise some eyebrows, but I don't feel it's worth subjecting users to core dumps if they want to try vfio-pci on new platforms. INTx routing is part of some larger plan, but until that plan materializes we have to try to avoid the API unless we think there's a good chance it might be there. I'll accept the maintenance of updating a whitelist in the interim. Thanks, Alex hw/vfio_pci.c | 224 + 1 file changed, 224 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 639371e..777a5f8 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); /* + * PCI code refuses to make it possible to probe whether the chipset + * supports pci_device_route_intx_to_irq() and booby traps the call + * to assert if doesn't. For us, this is just an optimization, so + * only enable it when we know it's present. Unfortunately PCIBus is + * private, so we can't just look at the function pointer. + */ +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev) +{ +#ifdef CONFIG_KVM +BusState *bus = qdev_get_parent_bus(pdev-qdev); +DeviceState *dev; + +if (!kvm_irqchip_in_kernel() || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { + return false; +} Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE? Also for KVM_IRQFD_FLAG_RESAMPLE. I posted the patch for that separately yesterday. I'll only request a pull once that's in. OK missed that. In the future, might be a good idea to note dependencies in the patchset or repost them as part of patchset with appropriate tagging. + +for (; bus-parent; bus = qdev_get_parent_bus(dev)) { + +dev = bus-parent; + +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 14)) { +return true; +} +} + +error_report(vfio-pci: VM chipset does not support INTx routing, + using slow INTx mode\n); When does this code trigger? It seems irqchip implies piix ATM - is this just dead code? Unused, but not unnecessary. Another chipset is under development, which means very quickly irqchip will not imply piix. So this is for purposes of an out of tree stuff, let's keep these hacks out of tree too. My guess is q35 can just be added with pci_device_route_intx straight away. Likewise irqfd support is being added to other architectures, so I don't know how long the kvm specific tests will hold up. Testing for a specific chipset could of course be avoided if we were willing to support: bool pci_device_intx_route_supported(PCIDevice *pdev) or the NOROUTE option I posted previously. This is just moving the pain to users who will get bad performance and will have to debug it. Injecting through userspace is slow, new architectures should simply add irqfd straight away instead of adding work arounds in userspace. This is exactly why we have the assert in pci core. Let's compare user experience: As coded here: - Error message, only runs slower if the driver actually uses INTx. Result: file bug, continue using vfio-pci, maybe do something useful, maybe find new bugs to file. Blindly calling PCI code w/ assert: - Assert. Result: file bug, full stop. Maybe I do too much kernel programming, but I don't take asserts lightly and prefer they be saved for something is really broken and I can't safely continue. This is not such a case. +#endif +return false; +} + +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int pin) +{ +if (!vfio_pci_bus_has_intx_route(pdev)) { +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 }; +} + +return pci_device_route_intx_to_irq(pdev, pin); +} + +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new) +{ +return old-mode != new-mode || old-irq != new-irq; +} + Didn't you add an API for this? It's
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 03:47 PM, Avi Kivity wrote: On 10/16/2012 01:06 PM, Alexander Graf wrote: On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. Couldn't you attach that payload to the irqfd? On x86 an irqfd is associated with a gsi, and a gsi with extra information, including all that is needed to queue an MSI. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. If there is a lot of prioritization and/or queuing logic, then yes. But what about MSI? Doesn't that have a direct path? Nope. Well, yes, in a certain special case where the MPIC pushes the interrupt vector on interrupt delivery into a special register. But not for the normal case. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for 2012-10-16
2012-10-16 -- - cpu as dev making qdev available to all parts of qemu is the best solution (aliguory) how does linux-user people think about it? Eduardo will follow Objections: * conceptually it makes no sense to have devices on *-user * desire of linux-user to maintain the code relativelly small * really they want less dependencies Who is going to merge it? either anthony or andreas through cpu tree - boot and -bootindex (agraf) should this overwrote CMOS on PC (anthony) Always writing nvram makes other things work. -boot c : means boot with legacy device with uefi there is a boot variable -boot c don't make sense out of x86 * make the list of bootable devices visible to the user and store in nvram? At this point I got lost, and Anthony/Alex needs to fill the details of the proposals. - Command line compatibility patches (marcelo) merge then directly or through qdev -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM ept flush
On 10/16/2012 01:57 PM, Rohan Sharma wrote: Is there a way to flush ept entries in qemu-kvm. No. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, Oct 16, 2012 at 08:48:04AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote: On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote: On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote: This makes use of the new level irqfd support enabling bypass of qemu userspace both on INTx injection and unmask. This significantly boosts the performance of devices making use of legacy interrupts. Signed-off-by: Alex Williamson alex.william...@redhat.com --- My INTx routing workaround below will probably raise some eyebrows, but I don't feel it's worth subjecting users to core dumps if they want to try vfio-pci on new platforms. INTx routing is part of some larger plan, but until that plan materializes we have to try to avoid the API unless we think there's a good chance it might be there. I'll accept the maintenance of updating a whitelist in the interim. Thanks, Alex hw/vfio_pci.c | 224 + 1 file changed, 224 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 639371e..777a5f8 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); /* + * PCI code refuses to make it possible to probe whether the chipset + * supports pci_device_route_intx_to_irq() and booby traps the call + * to assert if doesn't. For us, this is just an optimization, so + * only enable it when we know it's present. Unfortunately PCIBus is + * private, so we can't just look at the function pointer. + */ +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev) +{ +#ifdef CONFIG_KVM +BusState *bus = qdev_get_parent_bus(pdev-qdev); +DeviceState *dev; + +if (!kvm_irqchip_in_kernel() || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { + return false; +} Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE? Also for KVM_IRQFD_FLAG_RESAMPLE. I posted the patch for that separately yesterday. I'll only request a pull once that's in. OK missed that. In the future, might be a good idea to note dependencies in the patchset or repost them as part of patchset with appropriate tagging. + +for (; bus-parent; bus = qdev_get_parent_bus(dev)) { + +dev = bus-parent; + +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 14)) { +return true; +} +} + +error_report(vfio-pci: VM chipset does not support INTx routing, + using slow INTx mode\n); When does this code trigger? It seems irqchip implies piix ATM - is this just dead code? Unused, but not unnecessary. Another chipset is under development, which means very quickly irqchip will not imply piix. So this is for purposes of an out of tree stuff, let's keep these hacks out of tree too. My guess is q35 can just be added with pci_device_route_intx straight away. Likewise irqfd support is being added to other architectures, so I don't know how long the kvm specific tests will hold up. Testing for a specific chipset could of course be avoided if we were willing to support: bool pci_device_intx_route_supported(PCIDevice *pdev) or the NOROUTE option I posted previously. This is just moving the pain to users who will get bad performance and will have to debug it. Injecting through userspace is slow, new architectures should simply add irqfd straight away instead of adding work arounds in userspace. This is exactly why we have the assert in pci core. Let's compare user experience: As coded here: - Error message, only runs slower if the driver actually uses INTx. Result: file bug, continue using vfio-pci, maybe do something useful, maybe find new bugs to file. Blindly calling PCI code w/ assert: - Assert. Result: file bug, full stop. Maybe I do too much kernel programming, but I don't take asserts lightly and prefer they be saved for something is really broken and I can't safely continue. This is not such a case. There's no chance we ship e.g. q35 by mistake without this API: since there is no way this specific assert can be missed in even basic testing: So I see it differently: As coded here: chipset authors get lazy and do not implement API. bad performance for all users. With assert: chipset authors implement necessary API. good performance for all
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, 2012-10-16 at 17:08 +0200, Michael S. Tsirkin wrote: On Tue, Oct 16, 2012 at 08:48:04AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote: On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote: On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote: This makes use of the new level irqfd support enabling bypass of qemu userspace both on INTx injection and unmask. This significantly boosts the performance of devices making use of legacy interrupts. Signed-off-by: Alex Williamson alex.william...@redhat.com --- My INTx routing workaround below will probably raise some eyebrows, but I don't feel it's worth subjecting users to core dumps if they want to try vfio-pci on new platforms. INTx routing is part of some larger plan, but until that plan materializes we have to try to avoid the API unless we think there's a good chance it might be there. I'll accept the maintenance of updating a whitelist in the interim. Thanks, Alex hw/vfio_pci.c | 224 + 1 file changed, 224 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 639371e..777a5f8 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); /* + * PCI code refuses to make it possible to probe whether the chipset + * supports pci_device_route_intx_to_irq() and booby traps the call + * to assert if doesn't. For us, this is just an optimization, so + * only enable it when we know it's present. Unfortunately PCIBus is + * private, so we can't just look at the function pointer. + */ +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev) +{ +#ifdef CONFIG_KVM +BusState *bus = qdev_get_parent_bus(pdev-qdev); +DeviceState *dev; + +if (!kvm_irqchip_in_kernel() || +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { + return false; +} Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE? Also for KVM_IRQFD_FLAG_RESAMPLE. I posted the patch for that separately yesterday. I'll only request a pull once that's in. OK missed that. In the future, might be a good idea to note dependencies in the patchset or repost them as part of patchset with appropriate tagging. + +for (; bus-parent; bus = qdev_get_parent_bus(dev)) { + +dev = bus-parent; + +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 14)) { +return true; +} +} + +error_report(vfio-pci: VM chipset does not support INTx routing, + using slow INTx mode\n); When does this code trigger? It seems irqchip implies piix ATM - is this just dead code? Unused, but not unnecessary. Another chipset is under development, which means very quickly irqchip will not imply piix. So this is for purposes of an out of tree stuff, let's keep these hacks out of tree too. My guess is q35 can just be added with pci_device_route_intx straight away. Likewise irqfd support is being added to other architectures, so I don't know how long the kvm specific tests will hold up. Testing for a specific chipset could of course be avoided if we were willing to support: bool pci_device_intx_route_supported(PCIDevice *pdev) or the NOROUTE option I posted previously. This is just moving the pain to users who will get bad performance and will have to debug it. Injecting through userspace is slow, new architectures should simply add irqfd straight away instead of adding work arounds in userspace. This is exactly why we have the assert in pci core. Let's compare user experience: As coded here: - Error message, only runs slower if the driver actually uses INTx. Result: file bug, continue using vfio-pci, maybe do something useful, maybe find new bugs to file. Blindly calling PCI code w/ assert: - Assert. Result: file bug, full stop. Maybe I do too much kernel programming, but I don't take asserts lightly and prefer they be saved for something is really broken and I can't safely continue. This is not such a case. There's no chance we ship e.g. q35 by mistake without this API: since there is no way this specific assert can be missed in even basic testing: So I see it differently:
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, Oct 16, 2012 at 09:13:15AM -0600, Alex Williamson wrote: There's no chance we ship e.g. q35 by mistake without this API: since there is no way this specific assert can be missed in even basic testing: So I see it differently: As coded here: chipset authors get lazy and do not implement API. bad performance for all users. With assert: chipset authors implement necessary API. good performance for all users. I prefer a carrot, not a whip. Thanks, Alex It's not just that. Problem is performance testing/fixing is hard. Catching and fixing asserts is easy. So working around buggy qemu code really backfires as it reverses the motivation for writing well performing code. History proves me right: for each API change where we implemented a fallback old code stayed around for years. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I/O errors in guest OS after repeated migration
Hi, I'm experiencing I/O errors in a guest machine after migrating it from one host to another, and then back to the original host. After doing this, I find the following in the dmesg output of the guest machine: [ 345.390543] end_request: I/O error, dev vda, sector 273871 [ 345.391125] end_request: I/O error, dev vda, sector 273871 [ 345.391705] end_request: I/O error, dev vda, sector 273871 [ 345.394796] end_request: I/O error, dev vda, sector 1745983 [ 345.396005] end_request: I/O error, dev vda, sector 1745983 [ 346.083160] end_request: I/O error, dev vdb, sector 54528008 [ 346.083179] Buffer I/O error on device dm-0, logical block 6815745 [ 346.083181] lost page write due to I/O error on dm-0 [ 346.083193] end_request: I/O error, dev vdb, sector 54528264 [ 346.083195] Buffer I/O error on device dm-0, logical block 6815777 [ 346.083197] lost page write due to I/O error on dm-0 [ 346.083201] end_request: I/O error, dev vdb, sector 2056 [ 346.083204] Buffer I/O error on device dm-0, logical block 1 [ 346.083206] lost page write due to I/O error on dm-0 [ 346.083209] Buffer I/O error on device dm-0, logical block 2 [ 346.083211] lost page write due to I/O error on dm-0 [ 346.083215] end_request: I/O error, dev vdb, sector 10248 [ 346.083217] Buffer I/O error on device dm-0, logical block 1025 [ 346.083219] lost page write due to I/O error on dm-0 [ 346.091499] end_request: I/O error, dev vdb, sector 76240 [ 346.091506] Buffer I/O error on device dm-0, logical block 9274 [ 346.091508] lost page write due to I/O error on dm-0 [ 346.091572] JBD2: Detected IO errors while flushing file data on dm-0-8 [ 346.091915] end_request: I/O error, dev vdb, sector 38017360 [ 346.091956] Aborting journal on device dm-0-8. [ 346.092557] end_request: I/O error, dev vdb, sector 38012928 [ 346.092566] Buffer I/O error on device dm-0, logical block 4751360 [ 346.092569] lost page write due to I/O error on dm-0 [ 346.092624] JBD2: I/O error detected when updating journal superblock for dm-0-8. [ 346.100940] end_request: I/O error, dev vdb, sector 2048 [ 346.100948] Buffer I/O error on device dm-0, logical block 0 [ 346.100952] lost page write due to I/O error on dm-0 [ 346.101027] EXT4-fs error (device dm-0): ext4_journal_start_sb:327: Detected aborted journal [ 346.101038] EXT4-fs (dm-0): Remounting filesystem read-only [ 346.101051] EXT4-fs (dm-0): previous I/O error to superblock detected [ 346.101836] end_request: I/O error, dev vdb, sector 2048 [ 346.101845] Buffer I/O error on device dm-0, logical block 0 [ 346.101849] lost page write due to I/O error on dm-0 [ 373.006680] end_request: I/O error, dev vda, sector 624319 [ 373.007543] end_request: I/O error, dev vda, sector 624319 [ 373.008327] end_request: I/O error, dev vda, sector 624319 [ 374.886674] end_request: I/O error, dev vda, sector 624319 [ 374.887563] end_request: I/O error, dev vda, sector 624319 The hosts are both running Fedora 17 with qemu-kvm-1.0.1-1.fc17.x86_64. The guest machine has been started and migrated using libvirt (0.9.11). Kernel version is 3.5.6-1.fc17.x86_64 on the first host and 3.5.5-2.fc17.x86_64 on the second. The guest machine is on Kernel 3.3.8 and uses ext4 on its disks. The commandline, as generated by libvirtd, looks like this: LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.15 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name migratetest2 -uuid ddbf11e9-387e-902b-4849-8c3067dc42a2 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/migratetest2.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-reboot -no- shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/data/migratetest2_system,if=none,id=drive-virtio- disk0,format=qcow2,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio- disk0,bootindex=1 -drive file=/data/migratetest2_data-1,if=none,id=drive- virtio-disk1,format=qcow2,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 - netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net- pci,netdev=hostnet0,id=net0,mac=02:00:00:00:00:0c,bus=pci.0,addr=0x3 -vnc 127.0.0.1:2,password -k de -vga cirrus -incoming tcp:0.0.0.0:49153 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 The second host has an ext4 filesystem mounted under /data, which it exports using NFSv3 over TCP to the first host, which also mounts it under /data. So far, the problem seems reproducible: When I start another guest machine and do the same thing with it, the same problem happens. Can anybody help me with this problem? Guido -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, 2012-10-16 at 17:23 +0200, Michael S. Tsirkin wrote: On Tue, Oct 16, 2012 at 09:13:15AM -0600, Alex Williamson wrote: There's no chance we ship e.g. q35 by mistake without this API: since there is no way this specific assert can be missed in even basic testing: So I see it differently: As coded here: chipset authors get lazy and do not implement API. bad performance for all users. With assert: chipset authors implement necessary API. good performance for all users. I prefer a carrot, not a whip. Thanks, Alex It's not just that. Problem is performance testing/fixing is hard. Getting an error_report from the driver saying it's using a slow path and why makes that significantly easier. Catching and fixing asserts is easy. Easy for who? The user trying to test a feature? Probably not. Me, who may not have access to the chipset documentation or understand the platform? Maybe, maybe not. So working around buggy qemu code really backfires as it reverses the motivation for writing well performing code. History proves me right: for each API change where we implemented a fallback old code stayed around for years. Does that necessarily mean it was wrong? How many of those API changes added new features that may have been abandoned if the developer was required to make sweeping changes to get their code accepted? If not abandoned, how much delayed? How many land mines might we have in the code for changes that were done incorrectly or missed? I don't understand why adding robustness to the API is such a contentious point, but it's your prerogative, just as it's mine to avoid using that API arbitrarily. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio-pci: Add KVM INTx acceleration
On Tue, Oct 16, 2012 at 10:49:38AM -0600, Alex Williamson wrote: On Tue, 2012-10-16 at 17:23 +0200, Michael S. Tsirkin wrote: On Tue, Oct 16, 2012 at 09:13:15AM -0600, Alex Williamson wrote: There's no chance we ship e.g. q35 by mistake without this API: since there is no way this specific assert can be missed in even basic testing: So I see it differently: As coded here: chipset authors get lazy and do not implement API. bad performance for all users. With assert: chipset authors implement necessary API. good performance for all users. I prefer a carrot, not a whip. Thanks, Alex It's not just that. Problem is performance testing/fixing is hard. Getting an error_report from the driver saying it's using a slow path and why makes that significantly easier. Catching and fixing asserts is easy. Easy for who? The user trying to test a feature? Probably not. Me, who may not have access to the chipset documentation or understand the platform? Maybe, maybe not. So working around buggy qemu code really backfires as it reverses the motivation for writing well performing code. History proves me right: for each API change where we implemented a fallback old code stayed around for years. Does that necessarily mean it was wrong? How many of those API changes added new features that may have been abandoned if the developer was required to make sweeping changes to get their code accepted? If not abandoned, how much delayed? How many land mines might we have in the code for changes that were done incorrectly or missed? I don't understand why adding robustness to the API is such a contentious point, but it's your prerogative, just as it's mine to avoid using that API arbitrarily. Thanks, Alex Yea. All I say is, I intend to fix things so you don't need to probe for this API. I think it does not make sense to add a temporary API as a stopgap for this since it will solve no actual problem. If e.g. Jason finds it hard to add this to q35, we could add a stopgap solution for vfio. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/O errors in guest OS after repeated migration
On Tuesday, October 16, 2012 11:33:44 AM Guido Winkelmann wrote: Hi, I'm experiencing I/O errors in a guest machine after migrating it from one host to another, and then back to the original host. After doing this, I find the following in the dmesg output of the guest machine: [ 345.390543] end_request: I/O error, dev vda, sector 273871 [ 345.391125] end_request: I/O error, dev vda, sector 273871 [ 345.391705] end_request: I/O error, dev vda, sector 273871 [ 345.394796] end_request: I/O error, dev vda, sector 1745983 [ 345.396005] end_request: I/O error, dev vda, sector 1745983 [ 346.083160] end_request: I/O error, dev vdb, sector 54528008 [ 346.083179] Buffer I/O error on device dm-0, logical block 6815745 [ 346.083181] lost page write due to I/O error on dm-0 [ 346.083193] end_request: I/O error, dev vdb, sector 54528264 [ 346.083195] Buffer I/O error on device dm-0, logical block 6815777 [ 346.083197] lost page write due to I/O error on dm-0 [ 346.083201] end_request: I/O error, dev vdb, sector 2056 [ 346.083204] Buffer I/O error on device dm-0, logical block 1 [ 346.083206] lost page write due to I/O error on dm-0 [ 346.083209] Buffer I/O error on device dm-0, logical block 2 [ 346.083211] lost page write due to I/O error on dm-0 [ 346.083215] end_request: I/O error, dev vdb, sector 10248 [ 346.083217] Buffer I/O error on device dm-0, logical block 1025 [ 346.083219] lost page write due to I/O error on dm-0 [ 346.091499] end_request: I/O error, dev vdb, sector 76240 [ 346.091506] Buffer I/O error on device dm-0, logical block 9274 [ 346.091508] lost page write due to I/O error on dm-0 [ 346.091572] JBD2: Detected IO errors while flushing file data on dm-0-8 [ 346.091915] end_request: I/O error, dev vdb, sector 38017360 [ 346.091956] Aborting journal on device dm-0-8. [ 346.092557] end_request: I/O error, dev vdb, sector 38012928 [ 346.092566] Buffer I/O error on device dm-0, logical block 4751360 [ 346.092569] lost page write due to I/O error on dm-0 [ 346.092624] JBD2: I/O error detected when updating journal superblock for dm-0-8. [ 346.100940] end_request: I/O error, dev vdb, sector 2048 [ 346.100948] Buffer I/O error on device dm-0, logical block 0 [ 346.100952] lost page write due to I/O error on dm-0 [ 346.101027] EXT4-fs error (device dm-0): ext4_journal_start_sb:327: Detected aborted journal [ 346.101038] EXT4-fs (dm-0): Remounting filesystem read-only [ 346.101051] EXT4-fs (dm-0): previous I/O error to superblock detected [ 346.101836] end_request: I/O error, dev vdb, sector 2048 [ 346.101845] Buffer I/O error on device dm-0, logical block 0 [ 346.101849] lost page write due to I/O error on dm-0 [ 373.006680] end_request: I/O error, dev vda, sector 624319 [ 373.007543] end_request: I/O error, dev vda, sector 624319 [ 373.008327] end_request: I/O error, dev vda, sector 624319 [ 374.886674] end_request: I/O error, dev vda, sector 624319 [ 374.887563] end_request: I/O error, dev vda, sector 624319 The hosts are both running Fedora 17 with qemu-kvm-1.0.1-1.fc17.x86_64. The guest machine has been started and migrated using libvirt (0.9.11). Kernel version is 3.5.6-1.fc17.x86_64 on the first host and 3.5.5-2.fc17.x86_64 on the second. The guest machine is on Kernel 3.3.8 and uses ext4 on its disks. The commandline, as generated by libvirtd, looks like this: LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.15 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name migratetest2 -uuid ddbf11e9-387e-902b-4849-8c3067dc42a2 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/migratetest2.monitor,serve r,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-reboot -no- shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/data/migratetest2_system,if=none,id=drive-virtio- disk0,format=qcow2,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio- disk0,bootindex=1 -drive file=/data/migratetest2_data-1,if=none,id=drive- virtio-disk1,format=qcow2,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 - netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net- pci,netdev=hostnet0,id=net0,mac=02:00:00:00:00:0c,bus=pci.0,addr=0x3 -vnc 127.0.0.1:2,password -k de -vga cirrus -incoming tcp:0.0.0.0:49153 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 I see qcow2 in there. Live migration of qcow2 was a new feature in 1.0. Have you tried other formats or different qemu/kvm versions? The second host has an ext4 filesystem mounted under /data, which it exports using NFSv3 over TCP to the first host, which also mounts it under /data. So far, the problem seems reproducible: When I start another guest machine and do the same thing with
[patch 08/15] x86: pvclock: generic pvclock vsyscall initialization
Originally from Jeremy Fitzhardinge. Introduce generic, non hypervisor specific, pvclock initialization routines. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -17,6 +17,10 @@ #include linux/kernel.h #include linux/percpu.h +#include linux/notifier.h +#include linux/sched.h +#include linux/gfp.h +#include linux/bootmem.h #include asm/pvclock.h static u8 valid_flags __read_mostly = 0; @@ -122,3 +126,70 @@ void pvclock_read_wallclock(struct pvclo set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } + +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + +static aligned_pvti_t *pvclock_vdso_info; + +static struct pvclock_vsyscall_time_info *pvclock_get_vsyscall_user_time_info(int cpu) +{ + if (pvclock_vdso_info == NULL) { + BUG(); + return NULL; + } + + return pvclock_vdso_info[cpu].info; +} + +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu) +{ + return pvclock_get_vsyscall_user_time_info(cpu)-pvti; +} + +int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v) +{ + struct task_migration_notifier *mn = v; + struct pvclock_vsyscall_time_info *pvti; + + pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu); + + if (pvti == NULL) + return NOTIFY_DONE; + + pvti-migrate_count++; + + return NOTIFY_DONE; +} + +static struct notifier_block pvclock_migrate = { + .notifier_call = pvclock_task_migrate, +}; + +/* + * Initialize the generic pvclock vsyscall state. This will allocate + * a/some page(s) for the per-vcpu pvclock information, set up a + * fixmap mapping for the page(s) + */ +int __init pvclock_init_vsyscall(void) +{ + int idx; + unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE; + + pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0); + if (!pvclock_vdso_info) + return -ENOMEM; + + memset(pvclock_vdso_info, 0, size); + + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { + __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, +__pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE), +PAGE_KERNEL_VVAR); + } + + register_task_migration_notifier(pvclock_migrate); + + return 0; +} + +#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */ Index: vsyscall/arch/x86/include/asm/fixmap.h === --- vsyscall.orig/arch/x86/include/asm/fixmap.h +++ vsyscall/arch/x86/include/asm/fixmap.h @@ -19,6 +19,7 @@ #include asm/acpi.h #include asm/apicdef.h #include asm/page.h +#include asm/pvclock.h #ifdef CONFIG_X86_32 #include linux/threads.h #include asm/kmap_types.h @@ -81,6 +82,10 @@ enum fixed_addresses { VVAR_PAGE, VSYSCALL_HPET, #endif +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + PVCLOCK_FIXMAP_BEGIN, + PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, +#endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); void pvclock_resume(void); +int __init pvclock_init_vsyscall(void); +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, @@ -85,4 +87,24 @@ unsigned __pvclock_read_cycles(const str return version; } +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + +struct pvclock_vsyscall_time_info { + struct pvclock_vcpu_time_info pvti; + u32 migrate_count; +}; + +typedef union { + struct pvclock_vsyscall_time_info info; + char pad[SMP_CACHE_BYTES]; +} aligned_pvti_t cacheline_aligned; + +#define PVTI_SIZE sizeof(aligned_pvti_t) +#if NR_CPUS == 1 +#define PVCLOCK_VSYSCALL_NR_PAGES 1 +#else +#define PVCLOCK_VSYSCALL_NR_PAGES ((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1 +#endif +#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */ + #endif /* _ASM_X86_PVCLOCK_H */ Index: vsyscall/arch/x86/include/asm/clocksource.h === --- vsyscall.orig/arch/x86/include/asm/clocksource.h +++ vsyscall/arch/x86/include/asm/clocksource.h @@ -8,6 +8,7 @@ #define VCLOCK_NONE 0 /* No vDSO clock available. */ #define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ #define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ +#define VCLOCK_PVCLOCK
[patch 06/15] x86: pvclock: introduce helper to read flags
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -45,6 +45,19 @@ void pvclock_resume(void) atomic64_set(last_value, 0); } +u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src) +{ + unsigned version; + cycle_t ret; + u8 flags; + + do { + version = __pvclock_read_cycles(src, ret, flags); + } while ((src-version 1) || version != src-version); + + return flags valid_flags; +} + cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { unsigned version; Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -6,6 +6,7 @@ /* some helper functions for xen and kvm pv clock sources */ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src); +u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src); void pvclock_set_flags(u8 flags); unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 13/15] KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc(). Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/include/asm/kvm_host.h === --- vsyscall.orig/arch/x86/include/asm/kvm_host.h +++ vsyscall/arch/x86/include/asm/kvm_host.h @@ -703,7 +703,7 @@ struct kvm_x86_ops { void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); - u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2); Index: vsyscall/arch/x86/kvm/lapic.c === --- vsyscall.orig/arch/x86/kvm/lapic.c +++ vsyscall/arch/x86/kvm/lapic.c @@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_ local_irq_save(flags); now = apic-lapic_timer.timer.base-get_time(); - guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu); + guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu, native_read_tsc()); if (likely(tscdeadline guest_tsc)) { ns = (tscdeadline - guest_tsc) * 100ULL; do_div(ns, this_tsc_khz); Index: vsyscall/arch/x86/kvm/svm.c === --- vsyscall.orig/arch/x86/kvm/svm.c +++ vsyscall/arch/x86/kvm/svm.c @@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct return 0; } -u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu) +u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu)); return vmcb-control.tsc_offset + - svm_scale_tsc(vcpu, native_read_tsc()); + svm_scale_tsc(vcpu, host_tsc); } static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) Index: vsyscall/arch/x86/kvm/vmx.c === --- vsyscall.orig/arch/x86/kvm/vmx.c +++ vsyscall/arch/x86/kvm/vmx.c @@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void) * Like guest_read_tsc, but always returns L1's notion of the timestamp * counter, even if a nested guest (L2) is currently running. */ -u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu) +u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { - u64 host_tsc, tsc_offset; + u64 tsc_offset; - rdtscll(host_tsc); tsc_offset = is_guest_mode(vcpu) ? to_vmx(vcpu)-nested.vmcs01_tsc_offset : vmcs_read64(TSC_OFFSET); Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - tsc_timestamp = kvm_x86_ops-read_l1_tsc(v); + tsc_timestamp = kvm_x86_ops-read_l1_tsc(v, native_read_tsc()); kernel_ns = get_kernel_ns(); this_tsc_khz = __get_cpu_var(cpu_tsc_khz); if (unlikely(this_tsc_khz == 0)) { @@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v if (hw_breakpoint_active()) hw_breakpoint_restore(); - vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu); + vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()); vcpu-mode = OUTSIDE_GUEST_MODE; smp_wmb(); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 15/15] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
KVM added a global variable to guarantee monotonicity in the guest. It is necessary because the time between 1. ktime_get_ts(timespec); 2. rdtscll(tsc); Is variable. That is, given a host with stable TSC, suppose that two VCPUs read the same time via ktime_get_ts() above. The time required to execute 2. is not the same on those two instances executing in different VCPUS (cache misses, interrupts...). If the TSC value that is used by the host to interpolate when calculating the monotonic time is the same value used to calculate the tsc_timestamp value stored in the pvclock data structure, then this problem disappears. Monotonicity is then guaranteed by the synchronicity of the host TSCs. Set TSC stable pvclock flag in that case, allowing the guest to read clock from userspace. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -46,6 +46,7 @@ #include linux/uaccess.h #include linux/hash.h #include linux/pci.h +#include linux/pvclock_gtod.h #include trace/events/kvm.h #define CREATE_TRACE_POINTS @@ -1135,8 +1136,91 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu EXPORT_SYMBOL_GPL(kvm_write_tsc); +static cycle_t read_tsc(void) +{ + cycle_t ret; + u64 last; + + /* +* Empirically, a fence (of type that depends on the CPU) +* before rdtsc is enough to ensure that rdtsc is ordered +* with respect to loads. The various CPU manuals are unclear +* as to whether rdtsc can be reordered with later loads, +* but no one has ever seen it happen. +*/ + rdtsc_barrier(); + ret = (cycle_t)vget_cycles(); + + last = pvclock_gtod_data.clock.cycle_last; + + if (likely(ret = last)) + return ret; + + /* +* GCC likes to generate cmov here, but this branch is extremely +* predictable (it's just a funciton of time and the likely is +* very likely) and there's a data dependence, so force GCC +* to generate a branch instead. I don't barrier() because +* we don't actually need a barrier, and if this function +* ever gets inlined it will generate worse code. +*/ + asm volatile (); + return last; +} + +static inline u64 vgettsc(cycle_t *cycle_now) +{ + long v; + struct pvclock_gtod_data *gtod = pvclock_gtod_data; + + *cycle_now = read_tsc(); + + v = (*cycle_now - gtod-clock.cycle_last) gtod-clock.mask; + return v * gtod-clock.mult; +} + +static int do_monotonic(struct timespec *ts, cycle_t *cycle_now) +{ + unsigned long seq; + u64 ns; + int mode; + struct pvclock_gtod_data *gtod = pvclock_gtod_data; + + ts-tv_nsec = 0; + do { + seq = read_seqcount_begin(gtod-seq); + mode = gtod-clock.vclock_mode; + ts-tv_sec = gtod-monotonic_time_sec; + ns = gtod-monotonic_time_snsec; + ns += vgettsc(cycle_now); + ns = gtod-clock.shift; + } while (unlikely(read_seqcount_retry(gtod-seq, seq))); + timespec_add_ns(ts, ns); + + return mode; +} + +/* returns true if host is using tsc clocksource */ +static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now) +{ + struct timespec ts; + + /* checked again under seqlock below */ + if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC) + return false; + + if (do_monotonic(ts, cycle_now) != VCLOCK_TSC) + return false; + + monotonic_to_bootbased(ts); + *kernel_ns = timespec_to_ns(ts); + + return true; +} + static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page, -unsigned int offset_in_page, gpa_t gpa) +unsigned int offset_in_page, gpa_t gpa, +bool host_tsc_clocksource) { struct kvm_vcpu_arch *vcpu = v-arch; void *shared_kaddr; @@ -1155,6 +1239,10 @@ static void kvm_write_pvtime(struct kvm_ vcpu-pvclock_set_guest_stopped_request = false; } + /* If the host uses TSC clocksource, then it is stable */ + if (host_tsc_clocksource) + pvclock_flags |= PVCLOCK_TSC_STABLE_BIT; + vcpu-hv_clock.flags = pvclock_flags; memcpy(shared_kaddr + offset_in_page, vcpu-hv_clock, @@ -1172,11 +1260,12 @@ static int kvm_guest_time_update(struct unsigned long this_tsc_khz; s64 kernel_ns, max_kernel_ns; u64 tsc_timestamp; + cycle_t cycle_now; + u64 host_tsc; + bool host_tsc_clocksource; /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - tsc_timestamp = kvm_x86_ops-read_l1_tsc(v, native_read_tsc()); - kernel_ns = get_kernel_ns();
[patch 14/15] time: export time information for KVM pvclock
As suggested by John, export time data similarly to how its by vsyscall support. This allows KVM to retrieve necessary information to implement vsyscall support in KVM guests. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/include/linux/pvclock_gtod.h === --- /dev/null +++ vsyscall/include/linux/pvclock_gtod.h @@ -0,0 +1,23 @@ +#ifndef _PVCLOCK_GTOD_H +#define _PVCLOCK_GTOD_H + +#include linux/clocksource.h + +struct pvclock_gtod_data { + seqcount_t seq; + + struct { /* extract of a clocksource struct */ + int vclock_mode; + cycle_t cycle_last; + cycle_t mask; + u32 mult; + u32 shift; + } clock; + + /* open coded 'struct timespec' */ + u64 monotonic_time_snsec; + time_t monotonic_time_sec; +}; +extern struct pvclock_gtod_data pvclock_gtod_data; + +#endif /* _PVCLOCK_GTOD_H */ Index: vsyscall/kernel/time/timekeeping.c === --- vsyscall.orig/kernel/time/timekeeping.c +++ vsyscall/kernel/time/timekeeping.c @@ -21,6 +21,7 @@ #include linux/time.h #include linux/tick.h #include linux/stop_machine.h +#include linux/pvclock_gtod.h static struct timekeeper timekeeper; @@ -180,6 +181,37 @@ static inline s64 timekeeping_get_ns_raw return nsec + arch_gettimeoffset(); } +struct pvclock_gtod_data pvclock_gtod_data; +EXPORT_SYMBOL_GPL(pvclock_gtod_data); + +static void update_pvclock_gtod(struct timekeeper *tk) +{ + struct pvclock_gtod_data *vdata = pvclock_gtod_data; + + write_seqcount_begin(vdata-seq); + + /* copy vsyscall data */ + vdata-clock.vclock_mode= tk-clock-archdata.vclock_mode; + vdata-clock.cycle_last = tk-clock-cycle_last; + vdata-clock.mask = tk-clock-mask; + vdata-clock.mult = tk-mult; + vdata-clock.shift = tk-shift; + + vdata-monotonic_time_sec = tk-xtime_sec + + tk-wall_to_monotonic.tv_sec; + vdata-monotonic_time_snsec = tk-xtime_nsec + + (tk-wall_to_monotonic.tv_nsec +tk-shift); + while (vdata-monotonic_time_snsec = + (((u64)NSEC_PER_SEC) tk-shift)) { + vdata-monotonic_time_snsec -= + ((u64)NSEC_PER_SEC) tk-shift; + vdata-monotonic_time_sec++; + } + + write_seqcount_end(vdata-seq); +} + /* must hold write on timekeeper.lock */ static void timekeeping_update(struct timekeeper *tk, bool clearntp) { @@ -188,6 +220,7 @@ static void timekeeping_update(struct ti ntp_clear(); } update_vsyscall(tk); + update_pvclock_gtod(tk); } /** -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 03/15] x86: pvclock: remove pvclock_shadow_time
Originally from Jeremy Fitzhardinge. We can copy the information directly from struct pvclock_vcpu_time_info, remove pvclock_shadow_time. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -19,21 +19,6 @@ #include linux/percpu.h #include asm/pvclock.h -/* - * These are perodically updated - *xen: magic shared_info page - *kvm: gpa registered via msr - * and then copied here. - */ -struct pvclock_shadow_time { - u64 tsc_timestamp; /* TSC at last update of time vals. */ - u64 system_timestamp; /* Time, in nanosecs, since boot.*/ - u32 tsc_to_nsec_mul; - int tsc_shift; - u32 version; - u8 flags; -}; - static u8 valid_flags __read_mostly = 0; void pvclock_set_flags(u8 flags) @@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags) valid_flags = flags; } -static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) +static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src) { - u64 delta = native_read_tsc() - shadow-tsc_timestamp; - return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, - shadow-tsc_shift); -} - -/* - * Reads a consistent set of time-base values from hypervisor, - * into a shadow data area. - */ -static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, - struct pvclock_vcpu_time_info *src) -{ - do { - dst-version = src-version; - rmb(); /* fetch version before data */ - dst-tsc_timestamp = src-tsc_timestamp; - dst-system_timestamp = src-system_time; - dst-tsc_to_nsec_mul = src-tsc_to_system_mul; - dst-tsc_shift = src-tsc_shift; - dst-flags = src-flags; - rmb(); /* test version after fetching data */ - } while ((src-version 1) || (dst-version != src-version)); - - return dst-version; + u64 delta = native_read_tsc() - src-tsc_timestamp; + return pvclock_scale_delta(delta, src-tsc_to_system_mul, + src-tsc_shift); } unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) @@ -90,21 +54,20 @@ void pvclock_resume(void) cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { - struct pvclock_shadow_time shadow; unsigned version; cycle_t ret, offset; u64 last; do { - version = pvclock_get_time_values(shadow, src); + version = src-version; rdtsc_barrier(); - offset = pvclock_get_nsec_offset(shadow); - ret = shadow.system_timestamp + offset; + offset = pvclock_get_nsec_offset(src); + ret = src-system_time + offset; rdtsc_barrier(); - } while (version != src-version); + } while ((src-version 1) || version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) - (shadow.flags PVCLOCK_TSC_STABLE_BIT)) + (src-flags PVCLOCK_TSC_STABLE_BIT)) return ret; /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 05/15] x86: pvclock: fix flags usage race
Validity of values returned by pvclock (including flags) is guaranteed by version checks. That is, read of src-flags outside version check protection can refer to a different paravirt clock update by the hypervisor. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -66,18 +66,21 @@ u64 pvclock_get_nsec_offset(const struct static __always_inline unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, - cycle_t *cycles) + cycle_t *cycles, u8 *flags) { unsigned version; cycle_t ret, offset; + u8 ret_flags; version = src-version; rdtsc_barrier(); offset = pvclock_get_nsec_offset(src); ret = src-system_time + offset; + ret_flags = src-flags; rdtsc_barrier(); *cycles = ret; + *flags = ret_flags; return version; } Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -50,13 +50,14 @@ cycle_t pvclock_clocksource_read(struct unsigned version; cycle_t ret; u64 last; + u8 flags; do { - version = __pvclock_read_cycles(src, ret); + version = __pvclock_read_cycles(src, ret, flags); } while ((src-version 1) || version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) - (src-flags PVCLOCK_TSC_STABLE_BIT)) + (flags PVCLOCK_TSC_STABLE_BIT)) return ret; /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 11/15] x86: vdso: pvclock gettime support
Improve performance of time system calls when using Linux pvclock, by reading time info from fixmap visible copy of pvclock data. Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -22,6 +22,7 @@ #include asm/hpet.h #include asm/unistd.h #include asm/io.h +#include asm/pvclock.h #define gtod (VVAR(vsyscall_gtod_data)) @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void) return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); } +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + +static notrace cycle_t vread_pvclock(int *mode) +{ + const struct pvclock_vsyscall_time_info *pvti; + cycle_t ret; + u64 last; + u32 version; + u32 migrate_count; + u8 flags; + unsigned cpu, cpu1; + + + /* +* When looping to get a consistent (time-info, tsc) pair, we +* also need to deal with the possibility we can switch vcpus, +* so make sure we always re-fetch time-info for the current vcpu. +*/ + do { + cpu = __getcpu() 0xfff; + pvti = get_pvti(cpu); + + migrate_count = pvti-migrate_count; + + version = __pvclock_read_cycles(pvti-pvti, ret, flags); + + /* +* Test we're still on the cpu as well as the version. +* We could have been migrated just after the first +* vgetcpu but before fetching the version, so we +* wouldn't notice a version change. +*/ + cpu1 = __getcpu() 0xfff; + } while (unlikely(cpu != cpu1 || + (pvti-pvti.version 1) || + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); + + if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) + *mode = VCLOCK_NONE; + + last = VVAR(vsyscall_gtod_data).clock.cycle_last; + + if (likely(ret = last)) + return ret; + + return last; +} +#endif + notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -88,6 +152,8 @@ notrace static inline u64 vgetsns(int *m cycles = vread_tsc(); else if (gtod-clock.vclock_mode == VCLOCK_HPET) cycles = vread_hpet(); + else if (gtod-clock.vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); else return 0; v = (cycles - gtod-clock.cycle_last) gtod-clock.mask; Index: vsyscall/arch/x86/include/asm/vsyscall.h === --- vsyscall.orig/arch/x86/include/asm/vsyscall.h +++ vsyscall/arch/x86/include/asm/vsyscall.h @@ -33,6 +33,21 @@ extern void map_vsyscall(void); */ extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address); +static inline unsigned int __getcpu(void) +{ + unsigned int p; + + if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) { + /* Load per CPU data from RDTSCP */ + native_read_tscp(p); + } else { + /* Load per CPU data from GDT */ + asm(lsl %1,%0 : =r (p) : r (__PER_CPU_SEG)); + } + + return p; +} + #endif /* __KERNEL__ */ #endif /* _ASM_X86_VSYSCALL_H */ Index: vsyscall/arch/x86/vdso/vgetcpu.c === --- vsyscall.orig/arch/x86/vdso/vgetcpu.c +++ vsyscall/arch/x86/vdso/vgetcpu.c @@ -17,13 +17,8 @@ __vdso_getcpu(unsigned *cpu, unsigned *n { unsigned int p; - if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) { - /* Load per CPU data from RDTSCP */ - native_read_tscp(p); - } else { - /* Load per CPU data from GDT */ - asm(lsl %1,%0 : =r (p) : r (__PER_CPU_SEG)); - } + p = __getcpu(); + if (cpu) *cpu = p 0xfff; if (node) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 09/15] x86: kvm guest: pvclock vsyscall support
Allow hypervisor to update userspace visible copy of pvclock data. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/kvmclock.c === --- vsyscall.orig/arch/x86/kernel/kvmclock.c +++ vsyscall/arch/x86/kernel/kvmclock.c @@ -31,6 +31,9 @@ static int kvmclock = 1; static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; +/* set when the generic vsyscall pvclock elements are setup */ +bool vsyscall_clock_initializable = false; + static int parse_no_kvmclock(char *arg) { kvmclock = 0; @@ -151,6 +154,28 @@ int kvm_register_clock(char *txt) return ret; } +static int kvm_register_vsyscall_clock(char *txt) +{ +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + int cpu = smp_processor_id(); + int low, high, ret; + struct pvclock_vcpu_time_info *info; + + info = pvclock_get_vsyscall_time_info(cpu); + + low = (int)__pa(info) | 1; + high = ((u64)__pa(per_cpu(hv_clock, cpu)) 32); + ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high); + printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n, + cpu, high, low, txt); + + return ret; +#else + return 0; +#endif +} + + static void kvm_save_sched_clock_state(void) { } @@ -158,6 +183,8 @@ static void kvm_save_sched_clock_state(v static void kvm_restore_sched_clock_state(void) { kvm_register_clock(primary cpu clock, resume); + if (vsyscall_clock_initializable) + kvm_register_vsyscall_clock(primary cpu vsyscall clock, resume); } #ifdef CONFIG_X86_LOCAL_APIC @@ -168,6 +195,8 @@ static void __cpuinit kvm_setup_secondar * we shouldn't fail. */ WARN_ON(kvm_register_clock(secondary cpu clock)); + if (vsyscall_clock_initializable) + kvm_register_vsyscall_clock(secondary cpu vsyscall clock); } #endif @@ -182,6 +211,8 @@ static void __cpuinit kvm_setup_secondar #ifdef CONFIG_KEXEC static void kvm_crash_shutdown(struct pt_regs *regs) { + if (vsyscall_clock_initializable) + native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0); native_write_msr(msr_kvm_system_time, 0, 0); kvm_disable_steal_time(); native_machine_crash_shutdown(regs); @@ -190,6 +221,8 @@ static void kvm_crash_shutdown(struct pt static void kvm_shutdown(void) { + if (vsyscall_clock_initializable) + native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0); native_write_msr(msr_kvm_system_time, 0, 0); kvm_disable_steal_time(); native_machine_shutdown(); @@ -233,3 +266,27 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); } + +int kvm_setup_vsyscall_timeinfo(void) +{ +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + int ret; + struct pvclock_vcpu_time_info *vcpu_time; + u8 flags; + + vcpu_time = get_cpu_var(hv_clock); + flags = pvclock_read_flags(vcpu_time); + put_cpu_var(hv_clock); + + if (!(flags PVCLOCK_TSC_STABLE_BIT)) + return 1; + + if ((ret = pvclock_init_vsyscall())) + return ret; + + kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; + vsyscall_clock_initialized = true; +#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */ + return 0; +} + Index: vsyscall/arch/x86/kernel/kvm.c === --- vsyscall.orig/arch/x86/kernel/kvm.c +++ vsyscall/arch/x86/kernel/kvm.c @@ -42,6 +42,7 @@ #include asm/apic.h #include asm/apicdef.h #include asm/hypervisor.h +#include asm/kvm_guest.h static int kvmapf = 1; @@ -468,6 +469,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) apic_set_eoi_write(kvm_guest_apic_eoi_write); + if (kvm_para_has_feature(KVM_FEATURE_USERSPACE_CLOCKSOURCE)) + kvm_setup_vsyscall_timeinfo(); + #ifdef CONFIG_SMP smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu; register_cpu_notifier(kvm_cpu_notifier); Index: vsyscall/arch/x86/include/asm/kvm_guest.h === --- /dev/null +++ vsyscall/arch/x86/include/asm/kvm_guest.h @@ -0,0 +1,8 @@ +#ifndef _ASM_X86_KVM_GUEST_H +#define _ASM_X86_KVM_GUEST_H + +extern bool vsyscall_clock_initializable; + +int kvm_setup_vsyscall_timeinfo(void); + +#endif /* _ASM_X86_KVM_GUEST_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 07/15] sched: add notifier for cross-cpu migrations
Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/include/linux/sched.h === --- vsyscall.orig/include/linux/sched.h +++ vsyscall/include/linux/sched.h @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void) extern void calc_global_load(unsigned long ticks); extern void update_cpu_load_nohz(void); +/* Notifier for when a task gets migrated to a new CPU */ +struct task_migration_notifier { + struct task_struct *task; + int from_cpu; + int to_cpu; +}; +extern void register_task_migration_notifier(struct notifier_block *n); + extern unsigned long get_parent_ip(unsigned long addr); struct seq_file; Index: vsyscall/kernel/sched/core.c === --- vsyscall.orig/kernel/sched/core.c +++ vsyscall/kernel/sched/core.c @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s rq-skip_clock_update = 1; } +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); + +void register_task_migration_notifier(struct notifier_block *n) +{ + atomic_notifier_chain_register(task_migration_notifier, n); +} + #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) { @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p, trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { + struct task_migration_notifier tmn; + p-se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0); + + tmn.task = p; + tmn.from_cpu = task_cpu(p); + tmn.to_cpu = new_cpu; + + atomic_notifier_call_chain(task_migration_notifier, 0, tmn); } __set_task_cpu(p, new_cpu); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 12/15] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/include/asm/kvm_para.h === --- vsyscall.orig/arch/x86/include/asm/kvm_para.h +++ vsyscall/arch/x86/include/asm/kvm_para.h @@ -23,6 +23,7 @@ #define KVM_FEATURE_ASYNC_PF 4 #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -39,6 +40,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_USERSPACE_TIME 0x4b564d05 struct kvm_steal_time { __u64 steal; Index: vsyscall/Documentation/virtual/kvm/msr.txt === --- vsyscall.orig/Documentation/virtual/kvm/msr.txt +++ vsyscall/Documentation/virtual/kvm/msr.txt @@ -125,6 +125,22 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 Availability of this MSR must be checked via bit 3 in 0x401 cpuid leaf prior to usage. +MSR_KVM_USERSPACE_TIME: 0x4b564d05 + +Allow a guest to register a second location for the VCPU time info +structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). +This is intended to allow the guest kernel to map this information +into a usermode accessible page, so that usermode can efficiently +calculate system time from the TSC without having to make a syscall. + +Relationship with master copy (MSR_KVM_SYSTEM_TIME_NEW): + +- This MSR must be enabled only when the master is enabled. +- Disabling updates to the master automatically disables +updates for this copy. + +Availability of this MSR must be checked via bit 7 in 0x401 cpuid +leaf prior to usage. MSR_KVM_WALL_CLOCK: 0x11 Index: vsyscall/arch/x86/include/asm/kvm_host.h === --- vsyscall.orig/arch/x86/include/asm/kvm_host.h +++ vsyscall/arch/x86/include/asm/kvm_host.h @@ -415,10 +415,13 @@ struct kvm_vcpu_arch { int (*complete_userspace_io)(struct kvm_vcpu *vcpu); gpa_t time; + gpa_t uspace_time; struct pvclock_vcpu_time_info hv_clock; unsigned int hw_tsc_khz; unsigned int time_offset; + unsigned int uspace_time_offset; struct page *time_page; + struct page *uspace_time_page; /* set guest stopped flag in pvclock flags field */ bool pvclock_set_guest_stopped_request; Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -809,13 +809,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN10 +#define KVM_SAVE_MSRS_BEGIN11 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, - MSR_KVM_PV_EOI_EN, + MSR_KVM_PV_EOI_EN, MSR_KVM_USERSPACE_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_STAR, #ifdef CONFIG_X86_64 @@ -1135,16 +1135,43 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu EXPORT_SYMBOL_GPL(kvm_write_tsc); +static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page, +unsigned int offset_in_page, gpa_t gpa) +{ + struct kvm_vcpu_arch *vcpu = v-arch; + void *shared_kaddr; + struct pvclock_vcpu_time_info *guest_hv_clock; + u8 pvclock_flags; + + shared_kaddr = kmap_atomic(page); + + guest_hv_clock = shared_kaddr + offset_in_page; + + /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ + pvclock_flags = (guest_hv_clock-flags PVCLOCK_GUEST_STOPPED); + + if (vcpu-pvclock_set_guest_stopped_request) { + pvclock_flags |= PVCLOCK_GUEST_STOPPED; + vcpu-pvclock_set_guest_stopped_request = false; + } + + vcpu-hv_clock.flags = pvclock_flags; + + memcpy(shared_kaddr + offset_in_page, vcpu-hv_clock, + sizeof(vcpu-hv_clock)); + + kunmap_atomic(shared_kaddr); + + mark_page_dirty(v-kvm, gpa PAGE_SHIFT); +} + static int kvm_guest_time_update(struct kvm_vcpu *v) { unsigned long flags; struct kvm_vcpu_arch *vcpu = v-arch; -
[patch 00/15] pvclock vsyscall support + KVM hypervisor support
This patchset, based on earlier work by Jeremy Fitzhardinge, implements paravirtual clock vsyscall support. It should be possible to implement Xen support relatively easily. It reduces clock_gettime from 500 cycles to 200 cycles on my testbox (including an mfence, that measurement). NOTE: exporting pvclock tsc stable bit if the guest TSCs are not synchronized is incorrect, next version will have that fixed. Please review. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 10/15] x86: vsyscall: pass mode to gettime backend
Required by next patch. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -80,7 +80,7 @@ notrace static long vdso_fallback_gtod(s } -notrace static inline u64 vgetsns(void) +notrace static inline u64 vgetsns(int *mode) { long v; cycles_t cycles; @@ -107,7 +107,7 @@ notrace static int __always_inline do_re mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-wall_time_sec; ns = gtod-wall_time_snsec; - ns += vgetsns(); + ns += vgetsns(mode); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); @@ -127,7 +127,7 @@ notrace static int do_monotonic(struct t mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-monotonic_time_sec; ns = gtod-monotonic_time_snsec; - ns += vgetsns(); + ns += vgetsns(mode); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); timespec_add_ns(ts, ns); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 02/15] x86: pvclock: make sure rdtsc doesnt speculate out of region
Originally from Jeremy Fitzhardinge. pvclock_get_time_values, which contains the memory barriers will be removed by next patch. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct do { version = pvclock_get_time_values(shadow, src); - barrier(); + rdtsc_barrier(); offset = pvclock_get_nsec_offset(shadow); ret = shadow.system_timestamp + offset; - barrier(); + rdtsc_barrier(); } while (version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 04/15] x86: pvclock: create helper for pvclock data retrieval
Originally from Jeremy Fitzhardinge. So code can be reused. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags) valid_flags = flags; } -static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src) -{ - u64 delta = native_read_tsc() - src-tsc_timestamp; - return pvclock_scale_delta(delta, src-tsc_to_system_mul, - src-tsc_shift); -} - unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) { u64 pv_tsc_khz = 100ULL 32; @@ -55,15 +48,11 @@ void pvclock_resume(void) cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { unsigned version; - cycle_t ret, offset; + cycle_t ret; u64 last; do { - version = src-version; - rdtsc_barrier(); - offset = pvclock_get_nsec_offset(src); - ret = src-system_time + offset; - rdtsc_barrier(); + version = __pvclock_read_cycles(src, ret); } while ((src-version 1) || version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -56,4 +56,29 @@ static inline u64 pvclock_scale_delta(u6 return product; } +static __always_inline +u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src) +{ + u64 delta = __native_read_tsc() - src-tsc_timestamp; + return pvclock_scale_delta(delta, src-tsc_to_system_mul, + src-tsc_shift); +} + +static __always_inline +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, + cycle_t *cycles) +{ + unsigned version; + cycle_t ret, offset; + + version = src-version; + rdtsc_barrier(); + offset = pvclock_get_nsec_offset(src); + ret = src-system_time + offset; + rdtsc_barrier(); + + *cycles = ret; + return version; +} + #endif /* _ASM_X86_PVCLOCK_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 01/15] KVM: x86: retain pvclock guest stopped bit in guest memory
Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU migration) to clear the bit. Noticed by Paolo Bonzini. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct unsigned long this_tsc_khz; s64 kernel_ns, max_kernel_ns; u64 tsc_timestamp; + struct pvclock_vcpu_time_info *guest_hv_clock; u8 pvclock_flags; /* Keep irq disabled to prevent changes to the clock */ @@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct vcpu-last_kernel_ns = kernel_ns; vcpu-last_guest_tsc = tsc_timestamp; - pvclock_flags = 0; - if (vcpu-pvclock_set_guest_stopped_request) { - pvclock_flags |= PVCLOCK_GUEST_STOPPED; - vcpu-pvclock_set_guest_stopped_request = false; - } - - vcpu-hv_clock.flags = pvclock_flags; /* * The interface expects us to write an even number signaling that the @@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct shared_kaddr = kmap_atomic(vcpu-time_page); + guest_hv_clock = shared_kaddr + vcpu-time_offset; + + /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ + pvclock_flags = (guest_hv_clock-flags PVCLOCK_GUEST_STOPPED); + + if (vcpu-pvclock_set_guest_stopped_request) { + pvclock_flags |= PVCLOCK_GUEST_STOPPED; + vcpu-pvclock_set_guest_stopped_request = false; + } + + vcpu-hv_clock.flags = pvclock_flags; + memcpy(shared_kaddr + vcpu-time_offset, vcpu-hv_clock, sizeof(vcpu-hv_clock)); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM ept flush
Thanks for the reply. I have one more question. If I do munmap of the RAM allocated in qemu, will the changes be reflected in KVM Ept. I guess there is some mmu notifier which ensures that entries of EPT are synced with the host entries. On Tue, Oct 16, 2012 at 8:27 PM, Avi Kivity a...@redhat.com wrote: On 10/16/2012 01:57 PM, Rohan Sharma wrote: Is there a way to flush ept entries in qemu-kvm. No. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
Avi Kivity a...@redhat.com writes: On 10/16/2012 01:58 PM, Paul Mackerras wrote: On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: On 10/16/2012 06:01 AM, Paul Mackerras wrote: +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE ((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. That's sane and expected, but it should be documented. Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. This forces userspace to dedicate a thread for the HPT. If no changes are available, does read return a size 0? I don't think it's necessary to support polling. The kernel should always be able to respond to userspace here. The only catch is whether to return !0 read sizes when there are no changes. At any case, I can't see why a dedicated thread is needed. QEMU is going to poll HPT based on how fast we can send data over the wire. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). I meant the internal data structure that holds HPT entries. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote: On 10/16/2012 01:58 PM, Paul Mackerras wrote: On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. This forces userspace to dedicate a thread for the HPT. Why? Reads never block in any case. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { +__u32 index; +__u16 n_valid; +__u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). I meant the internal data structure that holds HPT entries. Oh, that's just an array, and userspace already knows how big it is. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? The index is just the position in the array. Typically, in each group of 8 it will tend to be the low-numbered ones that are valid, since creating an entry usually uses the first empty slot. So I expect that on the first pass, most of the records will represent 8 HPTEs. On subsequent passes, probably most records will represent a single HPTE. + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does it warrant a live migration protocol? The qemu people I talked to seemed to think so. Because it is a hash table, updates tend to be scattered throughout the whole table, which is another reason why per-page dirty tracking and updates would be pretty inefficient. This suggests a stream format that includes the index in every entry. That would amount to dropping the n_valid and n_invalid fields from the current header format. That would be less efficient for the initial pass (assuming we achieve an average n_valid of at least 2 on the initial pass), and probably less efficient for the incremental updates, since a newly-invalidated entry would have to be represented as 16 zero bytes rather than just an 8-byte header with n_valid=0 and n_invalid=1. I'm assuming here that the initial pass would omit invalid entries. As for the change rate, it depends on the application of course, but basically every time the guest changes a PTE in its Linux page tables we do the corresponding change to the corresponding HPT entry, so the rate can be quite high. Workloads that do a lot of fork, exit, mmap, exec, etc. have a high rate of HPT updates. If the rate is high enough, then there's no point in a live update. True, but doesn't that argument apply to memory pages as well? Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
[PATCH RFC V2 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, cpu%d, i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, intr %llu, (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 36d12f0..c0b0095 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -27,6 +27,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC V2 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 8 files changed, 31 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1eaa6b0..bd4e412 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -409,6 +409,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 2f7712e..debe72e 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 91b3b2a..4e5582a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -368,7 +368,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu, u64 *steal) +static u64 kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -378,6 +378,7 @@ static u64 kvm_steal_clock(int cpu, u64 *steal) version = src-version; rmb(); *steal = src-steal; + *consigned = src-consigned; rmb(); } while ((version 1) || (version != src-version)); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1f09552..801cfa8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1554,8 +1554,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; + vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned; vcpu-arch.st.steal.version += 2; vcpu-arch.st.accum_steal = 0; + vcpu-arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index c0b0095..253fdce 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -125,6 +125,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index dd3fd46..bf2025a 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + * Consigned time represents the amount of time that the cpu was + * expected to be somewhere else. + */ +void account_consigned_time(cputime_t cputime) +{ + u64 *cpustat = kcpustat_this_cpu-cpustat; + + cpustat[CPUTIME_CONSIGN] += (__force u64) cputime; +} + +/* * Account for involuntary wait time. * @cputime: the cpu time spent in involuntary wait */ @@ -274,15 +286,20 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(paravirt_steal_enabled)) { u64 steal, st = 0; + u64 consigned, cs = 0; -
[PATCH RFC V2 5/5] Add an ioctl to communicate the consign limit to the host.
Add an ioctl to communicate the consign limit to the host. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c |6 ++ include/linux/kvm.h |2 ++ include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c |7 +++ 4 files changed, 17 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 469e748..5a4b8db 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5928,6 +5928,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; } +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement) +{ + vcpu-arch.consigned_limit = entitlement; + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct i387_fxsave_struct *fxsave = diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 2ce09aa..6b211fb 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -904,6 +904,8 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) /* VM is being stopped by host */ #define KVM_KVMCLOCK_CTRL_IO(KVMIO, 0xad) +/* Set the consignment limit which will be used to separete steal time */ +#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 1) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8a59e0a..9a0a791 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -538,6 +538,8 @@ void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, + long entitlement); void kvm_free_physmem(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d617f69..ccf0aec 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1936,6 +1936,13 @@ out_free2: r = 0; break; } + case KVM_SET_ENTITLEMENT: { + r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg); + if (r) + goto out; + r = 0; + break; + } default: r = kvm_arch_vcpu_ioctl(filp, ioctl, arg); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Added call parameter to track whether invocation originated with guest or elsewhere
Signed-off-by: Will Auld will.a...@intel.com --- Resending to full list Marcelo, This patch is what I believe you ask for as foundational for later patches to address IA32_TSC_ADJUST. Thanks, Will arch/x86/include/asm/kvm_host.h | 8 arch/x86/kvm/svm.c | 18 ++ arch/x86/kvm/vmx.c | 18 ++ arch/x86/kvm/x86.c | 18 ++ arch/x86/kvm/x86.h | 2 +- 5 files changed, 35 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..c06f0d1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -621,7 +621,7 @@ struct kvm_x86_ops { void (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data, bool guest_initiated); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -684,7 +684,7 @@ struct kvm_x86_ops { bool (*has_wbinvd_exit)(void); void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale); - void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); + void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset, bool guest_initiated); u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); @@ -772,7 +772,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data, bool guest_initiated); struct x86_emulate_ctxt; @@ -799,7 +799,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool guest_initiated); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..424be27 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1012,7 +1012,8 @@ static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) svm-tsc_ratio = ratio; } -static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) +static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset, + bool guest_initiated) { struct vcpu_svm *svm = to_svm(vcpu); u64 g_tsc_offset = 0; @@ -1255,7 +1256,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); + kvm_write_tsc(svm-vcpu, 0, false /*Not Guest Initiated*/); err = fx_init(svm-vcpu); if (err) @@ -3147,13 +3148,14 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) return 0; } -static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) +static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data, + bool guest_initiated) { struct vcpu_svm *svm = to_svm(vcpu); switch (ecx) { case MSR_IA32_TSC: - kvm_write_tsc(vcpu, data); + kvm_write_tsc(vcpu, data, guest_initiated); break; case MSR_STAR: svm-vmcb-save.star = data; @@ -3208,12 +3210,12 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, ecx, data); break; default: - return kvm_set_msr_common(vcpu, ecx, data); + return kvm_set_msr_common(vcpu, ecx, data, guest_initiated); } return 0; } -static int wrmsr_interception(struct vcpu_svm *svm) +static int wrmsr_interception(struct vcpu_svm *svm, bool guest_initiated) { u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX]; u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX] -1u) @@ -3221,7 +3223,7 @@ static int wrmsr_interception(struct vcpu_svm *svm) svm-next_rip = kvm_rip_read(svm-vcpu) + 2; - if (svm_set_msr(svm-vcpu, ecx, data)) { + if (svm_set_msr(svm-vcpu, ecx, data, guest_initiated)) {
[PATCH RFC V2 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bd4e412..d700850 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -339,6 +341,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 801cfa8..469e748 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1535,13 +1535,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) return; delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; - vcpu-arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu-arch.current_consigned vcpu-arch.consigned_limit) { + vcpu-arch.current_consigned += delta; + if (vcpu-arch.current_consigned vcpu-arch.consigned_limit) { + steal_delta = vcpu-arch.current_consigned + - vcpu-arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu-arch.st.accum_steal = steal_delta; + vcpu-arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6187,11 +6206,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) return irqchip_in_kernel(vcpu-kvm) == (vcpu-arch.apic != NULL); } +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu-arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(vcpu-arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu-kvm == NULL); kvm = vcpu-kvm; @@ -6234,6 +6267,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu-arch.steal_timer.function = steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6252,6 +6291,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; + hrtimer_cancel(vcpu-arch.steal_timer); kvm_pmu_destroy(vcpu);
[PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. TODO: * Change native_clock to take params and not return a value * Change update_rq_clock_task Changes from V1: * Removed the steal time allowed percentage from the guest * Moved the separation of consigned (expected steal) and steal time to the host. * No longer include a sysctl interface. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 + arch/x86/include/asm/kvm_para.h |3 ++ arch/x86/include/asm/paravirt.h |4 ++- arch/x86/kernel/kvm.c |8 ++ arch/x86/kvm/x86.c | 50 ++- fs/proc/stat.c |9 +-- include/linux/kernel_stat.h |2 ++ include/linux/kvm.h |2 ++ include/linux/kvm_host.h|2 ++ kernel/sched/cputime.c | 21 +++- kernel/sched/sched.h|2 ++ virt/kvm/kvm_main.c |7 + 12 files changed, 108 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC V2 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |7 ++- kernel/sched/cputime.c |2 +- 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index c1d61ee..91b3b2a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -368,9 +368,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static u64 kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -378,11 +377,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src-version; rmb(); - steal = src-steal; + *steal = src-steal; rmb(); } while ((version 1) || (version != src-version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 81b763b..dd3fd46 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(paravirt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), steal); steal -= this_rq()-prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: KVM_MAX_VCPUS
For pCPU/core and VCPUS/logical cpu mapping, It should be 8 multiple. 254 is reasonable. Or something I miss? -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Vinod, Chegu Sent: Sunday, October 14, 2012 9:43 PM To: Gleb Natapov Cc: Sasha Levin; KVM Subject: Re: KVM_MAX_VCPUS On 10/14/2012 2:08 AM, Gleb Natapov wrote: On Sat, Oct 13, 2012 at 10:32:13PM -0400, Sasha Levin wrote: On 10/13/2012 06:29 PM, Chegu Vinod wrote: Hello, Wanted to get a clarification about KVM_MAX_VCPUS(currently set to 254) in kvm_host.h file. The kvm_vcpu *vcpus array is sized based on KVM_MAX_VCPUS. (i.e. a max of 254 elements in the array). An 8bit APIC id should allow for 256 ID's. Reserving one for Broadcast should leave 255 ID's. Is there one more ID reserved for some other purpose ? (hence leading to KVM_MAX_VCPUS being set to 254 and not 255). Another ID goes to the IO-APIC. This is not really needed on KVM. We can enlarge KVM_MAX_VCPUS to 255. Thanks for clarification! ( We did suspect the IO-APIC...but weren't quite sure). Vinod -- Gleb. . -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: VMX: report internal error for the unhandleable event
VM exits during Event Delivery is really unexpected if it is not caused by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal and freeze the guest, the VMM has the chance to check the guest Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c | 19 +++ include/linux/kvm.h |8 ++-- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ad6b1dd..b8a0841 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5979,13 +5979,24 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) return 0; } + /* +* Note: +* Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by +* delivery event since it indicates guest is accessing MMIO. +* The vm-exit can be triggered again after return to guest that +* will cause infinite loop. +*/ if ((vectoring_info VECTORING_INFO_VALID_MASK) (exit_reason != EXIT_REASON_EXCEPTION_NMI exit_reason != EXIT_REASON_EPT_VIOLATION - exit_reason != EXIT_REASON_TASK_SWITCH)) - printk(KERN_WARNING %s: unexpected, valid vectoring info - (0x%x) and exit reason is 0x%x\n, - __func__, vectoring_info, exit_reason); + exit_reason != EXIT_REASON_TASK_SWITCH)) { + vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; + vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV; + vcpu-run-internal.ndata = 2; + vcpu-run-internal.data[0] = vectoring_info; + vcpu-run-internal.data[1] = exit_reason; + return 0; + } if (unlikely(!cpu_has_virtual_nmis() vmx-soft_vnmi_blocked !(is_guest_mode(vcpu) nested_cpu_has_virtual_nmis( diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 65ad5c6..494a84c 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -170,8 +170,12 @@ struct kvm_pit_config { #define KVM_EXIT_WATCHDOG 21 /* For KVM_EXIT_INTERNAL_ERROR */ -#define KVM_INTERNAL_ERROR_EMULATION 1 -#define KVM_INTERNAL_ERROR_SIMUL_EX 2 +/* Emulate instruction failed. */ +#define KVM_INTERNAL_ERROR_EMULATION 1 +/* Encounter unexpected simultaneous exceptions. */ +#define KVM_INTERNAL_ERROR_SIMUL_EX2 +/* Encounter unexpected vm-exit due to delivery event. */ +#define KVM_INTERNAL_ERROR_DELIVERY_EV 3 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ struct kvm_run { -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: VMX: report internal error for MMIO #PF due to delivery event
The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we can not fix it if it is caused by delivery event. Reporting internal error for this case Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/vmx.c | 26 -- 1 files changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index b8a0841..6599e45 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4287,16 +4287,6 @@ static int handle_exception(struct kvm_vcpu *vcpu) if (is_machine_check(intr_info)) return handle_machine_check(vcpu); - if ((vect_info VECTORING_INFO_VALID_MASK) - !is_page_fault(intr_info)) { - vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_SIMUL_EX; - vcpu-run-internal.ndata = 2; - vcpu-run-internal.data[0] = vect_info; - vcpu-run-internal.data[1] = intr_info; - return 0; - } - if ((intr_info INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR) return 1; /* already handled by vmx_vcpu_run() */ @@ -4315,6 +4305,22 @@ static int handle_exception(struct kvm_vcpu *vcpu) error_code = 0; if (intr_info INTR_INFO_DELIVER_CODE_MASK) error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); + + /* +* The #PF with PFEC.RSVD = 1 indicates the guest is accessing +* MMIO, it is better to report an internal error. +* See the comments in vmx_handle_exit. +*/ + if ((vect_info VECTORING_INFO_VALID_MASK) + !(is_page_fault(intr_info) !(error_code PFERR_RSVD_MASK))) { + vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR; + vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_SIMUL_EX; + vcpu-run-internal.ndata = 2; + vcpu-run-internal.data[0] = vect_info; + vcpu-run-internal.data[1] = intr_info; + return 0; + } + if (is_page_fault(intr_info)) { /* EPT won't cause page fault directly */ BUG_ON(enable_ept); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing
On Tue, 2012-10-16 at 14:13 +1100, Paul Mackerras wrote: Michael, On Tue, Oct 16, 2012 at 11:15:50AM +1100, Michael Ellerman wrote: In the Book3s HV code, kvmppc_run_core() has logic to grab the secondary threads of the physical core. If for some reason a thread is stuck, kvmppc_grab_hwthread() can fail, but currently we ignore the failure and continue into the guest. If the stuck thread is in the kernel badness ensues. Instead we should check for failure and bail out. I've moved the grabbing prior to the startup of runnable threads, to simplify the error case. AFAICS this is harmless, but I could be missing something subtle. Thanks for looking at this - but in fact this is fixed by my patch entitled KVM: PPC: Book3S HV: Fix some races in starting secondary threads submitted back on August 28. OK thanks. It seems that patch didn't make 3.7 ? I don't see it in kvm-ppc-next either. cheers -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] Various Book3s HV fixes that haven't been picked up yet
On 16.10.2012, at 05:08, Paul Mackerras pau...@samba.org wrote: On Mon, Oct 15, 2012 at 02:00:54PM +0200, Alexander Graf wrote: Sorry, I can't accept patches that haven't shown up on kvm@vger. Please send this patch set again with CC to kvm@vger. Done; I didn't cc kvm-ppc this time since the patches haven't changed. Thanks :) By the way, what is the purpose of kvm-ppc@vger.kernel.org? It's basically a tag so people can filter based on it. For me, mails to kvm-ppc@vger land straight in my inbox, while mails to kvm@vger go to a folder I only read occasionally. Alex Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
On 10/16/2012 05:59 AM, Paul Mackerras wrote: The mmu_notifier_retry() function, used to test whether any page invalidations are in progress, currently takes a vcpu pointer, though the code only needs the VM's struct kvm pointer. Forthcoming patches to the powerpc Book3S HV code will need to test for retry within a VM ioctl, where a struct kvm pointer is available but a struct vcpu pointer isn't. Therefore this creates a variant of mmu_notifier_retry called kvm_mmu_notifier_retry that takes a struct kvm pointer, and implements mmu_notifier_retry in terms of it. Why not change mmu_notifier_retry() and all its callers? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
On 16.10.2012, at 11:44, Avi Kivity wrote: On 10/16/2012 05:59 AM, Paul Mackerras wrote: The mmu_notifier_retry() function, used to test whether any page invalidations are in progress, currently takes a vcpu pointer, though the code only needs the VM's struct kvm pointer. Forthcoming patches to the powerpc Book3S HV code will need to test for retry within a VM ioctl, where a struct kvm pointer is available but a struct vcpu pointer isn't. Therefore this creates a variant of mmu_notifier_retry called kvm_mmu_notifier_retry that takes a struct kvm pointer, and implements mmu_notifier_retry in terms of it. Why not change mmu_notifier_retry() and all its callers? Why not use Christoffer's patch? :) Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On 10/16/2012 06:01 AM, Paul Mackerras wrote: A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on this fd return the contents of the HPT (hashed page table), writes create and/or remove entries in the HPT. There is a new capability, KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl takes an argument structure with the index of the first HPT entry to read out and a set of flags. The flags indicate whether the user is intending to read or write the HPT, and whether to return all entries or only the bolted entries (those with the bolted bit, 0x10, set in the first doubleword). This is intended for use in implementing qemu's savevm/loadvm and for live migration. Therefore, on reads, the first pass returns information about all HPTEs (or all bolted HPTEs). When the first pass reaches the end of the HPT, it returns from the read. Subsequent reads only return information about HPTEs that have changed since they were last read. A read that finds no changed HPTEs in the HPT following where the last read finished will return 0 bytes. Copying people with interest in migration. +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE ((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? What happens if the read buffer size is not a multiple of entry size? Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Suppose new hardware arrives that supports nesting HPTs, so that kvm is no longer synchronously aware of the guest HPT (similar to how NPT/EPT made kvm unaware of guest virtual-physical translations on x86). How will we deal with that? But I guess this will be a non-guest-transparent and non-userspace-transparent change, unlike NPT/EPT, so a userspace ABI addition will be needed anyway). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip
On 16.10.2012, at 12:57, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: The current eventfd code assumes that when we have eventfd, we also have irqfd for in-kernel interrupt delivery. This is not necessarily true. On PPC we don't have an in-kernel irqchip yet, but we can still support easily support eventfd. Don't you need, in addition, to reject PIO space for ioeventfd? Yeah, we could tell user space that it's doing something stupid. Not sure how critical it is - right now it would just register it and never get triggered, because the bus doesn't exist. But sanity checks are a good thing usually. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: On 10/16/2012 06:01 AM, Paul Mackerras wrote: +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE ((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. Because it is a hash table, updates tend to be scattered throughout the whole table, which is another reason why per-page dirty tracking and updates would be pretty inefficient. As for the change rate, it depends on the application of course, but basically every time the guest changes a PTE in its Linux page tables we do the corresponding change to the corresponding HPT entry, so the rate can be quite high. Workloads that do a lot of fork, exit, mmap, exec, etc. have a high rate of HPT updates. Suppose new hardware arrives that supports nesting HPTs, so that kvm is no longer synchronously aware of the guest HPT (similar to how NPT/EPT made
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On 10/16/2012 01:58 PM, Paul Mackerras wrote: On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: On 10/16/2012 06:01 AM, Paul Mackerras wrote: +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. That's sane and expected, but it should be documented. Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. This forces userspace to dedicate a thread for the HPT. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). I meant the internal data structure that holds HPT entries. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does it warrant a live migration protocol? Because it is
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 01:06 PM, Alexander Graf wrote: On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. Couldn't you attach that payload to the irqfd? On x86 an irqfd is associated with a gsi, and a gsi with extra information, including all that is needed to queue an MSI. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. If there is a lot of prioritization and/or queuing logic, then yes. But what about MSI? Doesn't that have a direct path? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: PPC: Support ioeventfd
On 10/16/2012 03:47 PM, Avi Kivity wrote: On 10/16/2012 01:06 PM, Alexander Graf wrote: On 16.10.2012, at 13:01, Avi Kivity wrote: On 10/16/2012 12:59 PM, Alexander Graf wrote: On 16.10.2012, at 12:56, Avi Kivity wrote: On 10/15/2012 02:02 PM, Alexander Graf wrote: In order to support vhost, we need to be able to support ioeventfd. This patch set adds support for ioeventfd to PPC and makes it possible to do so without implementing irqfd along the way, as it requires an in-kernel irqchip which we don't have yet. It's not strictly required. You have an interrupt line leading to the core, no? You could have your irqfd trigger that. The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this patch set is cleaning up: enable you to build ioeventfd support without in-kernel irqchip irqfd support. Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) in user space. That path works just fine with these patches applied. That's all true but it wasn't my point. I was asking whether you can enable irqfd without in-kernel irqchip. If your irq input is edge triggered then it's trivial. If it's level triggered then you can use the new resampling irqfd. I'm not sure I fully grasp what you're trying to say :). We have a single interrupt line on the core. So whenever any external interrupt gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it which line is active. Couldn't you attach that payload to the irqfd? On x86 an irqfd is associated with a gsi, and a gsi with extra information, including all that is needed to queue an MSI. So yes, we could create a direct fd channel between vhost and the user space MPIC, but it wouldn't buy us anything. The interrupt injection path would be as long as it is with the current mechanism. If there is a lot of prioritization and/or queuing logic, then yes. But what about MSI? Doesn't that have a direct path? Nope. Well, yes, in a certain special case where the MPIC pushes the interrupt vector on interrupt delivery into a special register. But not for the normal case. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing
On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote: Thanks for looking at this - but in fact this is fixed by my patch entitled KVM: PPC: Book3S HV: Fix some races in starting secondary threads submitted back on August 28. OK thanks. It seems that patch didn't make 3.7 ? I don't see it in kvm-ppc-next either. Alex, WTF ? Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
Avi Kivity a...@redhat.com writes: On 10/16/2012 01:58 PM, Paul Mackerras wrote: On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: On 10/16/2012 06:01 AM, Paul Mackerras wrote: +4.78 KVM_PPC_GET_HTAB_FD + +Capability: KVM_CAP_PPC_HTAB_FD +Architectures: powerpc +Type: vm ioctl +Parameters: Pointer to struct kvm_get_htab_fd (in) +Returns: file descriptor number (= 0) on success, -1 on error + +This returns a file descriptor that can be used either to read out the +entries in the guest's hashed page table (HPT), or to write entries to +initialize the HPT. The returned fd can only be written to if the +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and +can only be read if that bit is clear. The argument struct looks like +this: + +/* For KVM_PPC_GET_HTAB_FD */ +struct kvm_get_htab_fd { + __u64 flags; + __u64 start_index; +}; + +/* Values for kvm_get_htab_fd.flags */ +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) +#define KVM_GET_HTAB_WRITE ((__u64)0x2) + +The `start_index' field gives the index in the HPT of the entry at +which to start reading. It is ignored when writing. + +Reads on the fd will initially supply information about all +interesting HPT entries. Interesting entries are those with the +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise +all entries. When the end of the HPT is reached, the read() will +return. What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. That's sane and expected, but it should be documented. Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. This forces userspace to dedicate a thread for the HPT. If no changes are available, does read return a size 0? I don't think it's necessary to support polling. The kernel should always be able to respond to userspace here. The only catch is whether to return !0 read sizes when there are no changes. At any case, I can't see why a dedicated thread is needed. QEMU is going to poll HPT based on how fast we can send data over the wire. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { + __u32 index; + __u16 n_valid; + __u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). I meant the internal data structure that holds HPT entries. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the
Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote: On 10/16/2012 01:58 PM, Paul Mackerras wrote: On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry to change). No. This forces userspace to dedicate a thread for the HPT. Why? Reads never block in any case. If read() is called again on the fd, it will start again from +the beginning of the HPT, but will only return HPT entries that have +changed since they were last read. + +Data read or written is structured as a header (8 bytes) followed by a +series of valid HPT entries (16 bytes) each. The header indicates how +many valid HPT entries there are and how many invalid entries follow +the valid entries. The invalid entries are not represented explicitly +in the stream. The header format is: + +struct kvm_get_htab_header { +__u32 index; +__u16 n_valid; +__u16 n_invalid; +}; This structure forces the kernel to return entries sequentially. Will this block changing the data structure in the future? Or is the hardware spec sufficiently strict that such changes are not realistic? By data structure, do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean I want the new stream format. The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). I meant the internal data structure that holds HPT entries. Oh, that's just an array, and userspace already knows how big it is. I guess I don't understand the index. Do we expect changes to be in contiguous ranges? And invalid entries to be contiguous as well? That doesn't fit with how hash tables work. Does the index represent the position of the entry within the table, or something else? The index is just the position in the array. Typically, in each group of 8 it will tend to be the low-numbered ones that are valid, since creating an entry usually uses the first empty slot. So I expect that on the first pass, most of the records will represent 8 HPTEs. On subsequent passes, probably most records will represent a single HPTE. + +Writes to the fd create HPT entries starting at the index given in the +header; first `n_valid' valid entries with contents from the data +written, then `n_invalid' invalid entries, invalidating any previously +valid entries found. This scheme is a clever, original, and very interesting approach to live migration. That doesn't necessarily mean a NAK, we should see if it makes sense for other migration APIs as well (we currently have difficulties migrating very large/wide guests). What is the typical number of entries in the HPT? Do you have estimates of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does it warrant a live migration protocol? The qemu people I talked to seemed to think so. Because it is a hash table, updates tend to be scattered throughout the whole table, which is another reason why per-page dirty tracking and updates would be pretty inefficient. This suggests a stream format that includes the index in every entry. That would amount to dropping the n_valid and n_invalid fields from the current header format. That would be less efficient for the initial pass (assuming we achieve an average n_valid of at least 2 on the initial pass), and probably less efficient for the incremental updates, since a newly-invalidated entry would have to be represented as 16 zero bytes rather than just an 8-byte header with n_valid=0 and n_invalid=1. I'm assuming here that the initial pass would omit invalid entries. As for the change rate, it depends on the application of course, but basically every time the guest changes a PTE in its Linux page tables we do the corresponding change to the corresponding HPT entry, so the rate can be quite high. Workloads that do a lot of fork, exit, mmap, exec, etc. have a high rate of HPT updates. If the rate is high enough, then there's no point in a live update. True, but doesn't that argument apply to memory pages as well? Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to