date:20121016

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Michael S. Tsirkin

On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
 This makes use of the new level irqfd support enabling bypass of
 qemu userspace both on INTx injection and unmask.  This significantly
 boosts the performance of devices making use of legacy interrupts.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 My INTx routing workaround below will probably raise some eyebrows,
 but I don't feel it's worth subjecting users to core dumps if they
 want to try vfio-pci on new platforms.  INTx routing is part of some
 larger plan, but until that plan materializes we have to try to avoid
 the API unless we think there's a good chance it might be there.
 I'll accept the maintenance of updating a whitelist in the interim.
 Thanks,
 
 Alex
 
  hw/vfio_pci.c |  224 
 +
  1 file changed, 224 insertions(+)
 
 diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
 index 639371e..777a5f8 100644
 --- a/hw/vfio_pci.c
 +++ b/hw/vfio_pci.c
 @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, 
 uint32_t addr, int len);
  static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
  
  /*
 + * PCI code refuses to make it possible to probe whether the chipset
 + * supports pci_device_route_intx_to_irq() and booby traps the call
 + * to assert if doesn't.  For us, this is just an optimization, so
 + * only enable it when we know it's present.  Unfortunately PCIBus is
 + * private, so we can't just look at the function pointer.
 + */
 +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
 +{
 +#ifdef CONFIG_KVM
 +BusState *bus = qdev_get_parent_bus(pdev-qdev);
 +DeviceState *dev;
 +
 +if (!kvm_irqchip_in_kernel() ||
 +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
 + return false;
 +}


Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
Also for KVM_IRQFD_FLAG_RESAMPLE.

 +
 +for (; bus-parent; bus = qdev_get_parent_bus(dev)) {
 +
 +dev = bus-parent;
 +
 +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 
 14)) {
 +return true;
 +}
 +}
 +
 +error_report(vfio-pci: VM chipset does not support INTx routing, 
 + using slow INTx mode\n);

When does this code trigger? It seems irqchip implies piix ATM -
is this just dead code?


 +#endif
 +return false;
 +}
 +
 +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int 
 pin)
 +{
 +if (!vfio_pci_bus_has_intx_route(pdev)) {
 +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 };
 +}
 +
 +return pci_device_route_intx_to_irq(pdev, pin);
 +}
 +
 +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new)
 +{
 +return old-mode != new-mode || old-irq != new-irq;
 +}
 +

Didn't you add an API for this? It's on pci branch but I can drop
it if not needed.

 +/*
   * Common VFIO interrupt disable
   */
  static void vfio_disable_irqindex(VFIODevice *vdev, int index)
 @@ -185,6 +232,21 @@ static void vfio_unmask_intx(VFIODevice *vdev)
  ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set);
  }
  
 +#ifdef CONFIG_KVM
 +static void vfio_mask_intx(VFIODevice *vdev)
 +{
 +struct vfio_irq_set irq_set = {
 +.argsz = sizeof(irq_set),
 +.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
 +.index = VFIO_PCI_INTX_IRQ_INDEX,
 +.start = 0,
 +.count = 1,
 +};
 +
 +ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set);
 +}
 +#endif
 +
  /*
   * Disabling BAR mmaping can be slow, but toggling it around INTx can
   * also be a huge overhead.  We try to get the best of both worlds by
 @@ -248,6 +310,161 @@ static void vfio_eoi(VFIODevice *vdev)
  vfio_unmask_intx(vdev);
  }
  
 +static void vfio_enable_intx_kvm(VFIODevice *vdev)
 +{
 +#ifdef CONFIG_KVM
 +struct kvm_irqfd irqfd = {
 +.fd = event_notifier_get_fd(vdev-intx.interrupt),
 +.gsi = vdev-intx.route.irq,
 +.flags = KVM_IRQFD_FLAG_RESAMPLE,


Should not kvm ioctl handling be localized in kvm-all.c?
E.g. extend kvm_irqchip_add_irqfd_notifier in
some way? Same question for KVM_CAP_IRQFD_RESAMPLE use above ...


 +};
 +struct vfio_irq_set *irq_set;
 +int ret, argsz;
 +int32_t *pfd;
 +
 +if (!kvm_irqchip_in_kernel() ||
 +vdev-intx.route.mode != PCI_INTX_ENABLED ||
 +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
 +return;
 +}
 +
 +/* Get to a known interrupt state */
 +qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
 +vfio_mask_intx(vdev);
 +vdev-intx.pending = false;
 +qemu_set_irq(vdev-pdev.irq[vdev-intx.pin], 0);
 +
 +/* Get an eventfd for resample/unmask */
 +if (event_notifier_init(vdev-intx.unmask, 0)) {
 +error_report(vfio: Error: event_notifier_init failed eoi\n);
 +goto fail;
 +}
 +
 +/* KVM triggers it, VFIO listens for it */
 +irqfd.resamplefd =

[Patch]KVM: enabling per domain PLE

2012-10-16 Thread Hu, Xuekun

Setting the same PLE parameter arbitrarily for different
workloads is not a good solution. The solution enables
per domain PLE which gives user ability to set PLE parameter
for different domain for better performance.

Signed-off-by: Xuekun Hu xuekun...@intel.com
---
 arch/x86/include/asm/kvm_host.h |6 ++
 arch/x86/kvm/svm.c  |   21 +
 arch/x86/kvm/vmx.c  |   32 ++--
 arch/x86/kvm/x86.c  |   15 ++-
 include/linux/kvm.h |3 +++
 5 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..b54f38b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -569,6 +569,8 @@ struct kvm_arch {
#ifdef CONFIG_KVM_MMU_AUDIT
int audit_point;
#endif
+   int ple_gap;
+   int ple_window;
 };
 
 struct kvm_vm_stat {
@@ -621,6 +623,10 @@ struct kvm_x86_ops {
int (*hardware_setup)(void);   /* __init */
void (*hardware_unsetup)(void);/* __exit */
bool (*cpu_has_accelerated_tpr)(void);
+   bool (*cpu_has_ple)(void);
+   void (*ple_setup)(struct kvm *kvm);   /* __init */
+   void (*set_ple_gap)(struct kvm *kvm, int value);
+   void (*set_ple_window)(struct kvm *kvm, int value);
void (*cpuid_update)(struct kvm_vcpu *vcpu);
 
/* Create, but do not attach this VCPU */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d017df3..311ed96 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3976,6 +3976,23 @@ static bool svm_cpu_has_accelerated_tpr(void)
return false;
 }
 
+static bool cpu_has_svm_ple(void)
+{
+   return false;
+}
+
+static void svm_ple_setup(struct kvm *kvm)
+{
+}
+
+static void svm_set_ple_gap(struct kvm *kvm, int value)
+{
+}
+
+static void svm_set_ple_window(struct kvm *kvm, int value)
+{
+}
+
 static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
return 0;
@@ -4229,6 +4246,10 @@ static struct kvm_x86_ops svm_x86_ops = {
.hardware_enable = svm_hardware_enable,
.hardware_disable = svm_hardware_disable,
.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
+   .cpu_has_ple = cpu_has_svm_ple,
+   .ple_setup = svm_ple_setup,
+   .set_ple_gap = svm_set_ple_gap,
+   .set_ple_window = svm_set_ple_window,
 
.vcpu_create = svm_create_vcpu,
.vcpu_free = svm_free_vcpu,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad6b1dd..b761669 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3805,6 +3805,29 @@ static void ept_set_mmio_spte_mask(void)
kvm_mmu_set_mmio_spte_mask(0xffull  49 | 0x6ull);
 }
 
+static void vmx_ple_setup(struct kvm *kvm)
+{
+   /*
+* set up ple default value
+*/
+   if (ple_gap) {
+   kvm-arch.ple_gap = ple_gap;
+   kvm-arch.ple_window = ple_window;
+   }
+}
+
+static void vmx_set_ple_gap(struct kvm *kvm, int value)
+{
+   if (ple_gap)
+   kvm-arch.ple_gap = value;
+}
+
+static void vmx_set_ple_window(struct kvm *kvm, int value)
+{
+   if (ple_gap)
+   kvm-arch.ple_window = value;
+}
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -3814,6 +3837,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
unsigned long a;
 #endif
int i;
+   struct kvm *kvm = vmx-vcpu.kvm;
 
/* I/O */
vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
@@ -3836,8 +3860,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
}
 
if (ple_gap) {
-   vmcs_write32(PLE_GAP, ple_gap);
-   vmcs_write32(PLE_WINDOW, ple_window);
+   vmcs_write32(PLE_GAP, kvm-arch.ple_gap);
+   vmcs_write32(PLE_WINDOW, kvm-arch.ple_window);
}
 
vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
@@ -7233,6 +7257,10 @@ static struct kvm_x86_ops vmx_x86_ops = {
.hardware_enable = hardware_enable,
.hardware_disable = hardware_disable,
.cpu_has_accelerated_tpr = report_flexpriority,
+   .cpu_has_ple = cpu_has_vmx_ple,
+   .ple_setup = vmx_ple_setup,
+   .set_ple_gap = vmx_set_ple_gap,
+   .set_ple_window = vmx_set_ple_window,
 
.vcpu_create = vmx_create_vcpu,
.vcpu_free = vmx_free_vcpu,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..89b8291 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2197,6 +2197,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_VAPIC:
r = !kvm_x86_ops-cpu_has_accelerated_tpr();
break;
+   case KVM_CAP_PLE:
+   r = kvm_x86_ops-cpu_has_ple();
+   break;
case KVM_CAP_NR_VCPUS:
r = KVM_SOFT_MAX_VCPUS;
break;
@@ -3462,7 +3465,16 @@ long

[Patch]QEMU: Add -ple-gap and -ple-window options for per domain PLE

2012-10-16 Thread Hu, Xuekun

QEMU: Add -ple-gap and -ple-window options for per domain PLE

Signed-off-by: Xuekun Hu xuekun...@intel.com
---
 linux-headers/linux/kvm.h |3 +++
 qemu-options.hx   |   16 
 sysemu.h  |2 ++
 target-i386/kvm.c |   13 +
 vl.c  |   29 +
 5 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 4b9e575..a32c68a 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_PLE 83
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -769,6 +770,8 @@ struct kvm_msi {
struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_PLE_GAP   _IOW(KVMIO,  0x49, __u32)
+#define KVM_SET_PLE_WINDOW_IOW(KVMIO,  0x4a, __u32)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/qemu-options.hx b/qemu-options.hx
index 3cd9243..c1c5c7b 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -98,6 +98,22 @@ given, the total number of CPUs @var{n} can be omitted. 
@var{maxcpus}
 specifies the maximum number of hotpluggable CPUs.
 ETEXI
 
+DEF(ple-gap, HAS_ARG, QEMU_OPTION_ple_gap,
+-ple-gap n Set ple_gap per vm\n, QEMU_ARCH_I386)
+STEXI
+@item -ple-gap @var{n}
+@findex -ple-gap
+Set ple_gap to vm.
+ETEXI
+
+DEF(ple-window, HAS_ARG, QEMU_OPTION_ple_window,
+-ple-window n Set ple_window per vm\n, QEMU_ARCH_I386)
+STEXI
+@item -ple-window @var{n}
+@findex -ple-window
+Set ple_gap to vm.
+ETEXI
+
 DEF(numa, HAS_ARG, QEMU_OPTION_numa,
 -numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]\n, QEMU_ARCH_ALL)
 STEXI
diff --git a/sysemu.h b/sysemu.h
index b38b1a3..e25d111 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -130,6 +130,8 @@ extern uint8_t *boot_splash_filedata;
 extern int boot_splash_filedata_size;
 extern uint8_t qemu_extra_params_fw[2];
 extern QEMUClock *rtc_clock;
+extern int32_t ple_gap;
+extern int32_t ple_window;
 
 #define MAX_NODES 64
 #define MAX_CPUMASK_BITS 255
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 5b18383..12e0b2a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -724,6 +724,19 @@ int kvm_arch_init(KVMState *s)
 }
 }
 }
+
+if (kvm_check_extension(s, KVM_CAP_PLE)  (ple_gap != 0)) {
+ret = kvm_vm_ioctl(s, KVM_SET_PLE_GAP, ple_gap);
+if (ret  0) {
+fprintf(stderr, kvm_set_ple_gap failed\n);
+return ret;
+}
+ret = kvm_vm_ioctl(s, KVM_SET_PLE_WINDOW, ple_window);
+if (ret  0) {
+fprintf(stderr, kvm_set_ple_window failed\n);
+return ret;
+}
+}
 return 0;
 }
 
diff --git a/vl.c b/vl.c
index a1c0aa7..96dbc4b 100644
--- a/vl.c
+++ b/vl.c
@@ -234,6 +234,9 @@ uint8_t *boot_splash_filedata;
 int boot_splash_filedata_size;
 uint8_t qemu_extra_params_fw[2];
 
+int32_t ple_gap = 0;
+int32_t ple_window = 0;
+
 typedef struct FWBootEntry FWBootEntry;
 
 struct FWBootEntry {
@@ -2590,6 +2593,32 @@ int main(int argc, char **argv, char **envp)
 }
 }
 break;
+case QEMU_OPTION_ple_gap:
+{
+int32_t value;
+char *end;
+
+value = strtol(optarg, end, 10);
+if (value  0 || *end) {
+fprintf(stderr, qemu: invalid ple_gap vaule: %s\n, 
optarg);
+exit(1);
+}
+ple_gap = value;
+}
+break;
+case QEMU_OPTION_ple_window:
+{
+int32_t value;
+char *end;
+
+value = strtol(optarg, end, 10);
+if (value  0 || *end) {
+fprintf(stderr, qemu: invalid ple_gap vaule: %s\n, 
optarg);
+exit(1);
+}
+ple_window = value;
+}
+break;
 case QEMU_OPTION_numa:
 if (nb_numa_nodes = MAX_NODES) {
 fprintf(stderr, qemu: too many NUMA nodes\n);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm tools: fix help output for run command

2012-10-16 Thread Pekka Enberg

On Fri, 12 Oct 2012, William Dauchy wrote:
 This dummy patch remove tabs in help output.
 Introduced in commit:
 ae9ec23 kvm tools: generate command line options dynamically
 
 Signed-off-by: William Dauchy wdau...@gmail.com

Applied, thanks William!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA

2012-10-16 Thread Mel Gorman

Thierry reported that the iron out patch for isolate_freepages_block()
had problems due to the strict check being too strict with mm: compaction:
Iron out isolate_freepages_block() and isolate_freepages_range() -fix1.
It's possible that more pages than necessary are isolated but the check
still fails and I missed that this fix was not picked up before RC1. This
same problem has been identified in 3.7-RC1 by Tony Prisk and should be
addressed by the following patch.

Signed-off-by: Mel Gorman mgor...@suse.de
Tested-by: Tony Prisk li...@prisktech.co.nz
---
 mm/compaction.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 2c4ce17..9eef558 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -346,7 +346,7 @@ static unsigned long isolate_freepages_block(struct 
compact_control *cc,
 * pages requested were isolated. If there were any failures, 0 is
 * returned and CMA will fail.
 */
-   if (strict  nr_strict_required != total_isolated)
+   if (strict  nr_strict_required  total_isolated)
total_isolated = 0;
 
if (locked)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kvm-clock clocksource efficiency versus tsc...

2012-10-16 Thread Erik Brakkee

OS: Centos 6.2
KVM version: qemu-kvm-tools-0.12.1.2-2.209.el6_2.4.x86_64
 qemu-kvm-0.12.1.2-2.209.el6_2.4.x86_64
uname -a:
Linux myhost 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012
x86_64 x86_64 x86_64 GNU/Linux

Hi,


I have been performance testing a time tracing utiliity for a Java
enterprise application at work. The idea is that we measure time for
different parts of our application and build time trees for this
information. The time tree can be viewed and analysed in case of problems.
It is also automatically output in case a certain operation is taking
longer than a threshold.

As part of these tests we found that it matters a lot what type of
clocksource is used. In particular, with hpet and acpi_pm the execution is
very slow (700ns per call (similar results using clock_gettime() in a C
program). In addition, hpet and acpi_pm sycnrhonize the application. This
is of course a disaster for server applications that tend to query the
current time quite a lot.

What works quite well is the tsc clocksource. In that case, the time drops
to about 38ns on one of our systems and we can prove that there is no
synchronization anymore which is good.

When running inside a KVM VM the default clock source is kvm-clock. This
clock takes about 160ns per call and also does not synchronise the
application. However, using the tsc clock source delivers similar
performance on the virtual machine as on the host.

I now have the following questions:
* can the performance of kvm-clock be optimized further to be (almost)
identical to that of the host's clock source?
* what are the consequences of using the tsc clock in combination with
NTPD? Will this result in system instability or larger than usual clock
skew?

Cheers
  Erik



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2012-10-16

2012-10-16 Thread Alexander Graf

-boot and -bootindex

On 15.10.2012, at 12:17, Juan Quintela quint...@redhat.com wrote:

 
 Hi
 
 Please send in any agenda topics you are interested in.
 
 Later, Juan.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Secure migration of LVM based guests over WAN

2012-10-16 Thread Lukas Laukamp


Hey all,

I have a question about a solution for migrate LVM based guests directly 
over the network.


So the situation: Two KVM hosts with libvirt, multiple LVM based guests
Want to do: Migrate a LVM based guest directly to the other host over an 
secure connection


I know that migration is possible when the VM disks are stored on an 
NFS, GFS2 filer/cluster etc.


So would it be possible to do an offline migration directly with netcat 
or something like that?


Would be greate when someone could help me.

Best Regards
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch]KVM: enabling per domain PLE

2012-10-16 Thread Avi Kivity

On 10/16/2012 08:53 AM, Hu, Xuekun wrote:
 Setting the same PLE parameter arbitrarily for different
 workloads is not a good solution. 

True.

 The solution enables
 per domain PLE which gives user ability to set PLE parameter
 for different domain for better performance.

The problem with this is that it requires an administrator to understand
the workload, not only of the guest, but also of other guests on the
machine.  With low overcommit, a high PLE window reduces unneeded exits,
but with high overcommit we need those exits to reduce spinning.

In addition, most kvm hosts don't have an administrator.  They are
controlled by a management system, which means we'll need some algorithm
in userspace to control the PLE window.  Taking the two together, we
need a dynamic (for changing workloads) algorithm.

There are threads discussing this dynamic algorithm, we are making slow
progress because it's such a difficult problem, but I think this is much
more useful than anything requiring user intervention.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Secure migration of LVM based guests over WAN

2012-10-16 Thread Avi Kivity

On 10/16/2012 11:12 AM, Lukas Laukamp wrote:
 Hey all,
 
 I have a question about a solution for migrate LVM based guests directly
 over the network.
 
 So the situation: Two KVM hosts with libvirt, multiple LVM based guests
 Want to do: Migrate a LVM based guest directly to the other host over an
 secure connection
 
 I know that migration is possible when the VM disks are stored on an
 NFS, GFS2 filer/cluster etc.
 
 So would it be possible to do an offline migration directly with netcat
 or something like that?
 

If all you need is offline, you can use scp to copy each volume to the
destination volume.  Make sure the guests are shut down when you do that.

It is also possible to do a live migration, but unless the destination
and source are in the same IP subnet, the guests are going to lose
connectivity.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Avi Kivity

On 10/16/2012 05:59 AM, Paul Mackerras wrote:
 The mmu_notifier_retry() function, used to test whether any page
 invalidations are in progress, currently takes a vcpu pointer, though
 the code only needs the VM's struct kvm pointer.  Forthcoming patches
 to the powerpc Book3S HV code will need to test for retry within a VM
 ioctl, where a struct kvm pointer is available but a struct vcpu
 pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
 called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
 implements mmu_notifier_retry in terms of it.

Why not change mmu_notifier_retry() and all its callers?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Secure migration of LVM based guests over WAN

2012-10-16 Thread Lukas Laukamp


Am 16.10.2012 11:40, schrieb Avi Kivity:

On 10/16/2012 11:12 AM, Lukas Laukamp wrote:

Hey all,

I have a question about a solution for migrate LVM based guests directly
over the network.

So the situation: Two KVM hosts with libvirt, multiple LVM based guests
Want to do: Migrate a LVM based guest directly to the other host over an
secure connection

I know that migration is possible when the VM disks are stored on an
NFS, GFS2 filer/cluster etc.

So would it be possible to do an offline migration directly with netcat
or something like that?


If all you need is offline, you can use scp to copy each volume to the
destination volume.  Make sure the guests are shut down when you do that.

It is also possible to do a live migration, but unless the destination
and source are in the same IP subnet, the guests are going to lose
connectivity.




Hello Avi,

so can I simply copy an logical volume to the path of the volume group 
with scp?


For the live migration theme, it would be no problem when the guests 
looses connectivity, how could be done a live migration?


Best Regards
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2012-10-16

2012-10-16 Thread Igor Mammedov

On Mon, 15 Oct 2012 18:27:51 +0200
Igor Mammedov imamm...@redhat.com wrote:

 CPU as DEVICE
 http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg00719.html
 latest known tree for testing:
 
 https://github.com/ehabkost/qemu-hacks/commits/work/cpu-devicestate-qdev-core
 may be we could agree on proposed RFC.

Rebased series that applies to yesterday's master.
http://www.mail-archive.com/qemu-devel@nongnu.org/msg136505.html
git tree for testing: https://github.com/imammedo/qemu/tree/cpu_as_dev.v3
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 11:44, Avi Kivity wrote:

 On 10/16/2012 05:59 AM, Paul Mackerras wrote:
 The mmu_notifier_retry() function, used to test whether any page
 invalidations are in progress, currently takes a vcpu pointer, though
 the code only needs the VM's struct kvm pointer.  Forthcoming patches
 to the powerpc Book3S HV code will need to test for retry within a VM
 ioctl, where a struct kvm pointer is available but a struct vcpu
 pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
 called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
 implements mmu_notifier_retry in terms of it.
 
 Why not change mmu_notifier_retry() and all its callers?

Why not use Christoffer's patch? :)

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity

On 10/16/2012 06:01 AM, Paul Mackerras wrote:
 A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
 this fd return the contents of the HPT (hashed page table), writes
 create and/or remove entries in the HPT.  There is a new capability,
 KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
 takes an argument structure with the index of the first HPT entry to
 read out and a set of flags.  The flags indicate whether the user is
 intending to read or write the HPT, and whether to return all entries
 or only the bolted entries (those with the bolted bit, 0x10, set in
 the first doubleword).
 
 This is intended for use in implementing qemu's savevm/loadvm and for
 live migration.  Therefore, on reads, the first pass returns information
 about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
 end of the HPT, it returns from the read.  Subsequent reads only return
 information about HPTEs that have changed since they were last read.
 A read that finds no changed HPTEs in the HPT following where the last
 read finished will return 0 bytes.

Copying people with interest in migration.

 +4.78 KVM_PPC_GET_HTAB_FD
 +
 +Capability: KVM_CAP_PPC_HTAB_FD
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: Pointer to struct kvm_get_htab_fd (in)
 +Returns: file descriptor number (= 0) on success, -1 on error
 +
 +This returns a file descriptor that can be used either to read out the
 +entries in the guest's hashed page table (HPT), or to write entries to
 +initialize the HPT.  The returned fd can only be written to if the
 +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
 +can only be read if that bit is clear.  The argument struct looks like
 +this:
 +
 +/* For KVM_PPC_GET_HTAB_FD */
 +struct kvm_get_htab_fd {
 + __u64   flags;
 + __u64   start_index;
 +};
 +
 +/* Values for kvm_get_htab_fd.flags */
 +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
 +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
 +
 +The `start_index' field gives the index in the HPT of the entry at
 +which to start reading.  It is ignored when writing.
 +
 +Reads on the fd will initially supply information about all
 +interesting HPT entries.  Interesting entries are those with the
 +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
 +all entries.  When the end of the HPT is reached, the read() will
 +return.  

What happens if the read buffer is smaller than the HPT size?

What happens if the read buffer size is not a multiple of entry size?

Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).

 If read() is called again on the fd, it will start again from
 +the beginning of the HPT, but will only return HPT entries that have
 +changed since they were last read.
 +
 +Data read or written is structured as a header (8 bytes) followed by a
 +series of valid HPT entries (16 bytes) each.  The header indicates how
 +many valid HPT entries there are and how many invalid entries follow
 +the valid entries.  The invalid entries are not represented explicitly
 +in the stream.  The header format is:
 +
 +struct kvm_get_htab_header {
 + __u32   index;
 + __u16   n_valid;
 + __u16   n_invalid;
 +};

This structure forces the kernel to return entries sequentially.  Will
this block changing the data structure in the future?  Or is the
hardware spec sufficiently strict that such changes are not realistic?

 +
 +Writes to the fd create HPT entries starting at the index given in the
 +header; first `n_valid' valid entries with contents from the data
 +written, then `n_invalid' invalid entries, invalidating any previously
 +valid entries found.

This scheme is a clever, original, and very interesting approach to live
migration.  That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).

What is the typical number of entries in the HPT?  Do you have estimates
of the change rate?

Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual-physical translations on x86).  How
will we deal with that?  But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Secure migration of LVM based guests over WAN

2012-10-16 Thread Avi Kivity

On 10/16/2012 11:48 AM, Lukas Laukamp wrote:
 Am 16.10.2012 11:40, schrieb Avi Kivity:
 On 10/16/2012 11:12 AM, Lukas Laukamp wrote:
 Hey all,

 I have a question about a solution for migrate LVM based guests directly
 over the network.

 So the situation: Two KVM hosts with libvirt, multiple LVM based guests
 Want to do: Migrate a LVM based guest directly to the other host over an
 secure connection

 I know that migration is possible when the VM disks are stored on an
 NFS, GFS2 filer/cluster etc.

 So would it be possible to do an offline migration directly with netcat
 or something like that?

 If all you need is offline, you can use scp to copy each volume to the
 destination volume.  Make sure the guests are shut down when you do that.

 It is also possible to do a live migration, but unless the destination
 and source are in the same IP subnet, the guests are going to lose
 connectivity.


 
 Hello Avi,
 
 so can I simply copy an logical volume to the path of the volume group
 with scp?

Yes.  Best to enable compression to avoid sending zero blocks.

 
 For the live migration theme, it would be no problem when the guests
 looses connectivity, how could be done a live migration?
 

See the -b option to the migrate command.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity

On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip

2012-10-16 Thread Avi Kivity

On 10/15/2012 02:02 PM, Alexander Graf wrote:
 The current eventfd code assumes that when we have eventfd, we also have
 irqfd for in-kernel interrupt delivery. This is not necessarily true. On
 PPC we don't have an in-kernel irqchip yet, but we can still support easily
 support eventfd.
 

Don't you need, in addition, to reject PIO space for ioeventfd?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 12:56, Avi Kivity wrote:

 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.

The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this 
patch set is cleaning up: enable you to build ioeventfd support without 
in-kernel irqchip irqfd support.

Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) 
in user space. That path works just fine with these patches applied.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity

On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support without 
 in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 12:57, Avi Kivity wrote:

 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 The current eventfd code assumes that when we have eventfd, we also have
 irqfd for in-kernel interrupt delivery. This is not necessarily true. On
 PPC we don't have an in-kernel irqchip yet, but we can still support easily
 support eventfd.
 
 
 Don't you need, in addition, to reject PIO space for ioeventfd?

Yeah, we could tell user space that it's doing something stupid. Not sure how 
critical it is - right now it would just register it and never get triggered, 
because the bus doesn't exist. But sanity checks are a good thing usually.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 13:01, Avi Kivity wrote:

 On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support without 
 in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.
 
 That's all true but it wasn't my point.  I was asking whether you can
 enable irqfd without in-kernel irqchip.  If your irq input is edge
 triggered then it's trivial.  If it's level triggered then you can use
 the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras

On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  +   __u64   flags;
  +   __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY   ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE ((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?

That's fine; the read stops when it has filled the buffer and a
subsequent read will continue from where the previous one finished.

 What happens if the read buffer size is not a multiple of entry size?

Then we don't use the last few bytes of the buffer.  The read() call
returns the number of bytes that were filled in, of course.  In any
case, the header size is 8 bytes and the HPT entry size is 16 bytes,
so the number of bytes filled in won't necessarily be a multiple of 16
bytes.

 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).

No.

  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  +   __u32   index;
  +   __u16   n_valid;
  +   __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?

By data structure, do you mean the stream format on the file
descriptor, or the HPT structure?  If we want a new stream format,
then we would define a bit in the flags field of struct
kvm_get_htab_fd to mean I want the new stream format.  The code
fails the ioctl if any unknown flag bits are set, so a new userspace
that wants to use the new format could then detect that it is running
on an old kernel and fall back to the old format.

The HPT entry format is very unlikely to change in size or basic
layout (though the architects do redefine some of the bits
occasionally).

  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the change rate?

Typically the HPT would have about a million entries, i.e. it would be
16MiB in size.  The usual guideline is to make it about 1/64 of the
maximum amount of RAM the guest could ever have, rounded up to a power
of two, although we often run with less, say 1/128 or even 1/256.

Because it is a hash table, updates tend to be scattered throughout
the whole table, which is another reason why per-page dirty tracking
and updates would be pretty inefficient.

As for the change rate, it depends on the application of course, but
basically every time the guest changes a PTE in its Linux page tables
we do the corresponding change to the corresponding HPT entry, so the
rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
exec, etc. have a high rate of HPT updates.

 Suppose new hardware arrives that supports nesting HPTs, so that kvm is
 no longer synchronously aware of the guest HPT (similar to how NPT/EPT
 made

[PATCH v5 1/6] KVM: MMU: fix release noslot pfn

2012-10-16 Thread Xiao Guangrong

We can not directly call kvm_release_pfn_clean to release the pfn
since we can meet noslot pfn which is used to cache mmio info into
spte

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c  |3 +--
 virt/kvm/kvm_main.c |4 +---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d289fee..6f85fe0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2497,8 +2497,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
}

-   if (!is_error_pfn(pfn))
-   kvm_release_pfn_clean(pfn);
+   kvm_release_pfn_clean(pfn);
 }

 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c353b45..a65bc02 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1322,9 +1322,7 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);

 void kvm_release_pfn_clean(pfn_t pfn)
 {
-   WARN_ON(is_error_pfn(pfn));
-
-   if (!kvm_is_mmio_pfn(pfn))
+   if (!is_error_pfn(pfn)  !kvm_is_mmio_pfn(pfn))
put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 2/6] KVM: MMU: remove mmu_is_invalid

2012-10-16 Thread Xiao Guangrong

Remove mmu_is_invalid and use is_invalid_pfn instead

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |5 -
 arch/x86/kvm/paging_tmpl.h |4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6f85fe0..b8d13d7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2699,11 +2699,6 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
*vcpu,
}
 }

-static bool mmu_invalid_pfn(pfn_t pfn)
-{
-   return unlikely(is_invalid_pfn(pfn));
-}
-
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
pfn_t pfn, unsigned access, int *ret_val)
 {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 714e2c0..045d31a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -340,7 +340,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
pte_access = sp-role.access  gpte_access(vcpu, gpte);
protect_clean_gpte(pte_access, gpte);
pfn = gfn_to_pfn_atomic(vcpu-kvm, gpte_to_gfn(gpte));
-   if (mmu_invalid_pfn(pfn))
+   if (is_invalid_pfn(pfn))
return;

/*
@@ -416,7 +416,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
gfn = gpte_to_gfn(gpte);
pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
  pte_access  ACC_WRITE_MASK);
-   if (mmu_invalid_pfn(pfn))
+   if (is_invalid_pfn(pfn))
break;

mmu_set_spte(vcpu, spte, sp-role.access, pte_access, 0, 0,
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 3/6] KVM: MMU: cleanup FNAME(page_fault)

2012-10-16 Thread Xiao Guangrong

Let it return emulate state instead of spte like __direct_map

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/paging_tmpl.h |   32 +---
 1 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 045d31a..c32 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -427,21 +427,21 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,

 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
+ * If the guest tries to write a write-protected page, we need to
+ * emulate this operation, return 1 to indicate this case.
  */
-static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
+static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 struct guest_walker *gw,
 int user_fault, int write_fault, int hlevel,
-int *emulate, pfn_t pfn, bool map_writable,
-bool prefault)
+pfn_t pfn, bool map_writable, bool prefault)
 {
-   unsigned access = gw-pt_access;
struct kvm_mmu_page *sp = NULL;
-   int top_level;
-   unsigned direct_access;
struct kvm_shadow_walk_iterator it;
+   unsigned direct_access, access = gw-pt_access;
+   int top_level, emulate = 0;

if (!is_present_gpte(gw-ptes[gw-level - 1]))
-   return NULL;
+   return 0;

direct_access = gw-pte_access;

@@ -505,17 +505,17 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
addr,

clear_sp_write_flooding_count(it.sptep);
mmu_set_spte(vcpu, it.sptep, access, gw-pte_access,
-user_fault, write_fault, emulate, it.level,
+user_fault, write_fault, emulate, it.level,
 gw-gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);

-   return it.sptep;
+   return emulate;

 out_gpte_changed:
if (sp)
kvm_mmu_put_page(sp, it.sptep);
kvm_release_pfn_clean(pfn);
-   return NULL;
+   return 0;
 }

 /*
@@ -538,8 +538,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
int write_fault = error_code  PFERR_WRITE_MASK;
int user_fault = error_code  PFERR_USER_MASK;
struct guest_walker walker;
-   u64 *sptep;
-   int emulate = 0;
int r;
pfn_t pfn;
int level = PT_PAGE_TABLE_LEVEL;
@@ -601,17 +599,13 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
kvm_mmu_free_some_pages(vcpu);
if (!force_pt_level)
transparent_hugepage_adjust(vcpu, walker.gfn, pfn, level);
-   sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault,
-level, emulate, pfn, map_writable, prefault);
-   (void)sptep;
-   pgprintk(%s: shadow pte %p %llx emulate %d\n, __func__,
-sptep, *sptep, emulate);
-
+   r = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault,
+level, pfn, map_writable, prefault);
++vcpu-stat.pf_fixed;
kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
spin_unlock(vcpu-kvm-mmu_lock);

-   return emulate;
+   return r;

 out_unlock:
spin_unlock(vcpu-kvm-mmu_lock);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 4/6] KVM: MMU: move prefetch_invalid_gpte out of pagaing_tmp.h

2012-10-16 Thread Xiao Guangrong

The function does not depend on guest mmu mode, move it out from
paging_tmpl.h

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   36 
 arch/x86/kvm/paging_tmpl.h |   26 +++---
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b8d13d7..56c0e39 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2505,6 +2505,14 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
mmu_free_roots(vcpu);
 }

+static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+   int bit7;
+
+   bit7 = (gpte  7)  1;
+   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
 static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 bool no_dirty_log)
 {
@@ -2517,6 +2525,26 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
*vcpu, gfn_t gfn,
return gfn_to_pfn_memslot_atomic(slot, gfn);
 }

+static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp, u64 *spte,
+ u64 gpte)
+{
+   if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+   goto no_present;
+
+   if (!is_present_gpte(gpte))
+   goto no_present;
+
+   if (!(gpte  PT_ACCESSED_MASK))
+   goto no_present;
+
+   return false;
+
+no_present:
+   drop_spte(vcpu-kvm, spte);
+   return true;
+}
+
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp,
u64 *start, u64 *end)
@@ -3394,14 +3422,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
 }

-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
-   int bit7;
-
-   bit7 = (gpte  7)  1;
-   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
 static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
 {
unsigned mask;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index c32..36a80ed 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -305,26 +305,6 @@ static int FNAME(walk_addr_nested)(struct guest_walker 
*walker,
addr, access);
 }

-static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
-   struct kvm_mmu_page *sp, u64 *spte,
-   pt_element_t gpte)
-{
-   if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-   goto no_present;
-
-   if (!is_present_gpte(gpte))
-   goto no_present;
-
-   if (!(gpte  PT_ACCESSED_MASK))
-   goto no_present;
-
-   return false;
-
-no_present:
-   drop_spte(vcpu-kvm, spte);
-   return true;
-}
-
 static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
  u64 *spte, const void *pte)
 {
@@ -333,7 +313,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
pfn_t pfn;

gpte = *(const pt_element_t *)pte;
-   if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
+   if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
return;

pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte);
@@ -408,7 +388,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,

gpte = gptep[i];

-   if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
+   if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
continue;

pte_access = sp-role.access  gpte_access(vcpu, gpte);
@@ -751,7 +731,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
  sizeof(pt_element_t)))
return -EINVAL;

-   if (FNAME(prefetch_invalid_gpte)(vcpu, sp, sp-spt[i], gpte)) {
+   if (prefetch_invalid_gpte(vcpu, sp, sp-spt[i], gpte)) {
vcpu-kvm-tlbs_dirty++;
continue;
}
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 5/6] KVM: MMU: introduce FNAME(prefetch_gpte)

2012-10-16 Thread Xiao Guangrong

The only difference between FNAME(update_pte) and FNAME(pte_prefetch)
is that the former is allowed to prefetch gfn from dirty logged slot,
so introduce a common function to prefetch spte

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/paging_tmpl.h |   55 +++
 1 files changed, 24 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 36a80ed..f887e4c 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -305,31 +305,43 @@ static int FNAME(walk_addr_nested)(struct guest_walker 
*walker,
addr, access);
 }

-static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- u64 *spte, const void *pte)
+static bool
+FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+u64 *spte, pt_element_t gpte, bool no_dirty_log)
 {
-   pt_element_t gpte;
unsigned pte_access;
+   gfn_t gfn;
pfn_t pfn;

-   gpte = *(const pt_element_t *)pte;
if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
-   return;
+   return false;

pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte);
+
+   gfn = gpte_to_gfn(gpte);
pte_access = sp-role.access  gpte_access(vcpu, gpte);
protect_clean_gpte(pte_access, gpte);
-   pfn = gfn_to_pfn_atomic(vcpu-kvm, gpte_to_gfn(gpte));
+   pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
+   no_dirty_log  (pte_access  ACC_WRITE_MASK));
if (is_invalid_pfn(pfn))
-   return;
+   return false;

/*
-* we call mmu_set_spte() with host_writable = true because that
-* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1).
+* we call mmu_set_spte() with host_writable = true because
+* pte_prefetch_gfn_to_pfn always gets a writable pfn.
 */
mmu_set_spte(vcpu, spte, sp-role.access, pte_access, 0, 0,
-NULL, PT_PAGE_TABLE_LEVEL,
-gpte_to_gfn(gpte), pfn, true, true);
+NULL, PT_PAGE_TABLE_LEVEL, gfn, pfn, true, true);
+
+   return true;
+}
+
+static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *spte, const void *pte)
+{
+   pt_element_t gpte = *(const pt_element_t *)pte;
+
+   FNAME(prefetch_gpte)(vcpu, sp, spte, gpte, false);
 }

 static bool FNAME(gpte_changed)(struct kvm_vcpu *vcpu,
@@ -375,33 +387,14 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
spte = sp-spt + i;

for (i = 0; i  PTE_PREFETCH_NUM; i++, spte++) {
-   pt_element_t gpte;
-   unsigned pte_access;
-   gfn_t gfn;
-   pfn_t pfn;
-
if (spte == sptep)
continue;

if (is_shadow_present_pte(*spte))
continue;

-   gpte = gptep[i];
-
-   if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
-   continue;
-
-   pte_access = sp-role.access  gpte_access(vcpu, gpte);
-   protect_clean_gpte(pte_access, gpte);
-   gfn = gpte_to_gfn(gpte);
-   pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
- pte_access  ACC_WRITE_MASK);
-   if (is_invalid_pfn(pfn))
+   if (!FNAME(prefetch_gpte)(vcpu, sp, spte, gptep[i], true))
break;
-
-   mmu_set_spte(vcpu, spte, sp-role.access, pte_access, 0, 0,
-NULL, PT_PAGE_TABLE_LEVEL, gfn,
-pfn, true, true);
}
 }

-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 6/6] KVM: do not treat noslot pfn as a error pfn

2012-10-16 Thread Xiao Guangrong

This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfn

After this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
  when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
  is noslot pfn
And is_invalid_pfn can be removed, it makes the code more clean

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_32_mmu_host.c |2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c |2 +-
 arch/powerpc/kvm/e500_tlb.c   |2 +-
 arch/x86/kvm/mmu.c|4 ++--
 arch/x86/kvm/paging_tmpl.h|2 +-
 arch/x86/kvm/x86.c|2 +-
 include/linux/kvm_host.h  |   28 
 virt/kvm/iommu.c  |4 ++--
 virt/kvm/kvm_main.c   |6 +++---
 9 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c 
b/arch/powerpc/kvm/book3s_32_mmu_host.c
index 9fac010..844d591 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -155,7 +155,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *orig_pte)

/* Get host physical address for gpa */
hpaddr = kvmppc_gfn_to_pfn(vcpu, orig_pte-raddr  PAGE_SHIFT);
-   if (is_error_pfn(hpaddr)) {
+   if (is_error_noslot_pfn(hpaddr)) {
printk(KERN_INFO Couldn't get guest page for gfn %lx!\n,
 orig_pte-eaddr);
r = -EINVAL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 6b2c80e..53f390f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -92,7 +92,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *orig_pte)

/* Get host physical address for gpa */
hpaddr = kvmppc_gfn_to_pfn(vcpu, orig_pte-raddr  PAGE_SHIFT);
-   if (is_error_pfn(hpaddr)) {
+   if (is_error_noslot_pfn(hpaddr)) {
printk(KERN_INFO Couldn't get guest page for gfn %lx!\n, 
orig_pte-eaddr);
r = -EINVAL;
goto out;
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index c733894..6305ee6 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -524,7 +524,7 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
if (likely(!pfnmap)) {
unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
pfn = gfn_to_pfn_memslot(slot, gfn);
-   if (is_error_pfn(pfn)) {
+   if (is_error_noslot_pfn(pfn)) {
printk(KERN_ERR Couldn't get real page for gfn %lx!\n,
(long)gfn);
return;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 56c0e39..54c3557 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2699,7 +2699,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
*vcpu,
 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
 * here.
 */
-   if (!is_error_pfn(pfn)  !kvm_is_mmio_pfn(pfn) 
+   if (!is_error_noslot_pfn(pfn)  !kvm_is_mmio_pfn(pfn) 
level == PT_PAGE_TABLE_LEVEL 
PageTransCompound(pfn_to_page(pfn)) 
!has_wrprotected_page(vcpu-kvm, gfn, PT_DIRECTORY_LEVEL)) {
@@ -2733,7 +2733,7 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, 
gva_t gva, gfn_t gfn,
bool ret = true;

/* The pfn is invalid, report the error! */
-   if (unlikely(is_invalid_pfn(pfn))) {
+   if (unlikely(is_error_pfn(pfn))) {
*ret_val = kvm_handle_bad_page(vcpu, gfn, pfn);
goto exit;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f887e4c..89f3241 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -323,7 +323,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
protect_clean_gpte(pte_access, gpte);
pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
no_dirty_log  (pte_access  ACC_WRITE_MASK));
-   if (is_invalid_pfn(pfn))
+   if (is_error_pfn(pfn))
return false;

/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 841d09b..8b90dd5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4504,7 +4504,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, 
gva_t gva)
 * instruction - ...
 */
pfn = gfn_to_pfn(vcpu-kvm, gpa_to_gfn(gpa));
-   if (!is_error_pfn(pfn)) {
+   if (!is_error_noslot_pfn(pfn)) {

Re: [PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA

2012-10-16 Thread Rik van Riel


On 10/16/2012 04:39 AM, Mel Gorman wrote:

Thierry reported that the iron out patch for isolate_freepages_block()
had problems due to the strict check being too strict with mm: compaction:
Iron out isolate_freepages_block() and isolate_freepages_range() -fix1.
It's possible that more pages than necessary are isolated but the check
still fails and I missed that this fix was not picked up before RC1. This
same problem has been identified in 3.7-RC1 by Tony Prisk and should be
addressed by the following patch.

Signed-off-by: Mel Gorman mgor...@suse.de
Tested-by: Tony Prisk li...@prisktech.co.nz


Acked-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity

On 10/16/2012 01:58 PM, Paul Mackerras wrote:
 On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  +  __u64   flags;
  +  __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY  ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?
 
 That's fine; the read stops when it has filled the buffer and a
 subsequent read will continue from where the previous one finished.
 
 What happens if the read buffer size is not a multiple of entry size?
 
 Then we don't use the last few bytes of the buffer.  The read() call
 returns the number of bytes that were filled in, of course.  In any
 case, the header size is 8 bytes and the HPT entry size is 16 bytes,
 so the number of bytes filled in won't necessarily be a multiple of 16
 bytes.

That's sane and expected, but it should be documented.

 
 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).
 
 No.

This forces userspace to dedicate a thread for the HPT.

 
  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  +  __u32   index;
  +  __u16   n_valid;
  +  __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?
 
 By data structure, do you mean the stream format on the file
 descriptor, or the HPT structure?  If we want a new stream format,
 then we would define a bit in the flags field of struct
 kvm_get_htab_fd to mean I want the new stream format.  The code
 fails the ioctl if any unknown flag bits are set, so a new userspace
 that wants to use the new format could then detect that it is running
 on an old kernel and fall back to the old format.
 
 The HPT entry format is very unlikely to change in size or basic
 layout (though the architects do redefine some of the bits
 occasionally).

I meant the internal data structure that holds HPT entries.

I guess I don't understand the index.  Do we expect changes to be in
contiguous ranges?  And invalid entries to be contiguous as well?  That
doesn't fit with how hash tables work.  Does the index represent the
position of the entry within the table, or something else?


 
  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the change rate?
 
 Typically the HPT would have about a million entries, i.e. it would be
 16MiB in size.  The usual guideline is to make it about 1/64 of the
 maximum amount of RAM the guest could ever have, rounded up to a power
 of two, although we often run with less, say 1/128 or even 1/256.

16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
it warrant a live migration protocol?

 Because it is

Re: [PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA

2012-10-16 Thread Minchan Kim

On Tue, Oct 16, 2012 at 09:39:27AM +0100, Mel Gorman wrote:
 Thierry reported that the iron out patch for isolate_freepages_block()
 had problems due to the strict check being too strict with mm: compaction:
 Iron out isolate_freepages_block() and isolate_freepages_range() -fix1.
 It's possible that more pages than necessary are isolated but the check
 still fails and I missed that this fix was not picked up before RC1. This
 same problem has been identified in 3.7-RC1 by Tony Prisk and should be
 addressed by the following patch.
 
 Signed-off-by: Mel Gorman mgor...@suse.de
 Tested-by: Tony Prisk li...@prisktech.co.nz
Acked-by: Minchan Kim minc...@kernel.org

-- 
Kind Regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using PCI config space to indicate config location

2012-10-16 Thread Rusty Russell

Michael S. Tsirkin m...@redhat.com writes:
 On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
   For writes, the standard seems to be a commit latch.  We could abuse 
   the
   generation count for this: the driver writes to it to commit config
   changes.
  
   I think this will work. There are a couple of things that bother me:
  
   This assumes read accesses have no side effects, and these are 
   sometimes handy.
   Also the semantics for write aren't very clear to me.
   I guess device must buffer data until generation count write?
   This assumes the device has a buffer to store writes,
   and it must track each byte written. I kind of dislike this
   tracking of accessed bytes. Also, device would need to resolve conflicts
   if any in some device specific way.
  
  It should be trivial to implement: you keep a scratch copy of the config
  space, and copy it to the master copy when they hit the latch.
  
  Implementation of this will show whether I've missed anything here, I
  think.
 
  What I refer to: what happens if driver does:
  - write offset 1
  - write offset 3
  - hit commit latch
 
 - nothing
 - nothing
 - effect of offset 1 and offset 3 writes

 OK so this means that you also need to track which bytes where written
 in order to know to skip byte 2.
 This is what I referred to. If instead we ask driver to specify
 offset/length explicitly device only needs to remember that.

I was assuming the implementation would keep two complete copies of the
config space: writes go to the scratch version, which gets copied to the
master version upon latch write.

But I do wonder if we should just skip this for now, since we don't
have any immediate need.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using PCI config space to indicate config location

2012-10-16 Thread Michael S. Tsirkin

On Tue, Oct 16, 2012 at 11:45:41PM +1030, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  On Fri, Oct 12, 2012 at 08:21:50PM +1030, Rusty Russell wrote:
  Michael S. Tsirkin m...@redhat.com writes:
   On Fri, Oct 12, 2012 at 08:59:36AM +1030, Rusty Russell wrote:
For writes, the standard seems to be a commit latch.  We could abuse 
the
generation count for this: the driver writes to it to commit config
changes.
   
I think this will work. There are a couple of things that bother me:
   
This assumes read accesses have no side effects, and these are 
sometimes handy.
Also the semantics for write aren't very clear to me.
I guess device must buffer data until generation count write?
This assumes the device has a buffer to store writes,
and it must track each byte written. I kind of dislike this
tracking of accessed bytes. Also, device would need to resolve 
conflicts
if any in some device specific way.
   
   It should be trivial to implement: you keep a scratch copy of the config
   space, and copy it to the master copy when they hit the latch.
   
   Implementation of this will show whether I've missed anything here, I
   think.
  
   What I refer to: what happens if driver does:
   - write offset 1
   - write offset 3
   - hit commit latch
  
  - nothing
  - nothing
  - effect of offset 1 and offset 3 writes
 
  OK so this means that you also need to track which bytes where written
  in order to know to skip byte 2.
  This is what I referred to. If instead we ask driver to specify
  offset/length explicitly device only needs to remember that.
 
 I was assuming the implementation would keep two complete copies of the
 config space: writes go to the scratch version, which gets copied to the
 master version upon latch write.

Yes but config space has some host modifiable registers too.
So host needs to be careful to avoid overwriting these.

If accesses have side effects that of course breaks too ...

 But I do wonder if we should just skip this for now, since we don't
 have any immediate need.
 
 Cheers,
 Rusty.

MAC setting from guest needs this right now, no?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity

On 10/16/2012 01:06 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 13:01, Avi Kivity wrote:
 
 On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an 
 in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support 
 without in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.
 
 That's all true but it wasn't my point.  I was asking whether you can
 enable irqfd without in-kernel irqchip.  If your irq input is edge
 triggered then it's trivial.  If it's level triggered then you can use
 the new resampling irqfd.
 
 I'm not sure I fully grasp what you're trying to say :).
 
 We have a single interrupt line on the core. So whenever any external 
 interrupt gets injected (same thing for MSI), we need to go to the MPIC / 
 XICS and ask it which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.

 
 So yes, we could create a direct fd channel between vhost and the user space 
 MPIC, but it wouldn't buy us anything. The interrupt injection path would be 
 as long as it is with the current mechanism.
 

If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Alex Williamson

On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote:
 On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
  This makes use of the new level irqfd support enabling bypass of
  qemu userspace both on INTx injection and unmask.  This significantly
  boosts the performance of devices making use of legacy interrupts.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
  My INTx routing workaround below will probably raise some eyebrows,
  but I don't feel it's worth subjecting users to core dumps if they
  want to try vfio-pci on new platforms.  INTx routing is part of some
  larger plan, but until that plan materializes we have to try to avoid
  the API unless we think there's a good chance it might be there.
  I'll accept the maintenance of updating a whitelist in the interim.
  Thanks,
  
  Alex
  
   hw/vfio_pci.c |  224 
  +
   1 file changed, 224 insertions(+)
  
  diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
  index 639371e..777a5f8 100644
  --- a/hw/vfio_pci.c
  +++ b/hw/vfio_pci.c
  @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, 
  uint32_t addr, int len);
   static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
   
   /*
  + * PCI code refuses to make it possible to probe whether the chipset
  + * supports pci_device_route_intx_to_irq() and booby traps the call
  + * to assert if doesn't.  For us, this is just an optimization, so
  + * only enable it when we know it's present.  Unfortunately PCIBus is
  + * private, so we can't just look at the function pointer.
  + */
  +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
  +{
  +#ifdef CONFIG_KVM
  +BusState *bus = qdev_get_parent_bus(pdev-qdev);
  +DeviceState *dev;
  +
  +if (!kvm_irqchip_in_kernel() ||
  +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
  +   return false;
  +}
 
 
 Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
 Also for KVM_IRQFD_FLAG_RESAMPLE.

I posted the patch for that separately yesterday.  I'll only request a
pull once that's in.

  +
  +for (; bus-parent; bus = qdev_get_parent_bus(dev)) {
  +
  +dev = bus-parent;
  +
  +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 
  14)) {
  +return true;
  +}
  +}
  +
  +error_report(vfio-pci: VM chipset does not support INTx routing, 
  + using slow INTx mode\n);
 
 When does this code trigger? It seems irqchip implies piix ATM -
 is this just dead code?

Unused, but not unnecessary.  Another chipset is under development,
which means very quickly irqchip will not imply piix.  Likewise irqfd
support is being added to other architectures, so I don't know how long
the kvm specific tests will hold up.  Testing for a specific chipset
could of course be avoided if we were willing to support:

bool pci_device_intx_route_supported(PCIDevice *pdev)

or the NOROUTE option I posted previously.

  +#endif
  +return false;
  +}
  +
  +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int 
  pin)
  +{
  +if (!vfio_pci_bus_has_intx_route(pdev)) {
  +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 };
  +}
  +
  +return pci_device_route_intx_to_irq(pdev, pin);
  +}
  +
  +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute 
  *new)
  +{
  +return old-mode != new-mode || old-irq != new-irq;
  +}
  +
 
 Didn't you add an API for this? It's on pci branch but I can drop
 it if not needed.

I did and I'll switch to it when available, but I have no idea when that
will be, so I've hedged my bets by re-implementing it here.  2 week+
turnover for a patch makes it difficult to coordinate dependent changes
on short qemu release cycles.

  +/*
* Common VFIO interrupt disable
*/
   static void vfio_disable_irqindex(VFIODevice *vdev, int index)
  @@ -185,6 +232,21 @@ static void vfio_unmask_intx(VFIODevice *vdev)
   ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set);
   }
   
  +#ifdef CONFIG_KVM
  +static void vfio_mask_intx(VFIODevice *vdev)
  +{
  +struct vfio_irq_set irq_set = {
  +.argsz = sizeof(irq_set),
  +.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
  +.index = VFIO_PCI_INTX_IRQ_INDEX,
  +.start = 0,
  +.count = 1,
  +};
  +
  +ioctl(vdev-fd, VFIO_DEVICE_SET_IRQS, irq_set);
  +}
  +#endif
  +
   /*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead.  We try to get the best of both worlds by
  @@ -248,6 +310,161 @@ static void vfio_eoi(VFIODevice *vdev)
   vfio_unmask_intx(vdev);
   }
   
  +static void vfio_enable_intx_kvm(VFIODevice *vdev)
  +{
  +#ifdef CONFIG_KVM
  +struct kvm_irqfd irqfd = {
  +.fd = event_notifier_get_fd(vdev-intx.interrupt),
  +.gsi = vdev-intx.route.irq,
  +.flags =

Re: Using PCI config space to indicate config location

2012-10-16 Thread Rusty Russell

Michael S. Tsirkin m...@redhat.com writes:
 On Tue, Oct 16, 2012 at 11:45:41PM +1030, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
 I was assuming the implementation would keep two complete copies of the
 config space: writes go to the scratch version, which gets copied to the
 master version upon latch write.

 Yes but config space has some host modifiable registers too.
 So host needs to be careful to avoid overwriting these.

 If accesses have side effects that of course breaks too ...

Yes.


 But I do wonder if we should just skip this for now, since we don't
 have any immediate need.
 
 Cheers,
 Rusty.

 MAC setting from guest needs this right now, no?

Ah, I missed that in my table:

Driver  Config   Device changesDriver writes... after init?
net YY NN
block   YY YY
console YY NN
rng NN NN
balloon YY YY
scsiYN YN
9p  YN NN

First line should be:

net YY YN

So we could add a new cvq command (eg. #define VIRTIO_NET_CTRL_MAC_SET
1) and a VIRTIO_NET_F_CVQ_MAC_SET feature.  Or go ahead with the
latching scheme (which doesn't really help other busses).

Cheers,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Michael S. Tsirkin

On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote:
 On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote:
  On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
   This makes use of the new level irqfd support enabling bypass of
   qemu userspace both on INTx injection and unmask.  This significantly
   boosts the performance of devices making use of legacy interrupts.
   
   Signed-off-by: Alex Williamson alex.william...@redhat.com
   ---
   
   My INTx routing workaround below will probably raise some eyebrows,
   but I don't feel it's worth subjecting users to core dumps if they
   want to try vfio-pci on new platforms.  INTx routing is part of some
   larger plan, but until that plan materializes we have to try to avoid
   the API unless we think there's a good chance it might be there.
   I'll accept the maintenance of updating a whitelist in the interim.
   Thanks,
   
   Alex
   
hw/vfio_pci.c |  224 
   +
1 file changed, 224 insertions(+)
   
   diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
   index 639371e..777a5f8 100644
   --- a/hw/vfio_pci.c
   +++ b/hw/vfio_pci.c
   @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice 
   *pdev, uint32_t addr, int len);
static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);

/*
   + * PCI code refuses to make it possible to probe whether the chipset
   + * supports pci_device_route_intx_to_irq() and booby traps the call
   + * to assert if doesn't.  For us, this is just an optimization, so
   + * only enable it when we know it's present.  Unfortunately PCIBus is
   + * private, so we can't just look at the function pointer.
   + */
   +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
   +{
   +#ifdef CONFIG_KVM
   +BusState *bus = qdev_get_parent_bus(pdev-qdev);
   +DeviceState *dev;
   +
   +if (!kvm_irqchip_in_kernel() ||
   +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
   + return false;
   +}
  
  
  Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
  Also for KVM_IRQFD_FLAG_RESAMPLE.
 
 I posted the patch for that separately yesterday.  I'll only request a
 pull once that's in.

OK missed that. In the future, might be a good idea to note dependencies
in the patchset or repost them as part of patchset with appropriate
tagging.

   +
   +for (; bus-parent; bus = qdev_get_parent_bus(dev)) {
   +
   +dev = bus-parent;
   +
   +if (!strncmp(i440FX-pcihost, object_get_typename(OBJECT(dev)), 
   14)) {
   +return true;
   +}
   +}
   +
   +error_report(vfio-pci: VM chipset does not support INTx routing, 
   + using slow INTx mode\n);
  
  When does this code trigger? It seems irqchip implies piix ATM -
  is this just dead code?
 
 Unused, but not unnecessary.  Another chipset is under development,
 which means very quickly irqchip will not imply piix.

So this is for purposes of an out of tree stuff, let's
keep these hacks out of tree too. My guess is
q35 can just be added with pci_device_route_intx straight away.

  Likewise irqfd
 support is being added to other architectures, so I don't know how long
 the kvm specific tests will hold up.  Testing for a specific chipset
 could of course be avoided if we were willing to support:
 
 bool pci_device_intx_route_supported(PCIDevice *pdev)
 
 or the NOROUTE option I posted previously.

This is just moving the pain to users who will
get bad performance and will have to debug it. Injecting
through userspace is slow, new architectures should
simply add irqfd straight away instead of adding
work arounds in userspace.

This is exactly why we have the assert in pci core.

   +#endif
   +return false;
   +}
   +
   +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, 
   int pin)
   +{
   +if (!vfio_pci_bus_has_intx_route(pdev)) {
   +return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 };
   +}
   +
   +return pci_device_route_intx_to_irq(pdev, pin);
   +}
   +
   +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute 
   *new)
   +{
   +return old-mode != new-mode || old-irq != new-irq;
   +}
   +
  
  Didn't you add an API for this? It's on pci branch but I can drop
  it if not needed.
 
 I did and I'll switch to it when available, but I have no idea when that
 will be, so I've hedged my bets by re-implementing it here.  2 week+
 turnover for a patch makes it difficult to coordinate dependent changes
 on short qemu release cycles.

It's available on pci branch, please base on that instead of master.
Yes I merge at about 2 week intervals but the point is using
subtrees. You do not need to block waiting for stuff to land
in master.

   +/*
 * Common VFIO interrupt disable
 */
static void vfio_disable_irqindex(VFIODevice *vdev, int index)
   @@ -185,6 +232,21 @@ static void

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Alex Williamson

On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote:
 On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote:
  On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote:
   On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
This makes use of the new level irqfd support enabling bypass of
qemu userspace both on INTx injection and unmask.  This significantly
boosts the performance of devices making use of legacy interrupts.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

My INTx routing workaround below will probably raise some eyebrows,
but I don't feel it's worth subjecting users to core dumps if they
want to try vfio-pci on new platforms.  INTx routing is part of some
larger plan, but until that plan materializes we have to try to avoid
the API unless we think there's a good chance it might be there.
I'll accept the maintenance of updating a whitelist in the interim.
Thanks,

Alex

 hw/vfio_pci.c |  224 
+
 1 file changed, 224 insertions(+)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 639371e..777a5f8 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice 
*pdev, uint32_t addr, int len);
 static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
 
 /*
+ * PCI code refuses to make it possible to probe whether the chipset
+ * supports pci_device_route_intx_to_irq() and booby traps the call
+ * to assert if doesn't.  For us, this is just an optimization, so
+ * only enable it when we know it's present.  Unfortunately PCIBus is
+ * private, so we can't just look at the function pointer.
+ */
+static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
+{
+#ifdef CONFIG_KVM
+BusState *bus = qdev_get_parent_bus(pdev-qdev);
+DeviceState *dev;
+
+if (!kvm_irqchip_in_kernel() ||
+!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
+   return false;
+}
   
   
   Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
   Also for KVM_IRQFD_FLAG_RESAMPLE.
  
  I posted the patch for that separately yesterday.  I'll only request a
  pull once that's in.
 
 OK missed that. In the future, might be a good idea to note dependencies
 in the patchset or repost them as part of patchset with appropriate
 tagging.
 
+
+for (; bus-parent; bus = qdev_get_parent_bus(dev)) {
+
+dev = bus-parent;
+
+if (!strncmp(i440FX-pcihost, 
object_get_typename(OBJECT(dev)), 14)) {
+return true;
+}
+}
+
+error_report(vfio-pci: VM chipset does not support INTx routing, 
+ using slow INTx mode\n);
   
   When does this code trigger? It seems irqchip implies piix ATM -
   is this just dead code?
  
  Unused, but not unnecessary.  Another chipset is under development,
  which means very quickly irqchip will not imply piix.
 
 So this is for purposes of an out of tree stuff, let's
 keep these hacks out of tree too. My guess is
 q35 can just be added with pci_device_route_intx straight away.
 
   Likewise irqfd
  support is being added to other architectures, so I don't know how long
  the kvm specific tests will hold up.  Testing for a specific chipset
  could of course be avoided if we were willing to support:
  
  bool pci_device_intx_route_supported(PCIDevice *pdev)
  
  or the NOROUTE option I posted previously.
 
 This is just moving the pain to users who will
 get bad performance and will have to debug it. Injecting
 through userspace is slow, new architectures should
 simply add irqfd straight away instead of adding
 work arounds in userspace.
 
 This is exactly why we have the assert in pci core.

Let's compare user experience:

As coded here:

- Error message, only runs slower if the driver actually uses INTx.
Result: file bug, continue using vfio-pci, maybe do something useful,
maybe find new bugs to file.

Blindly calling PCI code w/ assert:

- Assert.  Result: file bug, full stop.

Maybe I do too much kernel programming, but I don't take asserts lightly
and prefer they be saved for something is really broken and I can't
safely continue.  This is not such a case.

+#endif
+return false;
+}
+
+static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, 
int pin)
+{
+if (!vfio_pci_bus_has_intx_route(pdev)) {
+return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 };
+}
+
+return pci_device_route_intx_to_irq(pdev, pin);
+}
+
+static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, 
PCIINTxRoute *new)
+{
+return old-mode != new-mode || old-irq != new-irq;
+}
+
   
   Didn't you add an API for this? It's

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf


On 10/16/2012 03:47 PM, Avi Kivity wrote:

On 10/16/2012 01:06 PM, Alexander Graf wrote:

On 16.10.2012, at 13:01, Avi Kivity wrote:


On 10/16/2012 12:59 PM, Alexander Graf wrote:

On 16.10.2012, at 12:56, Avi Kivity wrote:


On 10/15/2012 02:02 PM, Alexander Graf wrote:

In order to support vhost, we need to be able to support ioeventfd.

This patch set adds support for ioeventfd to PPC and makes it possible to
do so without implementing irqfd along the way, as it requires an in-kernel
irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.

The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this 
patch set is cleaning up: enable you to build ioeventfd support without 
in-kernel irqchip irqfd support.

Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) 
in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.


So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?


Nope. Well, yes, in a certain special case where the MPIC pushes the 
interrupt vector on interrupt delivery into a special register. But not 
for the normal case.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call minutes for 2012-10-16

2012-10-16 Thread Juan Quintela


2012-10-16
--

- cpu as dev
  making qdev available to all parts of qemu is the best solution (aliguory)
  how does linux-user people think about it?
  Eduardo will follow
  Objections:
  * conceptually it makes no sense to have devices on *-user
  * desire of linux-user to maintain the code relativelly small
* really they want less dependencies
  Who is going to merge it?
   either anthony or andreas through cpu tree

- boot and -bootindex (agraf)
  should this overwrote CMOS on PC (anthony)
  Always writing nvram makes other things work.
  -boot c : means boot with legacy device
  with uefi there is a boot variable
  -boot c don't make sense out of x86
   * make the list of bootable devices visible to the user
 and store in nvram?

  At this point I got lost, and Anthony/Alex needs to fill the details
  of the proposals.

- Command line compatibility patches (marcelo)
  merge then directly or through qdev
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM ept flush

2012-10-16 Thread Avi Kivity

On 10/16/2012 01:57 PM, Rohan Sharma wrote:
 Is there a way to flush ept entries in qemu-kvm.

No.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Michael S. Tsirkin

On Tue, Oct 16, 2012 at 08:48:04AM -0600, Alex Williamson wrote:
 On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote:
  On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote:
   On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote:
On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
 This makes use of the new level irqfd support enabling bypass of
 qemu userspace both on INTx injection and unmask.  This significantly
 boosts the performance of devices making use of legacy interrupts.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 My INTx routing workaround below will probably raise some eyebrows,
 but I don't feel it's worth subjecting users to core dumps if they
 want to try vfio-pci on new platforms.  INTx routing is part of some
 larger plan, but until that plan materializes we have to try to avoid
 the API unless we think there's a good chance it might be there.
 I'll accept the maintenance of updating a whitelist in the interim.
 Thanks,
 
 Alex
 
  hw/vfio_pci.c |  224 
 +
  1 file changed, 224 insertions(+)
 
 diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
 index 639371e..777a5f8 100644
 --- a/hw/vfio_pci.c
 +++ b/hw/vfio_pci.c
 @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice 
 *pdev, uint32_t addr, int len);
  static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
  
  /*
 + * PCI code refuses to make it possible to probe whether the chipset
 + * supports pci_device_route_intx_to_irq() and booby traps the call
 + * to assert if doesn't.  For us, this is just an optimization, so
 + * only enable it when we know it's present.  Unfortunately PCIBus is
 + * private, so we can't just look at the function pointer.
 + */
 +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
 +{
 +#ifdef CONFIG_KVM
 +BusState *bus = qdev_get_parent_bus(pdev-qdev);
 +DeviceState *dev;
 +
 +if (!kvm_irqchip_in_kernel() ||
 +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
 + return false;
 +}


Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
Also for KVM_IRQFD_FLAG_RESAMPLE.
   
   I posted the patch for that separately yesterday.  I'll only request a
   pull once that's in.
  
  OK missed that. In the future, might be a good idea to note dependencies
  in the patchset or repost them as part of patchset with appropriate
  tagging.
  
 +
 +for (; bus-parent; bus = qdev_get_parent_bus(dev)) {
 +
 +dev = bus-parent;
 +
 +if (!strncmp(i440FX-pcihost, 
 object_get_typename(OBJECT(dev)), 14)) {
 +return true;
 +}
 +}
 +
 +error_report(vfio-pci: VM chipset does not support INTx 
 routing, 
 + using slow INTx mode\n);

When does this code trigger? It seems irqchip implies piix ATM -
is this just dead code?
   
   Unused, but not unnecessary.  Another chipset is under development,
   which means very quickly irqchip will not imply piix.
  
  So this is for purposes of an out of tree stuff, let's
  keep these hacks out of tree too. My guess is
  q35 can just be added with pci_device_route_intx straight away.
  
Likewise irqfd
   support is being added to other architectures, so I don't know how long
   the kvm specific tests will hold up.  Testing for a specific chipset
   could of course be avoided if we were willing to support:
   
   bool pci_device_intx_route_supported(PCIDevice *pdev)
   
   or the NOROUTE option I posted previously.
  
  This is just moving the pain to users who will
  get bad performance and will have to debug it. Injecting
  through userspace is slow, new architectures should
  simply add irqfd straight away instead of adding
  work arounds in userspace.
  
  This is exactly why we have the assert in pci core.
 
 Let's compare user experience:
 
 As coded here:
 
 - Error message, only runs slower if the driver actually uses INTx.
 Result: file bug, continue using vfio-pci, maybe do something useful,
 maybe find new bugs to file.
 
 Blindly calling PCI code w/ assert:
 
 - Assert.  Result: file bug, full stop.
 
 Maybe I do too much kernel programming, but I don't take asserts lightly
 and prefer they be saved for something is really broken and I can't
 safely continue.  This is not such a case.

There's no chance we ship e.g. q35 by mistake without this API: since
there is no way this specific assert can be missed in even basic
testing:

So I see it differently:

As coded here:
chipset authors get lazy and do not implement API.
bad performance for all users.

With assert:
chipset authors implement necessary API.
good performance for all

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Alex Williamson

On Tue, 2012-10-16 at 17:08 +0200, Michael S. Tsirkin wrote:
 On Tue, Oct 16, 2012 at 08:48:04AM -0600, Alex Williamson wrote:
  On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote:
   On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote:
On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote:
 On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
  This makes use of the new level irqfd support enabling bypass of
  qemu userspace both on INTx injection and unmask.  This 
  significantly
  boosts the performance of devices making use of legacy interrupts.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
  My INTx routing workaround below will probably raise some eyebrows,
  but I don't feel it's worth subjecting users to core dumps if they
  want to try vfio-pci on new platforms.  INTx routing is part of some
  larger plan, but until that plan materializes we have to try to 
  avoid
  the API unless we think there's a good chance it might be there.
  I'll accept the maintenance of updating a whitelist in the interim.
  Thanks,
  
  Alex
  
   hw/vfio_pci.c |  224 
  +
   1 file changed, 224 insertions(+)
  
  diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
  index 639371e..777a5f8 100644
  --- a/hw/vfio_pci.c
  +++ b/hw/vfio_pci.c
  @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice 
  *pdev, uint32_t addr, int len);
   static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
   
   /*
  + * PCI code refuses to make it possible to probe whether the 
  chipset
  + * supports pci_device_route_intx_to_irq() and booby traps the call
  + * to assert if doesn't.  For us, this is just an optimization, so
  + * only enable it when we know it's present.  Unfortunately PCIBus 
  is
  + * private, so we can't just look at the function pointer.
  + */
  +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
  +{
  +#ifdef CONFIG_KVM
  +BusState *bus = qdev_get_parent_bus(pdev-qdev);
  +DeviceState *dev;
  +
  +if (!kvm_irqchip_in_kernel() ||
  +!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
  +   return false;
  +}
 
 
 Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
 Also for KVM_IRQFD_FLAG_RESAMPLE.

I posted the patch for that separately yesterday.  I'll only request a
pull once that's in.
   
   OK missed that. In the future, might be a good idea to note dependencies
   in the patchset or repost them as part of patchset with appropriate
   tagging.
   
  +
  +for (; bus-parent; bus = qdev_get_parent_bus(dev)) {
  +
  +dev = bus-parent;
  +
  +if (!strncmp(i440FX-pcihost, 
  object_get_typename(OBJECT(dev)), 14)) {
  +return true;
  +}
  +}
  +
  +error_report(vfio-pci: VM chipset does not support INTx 
  routing, 
  + using slow INTx mode\n);
 
 When does this code trigger? It seems irqchip implies piix ATM -
 is this just dead code?

Unused, but not unnecessary.  Another chipset is under development,
which means very quickly irqchip will not imply piix.
   
   So this is for purposes of an out of tree stuff, let's
   keep these hacks out of tree too. My guess is
   q35 can just be added with pci_device_route_intx straight away.
   
 Likewise irqfd
support is being added to other architectures, so I don't know how long
the kvm specific tests will hold up.  Testing for a specific chipset
could of course be avoided if we were willing to support:

bool pci_device_intx_route_supported(PCIDevice *pdev)

or the NOROUTE option I posted previously.
   
   This is just moving the pain to users who will
   get bad performance and will have to debug it. Injecting
   through userspace is slow, new architectures should
   simply add irqfd straight away instead of adding
   work arounds in userspace.
   
   This is exactly why we have the assert in pci core.
  
  Let's compare user experience:
  
  As coded here:
  
  - Error message, only runs slower if the driver actually uses INTx.
  Result: file bug, continue using vfio-pci, maybe do something useful,
  maybe find new bugs to file.
  
  Blindly calling PCI code w/ assert:
  
  - Assert.  Result: file bug, full stop.
  
  Maybe I do too much kernel programming, but I don't take asserts lightly
  and prefer they be saved for something is really broken and I can't
  safely continue.  This is not such a case.
 
 There's no chance we ship e.g. q35 by mistake without this API: since
 there is no way this specific assert can be missed in even basic
 testing:
 
 So I see it differently:

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Michael S. Tsirkin

On Tue, Oct 16, 2012 at 09:13:15AM -0600, Alex Williamson wrote:
  There's no chance we ship e.g. q35 by mistake without this API: since
  there is no way this specific assert can be missed in even basic
  testing:
  
  So I see it differently:
  
  As coded here:
  chipset authors get lazy and do not implement API.
  bad performance for all users.
  
  With assert:
  chipset authors implement necessary API.
  good performance for all users.
 
 I prefer a carrot, not a whip.  Thanks,
 
 Alex
 

It's not just that.
Problem is performance testing/fixing is hard.
Catching and fixing asserts is easy.
So working around buggy qemu code really backfires
as it reverses the motivation for writing well performing
code. History proves me right: for each API change where
we implemented a fallback old code stayed around for years.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

I/O errors in guest OS after repeated migration

2012-10-16 Thread Guido Winkelmann

Hi,

I'm experiencing I/O errors in a guest machine after migrating it from one 
host to another, and then back to the original host. After doing this, I find 
the following in the dmesg output of the guest machine:

[  345.390543] end_request: I/O error, dev vda, sector 273871
[  345.391125] end_request: I/O error, dev vda, sector 273871
[  345.391705] end_request: I/O error, dev vda, sector 273871
[  345.394796] end_request: I/O error, dev vda, sector 1745983
[  345.396005] end_request: I/O error, dev vda, sector 1745983
[  346.083160] end_request: I/O error, dev vdb, sector 54528008
[  346.083179] Buffer I/O error on device dm-0, logical block 6815745
[  346.083181] lost page write due to I/O error on dm-0
[  346.083193] end_request: I/O error, dev vdb, sector 54528264
[  346.083195] Buffer I/O error on device dm-0, logical block 6815777
[  346.083197] lost page write due to I/O error on dm-0
[  346.083201] end_request: I/O error, dev vdb, sector 2056
[  346.083204] Buffer I/O error on device dm-0, logical block 1
[  346.083206] lost page write due to I/O error on dm-0
[  346.083209] Buffer I/O error on device dm-0, logical block 2
[  346.083211] lost page write due to I/O error on dm-0
[  346.083215] end_request: I/O error, dev vdb, sector 10248
[  346.083217] Buffer I/O error on device dm-0, logical block 1025
[  346.083219] lost page write due to I/O error on dm-0
[  346.091499] end_request: I/O error, dev vdb, sector 76240
[  346.091506] Buffer I/O error on device dm-0, logical block 9274
[  346.091508] lost page write due to I/O error on dm-0
[  346.091572] JBD2: Detected IO errors while flushing file data on dm-0-8
[  346.091915] end_request: I/O error, dev vdb, sector 38017360
[  346.091956] Aborting journal on device dm-0-8.
[  346.092557] end_request: I/O error, dev vdb, sector 38012928
[  346.092566] Buffer I/O error on device dm-0, logical block 4751360
[  346.092569] lost page write due to I/O error on dm-0
[  346.092624] JBD2: I/O error detected when updating journal superblock for 
dm-0-8.
[  346.100940] end_request: I/O error, dev vdb, sector 2048
[  346.100948] Buffer I/O error on device dm-0, logical block 0
[  346.100952] lost page write due to I/O error on dm-0
[  346.101027] EXT4-fs error (device dm-0): ext4_journal_start_sb:327: 
Detected aborted journal
[  346.101038] EXT4-fs (dm-0): Remounting filesystem read-only
[  346.101051] EXT4-fs (dm-0): previous I/O error to superblock detected
[  346.101836] end_request: I/O error, dev vdb, sector 2048
[  346.101845] Buffer I/O error on device dm-0, logical block 0
[  346.101849] lost page write due to I/O error on dm-0
[  373.006680] end_request: I/O error, dev vda, sector 624319
[  373.007543] end_request: I/O error, dev vda, sector 624319
[  373.008327] end_request: I/O error, dev vda, sector 624319
[  374.886674] end_request: I/O error, dev vda, sector 624319
[  374.887563] end_request: I/O error, dev vda, sector 624319

The hosts are both running Fedora 17 with qemu-kvm-1.0.1-1.fc17.x86_64. The 
guest machine has been started and migrated using libvirt (0.9.11). Kernel 
version is 3.5.6-1.fc17.x86_64 on the first host and 3.5.5-2.fc17.x86_64 on 
the second.
The guest machine is on Kernel 3.3.8 and uses ext4 on its disks.

The commandline, as generated by libvirtd, looks like this:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 
QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.15 -enable-kvm -m 1024 -smp 
1,sockets=1,cores=1,threads=1 -name migratetest2 -uuid 
ddbf11e9-387e-902b-4849-8c3067dc42a2 -nodefconfig -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/migratetest2.monitor,server,nowait
 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-reboot -no-
shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/data/migratetest2_system,if=none,id=drive-virtio-
disk0,format=qcow2,cache=none -device virtio-blk-
pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-
disk0,bootindex=1 -drive file=/data/migratetest2_data-1,if=none,id=drive-
virtio-disk1,format=qcow2,cache=none -device virtio-blk-
pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -
netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-
pci,netdev=hostnet0,id=net0,mac=02:00:00:00:00:0c,bus=pci.0,addr=0x3 -vnc 
127.0.0.1:2,password -k de -vga cirrus -incoming tcp:0.0.0.0:49153 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

The second host has an ext4 filesystem mounted under /data, which it exports 
using NFSv3 over TCP to the first host, which also mounts it under /data.

So far, the problem seems reproducible: When I start another guest machine and 
do the same thing with it, the same problem happens.

Can anybody help me with this problem?

Guido

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Alex Williamson

On Tue, 2012-10-16 at 17:23 +0200, Michael S. Tsirkin wrote:
 On Tue, Oct 16, 2012 at 09:13:15AM -0600, Alex Williamson wrote:
   There's no chance we ship e.g. q35 by mistake without this API: since
   there is no way this specific assert can be missed in even basic
   testing:
   
   So I see it differently:
   
   As coded here:
 chipset authors get lazy and do not implement API.
 bad performance for all users.
   
   With assert:
 chipset authors implement necessary API.
 good performance for all users.
  
  I prefer a carrot, not a whip.  Thanks,
  
  Alex
  
 
 It's not just that.
 Problem is performance testing/fixing is hard.

Getting an error_report from the driver saying it's using a slow path
and why makes that significantly easier.

 Catching and fixing asserts is easy.

Easy for who?  The user trying to test a feature?  Probably not.  Me,
who may not have access to the chipset documentation or understand the
platform?  Maybe, maybe not.

 So working around buggy qemu code really backfires
 as it reverses the motivation for writing well performing
 code. History proves me right: for each API change where
 we implemented a fallback old code stayed around for years.

Does that necessarily mean it was wrong?  How many of those API changes
added new features that may have been abandoned if the developer was
required to make sweeping changes to get their code accepted?  If not
abandoned, how much delayed?  How many land mines might we have in the
code for changes that were done incorrectly or missed?  I don't
understand why adding robustness to the API is such a contentious point,
but it's your prerogative, just as it's mine to avoid using that API
arbitrarily.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio-pci: Add KVM INTx acceleration

2012-10-16 Thread Michael S. Tsirkin

On Tue, Oct 16, 2012 at 10:49:38AM -0600, Alex Williamson wrote:
 On Tue, 2012-10-16 at 17:23 +0200, Michael S. Tsirkin wrote:
  On Tue, Oct 16, 2012 at 09:13:15AM -0600, Alex Williamson wrote:
There's no chance we ship e.g. q35 by mistake without this API: since
there is no way this specific assert can be missed in even basic
testing:

So I see it differently:

As coded here:
chipset authors get lazy and do not implement API.
bad performance for all users.

With assert:
chipset authors implement necessary API.
good performance for all users.
   
   I prefer a carrot, not a whip.  Thanks,
   
   Alex
   
  
  It's not just that.
  Problem is performance testing/fixing is hard.
 
 Getting an error_report from the driver saying it's using a slow path
 and why makes that significantly easier.
 
  Catching and fixing asserts is easy.
 
 Easy for who?  The user trying to test a feature?  Probably not.  Me,
 who may not have access to the chipset documentation or understand the
 platform?  Maybe, maybe not.
 
  So working around buggy qemu code really backfires
  as it reverses the motivation for writing well performing
  code. History proves me right: for each API change where
  we implemented a fallback old code stayed around for years.
 
 Does that necessarily mean it was wrong?  How many of those API changes
 added new features that may have been abandoned if the developer was
 required to make sweeping changes to get their code accepted?  If not
 abandoned, how much delayed?  How many land mines might we have in the
 code for changes that were done incorrectly or missed?  I don't
 understand why adding robustness to the API is such a contentious point,
 but it's your prerogative, just as it's mine to avoid using that API
 arbitrarily.  Thanks,
 
 Alex

Yea. All I say is, I intend to fix things so you don't need
to probe for this API. I think it does not make sense to
add a temporary API as a stopgap for this since it will solve no
actual problem. If e.g. Jason finds it hard to add this to
q35, we could add a stopgap solution for vfio.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: I/O errors in guest OS after repeated migration

2012-10-16 Thread Brian Jackson

On Tuesday, October 16, 2012 11:33:44 AM Guido Winkelmann wrote:
 Hi,
 
 I'm experiencing I/O errors in a guest machine after migrating it from one
 host to another, and then back to the original host. After doing this, I
 find the following in the dmesg output of the guest machine:
 
 [  345.390543] end_request: I/O error, dev vda, sector 273871
 [  345.391125] end_request: I/O error, dev vda, sector 273871
 [  345.391705] end_request: I/O error, dev vda, sector 273871
 [  345.394796] end_request: I/O error, dev vda, sector 1745983
 [  345.396005] end_request: I/O error, dev vda, sector 1745983
 [  346.083160] end_request: I/O error, dev vdb, sector 54528008
 [  346.083179] Buffer I/O error on device dm-0, logical block 6815745
 [  346.083181] lost page write due to I/O error on dm-0
 [  346.083193] end_request: I/O error, dev vdb, sector 54528264
 [  346.083195] Buffer I/O error on device dm-0, logical block 6815777
 [  346.083197] lost page write due to I/O error on dm-0
 [  346.083201] end_request: I/O error, dev vdb, sector 2056
 [  346.083204] Buffer I/O error on device dm-0, logical block 1
 [  346.083206] lost page write due to I/O error on dm-0
 [  346.083209] Buffer I/O error on device dm-0, logical block 2
 [  346.083211] lost page write due to I/O error on dm-0
 [  346.083215] end_request: I/O error, dev vdb, sector 10248
 [  346.083217] Buffer I/O error on device dm-0, logical block 1025
 [  346.083219] lost page write due to I/O error on dm-0
 [  346.091499] end_request: I/O error, dev vdb, sector 76240
 [  346.091506] Buffer I/O error on device dm-0, logical block 9274
 [  346.091508] lost page write due to I/O error on dm-0
 [  346.091572] JBD2: Detected IO errors while flushing file data on dm-0-8
 [  346.091915] end_request: I/O error, dev vdb, sector 38017360
 [  346.091956] Aborting journal on device dm-0-8.
 [  346.092557] end_request: I/O error, dev vdb, sector 38012928
 [  346.092566] Buffer I/O error on device dm-0, logical block 4751360
 [  346.092569] lost page write due to I/O error on dm-0
 [  346.092624] JBD2: I/O error detected when updating journal superblock
 for dm-0-8.
 [  346.100940] end_request: I/O error, dev vdb, sector 2048
 [  346.100948] Buffer I/O error on device dm-0, logical block 0
 [  346.100952] lost page write due to I/O error on dm-0
 [  346.101027] EXT4-fs error (device dm-0): ext4_journal_start_sb:327:
 Detected aborted journal
 [  346.101038] EXT4-fs (dm-0): Remounting filesystem read-only
 [  346.101051] EXT4-fs (dm-0): previous I/O error to superblock detected
 [  346.101836] end_request: I/O error, dev vdb, sector 2048
 [  346.101845] Buffer I/O error on device dm-0, logical block 0
 [  346.101849] lost page write due to I/O error on dm-0
 [  373.006680] end_request: I/O error, dev vda, sector 624319
 [  373.007543] end_request: I/O error, dev vda, sector 624319
 [  373.008327] end_request: I/O error, dev vda, sector 624319
 [  374.886674] end_request: I/O error, dev vda, sector 624319
 [  374.887563] end_request: I/O error, dev vda, sector 624319
 
 The hosts are both running Fedora 17 with qemu-kvm-1.0.1-1.fc17.x86_64. The
 guest machine has been started and migrated using libvirt (0.9.11). Kernel
 version is 3.5.6-1.fc17.x86_64 on the first host and 3.5.5-2.fc17.x86_64 on
 the second.
 The guest machine is on Kernel 3.3.8 and uses ext4 on its disks.
 
 The commandline, as generated by libvirtd, looks like this:
 
 LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
 QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.15 -enable-kvm -m 1024
 -smp 1,sockets=1,cores=1,threads=1 -name migratetest2 -uuid
 ddbf11e9-387e-902b-4849-8c3067dc42a2 -nodefconfig -nodefaults -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/migratetest2.monitor,serve
 r,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-reboot -no- shutdown -device
 piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
 file=/data/migratetest2_system,if=none,id=drive-virtio-
 disk0,format=qcow2,cache=none -device virtio-blk-
 pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-
 disk0,bootindex=1 -drive file=/data/migratetest2_data-1,if=none,id=drive-
 virtio-disk1,format=qcow2,cache=none -device virtio-blk-
 pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -
 netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-
 pci,netdev=hostnet0,id=net0,mac=02:00:00:00:00:0c,bus=pci.0,addr=0x3 -vnc
 127.0.0.1:2,password -k de -vga cirrus -incoming tcp:0.0.0.0:49153 -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6


I see qcow2 in there. Live migration of qcow2 was a new feature in 1.0. Have 
you tried other formats or different qemu/kvm versions?


 
 The second host has an ext4 filesystem mounted under /data, which it
 exports using NFSv3 over TCP to the first host, which also mounts it under
 /data.
 
 So far, the problem seems reproducible: When I start another guest machine
 and do the same thing with

[patch 08/15] x86: pvclock: generic pvclock vsyscall initialization

2012-10-16 Thread Marcelo Tosatti

Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization 
routines.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -17,6 +17,10 @@
 
 #include linux/kernel.h
 #include linux/percpu.h
+#include linux/notifier.h
+#include linux/sched.h
+#include linux/gfp.h
+#include linux/bootmem.h
 #include asm/pvclock.h
 
 static u8 valid_flags __read_mostly = 0;
@@ -122,3 +126,70 @@ void pvclock_read_wallclock(struct pvclo
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+static aligned_pvti_t *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info 
*pvclock_get_vsyscall_user_time_info(int cpu)
+{
+   if (pvclock_vdso_info == NULL) {
+   BUG();
+   return NULL;
+   }
+
+   return pvclock_vdso_info[cpu].info;
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+   return pvclock_get_vsyscall_user_time_info(cpu)-pvti;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+   struct task_migration_notifier *mn = v;
+   struct pvclock_vsyscall_time_info *pvti;
+
+   pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu);
+
+   if (pvti == NULL)
+   return NOTIFY_DONE;
+
+   pvti-migrate_count++;
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+   .notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state.  This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+int __init pvclock_init_vsyscall(void)
+{
+   int idx;
+   unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
+
+   pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);
+   if (!pvclock_vdso_info)
+   return -ENOMEM;
+
+   memset(pvclock_vdso_info, 0, size);
+
+   for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
+   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
+__pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
+PAGE_KERNEL_VVAR);
+   }
+
+   register_task_migration_notifier(pvclock_migrate);
+
+   return 0;
+}
+
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
Index: vsyscall/arch/x86/include/asm/fixmap.h
===
--- vsyscall.orig/arch/x86/include/asm/fixmap.h
+++ vsyscall/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include asm/acpi.h
 #include asm/apicdef.h
 #include asm/page.h
+#include asm/pvclock.h
 #ifdef CONFIG_X86_32
 #include linux/threads.h
 #include asm/kmap_types.h
@@ -81,6 +82,10 @@ enum fixed_addresses {
VVAR_PAGE,
VSYSCALL_HPET,
 #endif
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+   PVCLOCK_FIXMAP_BEGIN,
+   PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
+#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo
struct pvclock_vcpu_time_info *vcpu,
struct timespec *ts);
 void pvclock_resume(void);
+int __init pvclock_init_vsyscall(void);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
@@ -85,4 +87,24 @@ unsigned __pvclock_read_cycles(const str
return version;
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+struct pvclock_vsyscall_time_info {
+   struct pvclock_vcpu_time_info pvti;
+   u32 migrate_count;
+};
+
+typedef union {
+   struct pvclock_vsyscall_time_info info;
+   char pad[SMP_CACHE_BYTES];
+} aligned_pvti_t cacheline_aligned;
+
+#define PVTI_SIZE sizeof(aligned_pvti_t)
+#if NR_CPUS == 1
+#define PVCLOCK_VSYSCALL_NR_PAGES 1
+#else
+#define PVCLOCK_VSYSCALL_NR_PAGES ((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1
+#endif
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: vsyscall/arch/x86/include/asm/clocksource.h
===
--- vsyscall.orig/arch/x86/include/asm/clocksource.h
+++ vsyscall/arch/x86/include/asm/clocksource.h
@@ -8,6 +8,7 @@
 #define VCLOCK_NONE 0  /* No vDSO clock available. */
 #define VCLOCK_TSC  1  /* vDSO should use vread_tsc.   */
 #define VCLOCK_HPET 2  /* vDSO should use vread_hpet.  */
+#define VCLOCK_PVCLOCK

[patch 06/15] x86: pvclock: introduce helper to read flags

2012-10-16 Thread Marcelo Tosatti

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -45,6 +45,19 @@ void pvclock_resume(void)
atomic64_set(last_value, 0);
 }
 
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
+{
+   unsigned version;
+   cycle_t ret;
+   u8 flags;
+
+   do {
+   version = __pvclock_read_cycles(src, ret, flags);
+   } while ((src-version  1) || version != src-version);
+
+   return flags  valid_flags;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
unsigned version;
Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 13/15] KVM: x86: pass host_tsc to read_l1_tsc

2012-10-16 Thread Marcelo Tosatti

Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc().

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -703,7 +703,7 @@ struct kvm_x86_ops {
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
-   u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+   u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
 
void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
 
Index: vsyscall/arch/x86/kvm/lapic.c
===
--- vsyscall.orig/arch/x86/kvm/lapic.c
+++ vsyscall/arch/x86/kvm/lapic.c
@@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_
local_irq_save(flags);
 
now = apic-lapic_timer.timer.base-get_time();
-   guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
+   guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu, native_read_tsc());
if (likely(tscdeadline  guest_tsc)) {
ns = (tscdeadline - guest_tsc) * 100ULL;
do_div(ns, this_tsc_khz);
Index: vsyscall/arch/x86/kvm/svm.c
===
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct
return 0;
 }
 
-u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
return vmcb-control.tsc_offset +
-   svm_scale_tsc(vcpu, native_read_tsc());
+   svm_scale_tsc(vcpu, host_tsc);
 }
 
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
Index: vsyscall/arch/x86/kvm/vmx.c
===
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void)
  * Like guest_read_tsc, but always returns L1's notion of the timestamp
  * counter, even if a nested guest (L2) is currently running.
  */
-u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-   u64 host_tsc, tsc_offset;
+   u64 tsc_offset;
 
-   rdtscll(host_tsc);
tsc_offset = is_guest_mode(vcpu) ?
to_vmx(vcpu)-nested.vmcs01_tsc_offset :
vmcs_read64(TSC_OFFSET);
Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct 
 
/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
-   tsc_timestamp = kvm_x86_ops-read_l1_tsc(v);
+   tsc_timestamp = kvm_x86_ops-read_l1_tsc(v, native_read_tsc());
kernel_ns = get_kernel_ns();
this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
if (unlikely(this_tsc_khz == 0)) {
@@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v
if (hw_breakpoint_active())
hw_breakpoint_restore();
 
-   vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
+   vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu,
+  native_read_tsc());
 
vcpu-mode = OUTSIDE_GUEST_MODE;
smp_wmb();


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 15/15] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

2012-10-16 Thread Marcelo Tosatti

KVM added a global variable to guarantee monotonicity in the guest. 
It is necessary because the time between

1. ktime_get_ts(timespec);
2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances 
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when 
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, then
this problem disappears. Monotonicity is then guaranteed by the
synchronicity of the host TSCs.

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -46,6 +46,7 @@
 #include linux/uaccess.h
 #include linux/hash.h
 #include linux/pci.h
+#include linux/pvclock_gtod.h
 #include trace/events/kvm.h
 
 #define CREATE_TRACE_POINTS
@@ -1135,8 +1136,91 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static cycle_t read_tsc(void)
+{
+   cycle_t ret;
+   u64 last;
+
+   /*
+* Empirically, a fence (of type that depends on the CPU)
+* before rdtsc is enough to ensure that rdtsc is ordered
+* with respect to loads.  The various CPU manuals are unclear
+* as to whether rdtsc can be reordered with later loads,
+* but no one has ever seen it happen.
+*/
+   rdtsc_barrier();
+   ret = (cycle_t)vget_cycles();
+
+   last = pvclock_gtod_data.clock.cycle_last;
+
+   if (likely(ret = last))
+   return ret;
+
+   /*
+* GCC likes to generate cmov here, but this branch is extremely
+* predictable (it's just a funciton of time and the likely is
+* very likely) and there's a data dependence, so force GCC
+* to generate a branch instead.  I don't barrier() because
+* we don't actually need a barrier, and if this function
+* ever gets inlined it will generate worse code.
+*/
+   asm volatile ();
+   return last;
+}
+
+static inline u64 vgettsc(cycle_t *cycle_now)
+{
+   long v;
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+
+   *cycle_now = read_tsc();
+
+   v = (*cycle_now - gtod-clock.cycle_last)  gtod-clock.mask;
+   return v * gtod-clock.mult;
+}
+
+static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+
+   ts-tv_nsec = 0;
+   do {
+   seq = read_seqcount_begin(gtod-seq);
+   mode = gtod-clock.vclock_mode;
+   ts-tv_sec = gtod-monotonic_time_sec;
+   ns = gtod-monotonic_time_snsec;
+   ns += vgettsc(cycle_now);
+   ns = gtod-clock.shift;
+   } while (unlikely(read_seqcount_retry(gtod-seq, seq)));
+   timespec_add_ns(ts, ns);
+
+   return mode;
+}
+
+/* returns true if host is using tsc clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+   struct timespec ts;
+
+   /* checked again under seqlock below */
+   if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+   return false;
+
+   if (do_monotonic(ts, cycle_now) != VCLOCK_TSC)
+   return false;
+
+   monotonic_to_bootbased(ts);
+   *kernel_ns = timespec_to_ns(ts);
+
+   return true;
+}
+
 static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
-unsigned int offset_in_page, gpa_t gpa)
+unsigned int offset_in_page, gpa_t gpa,
+bool host_tsc_clocksource)
 {
struct kvm_vcpu_arch *vcpu = v-arch;
void *shared_kaddr;
@@ -1155,6 +1239,10 @@ static void kvm_write_pvtime(struct kvm_
vcpu-pvclock_set_guest_stopped_request = false;
}
 
+   /* If the host uses TSC clocksource, then it is stable */
+   if (host_tsc_clocksource)
+   pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
+
vcpu-hv_clock.flags = pvclock_flags;
 
memcpy(shared_kaddr + offset_in_page, vcpu-hv_clock,
@@ -1172,11 +1260,12 @@ static int kvm_guest_time_update(struct 
unsigned long this_tsc_khz;
s64 kernel_ns, max_kernel_ns;
u64 tsc_timestamp;
+   cycle_t cycle_now;
+   u64 host_tsc;
+   bool host_tsc_clocksource;
 
/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
-   tsc_timestamp = kvm_x86_ops-read_l1_tsc(v, native_read_tsc());
-   kernel_ns = get_kernel_ns();

[patch 14/15] time: export time information for KVM pvclock

2012-10-16 Thread Marcelo Tosatti

As suggested by John, export time data similarly to how its
by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/include/linux/pvclock_gtod.h
===
--- /dev/null
+++ vsyscall/include/linux/pvclock_gtod.h
@@ -0,0 +1,23 @@
+#ifndef _PVCLOCK_GTOD_H
+#define _PVCLOCK_GTOD_H
+
+#include linux/clocksource.h
+
+struct pvclock_gtod_data {
+   seqcount_t  seq;
+
+   struct { /* extract of a clocksource struct */
+   int vclock_mode;
+   cycle_t cycle_last;
+   cycle_t mask;
+   u32 mult;
+   u32 shift;
+   } clock;
+
+   /* open coded 'struct timespec' */
+   u64 monotonic_time_snsec;
+   time_t  monotonic_time_sec;
+};
+extern struct pvclock_gtod_data pvclock_gtod_data;
+
+#endif /* _PVCLOCK_GTOD_H */
Index: vsyscall/kernel/time/timekeeping.c
===
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
 #include linux/time.h
 #include linux/tick.h
 #include linux/stop_machine.h
+#include linux/pvclock_gtod.h
 
 
 static struct timekeeper timekeeper;
@@ -180,6 +181,37 @@ static inline s64 timekeeping_get_ns_raw
return nsec + arch_gettimeoffset();
 }
 
+struct pvclock_gtod_data pvclock_gtod_data;
+EXPORT_SYMBOL_GPL(pvclock_gtod_data);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+   struct pvclock_gtod_data *vdata = pvclock_gtod_data;
+
+   write_seqcount_begin(vdata-seq);
+
+   /* copy vsyscall data */
+   vdata-clock.vclock_mode= tk-clock-archdata.vclock_mode;
+   vdata-clock.cycle_last = tk-clock-cycle_last;
+   vdata-clock.mask   = tk-clock-mask;
+   vdata-clock.mult   = tk-mult;
+   vdata-clock.shift  = tk-shift;
+
+   vdata-monotonic_time_sec   = tk-xtime_sec
+   + tk-wall_to_monotonic.tv_sec;
+   vdata-monotonic_time_snsec = tk-xtime_nsec
+   + (tk-wall_to_monotonic.tv_nsec
+tk-shift);
+   while (vdata-monotonic_time_snsec =
+   (((u64)NSEC_PER_SEC)  tk-shift)) {
+   vdata-monotonic_time_snsec -=
+   ((u64)NSEC_PER_SEC)  tk-shift;
+   vdata-monotonic_time_sec++;
+   }
+
+   write_seqcount_end(vdata-seq);
+}
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
@@ -188,6 +220,7 @@ static void timekeeping_update(struct ti
ntp_clear();
}
update_vsyscall(tk);
+   update_pvclock_gtod(tk);
 }
 
 /**


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 03/15] x86: pvclock: remove pvclock_shadow_time

2012-10-16 Thread Marcelo Tosatti

Originally from Jeremy Fitzhardinge.

We can copy the information directly from struct pvclock_vcpu_time_info, 
remove pvclock_shadow_time.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -19,21 +19,6 @@
 #include linux/percpu.h
 #include asm/pvclock.h
 
-/*
- * These are perodically updated
- *xen: magic shared_info page
- *kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-   u64 tsc_timestamp; /* TSC at last update of time vals.  */
-   u64 system_timestamp;  /* Time, in nanosecs, since boot.*/
-   u32 tsc_to_nsec_mul;
-   int tsc_shift;
-   u32 version;
-   u8  flags;
-};
-
 static u8 valid_flags __read_mostly = 0;
 
 void pvclock_set_flags(u8 flags)
@@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags)
valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-   u64 delta = native_read_tsc() - shadow-tsc_timestamp;
-   return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
-  shadow-tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-   struct pvclock_vcpu_time_info *src)
-{
-   do {
-   dst-version = src-version;
-   rmb();  /* fetch version before data */
-   dst-tsc_timestamp = src-tsc_timestamp;
-   dst-system_timestamp  = src-system_time;
-   dst-tsc_to_nsec_mul   = src-tsc_to_system_mul;
-   dst-tsc_shift = src-tsc_shift;
-   dst-flags = src-flags;
-   rmb();  /* test version after fetching data */
-   } while ((src-version  1) || (dst-version != src-version));
-
-   return dst-version;
+   u64 delta = native_read_tsc() - src-tsc_timestamp;
+   return pvclock_scale_delta(delta, src-tsc_to_system_mul,
+  src-tsc_shift);
 }
 
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -90,21 +54,20 @@ void pvclock_resume(void)
 
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
-   struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
u64 last;
 
do {
-   version = pvclock_get_time_values(shadow, src);
+   version = src-version;
rdtsc_barrier();
-   offset = pvclock_get_nsec_offset(shadow);
-   ret = shadow.system_timestamp + offset;
+   offset = pvclock_get_nsec_offset(src);
+   ret = src-system_time + offset;
rdtsc_barrier();
-   } while (version != src-version);
+   } while ((src-version  1) || version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 
-   (shadow.flags  PVCLOCK_TSC_STABLE_BIT))
+   (src-flags  PVCLOCK_TSC_STABLE_BIT))
return ret;
 
/*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 05/15] x86: pvclock: fix flags usage race

2012-10-16 Thread Marcelo Tosatti

Validity of values returned by pvclock (including flags) is guaranteed by 
version 
checks.

That is, read of src-flags outside version check protection can refer
to a different paravirt clock update by the hypervisor.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -66,18 +66,21 @@ u64 pvclock_get_nsec_offset(const struct
 
 static __always_inline
 unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
-  cycle_t *cycles)
+  cycle_t *cycles, u8 *flags)
 {
unsigned version;
cycle_t ret, offset;
+   u8 ret_flags;
 
version = src-version;
rdtsc_barrier();
offset = pvclock_get_nsec_offset(src);
ret = src-system_time + offset;
+   ret_flags = src-flags;
rdtsc_barrier();
 
*cycles = ret;
+   *flags = ret_flags;
return version;
 }
 
Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -50,13 +50,14 @@ cycle_t pvclock_clocksource_read(struct 
unsigned version;
cycle_t ret;
u64 last;
+   u8 flags;
 
do {
-   version = __pvclock_read_cycles(src, ret);
+   version = __pvclock_read_cycles(src, ret, flags);
} while ((src-version  1) || version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 
-   (src-flags  PVCLOCK_TSC_STABLE_BIT))
+   (flags  PVCLOCK_TSC_STABLE_BIT))
return ret;
 
/*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 11/15] x86: vdso: pvclock gettime support

2012-10-16 Thread Marcelo Tosatti

Improve performance of time system calls when using Linux pvclock, 
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -22,6 +22,7 @@
 #include asm/hpet.h
 #include asm/unistd.h
 #include asm/io.h
+#include asm/pvclock.h
 
 #define gtod (VVAR(vsyscall_gtod_data))
 
@@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+
+static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+{
+   const aligned_pvti_t *pvti_base;
+   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
+   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
+
+   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
+
+   pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
+
+   return pvti_base[offset].info;
+}
+
+static notrace cycle_t vread_pvclock(int *mode)
+{
+   const struct pvclock_vsyscall_time_info *pvti;
+   cycle_t ret;
+   u64 last;
+   u32 version;
+   u32 migrate_count;
+   u8 flags;
+   unsigned cpu, cpu1;
+
+
+   /*
+* When looping to get a consistent (time-info, tsc) pair, we
+* also need to deal with the possibility we can switch vcpus,
+* so make sure we always re-fetch time-info for the current vcpu.
+*/
+   do {
+   cpu = __getcpu()  0xfff;
+   pvti = get_pvti(cpu);
+
+   migrate_count = pvti-migrate_count;
+
+   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
+
+   /*
+* Test we're still on the cpu as well as the version.
+* We could have been migrated just after the first
+* vgetcpu but before fetching the version, so we
+* wouldn't notice a version change.
+*/
+   cpu1 = __getcpu()  0xfff;
+   } while (unlikely(cpu != cpu1 ||
+ (pvti-pvti.version  1) ||
+ pvti-pvti.version != version ||
+ pvti-migrate_count != migrate_count));
+
+   if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
+   *mode = VCLOCK_NONE;
+
+   last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+   if (likely(ret = last))
+   return ret;
+
+   return last;
+}
+#endif
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
@@ -88,6 +152,8 @@ notrace static inline u64 vgetsns(int *m
cycles = vread_tsc();
else if (gtod-clock.vclock_mode == VCLOCK_HPET)
cycles = vread_hpet();
+   else if (gtod-clock.vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
else
return 0;
v = (cycles - gtod-clock.cycle_last)  gtod-clock.mask;
Index: vsyscall/arch/x86/include/asm/vsyscall.h
===
--- vsyscall.orig/arch/x86/include/asm/vsyscall.h
+++ vsyscall/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,21 @@ extern void map_vsyscall(void);
  */
 extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
 
+static inline unsigned int __getcpu(void)
+{
+   unsigned int p;
+
+   if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
+   /* Load per CPU data from RDTSCP */
+   native_read_tscp(p);
+   } else {
+   /* Load per CPU data from GDT */
+   asm(lsl %1,%0 : =r (p) : r (__PER_CPU_SEG));
+   }
+
+   return p;
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_VSYSCALL_H */
Index: vsyscall/arch/x86/vdso/vgetcpu.c
===
--- vsyscall.orig/arch/x86/vdso/vgetcpu.c
+++ vsyscall/arch/x86/vdso/vgetcpu.c
@@ -17,13 +17,8 @@ __vdso_getcpu(unsigned *cpu, unsigned *n
 {
unsigned int p;
 
-   if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
-   /* Load per CPU data from RDTSCP */
-   native_read_tscp(p);
-   } else {
-   /* Load per CPU data from GDT */
-   asm(lsl %1,%0 : =r (p) : r (__PER_CPU_SEG));
-   }
+   p = __getcpu();
+
if (cpu)
*cpu = p  0xfff;
if (node)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 09/15] x86: kvm guest: pvclock vsyscall support

2012-10-16 Thread Marcelo Tosatti

Allow hypervisor to update userspace visible copy of
pvclock data.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/kvmclock.c
===
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -31,6 +31,9 @@ static int kvmclock = 1;
 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
 
+/* set when the generic vsyscall pvclock elements are setup */
+bool vsyscall_clock_initializable = false;
+
 static int parse_no_kvmclock(char *arg)
 {
kvmclock = 0;
@@ -151,6 +154,28 @@ int kvm_register_clock(char *txt)
return ret;
 }
 
+static int kvm_register_vsyscall_clock(char *txt)
+{
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+   int cpu = smp_processor_id();
+   int low, high, ret;
+   struct pvclock_vcpu_time_info *info;
+
+   info = pvclock_get_vsyscall_time_info(cpu);
+
+   low = (int)__pa(info) | 1;
+   high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
+   ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high);
+   printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
+  cpu, high, low, txt);
+
+   return ret;
+#else
+   return 0;
+#endif
+}
+
+
 static void kvm_save_sched_clock_state(void)
 {
 }
@@ -158,6 +183,8 @@ static void kvm_save_sched_clock_state(v
 static void kvm_restore_sched_clock_state(void)
 {
kvm_register_clock(primary cpu clock, resume);
+   if (vsyscall_clock_initializable)
+   kvm_register_vsyscall_clock(primary cpu vsyscall clock, 
resume);
 }
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -168,6 +195,8 @@ static void __cpuinit kvm_setup_secondar
 * we shouldn't fail.
 */
WARN_ON(kvm_register_clock(secondary cpu clock));
+   if (vsyscall_clock_initializable)
+   kvm_register_vsyscall_clock(secondary cpu vsyscall clock);
 }
 #endif
 
@@ -182,6 +211,8 @@ static void __cpuinit kvm_setup_secondar
 #ifdef CONFIG_KEXEC
 static void kvm_crash_shutdown(struct pt_regs *regs)
 {
+   if (vsyscall_clock_initializable)
+   native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0);
native_write_msr(msr_kvm_system_time, 0, 0);
kvm_disable_steal_time();
native_machine_crash_shutdown(regs);
@@ -190,6 +221,8 @@ static void kvm_crash_shutdown(struct pt
 
 static void kvm_shutdown(void)
 {
+   if (vsyscall_clock_initializable)
+   native_write_msr(MSR_KVM_USERSPACE_TIME, 0, 0);
native_write_msr(msr_kvm_system_time, 0, 0);
kvm_disable_steal_time();
native_machine_shutdown();
@@ -233,3 +266,27 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 }
+
+int kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
+   int ret;
+   struct pvclock_vcpu_time_info *vcpu_time;
+   u8 flags;
+
+   vcpu_time = get_cpu_var(hv_clock);
+   flags = pvclock_read_flags(vcpu_time);
+   put_cpu_var(hv_clock);
+
+   if (!(flags  PVCLOCK_TSC_STABLE_BIT))
+   return 1;
+
+   if ((ret = pvclock_init_vsyscall()))
+   return ret;
+
+   kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+   vsyscall_clock_initialized = true;
+#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */
+   return 0;
+}
+
Index: vsyscall/arch/x86/kernel/kvm.c
===
--- vsyscall.orig/arch/x86/kernel/kvm.c
+++ vsyscall/arch/x86/kernel/kvm.c
@@ -42,6 +42,7 @@
 #include asm/apic.h
 #include asm/apicdef.h
 #include asm/hypervisor.h
+#include asm/kvm_guest.h
 
 static int kvmapf = 1;
 
@@ -468,6 +469,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+   if (kvm_para_has_feature(KVM_FEATURE_USERSPACE_CLOCKSOURCE))
+   kvm_setup_vsyscall_timeinfo();
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(kvm_cpu_notifier);
Index: vsyscall/arch/x86/include/asm/kvm_guest.h
===
--- /dev/null
+++ vsyscall/arch/x86/include/asm/kvm_guest.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_X86_KVM_GUEST_H
+#define _ASM_X86_KVM_GUEST_H
+
+extern bool vsyscall_clock_initializable;
+
+int kvm_setup_vsyscall_timeinfo(void);
+
+#endif /* _ASM_X86_KVM_GUEST_H */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 07/15] sched: add notifier for cross-cpu migrations

2012-10-16 Thread Marcelo Tosatti

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/include/linux/sched.h
===
--- vsyscall.orig/include/linux/sched.h
+++ vsyscall/include/linux/sched.h
@@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+   struct task_struct *task;
+   int from_cpu;
+   int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
Index: vsyscall/kernel/sched/core.c
===
--- vsyscall.orig/kernel/sched/core.c
+++ vsyscall/kernel/sched/core.c
@@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
rq-skip_clock_update = 1;
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+   atomic_notifier_chain_register(task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
trace_sched_migrate_task(p, new_cpu);
 
if (task_cpu(p) != new_cpu) {
+   struct task_migration_notifier tmn;
+
p-se.nr_migrations++;
perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+
+   tmn.task = p;
+   tmn.from_cpu = task_cpu(p);
+   tmn.to_cpu = new_cpu;
+
+   atomic_notifier_call_chain(task_migration_notifier, 0, tmn);
}
 
__set_task_cpu(p, new_cpu);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 12/15] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-10-16 Thread Marcelo Tosatti

Allow a guest to register a second location for the VCPU time info

structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
This is intended to allow the guest kernel to map this information
into a usermode accessible page, so that usermode can efficiently
calculate system time from the TSC without having to make a syscall.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/kvm_para.h
===
--- vsyscall.orig/arch/x86/include/asm/kvm_para.h
+++ vsyscall/arch/x86/include/asm/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
+#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_USERSPACE_TIME  0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
Index: vsyscall/Documentation/virtual/kvm/msr.txt
===
--- vsyscall.orig/Documentation/virtual/kvm/msr.txt
+++ vsyscall/Documentation/virtual/kvm/msr.txt
@@ -125,6 +125,22 @@ MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01
Availability of this MSR must be checked via bit 3 in 0x401 cpuid
leaf prior to usage.
 
+MSR_KVM_USERSPACE_TIME:  0x4b564d05
+
+Allow a guest to register a second location for the VCPU time info
+structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
+This is intended to allow the guest kernel to map this information
+into a usermode accessible page, so that usermode can efficiently
+calculate system time from the TSC without having to make a syscall.
+
+Relationship with master copy (MSR_KVM_SYSTEM_TIME_NEW):
+
+- This MSR must be enabled only when the master is enabled.
+- Disabling updates to the master automatically disables
+updates for this copy.
+
+Availability of this MSR must be checked via bit 7 in 0x401 cpuid
+leaf prior to usage.
 
 MSR_KVM_WALL_CLOCK:  0x11
 
Index: vsyscall/arch/x86/include/asm/kvm_host.h
===
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -415,10 +415,13 @@ struct kvm_vcpu_arch {
int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
 
gpa_t time;
+   gpa_t uspace_time;
struct pvclock_vcpu_time_info hv_clock;
unsigned int hw_tsc_khz;
unsigned int time_offset;
+   unsigned int uspace_time_offset;
struct page *time_page;
+   struct page *uspace_time_page;
/* set guest stopped flag in pvclock flags field */
bool pvclock_set_guest_stopped_request;
 
Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -809,13 +809,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN10
+#define KVM_SAVE_MSRS_BEGIN11
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-   MSR_KVM_PV_EOI_EN,
+   MSR_KVM_PV_EOI_EN, MSR_KVM_USERSPACE_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1135,16 +1135,43 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static void kvm_write_pvtime(struct kvm_vcpu *v, struct page *page,
+unsigned int offset_in_page, gpa_t gpa)
+{
+   struct kvm_vcpu_arch *vcpu = v-arch;
+   void *shared_kaddr;
+   struct pvclock_vcpu_time_info *guest_hv_clock;
+   u8 pvclock_flags;
+
+   shared_kaddr = kmap_atomic(page);
+
+   guest_hv_clock = shared_kaddr + offset_in_page;
+
+   /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+   pvclock_flags = (guest_hv_clock-flags  PVCLOCK_GUEST_STOPPED);
+
+   if (vcpu-pvclock_set_guest_stopped_request) {
+   pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+   vcpu-pvclock_set_guest_stopped_request = false;
+   }
+
+   vcpu-hv_clock.flags = pvclock_flags;
+
+   memcpy(shared_kaddr + offset_in_page, vcpu-hv_clock,
+  sizeof(vcpu-hv_clock));
+
+   kunmap_atomic(shared_kaddr);
+
+   mark_page_dirty(v-kvm, gpa  PAGE_SHIFT);
+}
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
unsigned long flags;
struct kvm_vcpu_arch *vcpu = v-arch;
-

[patch 00/15] pvclock vsyscall support + KVM hypervisor support

2012-10-16 Thread Marcelo Tosatti

This patchset, based on earlier work by Jeremy Fitzhardinge, implements
paravirtual clock vsyscall support.

It should be possible to implement Xen support relatively easily.

It reduces clock_gettime from 500 cycles to 200 cycles 
on my testbox (including an mfence, that measurement).

NOTE: exporting pvclock tsc stable bit if the guest TSCs are
not synchronized is incorrect, next version will have that fixed.

Please review.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 10/15] x86: vsyscall: pass mode to gettime backend

2012-10-16 Thread Marcelo Tosatti

Required by next patch.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -80,7 +80,7 @@ notrace static long vdso_fallback_gtod(s
 }
 
 
-notrace static inline u64 vgetsns(void)
+notrace static inline u64 vgetsns(int *mode)
 {
long v;
cycles_t cycles;
@@ -107,7 +107,7 @@ notrace static int __always_inline do_re
mode = gtod-clock.vclock_mode;
ts-tv_sec = gtod-wall_time_sec;
ns = gtod-wall_time_snsec;
-   ns += vgetsns();
+   ns += vgetsns(mode);
ns = gtod-clock.shift;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
 
@@ -127,7 +127,7 @@ notrace static int do_monotonic(struct t
mode = gtod-clock.vclock_mode;
ts-tv_sec = gtod-monotonic_time_sec;
ns = gtod-monotonic_time_snsec;
-   ns += vgetsns();
+   ns += vgetsns(mode);
ns = gtod-clock.shift;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
timespec_add_ns(ts, ns);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 02/15] x86: pvclock: make sure rdtsc doesnt speculate out of region

2012-10-16 Thread Marcelo Tosatti

Originally from Jeremy Fitzhardinge.

pvclock_get_time_values, which contains the memory barriers
will be removed by next patch.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
 
do {
version = pvclock_get_time_values(shadow, src);
-   barrier();
+   rdtsc_barrier();
offset = pvclock_get_nsec_offset(shadow);
ret = shadow.system_timestamp + offset;
-   barrier();
+   rdtsc_barrier();
} while (version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 04/15] x86: pvclock: create helper for pvclock data retrieval

2012-10-16 Thread Marcelo Tosatti

Originally from Jeremy Fitzhardinge.

So code can be reused.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags)
valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
-{
-   u64 delta = native_read_tsc() - src-tsc_timestamp;
-   return pvclock_scale_delta(delta, src-tsc_to_system_mul,
-  src-tsc_shift);
-}
-
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
 {
u64 pv_tsc_khz = 100ULL  32;
@@ -55,15 +48,11 @@ void pvclock_resume(void)
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
unsigned version;
-   cycle_t ret, offset;
+   cycle_t ret;
u64 last;
 
do {
-   version = src-version;
-   rdtsc_barrier();
-   offset = pvclock_get_nsec_offset(src);
-   ret = src-system_time + offset;
-   rdtsc_barrier();
+   version = __pvclock_read_cycles(src, ret);
} while ((src-version  1) || version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 
Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -56,4 +56,29 @@ static inline u64 pvclock_scale_delta(u6
return product;
 }
 
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+{
+   u64 delta = __native_read_tsc() - src-tsc_timestamp;
+   return pvclock_scale_delta(delta, src-tsc_to_system_mul,
+  src-tsc_shift);
+}
+
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+  cycle_t *cycles)
+{
+   unsigned version;
+   cycle_t ret, offset;
+
+   version = src-version;
+   rdtsc_barrier();
+   offset = pvclock_get_nsec_offset(src);
+   ret = src-system_time + offset;
+   rdtsc_barrier();
+
+   *cycles = ret;
+   return version;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 01/15] KVM: x86: retain pvclock guest stopped bit in guest memory

2012-10-16 Thread Marcelo Tosatti

Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
unsigned long this_tsc_khz;
s64 kernel_ns, max_kernel_ns;
u64 tsc_timestamp;
+   struct pvclock_vcpu_time_info *guest_hv_clock;
u8 pvclock_flags;
 
/* Keep irq disabled to prevent changes to the clock */
@@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
vcpu-last_kernel_ns = kernel_ns;
vcpu-last_guest_tsc = tsc_timestamp;
 
-   pvclock_flags = 0;
-   if (vcpu-pvclock_set_guest_stopped_request) {
-   pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-   vcpu-pvclock_set_guest_stopped_request = false;
-   }
-
-   vcpu-hv_clock.flags = pvclock_flags;
 
/*
 * The interface expects us to write an even number signaling that the
@@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
 
shared_kaddr = kmap_atomic(vcpu-time_page);
 
+   guest_hv_clock = shared_kaddr + vcpu-time_offset;
+
+   /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+   pvclock_flags = (guest_hv_clock-flags  PVCLOCK_GUEST_STOPPED);
+
+   if (vcpu-pvclock_set_guest_stopped_request) {
+   pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+   vcpu-pvclock_set_guest_stopped_request = false;
+   }
+
+   vcpu-hv_clock.flags = pvclock_flags;
+
memcpy(shared_kaddr + vcpu-time_offset, vcpu-hv_clock,
   sizeof(vcpu-hv_clock));
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM ept flush

2012-10-16 Thread Rohan Sharma

Thanks for the reply.
I have one more question.
If I do munmap of the RAM allocated in qemu,
will the changes be reflected in KVM Ept.
I guess there is some mmu notifier which ensures that entries of EPT
are synced with the host entries.

On Tue, Oct 16, 2012 at 8:27 PM, Avi Kivity a...@redhat.com wrote:
 On 10/16/2012 01:57 PM, Rohan Sharma wrote:
 Is there a way to flush ept entries in qemu-kvm.

 No.


 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Anthony Liguori

Avi Kivity a...@redhat.com writes:

 On 10/16/2012 01:58 PM, Paul Mackerras wrote:
 On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  + __u64   flags;
  + __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?
 
 That's fine; the read stops when it has filled the buffer and a
 subsequent read will continue from where the previous one finished.
 
 What happens if the read buffer size is not a multiple of entry size?
 
 Then we don't use the last few bytes of the buffer.  The read() call
 returns the number of bytes that were filled in, of course.  In any
 case, the header size is 8 bytes and the HPT entry size is 16 bytes,
 so the number of bytes filled in won't necessarily be a multiple of 16
 bytes.

 That's sane and expected, but it should be documented.

 
 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).
 
 No.

 This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size  0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  + __u32   index;
  + __u16   n_valid;
  + __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?
 
 By data structure, do you mean the stream format on the file
 descriptor, or the HPT structure?  If we want a new stream format,
 then we would define a bit in the flags field of struct
 kvm_get_htab_fd to mean I want the new stream format.  The code
 fails the ioctl if any unknown flag bits are set, so a new userspace
 that wants to use the new format could then detect that it is running
 on an old kernel and fall back to the old format.
 
 The HPT entry format is very unlikely to change in size or basic
 layout (though the architects do redefine some of the bits
 occasionally).

 I meant the internal data structure that holds HPT entries.

 I guess I don't understand the index.  Do we expect changes to be in
 contiguous ranges?  And invalid entries to be contiguous as well?  That
 doesn't fit with how hash tables work.  Does the index represent the
 position of the entry within the table, or something else?


 
  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras

On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
 On 10/16/2012 01:58 PM, Paul Mackerras wrote:
  On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
  Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
  to change).
  
  No.
 
 This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

   If read() is called again on the fd, it will start again from
   +the beginning of the HPT, but will only return HPT entries that have
   +changed since they were last read.
   +
   +Data read or written is structured as a header (8 bytes) followed by a
   +series of valid HPT entries (16 bytes) each.  The header indicates how
   +many valid HPT entries there are and how many invalid entries follow
   +the valid entries.  The invalid entries are not represented explicitly
   +in the stream.  The header format is:
   +
   +struct kvm_get_htab_header {
   +__u32   index;
   +__u16   n_valid;
   +__u16   n_invalid;
   +};
  
  This structure forces the kernel to return entries sequentially.  Will
  this block changing the data structure in the future?  Or is the
  hardware spec sufficiently strict that such changes are not realistic?
  
  By data structure, do you mean the stream format on the file
  descriptor, or the HPT structure?  If we want a new stream format,
  then we would define a bit in the flags field of struct
  kvm_get_htab_fd to mean I want the new stream format.  The code
  fails the ioctl if any unknown flag bits are set, so a new userspace
  that wants to use the new format could then detect that it is running
  on an old kernel and fall back to the old format.
  
  The HPT entry format is very unlikely to change in size or basic
  layout (though the architects do redefine some of the bits
  occasionally).
 
 I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

 I guess I don't understand the index.  Do we expect changes to be in
 contiguous ranges?  And invalid entries to be contiguous as well?  That
 doesn't fit with how hash tables work.  Does the index represent the
 position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

   +
   +Writes to the fd create HPT entries starting at the index given in the
   +header; first `n_valid' valid entries with contents from the data
   +written, then `n_invalid' invalid entries, invalidating any previously
   +valid entries found.
  
  This scheme is a clever, original, and very interesting approach to live
  migration.  That doesn't necessarily mean a NAK, we should see if it
  makes sense for other migration APIs as well (we currently have
  difficulties migrating very large/wide guests).
  
  What is the typical number of entries in the HPT?  Do you have estimates
  of the change rate?
  
  Typically the HPT would have about a million entries, i.e. it would be
  16MiB in size.  The usual guideline is to make it about 1/64 of the
  maximum amount of RAM the guest could ever have, rounded up to a power
  of two, although we often run with less, say 1/128 or even 1/256.
 
 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
 it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

  Because it is a hash table, updates tend to be scattered throughout
  the whole table, which is another reason why per-page dirty tracking
  and updates would be pretty inefficient.
 
 This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

  
  As for the change rate, it depends on the application of course, but
  basically every time the guest changes a PTE in its Linux page tables
  we do the corresponding change to the corresponding HPT entry, so the
  rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
  exec, etc. have a high rate of HPT updates.
 
 If the rate is high enough, then there's no point in a live update.

True, but doesn't that argument apply to memory pages as well?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to

[PATCH RFC V2 1/5] Alter the amount of steal time reported by the guest.

2012-10-16 Thread Michael Wolf

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed 
period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, cpu%d, i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, intr %llu, (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 36d12f0..c0b0095 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,6 +27,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V2 3/5] Add the code to send the consigned time from the host to the guest

2012-10-16 Thread Michael Wolf

Add the code to send the consigned time from the host to the guest

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/kvm_para.h |3 ++-
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |3 ++-
 arch/x86/kvm/x86.c  |2 ++
 include/linux/kernel_stat.h |1 +
 kernel/sched/cputime.c  |   21 +++--
 kernel/sched/sched.h|2 ++
 8 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1eaa6b0..bd4e412 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -409,6 +409,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2f7712e..debe72e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 91b3b2a..4e5582a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -368,7 +368,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu, u64 *steal)
+static u64 kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -378,6 +378,7 @@ static u64 kvm_steal_clock(int cpu, u64 *steal)
version = src-version;
rmb();
*steal = src-steal;
+   *consigned = src-consigned;
rmb();
} while ((version  1) || (version != src-version));
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f09552..801cfa8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1554,8 +1554,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal;
+   vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned;
vcpu-arch.st.steal.version += 2;
vcpu-arch.st.accum_steal = 0;
+   vcpu-arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index c0b0095..253fdce 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -125,6 +125,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index dd3fd46..bf2025a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time that the cpu was
+ * expected to be somewhere else.
+ */
+void account_consigned_time(cputime_t cputime)
+{
+   u64 *cpustat =  kcpustat_this_cpu-cpustat;
+
+   cpustat[CPUTIME_CONSIGN] += (__force u64) cputime;
+}
+
+/*
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
  */
@@ -274,15 +286,20 @@ static __always_inline bool 
steal_account_process_tick(void)
 #ifdef CONFIG_PARAVIRT
if (static_key_false(paravirt_steal_enabled)) {
u64 steal, st = 0;
+   u64 consigned, cs = 0;
 
-

[PATCH RFC V2 5/5] Add an ioctl to communicate the consign limit to the host.

2012-10-16 Thread Michael Wolf

Add an ioctl to communicate the consign limit to the host.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c   |6 ++
 include/linux/kvm.h  |2 ++
 include/linux/kvm_host.h |2 ++
 virt/kvm/kvm_main.c  |7 +++
 4 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 469e748..5a4b8db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5928,6 +5928,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
 }
 
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long 
entitlement)
+{
+   vcpu-arch.consigned_limit = entitlement;
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
struct i387_fxsave_struct *fxsave =
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..6b211fb 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -904,6 +904,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL_IO(KVMIO,   0xad)
+/* Set the consignment limit which will be used to separete steal time */
+#define KVM_SET_ENTITLEMENT  _IOW(KVMIO, 0xae, unsigned long)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
 #define KVM_DEV_ASSIGN_PCI_2_3 (1  1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8a59e0a..9a0a791 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -538,6 +538,8 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu,
+   long entitlement);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..ccf0aec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1936,6 +1936,13 @@ out_free2:
r = 0;
break;
}
+   case KVM_SET_ENTITLEMENT: {
+   r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg);
+   if (r)
+   goto out;
+   r = 0;
+   break;
+   }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Added call parameter to track whether invocation originated with guest or elsewhere

2012-10-16 Thread Will Auld

Signed-off-by: Will Auld will.a...@intel.com
---

Resending to full list

Marcelo,

This patch is what I believe you ask for as foundational for later
patches to address IA32_TSC_ADJUST. 

Thanks,

Will

 arch/x86/include/asm/kvm_host.h |  8 
 arch/x86/kvm/svm.c  | 18 ++
 arch/x86/kvm/vmx.c  | 18 ++
 arch/x86/kvm/x86.c  | 18 ++
 arch/x86/kvm/x86.h  |  2 +-
 5 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..c06f0d1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -621,7 +621,7 @@ struct kvm_x86_ops {
void (*set_guest_debug)(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
-   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data, bool 
guest_initiated);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -684,7 +684,7 @@ struct kvm_x86_ops {
bool (*has_wbinvd_exit)(void);
 
void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
-   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
+   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset, bool 
guest_initiated);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
@@ -772,7 +772,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data, bool 
guest_initiated);
 
 struct x86_emulate_ctxt;
 
@@ -799,7 +799,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
int *l);
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
 
 int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool 
guest_initiated);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index baead95..424be27 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1012,7 +1012,8 @@ static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 
user_tsc_khz, bool scale)
svm-tsc_ratio = ratio;
 }
 
-static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset, 
+   bool guest_initiated)
 {
struct vcpu_svm *svm = to_svm(vcpu);
u64 g_tsc_offset = 0;
@@ -1255,7 +1256,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
init_vmcb(svm);
-   kvm_write_tsc(svm-vcpu, 0);
+   kvm_write_tsc(svm-vcpu, 0, false /*Not Guest Initiated*/);
 
err = fx_init(svm-vcpu);
if (err)
@@ -3147,13 +3148,14 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
data)
return 0;
 }
 
-static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
+static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data, 
+   bool guest_initiated)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
switch (ecx) {
case MSR_IA32_TSC:
-   kvm_write_tsc(vcpu, data);
+   kvm_write_tsc(vcpu, data, guest_initiated);
break;
case MSR_STAR:
svm-vmcb-save.star = data;
@@ -3208,12 +3210,12 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
ecx, data);
break;
default:
-   return kvm_set_msr_common(vcpu, ecx, data);
+   return kvm_set_msr_common(vcpu, ecx, data, guest_initiated);
}
return 0;
 }
 
-static int wrmsr_interception(struct vcpu_svm *svm)
+static int wrmsr_interception(struct vcpu_svm *svm, bool guest_initiated)
 {
u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
@@ -3221,7 +3223,7 @@ static int wrmsr_interception(struct vcpu_svm *svm)
 

svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
-   if (svm_set_msr(svm-vcpu, ecx, data)) {
+   if (svm_set_msr(svm-vcpu, ecx, data, guest_initiated)) {

[PATCH RFC V2 4/5] Add a timer to allow the separation of consigned from steal time.

2012-10-16 Thread Michael Wolf

Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |   10 +
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/kvm/x86.c  |   42 ++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bd4e412..d700850 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 1UL
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -339,6 +341,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   u64 consigned_limit;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 801cfa8..469e748 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1535,13 +1535,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
return;
 
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
-   vcpu-arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_limit) {
+   vcpu-arch.current_consigned += delta;
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_limit) {
+   steal_delta = vcpu-arch.current_consigned
+   -  vcpu-arch.consigned_limit;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu-arch.st.accum_steal = steal_delta;
+   vcpu-arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6187,11 +6206,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
return irqchip_in_kernel(vcpu-kvm) == (vcpu-arch.apic != NULL);
 }
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   vcpu-arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(vcpu-arch.steal_timer, now,
+   ktime_set(0, KVM_STEAL_TIMER_DELAY));
+   return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
int r;
+   ktime_t ktime;
 
BUG_ON(vcpu-kvm == NULL);
kvm = vcpu-kvm;
@@ -6234,6 +6267,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+   /* Initialize and start a timer to capture steal and consigned time */
+   hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu-arch.steal_timer.function = steal_timer_fn;
+   ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+   hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
return 0;
 fail_free_mce_banks:
@@ -6252,6 +6291,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
int idx;
 
+   hrtimer_cancel(vcpu-arch.steal_timer);
kvm_pmu_destroy(vcpu);

[PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-10-16 Thread Michael Wolf

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

TODO:
* Change native_clock to take params and not return a value
* Change update_rq_clock_task

Changes from V1:
* Removed the steal time allowed percentage from the guest
* Moved the separation of consigned (expected steal) and steal time to the
  host.
* No longer include a sysctl interface.

---

Michael Wolf (5):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.
  Add an ioctl to communicate the consign limit to the host.


 arch/x86/include/asm/kvm_host.h |   11 +
 arch/x86/include/asm/kvm_para.h |3 ++
 arch/x86/include/asm/paravirt.h |4 ++-
 arch/x86/kernel/kvm.c   |8 ++
 arch/x86/kvm/x86.c  |   50 ++-
 fs/proc/stat.c  |9 +--
 include/linux/kernel_stat.h |2 ++
 include/linux/kvm.h |2 ++
 include/linux/kvm_host.h|2 ++
 kernel/sched/cputime.c  |   21 +++-
 kernel/sched/sched.h|2 ++
 virt/kvm/kvm_main.c |7 +
 12 files changed, 108 insertions(+), 13 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V2 2/5] Expand the steal time msr to also contain the consigned time.

2012-10-16 Thread Michael Wolf

Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |7 ++-
 kernel/sched/cputime.c  |2 +-
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..91b3b2a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -368,9 +368,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static u64 kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -378,11 +377,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src-version;
rmb();
-   steal = src-steal;
+   *steal = src-steal;
rmb();
} while ((version  1) || (version != src-version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 81b763b..dd3fd46 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(paravirt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), steal);
steal -= this_rq()-prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: KVM_MAX_VCPUS

2012-10-16 Thread Wei, Bing (WeiBing, MCXS-SH)

For pCPU/core and VCPUS/logical cpu mapping, It should be 8 multiple. 254 is 
reasonable. Or something I miss?

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Vinod, Chegu
Sent: Sunday, October 14, 2012 9:43 PM
To: Gleb Natapov
Cc: Sasha Levin; KVM
Subject: Re: KVM_MAX_VCPUS

On 10/14/2012 2:08 AM, Gleb Natapov wrote:
 On Sat, Oct 13, 2012 at 10:32:13PM -0400, Sasha Levin wrote:
 On 10/13/2012 06:29 PM, Chegu Vinod wrote:
 Hello,

 Wanted to get a clarification about KVM_MAX_VCPUS(currently set to 254)
   in kvm_host.h file. The kvm_vcpu *vcpus array is sized based on 
 KVM_MAX_VCPUS.
 (i.e. a max of 254 elements in the array).

 An 8bit APIC id should allow for 256 ID's. Reserving one for Broadcast 
 should
 leave 255 ID's.  Is there one more ID reserved for some other purpose ? 
 (hence
 leading to KVM_MAX_VCPUS being set to 254 and not 255).
 Another ID goes to the IO-APIC.

 This is not really needed on KVM. We can enlarge KVM_MAX_VCPUS to 255.

Thanks for clarification!  ( We did suspect the IO-APIC...but weren't 
quite sure).

Vinod

 --
   Gleb.
 .

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: VMX: report internal error for the unhandleable event

2012-10-16 Thread Xiao Guangrong

VM exits during Event Delivery is really unexpected if it is not caused
by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal
and freeze the guest, the VMM has the chance to check the guest

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/vmx.c  |   19 +++
 include/linux/kvm.h |8 ++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad6b1dd..b8a0841 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5979,13 +5979,24 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
return 0;
}

+   /*
+* Note:
+* Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by
+* delivery event since it indicates guest is accessing MMIO.
+* The vm-exit can be triggered again after return to guest that
+* will cause infinite loop.
+*/
if ((vectoring_info  VECTORING_INFO_VALID_MASK) 
(exit_reason != EXIT_REASON_EXCEPTION_NMI 
exit_reason != EXIT_REASON_EPT_VIOLATION 
-   exit_reason != EXIT_REASON_TASK_SWITCH))
-   printk(KERN_WARNING %s: unexpected, valid vectoring info 
-  (0x%x) and exit reason is 0x%x\n,
-  __func__, vectoring_info, exit_reason);
+   exit_reason != EXIT_REASON_TASK_SWITCH)) {
+   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV;
+   vcpu-run-internal.ndata = 2;
+   vcpu-run-internal.data[0] = vectoring_info;
+   vcpu-run-internal.data[1] = exit_reason;
+   return 0;
+   }

if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked 
!(is_guest_mode(vcpu)  nested_cpu_has_virtual_nmis(
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 65ad5c6..494a84c 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -170,8 +170,12 @@ struct kvm_pit_config {
 #define KVM_EXIT_WATCHDOG 21

 /* For KVM_EXIT_INTERNAL_ERROR */
-#define KVM_INTERNAL_ERROR_EMULATION 1
-#define KVM_INTERNAL_ERROR_SIMUL_EX 2
+/* Emulate instruction failed. */
+#define KVM_INTERNAL_ERROR_EMULATION   1
+/* Encounter unexpected simultaneous exceptions. */
+#define KVM_INTERNAL_ERROR_SIMUL_EX2
+/* Encounter unexpected vm-exit due to delivery event. */
+#define KVM_INTERNAL_ERROR_DELIVERY_EV 3

 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: VMX: report internal error for MMIO #PF due to delivery event

2012-10-16 Thread Xiao Guangrong

The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we
can not fix it if it is caused by delivery event. Reporting internal error
for this case

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/vmx.c |   26 --
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b8a0841..6599e45 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4287,16 +4287,6 @@ static int handle_exception(struct kvm_vcpu *vcpu)
if (is_machine_check(intr_info))
return handle_machine_check(vcpu);

-   if ((vect_info  VECTORING_INFO_VALID_MASK) 
-   !is_page_fault(intr_info)) {
-   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_SIMUL_EX;
-   vcpu-run-internal.ndata = 2;
-   vcpu-run-internal.data[0] = vect_info;
-   vcpu-run-internal.data[1] = intr_info;
-   return 0;
-   }
-
if ((intr_info  INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR)
return 1;  /* already handled by vmx_vcpu_run() */

@@ -4315,6 +4305,22 @@ static int handle_exception(struct kvm_vcpu *vcpu)
error_code = 0;
if (intr_info  INTR_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+
+   /*
+* The #PF with PFEC.RSVD = 1 indicates the guest is accessing
+* MMIO, it is better to report an internal error.
+* See the comments in vmx_handle_exit.
+*/
+   if ((vect_info  VECTORING_INFO_VALID_MASK) 
+   !(is_page_fault(intr_info)  !(error_code  PFERR_RSVD_MASK))) {
+   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_SIMUL_EX;
+   vcpu-run-internal.ndata = 2;
+   vcpu-run-internal.data[0] = vect_info;
+   vcpu-run-internal.data[1] = intr_info;
+   return 0;
+   }
+
if (is_page_fault(intr_info)) {
/* EPT won't cause page fault directly */
BUG_ON(enable_ept);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing

2012-10-16 Thread Michael Ellerman

On Tue, 2012-10-16 at 14:13 +1100, Paul Mackerras wrote:
 Michael,
 
 On Tue, Oct 16, 2012 at 11:15:50AM +1100, Michael Ellerman wrote:
  In the Book3s HV code, kvmppc_run_core() has logic to grab the secondary
  threads of the physical core.
  
  If for some reason a thread is stuck, kvmppc_grab_hwthread() can fail,
  but currently we ignore the failure and continue into the guest. If the
  stuck thread is in the kernel badness ensues.
  
  Instead we should check for failure and bail out.
  
  I've moved the grabbing prior to the startup of runnable threads, to 
  simplify
  the error case. AFAICS this is harmless, but I could be missing something
  subtle.
 
 Thanks for looking at this - but in fact this is fixed by my patch
 entitled KVM: PPC: Book3S HV: Fix some races in starting secondary
 threads submitted back on August 28.

OK thanks. It seems that patch didn't make 3.7 ?

I don't see it in kvm-ppc-next either.

cheers

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/8] Various Book3s HV fixes that haven't been picked up yet

2012-10-16 Thread Alexander Graf



On 16.10.2012, at 05:08, Paul Mackerras pau...@samba.org wrote:

 On Mon, Oct 15, 2012 at 02:00:54PM +0200, Alexander Graf wrote:
 
 Sorry, I can't accept patches that haven't shown up on kvm@vger. Please send 
 this patch set again with CC to kvm@vger.
 
 Done; I didn't cc kvm-ppc this time since the patches haven't changed.

Thanks :)

 
 By the way, what is the purpose of kvm-ppc@vger.kernel.org?

It's basically a tag so people can filter based on it. For me, mails to 
kvm-ppc@vger land straight in my inbox, while mails to kvm@vger go to a folder 
I only read occasionally.

Alex

 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Avi Kivity

On 10/16/2012 05:59 AM, Paul Mackerras wrote:
 The mmu_notifier_retry() function, used to test whether any page
 invalidations are in progress, currently takes a vcpu pointer, though
 the code only needs the VM's struct kvm pointer.  Forthcoming patches
 to the powerpc Book3S HV code will need to test for retry within a VM
 ioctl, where a struct kvm pointer is available but a struct vcpu
 pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
 called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
 implements mmu_notifier_retry in terms of it.

Why not change mmu_notifier_retry() and all its callers?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 11:44, Avi Kivity wrote:

 On 10/16/2012 05:59 AM, Paul Mackerras wrote:
 The mmu_notifier_retry() function, used to test whether any page
 invalidations are in progress, currently takes a vcpu pointer, though
 the code only needs the VM's struct kvm pointer.  Forthcoming patches
 to the powerpc Book3S HV code will need to test for retry within a VM
 ioctl, where a struct kvm pointer is available but a struct vcpu
 pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
 called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
 implements mmu_notifier_retry in terms of it.
 
 Why not change mmu_notifier_retry() and all its callers?

Why not use Christoffer's patch? :)

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity

On 10/16/2012 06:01 AM, Paul Mackerras wrote:
 A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
 this fd return the contents of the HPT (hashed page table), writes
 create and/or remove entries in the HPT.  There is a new capability,
 KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
 takes an argument structure with the index of the first HPT entry to
 read out and a set of flags.  The flags indicate whether the user is
 intending to read or write the HPT, and whether to return all entries
 or only the bolted entries (those with the bolted bit, 0x10, set in
 the first doubleword).
 
 This is intended for use in implementing qemu's savevm/loadvm and for
 live migration.  Therefore, on reads, the first pass returns information
 about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
 end of the HPT, it returns from the read.  Subsequent reads only return
 information about HPTEs that have changed since they were last read.
 A read that finds no changed HPTEs in the HPT following where the last
 read finished will return 0 bytes.

Copying people with interest in migration.

 +4.78 KVM_PPC_GET_HTAB_FD
 +
 +Capability: KVM_CAP_PPC_HTAB_FD
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: Pointer to struct kvm_get_htab_fd (in)
 +Returns: file descriptor number (= 0) on success, -1 on error
 +
 +This returns a file descriptor that can be used either to read out the
 +entries in the guest's hashed page table (HPT), or to write entries to
 +initialize the HPT.  The returned fd can only be written to if the
 +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
 +can only be read if that bit is clear.  The argument struct looks like
 +this:
 +
 +/* For KVM_PPC_GET_HTAB_FD */
 +struct kvm_get_htab_fd {
 + __u64   flags;
 + __u64   start_index;
 +};
 +
 +/* Values for kvm_get_htab_fd.flags */
 +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
 +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
 +
 +The `start_index' field gives the index in the HPT of the entry at
 +which to start reading.  It is ignored when writing.
 +
 +Reads on the fd will initially supply information about all
 +interesting HPT entries.  Interesting entries are those with the
 +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
 +all entries.  When the end of the HPT is reached, the read() will
 +return.  

What happens if the read buffer is smaller than the HPT size?

What happens if the read buffer size is not a multiple of entry size?

Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).

 If read() is called again on the fd, it will start again from
 +the beginning of the HPT, but will only return HPT entries that have
 +changed since they were last read.
 +
 +Data read or written is structured as a header (8 bytes) followed by a
 +series of valid HPT entries (16 bytes) each.  The header indicates how
 +many valid HPT entries there are and how many invalid entries follow
 +the valid entries.  The invalid entries are not represented explicitly
 +in the stream.  The header format is:
 +
 +struct kvm_get_htab_header {
 + __u32   index;
 + __u16   n_valid;
 + __u16   n_invalid;
 +};

This structure forces the kernel to return entries sequentially.  Will
this block changing the data structure in the future?  Or is the
hardware spec sufficiently strict that such changes are not realistic?

 +
 +Writes to the fd create HPT entries starting at the index given in the
 +header; first `n_valid' valid entries with contents from the data
 +written, then `n_invalid' invalid entries, invalidating any previously
 +valid entries found.

This scheme is a clever, original, and very interesting approach to live
migration.  That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).

What is the typical number of entries in the HPT?  Do you have estimates
of the change rate?

Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual-physical translations on x86).  How
will we deal with that?  But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity

On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity

On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support without 
 in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: Distangle eventfd code from irqchip

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 12:57, Avi Kivity wrote:

 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 The current eventfd code assumes that when we have eventfd, we also have
 irqfd for in-kernel interrupt delivery. This is not necessarily true. On
 PPC we don't have an in-kernel irqchip yet, but we can still support easily
 support eventfd.
 
 
 Don't you need, in addition, to reject PIO space for ioeventfd?

Yeah, we could tell user space that it's doing something stupid. Not sure how 
critical it is - right now it would just register it and never get triggered, 
because the bus doesn't exist. But sanity checks are a good thing usually.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf


On 16.10.2012, at 13:01, Avi Kivity wrote:

 On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support without 
 in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.
 
 That's all true but it wasn't my point.  I was asking whether you can
 enable irqfd without in-kernel irqchip.  If your irq input is edge
 triggered then it's trivial.  If it's level triggered then you can use
 the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras

On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  +   __u64   flags;
  +   __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY   ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE ((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?

That's fine; the read stops when it has filled the buffer and a
subsequent read will continue from where the previous one finished.

 What happens if the read buffer size is not a multiple of entry size?

Then we don't use the last few bytes of the buffer.  The read() call
returns the number of bytes that were filled in, of course.  In any
case, the header size is 8 bytes and the HPT entry size is 16 bytes,
so the number of bytes filled in won't necessarily be a multiple of 16
bytes.

 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).

No.

  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  +   __u32   index;
  +   __u16   n_valid;
  +   __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?

By data structure, do you mean the stream format on the file
descriptor, or the HPT structure?  If we want a new stream format,
then we would define a bit in the flags field of struct
kvm_get_htab_fd to mean I want the new stream format.  The code
fails the ioctl if any unknown flag bits are set, so a new userspace
that wants to use the new format could then detect that it is running
on an old kernel and fall back to the old format.

The HPT entry format is very unlikely to change in size or basic
layout (though the architects do redefine some of the bits
occasionally).

  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the change rate?

Typically the HPT would have about a million entries, i.e. it would be
16MiB in size.  The usual guideline is to make it about 1/64 of the
maximum amount of RAM the guest could ever have, rounded up to a power
of two, although we often run with less, say 1/128 or even 1/256.

Because it is a hash table, updates tend to be scattered throughout
the whole table, which is another reason why per-page dirty tracking
and updates would be pretty inefficient.

As for the change rate, it depends on the application of course, but
basically every time the guest changes a PTE in its Linux page tables
we do the corresponding change to the corresponding HPT entry, so the
rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
exec, etc. have a high rate of HPT updates.

 Suppose new hardware arrives that supports nesting HPTs, so that kvm is
 no longer synchronously aware of the guest HPT (similar to how NPT/EPT
 made

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Avi Kivity

On 10/16/2012 01:58 PM, Paul Mackerras wrote:
 On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  +  __u64   flags;
  +  __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY  ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?
 
 That's fine; the read stops when it has filled the buffer and a
 subsequent read will continue from where the previous one finished.
 
 What happens if the read buffer size is not a multiple of entry size?
 
 Then we don't use the last few bytes of the buffer.  The read() call
 returns the number of bytes that were filled in, of course.  In any
 case, the header size is 8 bytes and the HPT entry size is 16 bytes,
 so the number of bytes filled in won't necessarily be a multiple of 16
 bytes.

That's sane and expected, but it should be documented.

 
 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).
 
 No.

This forces userspace to dedicate a thread for the HPT.

 
  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  +  __u32   index;
  +  __u16   n_valid;
  +  __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?
 
 By data structure, do you mean the stream format on the file
 descriptor, or the HPT structure?  If we want a new stream format,
 then we would define a bit in the flags field of struct
 kvm_get_htab_fd to mean I want the new stream format.  The code
 fails the ioctl if any unknown flag bits are set, so a new userspace
 that wants to use the new format could then detect that it is running
 on an old kernel and fall back to the old format.
 
 The HPT entry format is very unlikely to change in size or basic
 layout (though the architects do redefine some of the bits
 occasionally).

I meant the internal data structure that holds HPT entries.

I guess I don't understand the index.  Do we expect changes to be in
contiguous ranges?  And invalid entries to be contiguous as well?  That
doesn't fit with how hash tables work.  Does the index represent the
position of the entry within the table, or something else?


 
  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the change rate?
 
 Typically the HPT would have about a million entries, i.e. it would be
 16MiB in size.  The usual guideline is to make it about 1/64 of the
 maximum amount of RAM the guest could ever have, rounded up to a power
 of two, although we often run with less, say 1/128 or even 1/256.

16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
it warrant a live migration protocol?

 Because it is

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Avi Kivity

On 10/16/2012 01:06 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 13:01, Avi Kivity wrote:
 
 On 10/16/2012 12:59 PM, Alexander Graf wrote:
 
 On 16.10.2012, at 12:56, Avi Kivity wrote:
 
 On 10/15/2012 02:02 PM, Alexander Graf wrote:
 In order to support vhost, we need to be able to support ioeventfd.
 
 This patch set adds support for ioeventfd to PPC and makes it possible to
 do so without implementing irqfd along the way, as it requires an 
 in-kernel
 irqchip which we don't have yet.
 
 It's not strictly required.  You have an interrupt line leading to the
 core, no?  You could have your irqfd trigger that.
 
 The irqfd code in KVM directly triggers the in-kernel irqchip. That's what 
 this patch set is cleaning up: enable you to build ioeventfd support 
 without in-kernel irqchip irqfd support.
 
 Vhost can work without in-kernel irqchip too, by listening to an eventfd 
 (iiuc) in user space. That path works just fine with these patches applied.
 
 That's all true but it wasn't my point.  I was asking whether you can
 enable irqfd without in-kernel irqchip.  If your irq input is edge
 triggered then it's trivial.  If it's level triggered then you can use
 the new resampling irqfd.
 
 I'm not sure I fully grasp what you're trying to say :).
 
 We have a single interrupt line on the core. So whenever any external 
 interrupt gets injected (same thing for MSI), we need to go to the MPIC / 
 XICS and ask it which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.

 
 So yes, we could create a direct fd channel between vhost and the user space 
 MPIC, but it wouldn't buy us anything. The interrupt injection path would be 
 as long as it is with the current mechanism.
 

If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] KVM: PPC: Support ioeventfd

2012-10-16 Thread Alexander Graf


On 10/16/2012 03:47 PM, Avi Kivity wrote:

On 10/16/2012 01:06 PM, Alexander Graf wrote:

On 16.10.2012, at 13:01, Avi Kivity wrote:


On 10/16/2012 12:59 PM, Alexander Graf wrote:

On 16.10.2012, at 12:56, Avi Kivity wrote:


On 10/15/2012 02:02 PM, Alexander Graf wrote:

In order to support vhost, we need to be able to support ioeventfd.

This patch set adds support for ioeventfd to PPC and makes it possible to
do so without implementing irqfd along the way, as it requires an in-kernel
irqchip which we don't have yet.

It's not strictly required.  You have an interrupt line leading to the
core, no?  You could have your irqfd trigger that.

The irqfd code in KVM directly triggers the in-kernel irqchip. That's what this 
patch set is cleaning up: enable you to build ioeventfd support without 
in-kernel irqchip irqfd support.

Vhost can work without in-kernel irqchip too, by listening to an eventfd (iiuc) 
in user space. That path works just fine with these patches applied.

That's all true but it wasn't my point.  I was asking whether you can
enable irqfd without in-kernel irqchip.  If your irq input is edge
triggered then it's trivial.  If it's level triggered then you can use
the new resampling irqfd.

I'm not sure I fully grasp what you're trying to say :).

We have a single interrupt line on the core. So whenever any external interrupt 
gets injected (same thing for MSI), we need to go to the MPIC / XICS and ask it 
which line is active.

Couldn't you attach that payload to the irqfd?  On x86 an irqfd is
associated with a gsi, and a gsi with extra information, including all
that is needed to queue an MSI.


So yes, we could create a direct fd channel between vhost and the user space 
MPIC, but it wouldn't buy us anything. The interrupt injection path would be as 
long as it is with the current mechanism.


If there is a lot of prioritization and/or queuing logic, then yes.  But
what about MSI?  Doesn't that have a direct path?


Nope. Well, yes, in a certain special case where the MPIC pushes the 
interrupt vector on interrupt delivery into a special register. But not 
for the normal case.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/powerpc: Handle errors in secondary thread grabbing

2012-10-16 Thread Benjamin Herrenschmidt

On Tue, 2012-10-16 at 17:00 +1100, Michael Ellerman wrote:
  Thanks for looking at this - but in fact this is fixed by my patch
  entitled KVM: PPC: Book3S HV: Fix some races in starting secondary
  threads submitted back on August 28.
 
 OK thanks. It seems that patch didn't make 3.7 ?
 
 I don't see it in kvm-ppc-next either.

Alex, WTF ?

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Anthony Liguori

Avi Kivity a...@redhat.com writes:

 On 10/16/2012 01:58 PM, Paul Mackerras wrote:
 On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
 On 10/16/2012 06:01 AM, Paul Mackerras wrote:
  +4.78 KVM_PPC_GET_HTAB_FD
  +
  +Capability: KVM_CAP_PPC_HTAB_FD
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: Pointer to struct kvm_get_htab_fd (in)
  +Returns: file descriptor number (= 0) on success, -1 on error
  +
  +This returns a file descriptor that can be used either to read out the
  +entries in the guest's hashed page table (HPT), or to write entries to
  +initialize the HPT.  The returned fd can only be written to if the
  +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
  +can only be read if that bit is clear.  The argument struct looks like
  +this:
  +
  +/* For KVM_PPC_GET_HTAB_FD */
  +struct kvm_get_htab_fd {
  + __u64   flags;
  + __u64   start_index;
  +};
  +
  +/* Values for kvm_get_htab_fd.flags */
  +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
  +#define KVM_GET_HTAB_WRITE   ((__u64)0x2)
  +
  +The `start_index' field gives the index in the HPT of the entry at
  +which to start reading.  It is ignored when writing.
  +
  +Reads on the fd will initially supply information about all
  +interesting HPT entries.  Interesting entries are those with the
  +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
  +all entries.  When the end of the HPT is reached, the read() will
  +return.  
 
 What happens if the read buffer is smaller than the HPT size?
 
 That's fine; the read stops when it has filled the buffer and a
 subsequent read will continue from where the previous one finished.
 
 What happens if the read buffer size is not a multiple of entry size?
 
 Then we don't use the last few bytes of the buffer.  The read() call
 returns the number of bytes that were filled in, of course.  In any
 case, the header size is 8 bytes and the HPT entry size is 16 bytes,
 so the number of bytes filled in won't necessarily be a multiple of 16
 bytes.

 That's sane and expected, but it should be documented.

 
 Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
 to change).
 
 No.

 This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size  0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

  If read() is called again on the fd, it will start again from
  +the beginning of the HPT, but will only return HPT entries that have
  +changed since they were last read.
  +
  +Data read or written is structured as a header (8 bytes) followed by a
  +series of valid HPT entries (16 bytes) each.  The header indicates how
  +many valid HPT entries there are and how many invalid entries follow
  +the valid entries.  The invalid entries are not represented explicitly
  +in the stream.  The header format is:
  +
  +struct kvm_get_htab_header {
  + __u32   index;
  + __u16   n_valid;
  + __u16   n_invalid;
  +};
 
 This structure forces the kernel to return entries sequentially.  Will
 this block changing the data structure in the future?  Or is the
 hardware spec sufficiently strict that such changes are not realistic?
 
 By data structure, do you mean the stream format on the file
 descriptor, or the HPT structure?  If we want a new stream format,
 then we would define a bit in the flags field of struct
 kvm_get_htab_fd to mean I want the new stream format.  The code
 fails the ioctl if any unknown flag bits are set, so a new userspace
 that wants to use the new format could then detect that it is running
 on an old kernel and fall back to the old format.
 
 The HPT entry format is very unlikely to change in size or basic
 layout (though the architects do redefine some of the bits
 occasionally).

 I meant the internal data structure that holds HPT entries.

 I guess I don't understand the index.  Do we expect changes to be in
 contiguous ranges?  And invalid entries to be contiguous as well?  That
 doesn't fit with how hash tables work.  Does the index represent the
 position of the entry within the table, or something else?


 
  +
  +Writes to the fd create HPT entries starting at the index given in the
  +header; first `n_valid' valid entries with contents from the data
  +written, then `n_invalid' invalid entries, invalidating any previously
  +valid entries found.
 
 This scheme is a clever, original, and very interesting approach to live
 migration.  That doesn't necessarily mean a NAK, we should see if it
 makes sense for other migration APIs as well (we currently have
 difficulties migrating very large/wide guests).
 
 What is the typical number of entries in the HPT?  Do you have estimates
 of the

Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-10-16 Thread Paul Mackerras

On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
 On 10/16/2012 01:58 PM, Paul Mackerras wrote:
  On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
  Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
  to change).
  
  No.
 
 This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

   If read() is called again on the fd, it will start again from
   +the beginning of the HPT, but will only return HPT entries that have
   +changed since they were last read.
   +
   +Data read or written is structured as a header (8 bytes) followed by a
   +series of valid HPT entries (16 bytes) each.  The header indicates how
   +many valid HPT entries there are and how many invalid entries follow
   +the valid entries.  The invalid entries are not represented explicitly
   +in the stream.  The header format is:
   +
   +struct kvm_get_htab_header {
   +__u32   index;
   +__u16   n_valid;
   +__u16   n_invalid;
   +};
  
  This structure forces the kernel to return entries sequentially.  Will
  this block changing the data structure in the future?  Or is the
  hardware spec sufficiently strict that such changes are not realistic?
  
  By data structure, do you mean the stream format on the file
  descriptor, or the HPT structure?  If we want a new stream format,
  then we would define a bit in the flags field of struct
  kvm_get_htab_fd to mean I want the new stream format.  The code
  fails the ioctl if any unknown flag bits are set, so a new userspace
  that wants to use the new format could then detect that it is running
  on an old kernel and fall back to the old format.
  
  The HPT entry format is very unlikely to change in size or basic
  layout (though the architects do redefine some of the bits
  occasionally).
 
 I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

 I guess I don't understand the index.  Do we expect changes to be in
 contiguous ranges?  And invalid entries to be contiguous as well?  That
 doesn't fit with how hash tables work.  Does the index represent the
 position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

   +
   +Writes to the fd create HPT entries starting at the index given in the
   +header; first `n_valid' valid entries with contents from the data
   +written, then `n_invalid' invalid entries, invalidating any previously
   +valid entries found.
  
  This scheme is a clever, original, and very interesting approach to live
  migration.  That doesn't necessarily mean a NAK, we should see if it
  makes sense for other migration APIs as well (we currently have
  difficulties migrating very large/wide guests).
  
  What is the typical number of entries in the HPT?  Do you have estimates
  of the change rate?
  
  Typically the HPT would have about a million entries, i.e. it would be
  16MiB in size.  The usual guideline is to make it about 1/64 of the
  maximum amount of RAM the guest could ever have, rounded up to a power
  of two, although we often run with less, say 1/128 or even 1/256.
 
 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
 it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

  Because it is a hash table, updates tend to be scattered throughout
  the whole table, which is another reason why per-page dirty tracking
  and updates would be pretty inefficient.
 
 This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

  
  As for the change rate, it depends on the application of course, but
  basically every time the guest changes a PTE in its Linux page tables
  we do the corresponding change to the corresponding HPT entry, so the
  rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
  exec, etc. have a high rate of HPT updates.
 
 If the rate is high enough, then there's no point in a live update.

True, but doesn't that argument apply to memory pages as well?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to

94 matches

Mail list logo