Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 We want to expose the pvclock shared memory areas, which 
 the hypervisor periodically updates, to userspace.
 
 For a linear mapping from userspace, it is necessary that
 entire page sized regions are used for array of pvclock 
 structures.
 
 There is no such guarantee with per cpu areas, therefore move
 to memblock_alloc based allocation.
Ok.

 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
For the concept:

Acked-by: Glauber Costa glom...@parallels.com

I do have one comment.

 
 Index: vsyscall/arch/x86/kernel/kvmclock.c
 ===
 --- vsyscall.orig/arch/x86/kernel/kvmclock.c
 +++ vsyscall/arch/x86/kernel/kvmclock.c
 @@ -23,6 +23,7 @@
  #include asm/apic.h
  #include linux/percpu.h
  #include linux/hardirq.h
 +#include linux/memblock.h
  
  #include asm/x86_init.h
  #include asm/reboot.h
 @@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
  early_param(no-kvmclock, parse_no_kvmclock);
  
  /* The hypervisor will put information about time periodically here */
 -static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, 
 hv_clock);
 +struct pvclock_aligned_vcpu_time_info {
 + struct pvclock_vcpu_time_info clock;
 +} __attribute__((__aligned__(SMP_CACHE_BYTES)));
 +
 +static struct pvclock_aligned_vcpu_time_info *hv_clock;
  static struct pvclock_wall_clock wall_clock;
  
  /*
 @@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
   struct pvclock_vcpu_time_info *vcpu_time;
   struct timespec ts;
   int low, high;
 + int cpu;
 +
 + preempt_disable();
 + cpu = smp_processor_id();
  
   low = (int)__pa_symbol(wall_clock);
   high = ((u64)__pa_symbol(wall_clock)  32);
  
   native_write_msr(msr_kvm_wall_clock, low, high);
  
 - vcpu_time = get_cpu_var(hv_clock);
 + vcpu_time = hv_clock[cpu].clock;


You are leaving preempt disabled for a lot longer than in should be. In
particular, you'll have a round trip to the hypervisor in the middle. It
doesn't matter *that* much because wallclock is mostly a one-time thing,
so I won't oppose in this basis.

But if you have the chance, either fix it, or tell us why you need
preemption on for longer than it was before.

  void __init kvmclock_init(void)
  {
 + unsigned long mem;
 +
   if (!kvm_para_available())
   return;
  
 + mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * 
 NR_CPUS,
 +  PAGE_SIZE);
 + if (!mem)
 + return;
 + hv_clock = __va(mem);
 +
   if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
   msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
   msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
 

If you don't have kvmclock enabled, which the next line in here would
test for, you will exit this function with allocated memory. How about
the following?

if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
} else if (!(kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info)
 * NR_CPUS, PAGE_SIZE);
if (!mem)
return;
hv_clock = __va(mem);

printk(KERN_INFO kvm-clock: Using msrs %x and %x,
   msr_kvm_system_time, msr_kvm_wall_clock);

if (kvm_register_clock(boot clock)) {
memblock_free();
return;
}




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Question] Intercept read access in KVM

2012-11-15 Thread R
Hi, everyone

I am studying KVM code.
And I try to intercept a  guest's read access to a page using VMX and
EPT support.

According to Intel Manual, the lowest bit of the EPT page table entry
is used to
control read access to a page.

I modified the rmap_write_protect function to remove the
VMX_EPT_READABLE_MASK of a spte. This is  accomplish by changing the
new_spte parameter of  mmu_spte_update function.

But in KVM, it seems like it is used to indicate whether a page is
present or not. So this would trigger the
WARN_ON(!is_rmap_spte(new_spte)) and cause the Host to panic.

Why can not rmap_write_protect  be applied to set the spte to non-present.

Does drop_spte function is the only way to  intercept a read access of
a in guest page.

I use kvm-kmod-3.1 and linux kernel 3.1. Can anyone help if I do
anything wrong or there is another more elegant way to do it.

Thank U for answering.

--
Thanks
Rui Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/18] x86: pvclock: create helper for pvclock data retrieval

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.
 
 So code can be reused.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
I thought I had acked this one already?

But maybe I didn't...

Acked-by: Glauber Costa glom...@parallels.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/18] x86: pvclock: introduce helper to read flags

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
 Index: vsyscall/arch/x86/kernel/pvclock.c
 ===
 --- vsyscall.orig/arch/x86/kernel/pvclock.c
Acked-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/18] x86: pvclock: add note about rdtsc barriers

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 As noted by Gleb, not advertising SSE2 support implies
 no RDTSC barriers.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

And this gets a separate patch because?

 Index: vsyscall/arch/x86/include/asm/pvclock.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/pvclock.h
 +++ vsyscall/arch/x86/include/asm/pvclock.h
 @@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str
   u8 ret_flags;
  
   version = src-version;
 + /* Note: emulated platforms which do not advertise SSE2 support
 +  * result in kvmclock not using the necessary RDTSC barriers.
 +  */

And the expected effects are? Will it work in that case? Any precautions
one must take? Is it safe for Xen? Is it safe for KVM?

Those are the types of things I'd expect to see in a comment for
something as subtle as this.

   rdtsc_barrier();
   offset = pvclock_get_nsec_offset(src);
   ret = src-system_time + offset;
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 As requested by Glauber, do not update kvmclock area on vcpu-pcpu 
 migration, in case the host has stable TSC. 
 
 This is to reduce cacheline bouncing.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
This looks fine, but it can always get tricky...
Assuming you tested this change in at least one stable tsc and one
unstable tsc system:

Acked-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PROBLEM: compilation issue, inline assembly arch/x86/kvm/emulate.c fails at -O0

2012-11-15 Thread Blower, Melanie
Thanks for your reply. As you agree there is an actual bug in this code, would 
you kindly be able to tell me when a fix would be available in the Linux trunk?
Thanks and best regards, Melanie Blower

-Original Message-
From: H. Peter Anvin [mailto:h...@zytor.com] 
Sent: Wednesday, November 14, 2012 7:14 PM
To: Blower, Melanie
Cc: t...@linutronix.de; mi...@redhat.com; a...@redhat.com; x...@kernel.org; 
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: PROBLEM: compilation issue, inline assembly arch/x86/kvm/emulate.c 
fails at -O0

On 11/14/2012 11:45 AM, Blower, Melanie wrote:
 [1.] gcc -O0 assembly arch/x86/kvm/emulate.c gets compilation failure 
 -- incorrect register restrictions [2.] Full description of the 
 problem/report:
 I'm trying to compile this file at -O0, but gcc chokes in register allocation 
 at the inline assembly.

 In the ordinary Linux build, this file compiles with gcc at -O2, without 
 compilation errors.

Compiling with -O0 is not really expected to work (although -O1 *is*), although 
what you are reporting is an actual bug (+a : a should either be +a or 
=a : a).

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center I work for Intel.  I don't 
speak on their behalf.



RE: virtio + vhost-net performance issue - preadv ?

2012-11-15 Thread Ben Clay
David-

Thanks for the followup.  That is disappointing, and I wish I knew why the
performance is so poor.  With the kernel and qemu replaced, I don't know
where the limitation is - raising the MTU makes no difference, and I also
tried a few different kernels inside the guest.  The network stack can
natively support high throughput (~23Gbps on IP over Infiniband on these
nodes), so I'm at a loss.  Maybe it's the lack of preadv?

Ben Clay
rbc...@ncsu.edu


-Original Message-
From: David Cruz [mailto:david.c...@gigas.com] 
Sent: Wednesday, November 14, 2012 1:27 AM
To: Ben Clay
Subject: Re: virtio + vhost-net performance issue - preadv ?

Got the same results here last year when we were testing.

In the end, we use only CentOS6. And even more, we changed the kernel to
3.5.5 due to unstable Windows virtualization when using several Windows
Server in the same Hypervisor.

2-4 GBit/s in Centos5 is acceptable. I think that's the max you can get in
that version.

David

2012/11/14 Ben Clay rbc...@ncsu.edu:
 I have a working copy of libvirt 0.10.2 + qemu 1.2 installed on a 
 vanilla up-to-date (2.6.32-279.9.1) CentOS 6 host, and get very good 
 VM - VM network performance (both running on the same host) using 
 virtio.  I have cgroups set to cap the VMs at 10Gbps and iperf shows 
 I'm getting exactly 10Gbps.

 I copied these VMs to a CentOS 5 host and installed libvirt 1.0 + qemu
1.2.
 However, the best performance I can get in between the VMs (again 
 running on the same host) is ~2Gbps.  In both cases, this is over a 
 bridged interface with static IPs assigned to each VM.  I've also 
 tried virtual networking with NAT or routing, yielding the same results.

 I figured it was due to vhost-net missing on the older CentOS 5 
 kernel, so I installed 2.6.39-4.2 from ELRepo and got the 
 /dev/vhost-net device and vhost processes associated with each VM:

 ]$ lsmod | grep vhost
 vhost_net  28446  2
 tun23888  7 vhost_net

 ]$ ps aux | grep vhost-
 root  9628  0.0  0.0  0 0 ?S17:57   0:00
 [vhost-9626]
 root  9671  0.0  0.0  0 0 ?S17:57   0:00
 [vhost-9670]

 ]$ ls /dev/vhost-net -al
 crw--- 1 root root 10, 58 Nov 13 15:19 /dev/vhost-net

 After installing the new kernel, I also tried rebuilding libvirt and 
 qemu, to no avail.  I also disabled cgroups, just in case it was 
 getting in the way, as well as iptables.  I can see the virtio_net 
 module loaded inside the guest, and using virtio raises my performance 
 from 400Mbps to 2Gbps, so it does make some improvement.

 The only differences between the two physical hosts that I can find are:

 - qemu on the CentOS 5 host builds without preadv support - would this 
 make such a huge performance difference?  CentOS5 only comes with an 
 old version of glibc, which is missing preadv
 - qemu on the CentOS 5 host builds without PIE
 - libvirt 1.0 was required on the CentOS 5 host, since 0.10.2 had a 
 build bug. This shouldn't matter I don't think.
 - I haven't tried rebuilding the VMs from scratch on the CentOS5 host, 
 which I guess is worth a try.

 The qemu process is being started with virtio + vhost:

 /usr/bin/qemu-system-x86_64 -name vmname -S -M pc-1.2 -enable-kvm -m 
 4096 -smp 8,sockets=8,cores=1,threads=1 -uuid
 212915ed-a34a-4d6d-68f5-2216083a7693 -no-user-config -nodefaults 
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname.monitor,server
 ,nowai t -mon chardev=charmonitor,id=monitor,mode=control -rtc 
 base=utc -no-shutdown -device 
 piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
 file=/mnt/vmname/disk.img,if=none,id=drive-virtio-disk0,format=raw,cac
 he=non
 e -device
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id
 =virti
 o-disk0,bootindex=1 -netdev tap,fd=16,id=hostnet0,vhost=on,vhostfd=18
 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0
 ,addr=
 0x3 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 
 -vnc
 127.0.0.1:1 -vga cirrus -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

 The relevant part of my libvirt config, of which I've tried omitting 
 the target, alias and address elements with no difference in performance:

interface type=bridge
   mac address=00:11:22:33:44:55/
   source bridge=br0/
   target dev=vnet0/
   model type=virtio/
   alias name=net0/
   address type=pci domain=0x bus=0x00 slot=0x03
 function=0x0/
 /interface

 Is there something else which could be getting in the way here?

 Thanks!

 Ben Clay
 rbc...@ncsu.edu



 --
 To unsubscribe from this list: send the line unsubscribe kvm in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH] vfio-pci: Re-order device reset

2012-11-15 Thread Alex Williamson
Move the device reset to the end of our disable path, the device
should already be stopped from pci_disable_device().  This also allows
us to manipulate the save/restore to avoid the save/reset/restore +
save/restore that we had before.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

Hannes, I think this does the move you're looking for, please let
me know if it works better for you.  This is also in my next branch:

git://github.com/awilliam/linux-vfio.git next

Thanks,

Alex

 drivers/vfio/pci/vfio_pci.c |   43 +++
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a4dc21b..b179f5a 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -92,9 +92,10 @@ out:
 
 static void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
+   struct pci_dev *pdev = vdev-pdev;
int bar;
 
-   pci_disable_device(vdev-pdev);
+   pci_disable_device(pdev);
 
vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
VFIO_IRQ_SET_ACTION_TRIGGER,
@@ -104,22 +105,40 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
vfio_config_free(vdev);
 
-   pci_reset_function(vdev-pdev);
-
-   if (pci_load_and_free_saved_state(vdev-pdev,
- vdev-pci_saved_state) == 0)
-   pci_restore_state(vdev-pdev);
-   else
-   pr_info(%s: Couldn't reload %s saved state\n,
-   __func__, dev_name(vdev-pdev-dev));
-
for (bar = PCI_STD_RESOURCES; bar = PCI_STD_RESOURCE_END; bar++) {
if (!vdev-barmap[bar])
continue;
-   pci_iounmap(vdev-pdev, vdev-barmap[bar]);
-   pci_release_selected_regions(vdev-pdev, 1  bar);
+   pci_iounmap(pdev, vdev-barmap[bar]);
+   pci_release_selected_regions(pdev, 1  bar);
vdev-barmap[bar] = NULL;
}
+
+   /*
+* If we have saved state, restore it.  If we can reset the device,
+* even better.  Resetting with current state seems better than
+* nothing, but saving and restoring current state without reset
+* is just busy work.
+*/
+   if (pci_load_and_free_saved_state(pdev, vdev-pci_saved_state)) {
+   pr_info(%s: Couldn't reload %s saved state\n,
+   __func__, dev_name(pdev-dev));
+
+   if (!vdev-reset_works)
+   return;
+
+   pci_save_state(pdev);
+   }
+
+   /*
+* Disable INTx and MSI, presumably to avoid spurious interrupts
+* during reset.  Stolen from pci_reset_function()
+*/
+   pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+
+   if (vdev-reset_works)
+   __pci_reset_function(pdev);
+
+   pci_restore_state(pdev);
 }
 
 static void vfio_pci_release(void *device_data)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] VFIO: fix out of order labels for error recovery in vfio_pci_init()

2012-11-15 Thread Jiang Liu
The two labels for error recovery in function vfio_pci_init() is out of
order, so fix it.

Signed-off-by: Jiang Liu jiang@huawei.com
---
 drivers/vfio/pci/vfio_pci.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6968b72..d07fb7e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -563,9 +563,9 @@ static int __init vfio_pci_init(void)
 
return 0;
 
-out_virqfd:
-   vfio_pci_virqfd_exit();
 out_driver:
+   vfio_pci_virqfd_exit();
+out_virqfd:
vfio_pci_uninit_perm_bits();
return ret;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] VFIO: use ACCESS_ONCE() to guard access to dev-driver

2012-11-15 Thread Jiang Liu
Comments from dev_driver_string(),
/* dev-driver can change to NULL underneath us because of unbinding,
 * so be careful about accessing it.
 */

So use ACCESS_ONCE() to guard access to dev-driver field.

Signed-off-by: Jiang Liu jiang@huawei.com
---
 drivers/vfio/vfio.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 3359ec2..0090212 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -465,8 +465,9 @@ static int vfio_dev_viable(struct device *dev, void *data)
 {
struct vfio_group *group = data;
struct vfio_device *device;
+   struct device_driver *drv = ACCESS_ONCE(dev-driver);
 
-   if (!dev-driver || vfio_whitelisted_driver(dev-driver))
+   if (!drv || vfio_whitelisted_driver(drv))
return 0;
 
device = vfio_group_get_device(group, dev);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] VFIO: unregister IOMMU notifier on error recovery path

2012-11-15 Thread Jiang Liu
On error recovery path in function vfio_create_group(), it should
unregister the IOMMU notifier for the new VFIO group. Otherwise it may
cause invalid memory access later when handling bus notifications.

Signed-off-by: Jiang Liu jiang@huawei.com
---
 drivers/vfio/vfio.c |   31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 17830c9..3359ec2 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -191,6 +191,17 @@ static void vfio_container_put(struct vfio_container 
*container)
kref_put(container-kref, vfio_container_release);
 }
 
+static void vfio_group_unlock_and_free(struct vfio_group *group)
+{
+   mutex_unlock(vfio.group_lock);
+   /*
+* Unregister outside of lock.  A spurious callback is harmless now
+* that the group is no longer in vfio.group_list.
+*/
+   iommu_group_unregister_notifier(group-iommu_group, group-nb);
+   kfree(group);
+}
+
 /**
  * Group objects - create, release, get, put, search
  */
@@ -229,8 +240,7 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group)
 
minor = vfio_alloc_group_minor(group);
if (minor  0) {
-   mutex_unlock(vfio.group_lock);
-   kfree(group);
+   vfio_group_unlock_and_free(group);
return ERR_PTR(minor);
}
 
@@ -239,8 +249,7 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group)
if (tmp-iommu_group == iommu_group) {
vfio_group_get(tmp);
vfio_free_group_minor(minor);
-   mutex_unlock(vfio.group_lock);
-   kfree(group);
+   vfio_group_unlock_and_free(group);
return tmp;
}
}
@@ -249,8 +258,7 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group)
group, %d, iommu_group_id(iommu_group));
if (IS_ERR(dev)) {
vfio_free_group_minor(minor);
-   mutex_unlock(vfio.group_lock);
-   kfree(group);
+   vfio_group_unlock_and_free(group);
return (struct vfio_group *)dev; /* ERR_PTR */
}
 
@@ -274,16 +282,7 @@ static void vfio_group_release(struct kref *kref)
device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group-minor));
list_del(group-vfio_next);
vfio_free_group_minor(group-minor);
-
-   mutex_unlock(vfio.group_lock);
-
-   /*
-* Unregister outside of lock.  A spurious callback is harmless now
-* that the group is no longer in vfio.group_list.
-*/
-   iommu_group_unregister_notifier(group-iommu_group, group-nb);
-
-   kfree(group);
+   vfio_group_unlock_and_free(group);
 }
 
 static void vfio_group_put(struct vfio_group *group)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] VFIO: unregister IOMMU notifier on error recovery path

2012-11-15 Thread Alex Williamson
On Fri, 2012-11-16 at 00:43 +0800, Jiang Liu wrote:
 On error recovery path in function vfio_create_group(), it should
 unregister the IOMMU notifier for the new VFIO group. Otherwise it may
 cause invalid memory access later when handling bus notifications.
 
 Signed-off-by: Jiang Liu jiang@huawei.com
 ---
  drivers/vfio/vfio.c |   31 +++
  1 file changed, 15 insertions(+), 16 deletions(-)


Applied all 3, thanks!

Alex

 diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
 index 17830c9..3359ec2 100644
 --- a/drivers/vfio/vfio.c
 +++ b/drivers/vfio/vfio.c
 @@ -191,6 +191,17 @@ static void vfio_container_put(struct vfio_container 
 *container)
   kref_put(container-kref, vfio_container_release);
  }
  
 +static void vfio_group_unlock_and_free(struct vfio_group *group)
 +{
 + mutex_unlock(vfio.group_lock);
 + /*
 +  * Unregister outside of lock.  A spurious callback is harmless now
 +  * that the group is no longer in vfio.group_list.
 +  */
 + iommu_group_unregister_notifier(group-iommu_group, group-nb);
 + kfree(group);
 +}
 +
  /**
   * Group objects - create, release, get, put, search
   */
 @@ -229,8 +240,7 @@ static struct vfio_group *vfio_create_group(struct 
 iommu_group *iommu_group)
  
   minor = vfio_alloc_group_minor(group);
   if (minor  0) {
 - mutex_unlock(vfio.group_lock);
 - kfree(group);
 + vfio_group_unlock_and_free(group);
   return ERR_PTR(minor);
   }
  
 @@ -239,8 +249,7 @@ static struct vfio_group *vfio_create_group(struct 
 iommu_group *iommu_group)
   if (tmp-iommu_group == iommu_group) {
   vfio_group_get(tmp);
   vfio_free_group_minor(minor);
 - mutex_unlock(vfio.group_lock);
 - kfree(group);
 + vfio_group_unlock_and_free(group);
   return tmp;
   }
   }
 @@ -249,8 +258,7 @@ static struct vfio_group *vfio_create_group(struct 
 iommu_group *iommu_group)
   group, %d, iommu_group_id(iommu_group));
   if (IS_ERR(dev)) {
   vfio_free_group_minor(minor);
 - mutex_unlock(vfio.group_lock);
 - kfree(group);
 + vfio_group_unlock_and_free(group);
   return (struct vfio_group *)dev; /* ERR_PTR */
   }
  
 @@ -274,16 +282,7 @@ static void vfio_group_release(struct kref *kref)
   device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group-minor));
   list_del(group-vfio_next);
   vfio_free_group_minor(group-minor);
 -
 - mutex_unlock(vfio.group_lock);
 -
 - /*
 -  * Unregister outside of lock.  A spurious callback is harmless now
 -  * that the group is no longer in vfio.group_list.
 -  */
 - iommu_group_unregister_notifier(group-iommu_group, group-nb);
 -
 - kfree(group);
 + vfio_group_unlock_and_free(group);
  }
  
  static void vfio_group_put(struct vfio_group *group)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Resend V2 of patch set to enable guest use of TSC_ADJUST functionality

2012-11-15 Thread Will Auld
Adding Gleb.

This is a resend of the patches for TSC_ADJUST functionality. The two KVM 
patches and an additional QEMU-KVM patch 
together provide this support. 

Will Auld (2):
  Add code to track call origin for msr assignment.
  Enabling IA32_TSC_ADJUST for KVM guest VM support

 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   | 16 ---
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  4 
 arch/x86/kvm/svm.c| 27 --
 arch/x86/kvm/vmx.c| 41 ---
 arch/x86/kvm/x86.c| 40 --
 arch/x86/kvm/x86.h|  2 +-
 8 files changed, 109 insertions(+), 23 deletions(-)

-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Add code to track call origin for msr assignment.

2012-11-15 Thread Will Auld
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST
msr while setting the tsc value. It is anticipated that this capability
is useful other tasks.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 12 +---
 arch/x86/kvm/svm.c  | 21 +++--
 arch/x86/kvm/vmx.c  | 24 +---
 arch/x86/kvm/x86.c  | 23 +--
 arch/x86/kvm/x86.h  |  2 +-
 5 files changed, 59 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..da34027 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
 
 struct x86_instruction_info;
 
+struct msr_data {
+bool host_initiated;
+u32 index;
+u64 data;
+};
+
 struct kvm_x86_ops {
int (*cpu_has_kvm_support)(void);  /* __init */
int (*disabled_by_bios)(void); /* __init */
@@ -621,7 +627,7 @@ struct kvm_x86_ops {
void (*set_guest_debug)(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
-   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 struct x86_emulate_ctxt;
 
@@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
int *l);
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
 
 int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index baead95..5ac11f0 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
struct page *msrpm_pages;
struct page *hsave_page;
struct page *nested_msrpm_pages;
+   struct msr_data msr;
int err;
 
svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
@@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
init_vmcb(svm);
-   kvm_write_tsc(svm-vcpu, 0);
+   msr.data = 0x0;
+   msr.index = MSR_IA32_TSC;
+   msr.host_initiated = true;
+   kvm_write_tsc(svm-vcpu, msr);
 
err = fx_init(svm-vcpu);
if (err)
@@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
data)
return 0;
 }
 
-static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
+static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   u32 ecx = msr-index;
+   u64 data = msr-data;
switch (ecx) {
case MSR_IA32_TSC:
-   kvm_write_tsc(vcpu, data);
+   kvm_write_tsc(vcpu, msr);
break;
case MSR_STAR:
svm-vmcb-save.star = data;
@@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
ecx, data);
break;
default:
-   return kvm_set_msr_common(vcpu, ecx, data);
+   return kvm_set_msr_common(vcpu, msr);
}
return 0;
 }
 
 static int wrmsr_interception(struct vcpu_svm *svm)
 {
+   struct msr_data msr;
u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
| ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX]  -1u)  32);
 
-
+   msr.data = data;
+   msr.index = ecx;
+   msr.host_initiated = false;
svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
-   if (svm_set_msr(svm-vcpu, 

[PATCH 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-15 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the use TSC counter offsetting VM-execution
control is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc
tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in
one of these three locations. The argument against storing it in the actual
MSR is performance. This is likely to be seldom used while the save/restore
is required on every transition. IA32_TSC_ADJUST was created as a way to
solve some issues with writing TSC itself so that is not an option either.
The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:
The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get TSC +
IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest.
While the total system semantics changed the semantics as seen by the guest
do not and hence this will not cause a problem.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   |  4 
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  4 
 arch/x86/kvm/svm.c|  6 ++
 arch/x86/kvm/vmx.c| 17 +
 arch/x86/kvm/x86.c| 17 +
 7 files changed, 50 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 6b7ee5f..e574d81 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -199,6 +199,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
 #define X86_FEATURE_FSGSBASE   (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
+#define X86_FEATURE_TSC_ADJUST  (9*32+ 1) /* TSC adjustment MSR 0x3b */
 #define X86_FEATURE_BMI1   (9*32+ 3) /* 1st group bit manipulation 
extensions */
 #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */
 #define X86_FEATURE_AVX2   (9*32+ 5) /* AVX2 instructions */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da34027..0a9ac5c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -442,6 +442,8 @@ struct kvm_vcpu_arch {
u32 virtual_tsc_mult;
u32 virtual_tsc_khz;
 
+   s64 ia32_tsc_adjust_msr;
+
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
@@ -693,6 +695,8 @@ struct kvm_x86_ops {
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
+   bool (*have_ia32_tsc_adjust_msr)(void);
+   void (*update_ia32_tsc_adjust_msr)(struct kvm_vcpu *vcpu, s64 offset);
u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
 
void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 957ec87..6486569 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -231,6 +231,7 @@
 #define MSR_IA32_EBL_CR_POWERON0x002a
 #define MSR_EBC_FREQUENCY_ID   0x002c
 #define MSR_IA32_FEATURE_CONTROL0x003a
+#define MSR_IA32_TSC_ADJUST 0x003b
 
 #define FEATURE_CONTROL_LOCKED (10)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (11)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..3007fbd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -320,6 +320,10 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
if (index == 0) {
entry-ebx = kvm_supported_word9_x86_features;
cpuid_mask(entry-ebx, 9);
+   if (kvm_x86_ops-have_ia32_tsc_adjust_msr()) {
+   // TSC_ADJUST is emulated 
+   entry-ebx |= F(TSC_ADJUST);
+   }
} 

Resend [PATCH] Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs

2012-11-15 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to the
hypervisor vcpu specific locations to store the value of the emulated MSRs.
In this way the IA32_TSC_ADJUST value will be included in all reads to
the TSC MSR whether through rdmsr or rdtsc.

As this is a new MSR that the guest may access and modify its value needs
to be migrated along with the other MRSs. The changes here are specifically
for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added
for migrating its value.

Signed-off-by: Will Auld will.a...@intel.com
---
 target-i386/cpu.h |  4 +++-
 target-i386/kvm.c | 15 +++
 target-i386/machine.c | 21 +
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index aabf993..7ca99c0 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -284,6 +284,7 @@
 #define MSR_IA32_APICBASE_BSP   (18)
 #define MSR_IA32_APICBASE_ENABLE(111)
 #define MSR_IA32_APICBASE_BASE  (0xf12)
+#define MSR_TSC_ADJUST 0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
 #define MSR_MTRRcap0xfe
@@ -701,6 +702,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;
 
 uint64_t tsc;
+uint64_t tsc_adjust;
 uint64_t tsc_deadline;
 
 uint64_t mcg_status;
@@ -979,7 +981,7 @@ static inline CPUX86State *cpu_init(const char *cpu_model)
 #define cpu_list_id x86_cpu_list
 #define cpudef_setup   x86_cpudef_setup
 
-#define CPU_SAVE_VERSION 12
+#define CPU_SAVE_VERSION 13
 
 /* MMU modes definitions */
 #define MMU_MODE0_SUFFIX _kernel
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 696b14a..e974c42 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static bool has_msr_misc_enable;
@@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) {
+has_msr_tsc_adjust = true;
+continue;
+}
 if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) {
 has_msr_tsc_deadline = true;
 continue;
@@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 }
+if (has_msr_tsc_adjust) {
+kvm_msr_entry_set(msrs[n++], 
+   MSR_TSC_ADJUST, env-tsc_adjust);
+}
 if (has_msr_tsc_deadline) {
 kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline);
 }
@@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_adjust) {
+msrs[n++].index = MSR_TSC_ADJUST;
+}
 if (has_msr_tsc_deadline) {
 msrs[n++].index = MSR_IA32_TSCDEADLINE;
 }
@@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env)
 case MSR_IA32_TSC:
 env-tsc = msrs[i].data;
 break;
+case MSR_TSC_ADJUST:
+env-tsc_adjust = msrs[i].data;
+break;
 case MSR_IA32_TSCDEADLINE:
 env-tsc_deadline = msrs[i].data;
 break;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index a8be058..95bda9b 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = {
 }
 };
 
+static bool tsc_adjust_needed(void *opaque)
+{
+CPUX86State *cpu = opaque;
+
+return cpu-tsc_adjust != 0;
+}
+
+static const VMStateDescription vmstate_msr_tsc_adjust = {
+.name = cpu/msr_tsc_adjust,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(tsc_adjust, CPUX86State),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static bool tscdeadline_needed(void *opaque)
 {
 CPUX86State *env = opaque;
@@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = {
 .vmsd = vmstate_fpop_ip_dp,
 .needed = fpop_ip_dp_needed,
 }, {
+.vmsd = vmstate_msr_tsc_adjust,
+.needed = tsc_adjust_needed,
+}, {
 .vmsd = vmstate_msr_tscdeadline,
 .needed = tscdeadline_needed,
 }, {
-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[patch 07/18] x86: pvclock: add note about rdtsc barriers (v2)

2012-11-15 Thread Marcelo Tosatti

As noted by Gleb, not advertising SSE2 support implies
no RDTSC barriers.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -74,6 +74,12 @@ unsigned __pvclock_read_cycles(const str
u8 ret_flags;
 
version = src-version;
+   /* Note: emulated platforms which do not advertise SSE2 support
+* result in kvmclock not using the necessary RDTSC barriers.
+* Without barriers, it is possible that RDTSC instruction reads from
+* the time stamp counter outside rdtsc_barrier protected section
+* below, resulting in violation of monotonicity.
+*/
rdtsc_barrier();
offset = pvclock_get_nsec_offset(src);
ret = src-system_time + offset;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2)

2012-11-15 Thread Marcelo Tosatti

We want to expose the pvclock shared memory areas, which 
the hypervisor periodically updates, to userspace.

For a linear mapping from userspace, it is necessary that
entire page sized regions are used for array of pvclock 
structures.

There is no such guarantee with per cpu areas, therefore move
to memblock_alloc based allocation.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/kvmclock.c
===
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include asm/apic.h
 #include linux/percpu.h
 #include linux/hardirq.h
+#include linux/memblock.h
 
 #include asm/x86_init.h
 #include asm/reboot.h
@@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
 early_param(no-kvmclock, parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+struct pvclock_aligned_vcpu_time_info {
+   struct pvclock_vcpu_time_info clock;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+static struct pvclock_aligned_vcpu_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
struct pvclock_vcpu_time_info *vcpu_time;
struct timespec ts;
int low, high;
+   int cpu;
 
low = (int)__pa_symbol(wall_clock);
high = ((u64)__pa_symbol(wall_clock)  32);
 
native_write_msr(msr_kvm_wall_clock, low, high);
 
-   vcpu_time = get_cpu_var(hv_clock);
+   preempt_disable();
+   cpu = smp_processor_id();
+
+   vcpu_time = hv_clock[cpu].clock;
pvclock_read_wallclock(wall_clock, vcpu_time, ts);
-   put_cpu_var(hv_clock);
+
+   preempt_enable();
 
return ts.tv_sec;
 }
@@ -74,9 +84,11 @@ static cycle_t kvm_clock_read(void)
 {
struct pvclock_vcpu_time_info *src;
cycle_t ret;
+   int cpu;
 
preempt_disable_notrace();
-   src = __get_cpu_var(hv_clock);
+   cpu = smp_processor_id();
+   src = hv_clock[cpu].clock;
ret = pvclock_clocksource_read(src);
preempt_enable_notrace();
return ret;
@@ -99,8 +111,15 @@ static cycle_t kvm_clock_get_cycles(stru
 static unsigned long kvm_get_tsc_khz(void)
 {
struct pvclock_vcpu_time_info *src;
-   src = per_cpu(hv_clock, 0);
-   return pvclock_tsc_khz(src);
+   int cpu;
+   unsigned long tsc_khz;
+
+   preempt_disable();
+   cpu = smp_processor_id();
+   src = hv_clock[cpu].clock;
+   tsc_khz = pvclock_tsc_khz(src);
+   preempt_enable();
+   return tsc_khz;
 }
 
 static void kvm_get_preset_lpj(void)
@@ -119,10 +138,14 @@ bool kvm_check_and_clear_guest_paused(vo
 {
bool ret = false;
struct pvclock_vcpu_time_info *src;
+   int cpu = smp_processor_id();
 
-   src = __get_cpu_var(hv_clock);
+   if (!hv_clock)
+   return ret;
+
+   src = hv_clock[cpu].clock;
if ((src-flags  PVCLOCK_GUEST_STOPPED) != 0) {
-   __this_cpu_and(hv_clock.flags, ~PVCLOCK_GUEST_STOPPED);
+   src-flags = ~PVCLOCK_GUEST_STOPPED;
ret = true;
}
 
@@ -141,9 +164,10 @@ int kvm_register_clock(char *txt)
 {
int cpu = smp_processor_id();
int low, high, ret;
+   struct pvclock_vcpu_time_info *src = hv_clock[cpu].clock;
 
-   low = (int)__pa(per_cpu(hv_clock, cpu)) | 1;
-   high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
+   low = (int)__pa(src) | 1;
+   high = ((u64)__pa(src)  32);
ret = native_write_msr_safe(msr_kvm_system_time, low, high);
printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
   cpu, high, low, txt);
@@ -197,6 +221,8 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
+   unsigned long mem;
+
if (!kvm_para_available())
return;
 
@@ -209,8 +235,18 @@ void __init kvmclock_init(void)
printk(KERN_INFO kvm-clock: Using msrs %x and %x,
msr_kvm_system_time, msr_kvm_wall_clock);
 
-   if (kvm_register_clock(boot clock))
+   mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * 
NR_CPUS,
+PAGE_SIZE);
+   if (!mem)
return;
+   hv_clock = __va(mem);
+
+   if (kvm_register_clock(boot clock)) {
+   hv_clock = NULL;
+   memblock_free(mem,
+   sizeof(struct pvclock_aligned_vcpu_time_info)*NR_CPUS);
+   return;
+   }
pv_time_ops.sched_clock = kvm_clock_read;
x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-15 Thread Marcelo Tosatti
On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
 On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
  On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
  Hi Marcelo,
 
  On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
  On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
  Do not drop large spte until it can be insteaded by small pages so that
  the guest can happliy read memory through it
 
  The idea is from Avi:
  | As I mentioned before, write-protecting a large spte is a good idea,
  | since it moves some work from protect-time to fault-time, so it reduces
  | jitter.  This removes the need for the return value.
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/kvm/mmu.c |   34 +-
   1 files changed, 9 insertions(+), 25 deletions(-)
 
  Its likely that other 4k pages are mapped read-write in the 2mb range 
  covered by a read-only 2mb map. Therefore its not entirely useful to
  map read-only. 
 
 
  It needs a page fault to install a pte even if it is the read access.
  After the change, the page fault can be avoided.
 
  Can you measure an improvement with this change?
 
  I have a test case to measure the read time which has been attached.
  It maps 4k pages at first (dirt-loggged), then switch to large sptes
  (stop dirt-logging), at the last, measure the read access time after write
  protect sptes.
 
  Before: 23314111 nsAfter: 11404197 ns
  
  Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
  that is:
  
  - large page must be destroyed when write protecting due to 
  shadowed page.
  - with shadow, it does not make sense to write protect 
  large sptes as mentioned earlier.
  
 
 This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
 |
 |pt = sp-spt;
 |for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
 |/* avoid RMW */
 |if (is_writable_pte(pt[i]))
 |update_spte(pt[i], pt[i]  
 ~PT_WRITABLE_MASK);
 |}
 
 The real problem in this code is it would write-protect the spte even if
 it is not a last spte that caused the middle-level shadow page table was
 write-protected. So e49146dce8c3dc6f44 added this code:
 |if (sp-role.level != PT_PAGE_TABLE_LEVEL)
 |continue;
 |
 was good to fix this problem.
 
 Now, the current code is:
 | for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
 | if (!is_shadow_present_pte(pt[i]) ||
 |   !is_last_spte(pt[i], sp-role.level))
 | continue;
 |
 | spte_write_protect(kvm, pt[i], flush, false);
 | }
 It only write-protect the last spte. So, it allows large spte existent.
 (the large spte can be broken by drop_large_spte() on the page-fault path.)
 
  So i wonder why is this part from your patch
  
  -   if (level  PT_PAGE_TABLE_LEVEL 
  -   has_wrprotected_page(vcpu-kvm, gfn, level)) {
  -   ret = 1;
  -   drop_spte(vcpu-kvm, sptep);
  -   goto done;
  -   }
  
  necessary (assuming EPT is in use).
 
 This is safe, we change these code to:
 
 - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 + if ((level  PT_PAGE_TABLE_LEVEL 
 +has_wrprotected_page(vcpu-kvm, gfn, level)) ||
 +   mmu_need_write_protect(vcpu, gfn, can_unsync)) {
   pgprintk(%s: found shadow page for %llx, marking ro\n,
__func__, gfn);
   ret = 1;
 
 The spte become read-only which can ensure the shadow gfn can not be changed.
 
 Btw, the origin code allows to create readonly spte under this case if 
 !(pte_access  WRITEABBLE)

Regarding shadow: it should be fine as long as fault path always deletes
large mappings, when shadowed pages are present in the region.

Ah, unshadowing from reexecute_instruction does not handle
large pages. I suppose that is what simplification refers 
to.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-15 Thread Marcelo Tosatti
On Wed, Nov 14, 2012 at 10:36:21AM +0900, Yoshihiro YUNOMAE wrote:
 Hi All,
 
 The following patch set can make disordered trace data of a guest and a host
 sorted in chronological order.
 
 In a virtualization environment, it is difficult to analyze performance
 problems, such as a delay of I/O request on a guest. This is because multiple
 guests operate on the host. One of approaches for solving such kind of 
 problems
 is to sort trace data of guests and the host in chronological order.
 
 After we applied the patch set(https://lkml.org/lkml/2012/11/13/588), raw TSC
 can be chosen as a timestamp of ftrace. TSC is useful for merging trace data
 in chronological order by two reasons. One of the reasons is that guests can
 directly read raw TSC from the CPU using rdtsc operation. This means that raw
 TSC value is not software clock like sched_clock, so we don't need to consider
 about how the timestamp is calculated. The other is that TSC of recent x86 
 CPUs
 is constantly incremented. This means that we don't need to worry about pace 
 of
 the timestamp. Therefore, choosing TSC as a timestamp for tracing is 
 reasonable
 to integrate trace data of guests and a host.
 
 Here, we need to consider about just one matter for using TSC on guests. TSC
 value on a guest is always the host TSC plus the guest's TSC offset. In 
 other
 words, to merge trace data using TSC as timestamp in chronological order, we
 need to consider TSC offset of the guest.
 
 However, only the host kernel can read the TSC offset from VMCS and TSC offset
 is not output in anywhere now. In other words, tools in userland cannot get
 the TSC offset value, so we cannot merge trace data of guest and the host in
 chronological order. Therefore, the TSC offset should be exported for userland
 tools.
 
 In this patch set, TSC offset is exported by printk() on the host. I also
 attached a tool for merging trace data of a guest and a host in chronological
 order.
 
 Example
 We assume that wakeup-latency for a command is big on a guest. Normally
 we will use ftrace's wakeup-latency tracer or event tracer on the guest, but 
 we
 may not be able to solve this problem. This is because guests often exit to
 the host for several reasons. In the next, we will use TSC as ftrace's 
 timestamp
 and record the trace data on the guest and the host. Then, we get following
 data:
 
  /* guest data */
 comm-3826  [000] d...49836825726903: sched_wakeup: [detail]
 comm-3826  [000] d...49836832225344: sched_switch: [detail]
  /* host data */
 qemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
 qemu-kvm-2687  [003] d...50550079240656: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079243467: kvm_entry: [detail]
 qemu-kvm-2687  [003] d...50550079256103: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079268391: kvm_entry: [detail]
 qemu-kvm-2687  [003] d...50550079280829: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079286028: kvm_entry: [detail]
 
 Since TSC offset is not considered, these data cannot be merged. If this trace
 data is shown like as follows, we will be able to understand the reason:
 
 qemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
 comm-3826  [000] d.h.49836825726903: sched_wakeup: [detail] =
 qemu-kvm-2687  [003] d...50550079240656: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079243467: kvm_entry: [detail]
 qemu-kvm-2687  [003] d...50550079256103: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079268391: kvm_entry: [detail]
 comm-3826  [000] d...49836832225344: sched_switch: [detail] =
 qemu-kvm-2687  [003] d...50550079280829: kvm_exit: [detail]
 qemu-kvm-2687  [003] d...50550079286028: kvm_entry: [detail]
 
 In this case, we can understand wakeup-latency was big due to exit to host
 twice. Getting this data sorted in chronological order is our goal.
 
 To merge the data like previous pattern, we apply this patch set. Then, we can
 get TSC offset of the guest as follows:
 
 $ dmesg | grep kvm
 [   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
     |
  PID TSC offset |
HOST TSC value --+ 
 
 We use this TSC offset value to a merge script and obtain the following data:
 
 $ ./trace-merge.pl 18446743360465545001 host.data guest.data
 hqemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
 hqemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
 gcomm-3826  [000] d.h.50550079226331: sched_wakeup: [detail] =
 hqemu-kvm-2687  [003] d...50550079240656: 

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-15 Thread Xiao Guangrong
On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
 On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
 On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
 On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
 Hi Marcelo,

 On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
 On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
 Do not drop large spte until it can be insteaded by small pages so that
 the guest can happliy read memory through it

 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   34 +-
  1 files changed, 9 insertions(+), 25 deletions(-)

 Its likely that other 4k pages are mapped read-write in the 2mb range 
 covered by a read-only 2mb map. Therefore its not entirely useful to
 map read-only. 


 It needs a page fault to install a pte even if it is the read access.
 After the change, the page fault can be avoided.

 Can you measure an improvement with this change?

 I have a test case to measure the read time which has been attached.
 It maps 4k pages at first (dirt-loggged), then switch to large sptes
 (stop dirt-logging), at the last, measure the read access time after write
 protect sptes.

 Before: 23314111 nsAfter: 11404197 ns

 Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
 that is:

 - large page must be destroyed when write protecting due to 
 shadowed page.
 - with shadow, it does not make sense to write protect 
 large sptes as mentioned earlier.


 This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
 |
 |pt = sp-spt;
 |for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
 |/* avoid RMW */
 |if (is_writable_pte(pt[i]))
 |update_spte(pt[i], pt[i]  
 ~PT_WRITABLE_MASK);
 |}

 The real problem in this code is it would write-protect the spte even if
 it is not a last spte that caused the middle-level shadow page table was
 write-protected. So e49146dce8c3dc6f44 added this code:
 |if (sp-role.level != PT_PAGE_TABLE_LEVEL)
 |continue;
 |
 was good to fix this problem.

 Now, the current code is:
 |for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
 |if (!is_shadow_present_pte(pt[i]) ||
 |  !is_last_spte(pt[i], sp-role.level))
 |continue;
 |
 |spte_write_protect(kvm, pt[i], flush, false);
 |}
 It only write-protect the last spte. So, it allows large spte existent.
 (the large spte can be broken by drop_large_spte() on the page-fault path.)

 So i wonder why is this part from your patch

 -   if (level  PT_PAGE_TABLE_LEVEL 
 -   has_wrprotected_page(vcpu-kvm, gfn, level)) {
 -   ret = 1;
 -   drop_spte(vcpu-kvm, sptep);
 -   goto done;
 -   }

 necessary (assuming EPT is in use).

 This is safe, we change these code to:

 -if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 +if ((level  PT_PAGE_TABLE_LEVEL 
 +   has_wrprotected_page(vcpu-kvm, gfn, level)) ||
 +  mmu_need_write_protect(vcpu, gfn, can_unsync)) {
  pgprintk(%s: found shadow page for %llx, marking ro\n,
   __func__, gfn);
  ret = 1;

 The spte become read-only which can ensure the shadow gfn can not be changed.

 Btw, the origin code allows to create readonly spte under this case if 
 !(pte_access  WRITEABBLE)
 
 Regarding shadow: it should be fine as long as fault path always deletes
 large mappings, when shadowed pages are present in the region.

For hard mmu is also safe, in this patch i added these code:

@@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
break;
}

+   drop_large_spte(vcpu, iterator.sptep);
+

It can delete large mappings like soft mmu does.

Anything i missed?

 
 Ah, unshadowing from reexecute_instruction does not handle
 large pages. I suppose that is what simplification refers 
 to.

reexecute_instruction did not directly handle last spte, it just
removes all shadow pages, then let cpu retry the instruction, the
page can become writable when encounter #PF again, large spte is fine
under this case.

(Out of this thread: I notice reexecute_instruction allows to retry
 instruct only if tdp_enabled == 0, but on nested npt, it also has
 page write-protected by shadow pages. Maybe we need to improve this
 restriction.
)

--
To unsubscribe from this list: send the line 

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-15 Thread Marcelo Tosatti
On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
 On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
  On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
  On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
  On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
  Hi Marcelo,
 
  On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
  On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
  Do not drop large spte until it can be insteaded by small pages so that
  the guest can happliy read memory through it
 
  The idea is from Avi:
  | As I mentioned before, write-protecting a large spte is a good idea,
  | since it moves some work from protect-time to fault-time, so it 
  reduces
  | jitter.  This removes the need for the return value.
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/kvm/mmu.c |   34 +-
   1 files changed, 9 insertions(+), 25 deletions(-)
 
  Its likely that other 4k pages are mapped read-write in the 2mb range 
  covered by a read-only 2mb map. Therefore its not entirely useful to
  map read-only. 
 
 
  It needs a page fault to install a pte even if it is the read access.
  After the change, the page fault can be avoided.
 
  Can you measure an improvement with this change?
 
  I have a test case to measure the read time which has been attached.
  It maps 4k pages at first (dirt-loggged), then switch to large sptes
  (stop dirt-logging), at the last, measure the read access time after 
  write
  protect sptes.
 
  Before: 23314111 ns  After: 11404197 ns
 
  Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
  that is:
 
  - large page must be destroyed when write protecting due to 
  shadowed page.
  - with shadow, it does not make sense to write protect 
  large sptes as mentioned earlier.
 
 
  This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
  |
  |pt = sp-spt;
  |for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
  |/* avoid RMW */
  |if (is_writable_pte(pt[i]))
  |update_spte(pt[i], pt[i]  
  ~PT_WRITABLE_MASK);
  |}
 
  The real problem in this code is it would write-protect the spte even if
  it is not a last spte that caused the middle-level shadow page table was
  write-protected. So e49146dce8c3dc6f44 added this code:
  |if (sp-role.level != PT_PAGE_TABLE_LEVEL)
  |continue;
  |
  was good to fix this problem.
 
  Now, the current code is:
  |  for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
  |  if (!is_shadow_present_pte(pt[i]) ||
  |!is_last_spte(pt[i], sp-role.level))
  |  continue;
  |
  |  spte_write_protect(kvm, pt[i], flush, false);
  |  }
  It only write-protect the last spte. So, it allows large spte existent.
  (the large spte can be broken by drop_large_spte() on the page-fault path.)
 
  So i wonder why is this part from your patch
 
  -   if (level  PT_PAGE_TABLE_LEVEL 
  -   has_wrprotected_page(vcpu-kvm, gfn, level)) {
  -   ret = 1;
  -   drop_spte(vcpu-kvm, sptep);
  -   goto done;
  -   }
 
  necessary (assuming EPT is in use).
 
  This is safe, we change these code to:
 
  -  if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
  +  if ((level  PT_PAGE_TABLE_LEVEL 
  + has_wrprotected_page(vcpu-kvm, gfn, level)) ||
  +mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 pgprintk(%s: found shadow page for %llx, marking ro\n,
  __func__, gfn);
 ret = 1;
 
  The spte become read-only which can ensure the shadow gfn can not be 
  changed.
 
  Btw, the origin code allows to create readonly spte under this case if 
  !(pte_access  WRITEABBLE)
  
  Regarding shadow: it should be fine as long as fault path always deletes
  large mappings, when shadowed pages are present in the region.
 
 For hard mmu is also safe, in this patch i added these code:
 
 @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
 int write,
   break;
   }
 
 + drop_large_spte(vcpu, iterator.sptep);
 +
 
 It can delete large mappings like soft mmu does.
 
 Anything i missed?
 
  
  Ah, unshadowing from reexecute_instruction does not handle
  large pages. I suppose that is what simplification refers 
  to.
 
 reexecute_instruction did not directly handle last spte, it just
 removes all shadow pages, then let cpu retry the instruction, the
 page can become writable when encounter #PF again, large spte is fine
 under this case.

While searching for a given gpa, you don't find large gfn which is
mapping it, right? (that is, searching for 

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-15 Thread Xiao Guangrong
On 11/16/2012 11:56 AM, Marcelo Tosatti wrote:
 On Fri, Nov 16, 2012 at 11:39:12AM +0800, Xiao Guangrong wrote:
 On 11/16/2012 11:02 AM, Marcelo Tosatti wrote:
 On Thu, Nov 15, 2012 at 07:17:15AM +0800, Xiao Guangrong wrote:
 On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
 On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
 Hi Marcelo,

 On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
 On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
 Do not drop large spte until it can be insteaded by small pages so that
 the guest can happliy read memory through it

 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it 
 reduces
 | jitter.  This removes the need for the return value.

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   34 +-
  1 files changed, 9 insertions(+), 25 deletions(-)

 Its likely that other 4k pages are mapped read-write in the 2mb range 
 covered by a read-only 2mb map. Therefore its not entirely useful to
 map read-only. 


 It needs a page fault to install a pte even if it is the read access.
 After the change, the page fault can be avoided.

 Can you measure an improvement with this change?

 I have a test case to measure the read time which has been attached.
 It maps 4k pages at first (dirt-loggged), then switch to large sptes
 (stop dirt-logging), at the last, measure the read access time after 
 write
 protect sptes.

 Before: 23314111 ns  After: 11404197 ns

 Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
 that is:

 - large page must be destroyed when write protecting due to 
 shadowed page.
 - with shadow, it does not make sense to write protect 
 large sptes as mentioned earlier.


 This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
 |
 |pt = sp-spt;
 |for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
 |/* avoid RMW */
 |if (is_writable_pte(pt[i]))
 |update_spte(pt[i], pt[i]  
 ~PT_WRITABLE_MASK);
 |}

 The real problem in this code is it would write-protect the spte even if
 it is not a last spte that caused the middle-level shadow page table was
 write-protected. So e49146dce8c3dc6f44 added this code:
 |if (sp-role.level != PT_PAGE_TABLE_LEVEL)
 |continue;
 |
 was good to fix this problem.

 Now, the current code is:
 |  for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
 |  if (!is_shadow_present_pte(pt[i]) ||
 |!is_last_spte(pt[i], sp-role.level))
 |  continue;
 |
 |  spte_write_protect(kvm, pt[i], flush, false);
 |  }
 It only write-protect the last spte. So, it allows large spte existent.
 (the large spte can be broken by drop_large_spte() on the page-fault path.)

 So i wonder why is this part from your patch

 -   if (level  PT_PAGE_TABLE_LEVEL 
 -   has_wrprotected_page(vcpu-kvm, gfn, level)) {
 -   ret = 1;
 -   drop_spte(vcpu-kvm, sptep);
 -   goto done;
 -   }

 necessary (assuming EPT is in use).

 This is safe, we change these code to:

 -  if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 +  if ((level  PT_PAGE_TABLE_LEVEL 
 + has_wrprotected_page(vcpu-kvm, gfn, level)) ||
 +mmu_need_write_protect(vcpu, gfn, can_unsync)) {
pgprintk(%s: found shadow page for %llx, marking ro\n,
 __func__, gfn);
ret = 1;

 The spte become read-only which can ensure the shadow gfn can not be 
 changed.

 Btw, the origin code allows to create readonly spte under this case if 
 !(pte_access  WRITEABBLE)

 Regarding shadow: it should be fine as long as fault path always deletes
 large mappings, when shadowed pages are present in the region.

 For hard mmu is also safe, in this patch i added these code:

 @@ -2635,6 +2617,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t 
 v, int write,
  break;
  }

 +drop_large_spte(vcpu, iterator.sptep);
 +

 It can delete large mappings like soft mmu does.

 Anything i missed?


 Ah, unshadowing from reexecute_instruction does not handle
 large pages. I suppose that is what simplification refers 
 to.

 reexecute_instruction did not directly handle last spte, it just
 removes all shadow pages, then let cpu retry the instruction, the
 page can become writable when encounter #PF again, large spte is fine
 under this case.
 
 While searching for a given gpa, you don't find large gfn which is
 mapping it, right? (that is, searching for gfn 4 fails to find large
 read-only gfn 0). Unshadowing gfn 4 will 

Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2)

2012-11-15 Thread Glauber Costa
On 11/16/2012 06:07 AM, Marcelo Tosatti wrote:
 
 We want to expose the pvclock shared memory areas, which 
 the hypervisor periodically updates, to userspace.
 
 For a linear mapping from userspace, it is necessary that
 entire page sized regions are used for array of pvclock 
 structures.
 
 There is no such guarantee with per cpu areas, therefore move
 to memblock_alloc based allocation.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
No further objections.

Acked-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html