Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Andre Przywara

On 05/07/2012 08:21 PM, Eduardo Habkost wrote:


Andre? Are you able to help to answer the question below?


Sorry for the delay, the easy answers first:


I would like to clarify what's the expected behavior of -cpu host to
be able to continue working on it.


The purpose of -cpu host is to let guests use ISA features that the host 
CPU provides. Those would not need any KVM/QEMU intervention, because 
they work out of the box. Things like AES or SSE4.2, which are used by 
Linux and glibc, for instance. Users would expect those to be usable, 
and actually we only need to set the bits and are done.


A second goal is to get rid of the awkward and artificial 
family/model/stepping settings and just let the guest see the actual CPU 
model. This has some strings attached, though, but other virtualization 
solution do it the same way and they can cope with it.


A third thing currently not really addressed are the more informational 
bits like cache size and organization and TLB layout. Those are 
properties of the (core) CPU (except shared L3, cache maybe) and apply 
to guests as well. I don't see any reason why this should not be exposed 
to the guest. From the top of my head I don't know any prominent user 
(just glibc reading the cache size, but not really using it), but I 
guess there are applications which benefit from it.


What clearly is not the intention of -cpu host is to provide access to 
every feature a certain CPU model provides. So things which require 
emulation or hypervisor intervention are not in the primary use case. 
That does not mean that they don't need to implemented, but that was not 
the purpose of -cpu host and they should be handled independently.


Regards,
Andre.

I believe the code will need to be

fixed on either case, but first we need to figure out what are the
expectations/requirements, to know _which_ changes will be needed.


On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:

(CCing Andre Przywara, in case he can help to clarify what's the
expected meaning of -cpu host)


[...]

I am not sure I understand what you are proposing. Let me explain the
use case I am thinking about:

- Feature FOO is of type (A) (e.g. just a new instruction set that
   doesn't require additional userspace support)
- User has a Qemu vesion that doesn't know anything about feature FOO
- User gets a new CPU that supports feature FOO
- User gets a new kernel that supports feature FOO (i.e. has FOO in
   GET_SUPPORTED_CPUID)
- User does _not_ upgrade Qemu.
- User expects to get feature FOO enabled if using -cpu host, without
   upgrading Qemu.

The problem here is: to support the above use-case, userspace need a
probing mechanism that can differentiate _new_ (previously unknown)
features that are in group (A) (safe to blindly enable) from features
that are in group (B) (that can't be enabled without an userspace
upgrade).

In short, it becomes a problem if we consider the following case:

- Feature BAR is of type (B) (it can't be enabled without extra
   userspace support)
- User has a Qemu version that doesn't know anything about feature BAR
- User gets a new CPU that supports feature BAR
- User gets a new kernel that supports feature BAR (i.e. has BAR in
   GET_SUPPORTED_CPUID)
- User does _not_ upgrade Qemu.
- User simply shouldn't get feature BAR enabled, even if using -cpu
   host, otherwise Qemu would break.

If userspace always limited itself to features it knows about, it would
be really easy to implement the feature without any new probing
mechanism from the kernel. But that's not how I think users expect -cpu
host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
introduced the -cpu host feature, in case he can explain what's the
expected semantics on the cases above.






--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Build fixes for ppc64-softmmu

2012-05-09 Thread Avi Kivity
On 05/09/2012 05:59 AM, Benjamin Herrenschmidt wrote:
 This is a small collection of build fixes that allow the
 qemu-kvm tree to build a ppc64-softmmu target with kvm
 enabled (provided device-assignment is disabled).


Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Gleb Natapov
On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
 
 On 08.05.2012, at 22:14, Eduardo Habkost wrote:
 
  On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
  On 07.05.2012, at 20:21, Eduardo Habkost wrote:
  
  
  Andre? Are you able to help to answer the question below?
  
  I would like to clarify what's the expected behavior of -cpu host to
  be able to continue working on it. I believe the code will need to be
  fixed on either case, but first we need to figure out what are the
  expectations/requirements, to know _which_ changes will be needed.
  
  
  On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
  (CCing Andre Przywara, in case he can help to clarify what's the
  expected meaning of -cpu host)
  
  [...]
  I am not sure I understand what you are proposing. Let me explain the
  use case I am thinking about:
  
  - Feature FOO is of type (A) (e.g. just a new instruction set that
  doesn't require additional userspace support)
  - User has a Qemu vesion that doesn't know anything about feature FOO
  - User gets a new CPU that supports feature FOO
  - User gets a new kernel that supports feature FOO (i.e. has FOO in
  GET_SUPPORTED_CPUID)
  - User does _not_ upgrade Qemu.
  - User expects to get feature FOO enabled if using -cpu host, without
  upgrading Qemu.
  
  The problem here is: to support the above use-case, userspace need a
  probing mechanism that can differentiate _new_ (previously unknown)
  features that are in group (A) (safe to blindly enable) from features
  that are in group (B) (that can't be enabled without an userspace
  upgrade).
  
  In short, it becomes a problem if we consider the following case:
  
  - Feature BAR is of type (B) (it can't be enabled without extra
  userspace support)
  - User has a Qemu version that doesn't know anything about feature BAR
  - User gets a new CPU that supports feature BAR
  - User gets a new kernel that supports feature BAR (i.e. has BAR in
  GET_SUPPORTED_CPUID)
  - User does _not_ upgrade Qemu.
  - User simply shouldn't get feature BAR enabled, even if using -cpu
  host, otherwise Qemu would break.
  
  If userspace always limited itself to features it knows about, it would
  be really easy to implement the feature without any new probing
  mechanism from the kernel. But that's not how I think users expect -cpu
  host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
  introduced the -cpu host feature, in case he can explain what's the
  expected semantics on the cases above.
  
  Can you think of any feature that'd go into category B?
  
  - TSC-deadline: can't be enabled unless userspace takes care to enable
   the in-kernel irqchip.
 
 The kernel can check if in-kernel irqchip has it enabled and otherwise mask 
 it out, no?
 
How kernel should know that userspace does not emulate it?

  - x2apic: ditto.
 
 Same here. For user space irqchip the kernel side doesn't care. If in-kernel 
 APIC is enabled, check for its capabilities.
 
Same here.

Well, technically both of those features can't be implemented in
userspace right now since MSRs are terminated in the kernel, but I
wouldn't make it into ABI.


--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Avi Kivity
On 05/08/2012 02:24 PM, Gleb Natapov wrote:
 Signed-off-by: Gleb Natapov g...@redhat.com

Please describe the regression you're testing for.  We could even link
it to the fix with the commit hash.


  void vfree(void *mem)
  {
  unsigned long size = ((unsigned long *)mem)[-1];
 diff --git a/lib/x86/vm.h b/lib/x86/vm.h
 index 71ab4a8..ff4842f 100644
 --- a/lib/x86/vm.h
 +++ b/lib/x86/vm.h
 @@ -22,6 +22,7 @@ void vfree(void *mem);
  void *vmap(unsigned long long phys, unsigned long size);
  void *alloc_vpage(void);
  void *alloc_vpages(ulong nr);
 +unsigned long virt_to_phys_cr3(void *mem);

uint64_t.

 @@ -0,0 +1,98 @@
 +/*
 + * Async PF test. For the test to actually do anywathing it ineeds to be 
 started
 + * in memory cgroup with 512M of memory and with more then 1G memory 
 provided 
 + * to the guest.
 + */

Please include instructions or a script on how to do that.

Alterative ways of doing this:
- file-backed memory using FUSE to control paging
- add madvise(MADV_DONTNEED) support to testdev, and have the guest
trigger page-in itself.

I'm not asking to change this test, just providing ideas for the future
in case fine-grained control is needed.  It also doesn't thrash the disk.

 +#include x86/msr.h
 +#include x86/processor.h
 +#include x86/apic-defs.h
 +#include x86/apic.h
 +#include x86/desc.h
 +#include x86/isr.h
 +#include x86/vm.h
 +
 +#include libcflat.h
 +#include stdint.h
 +
 +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
 +#define KVM_PV_REASON_PAGE_READY 2
 +
 +#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 +
 +#define KVM_ASYNC_PF_ENABLED(1  0)
 +#define KVM_ASYNC_PF_SEND_ALWAYS(1  1)
 +
 +volatile uint32_t apf_reason __attribute__((aligned(64)));
 +char *buf;
 +volatile uint64_t  i;
 +volatile unsigned long phys;

uint64_t.

 +
 +static void pf_isr(struct ex_regs *r)
 +{
 + void* virt = (void*)((ulong)(buf+i)  ~4095ul);
 +
 + switch (get_apf_reason()) {
 + case 0:

default:

 + printf(unexpected #PF at %p\n, read_cr2());
 + fail = true;
 + break;
 + case KVM_PV_REASON_PAGE_NOT_PRESENT:
 + phys = virt_to_phys_cr3(virt);
 + install_pte(phys_to_virt(read_cr3()), 1, virt, phys, 0);
 + write_cr3(read_cr3());

What's the point of these?

 + printf(Got not present #PF token %x virt addr %p phys 
 addr %p\n, read_cr2(), virt, phys);
 + while(phys) {
 + irq_enable();
 + halt();

Racy... you need safe_halt() here.

 + irq_disable();
 + }
 + break;
 + case KVM_PV_REASON_PAGE_READY:
 + printf(Got present #PF token %x\n, read_cr2());
 + if ((uint32_t)read_cr2() == ~0)
 + break;
 + install_pte(phys_to_virt(read_cr3()), 1, virt, phys | 
 PTE_PRESENT | PTE_WRITE, 0);
 + write_cr3(read_cr3());
 + phys = 0;
 + break;
 + }
 +}
 +


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Alexander Graf


On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote:

 On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
 
 On 08.05.2012, at 22:14, Eduardo Habkost wrote:
 
 On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
 On 07.05.2012, at 20:21, Eduardo Habkost wrote:
 
 
 Andre? Are you able to help to answer the question below?
 
 I would like to clarify what's the expected behavior of -cpu host to
 be able to continue working on it. I believe the code will need to be
 fixed on either case, but first we need to figure out what are the
 expectations/requirements, to know _which_ changes will be needed.
 
 
 On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
 (CCing Andre Przywara, in case he can help to clarify what's the
 expected meaning of -cpu host)
 
 [...]
 I am not sure I understand what you are proposing. Let me explain the
 use case I am thinking about:
 
 - Feature FOO is of type (A) (e.g. just a new instruction set that
 doesn't require additional userspace support)
 - User has a Qemu vesion that doesn't know anything about feature FOO
 - User gets a new CPU that supports feature FOO
 - User gets a new kernel that supports feature FOO (i.e. has FOO in
 GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User expects to get feature FOO enabled if using -cpu host, without
 upgrading Qemu.
 
 The problem here is: to support the above use-case, userspace need a
 probing mechanism that can differentiate _new_ (previously unknown)
 features that are in group (A) (safe to blindly enable) from features
 that are in group (B) (that can't be enabled without an userspace
 upgrade).
 
 In short, it becomes a problem if we consider the following case:
 
 - Feature BAR is of type (B) (it can't be enabled without extra
 userspace support)
 - User has a Qemu version that doesn't know anything about feature BAR
 - User gets a new CPU that supports feature BAR
 - User gets a new kernel that supports feature BAR (i.e. has BAR in
 GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User simply shouldn't get feature BAR enabled, even if using -cpu
 host, otherwise Qemu would break.
 
 If userspace always limited itself to features it knows about, it would
 be really easy to implement the feature without any new probing
 mechanism from the kernel. But that's not how I think users expect -cpu
 host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
 introduced the -cpu host feature, in case he can explain what's the
 expected semantics on the cases above.
 
 Can you think of any feature that'd go into category B?
 
 - TSC-deadline: can't be enabled unless userspace takes care to enable
 the in-kernel irqchip.
 
 The kernel can check if in-kernel irqchip has it enabled and otherwise mask 
 it out, no?
 
 How kernel should know that userspace does not emulate it?

You have to enable the in-kernel apic to use it, at which point the kernel 
knows it's in use, right?

 
 - x2apic: ditto.
 
 Same here. For user space irqchip the kernel side doesn't care. If in-kernel 
 APIC is enabled, check for its capabilities.
 
 Same here.
 
 Well, technically both of those features can't be implemented in
 userspace right now since MSRs are terminated in the kernel, but I

Doesn't sound like the greatest design - unless you deprecate the non-in-kernel 
apic case.

 wouldn't make it into ABI.
 
 
 --
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Gleb Natapov
On Wed, May 09, 2012 at 11:29:11AM +0300, Avi Kivity wrote:
 On 05/08/2012 02:24 PM, Gleb Natapov wrote:
  Signed-off-by: Gleb Natapov g...@redhat.com
 
 Please describe the regression you're testing for.  We could even link
 it to the fix with the commit hash.
 
The test does not really tests for a regression, because there wasn't
one. It test that prefault does not generates spurious #PFs. Will write
that.

 
   void vfree(void *mem)
   {
   unsigned long size = ((unsigned long *)mem)[-1];
  diff --git a/lib/x86/vm.h b/lib/x86/vm.h
  index 71ab4a8..ff4842f 100644
  --- a/lib/x86/vm.h
  +++ b/lib/x86/vm.h
  @@ -22,6 +22,7 @@ void vfree(void *mem);
   void *vmap(unsigned long long phys, unsigned long size);
   void *alloc_vpage(void);
   void *alloc_vpages(ulong nr);
  +unsigned long virt_to_phys_cr3(void *mem);
 
 uint64_t.
virt_to_phys() also unsigned long. And get_pte() that virt_to_phys_cr3()
uses also. I guess the code is not ready for more then 2^32 memory in
32bit VM.

 
  @@ -0,0 +1,98 @@
  +/*
  + * Async PF test. For the test to actually do anywathing it ineeds to be 
  started
  + * in memory cgroup with 512M of memory and with more then 1G memory 
  provided 
  + * to the guest.
  + */
 
 Please include instructions or a script on how to do that.
 
OK.

 Alterative ways of doing this:
 - file-backed memory using FUSE to control paging
Not sure how that can be done.

 - add madvise(MADV_DONTNEED) support to testdev, and have the guest
 trigger page-in itself.
MADV_DONTNEED will drop page, not swap it out.

 
 I'm not asking to change this test, just providing ideas for the future
 in case fine-grained control is needed.  It also doesn't thrash the disk.
 
  +#include x86/msr.h
  +#include x86/processor.h
  +#include x86/apic-defs.h
  +#include x86/apic.h
  +#include x86/desc.h
  +#include x86/isr.h
  +#include x86/vm.h
  +
  +#include libcflat.h
  +#include stdint.h
  +
  +#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
  +#define KVM_PV_REASON_PAGE_READY 2
  +
  +#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
  +
  +#define KVM_ASYNC_PF_ENABLED(1  0)
  +#define KVM_ASYNC_PF_SEND_ALWAYS(1  1)
  +
  +volatile uint32_t apf_reason __attribute__((aligned(64)));
  +char *buf;
  +volatile uint64_t  i;
  +volatile unsigned long phys;
 
 uint64_t.
 
  +
  +static void pf_isr(struct ex_regs *r)
  +{
  +   void* virt = (void*)((ulong)(buf+i)  ~4095ul);
  +
  +   switch (get_apf_reason()) {
  +   case 0:
 
 default:
I'd rather make deafult: fail the test. It shouldn't happen.

 
  +   printf(unexpected #PF at %p\n, read_cr2());
  +   fail = true;
  +   break;
  +   case KVM_PV_REASON_PAGE_NOT_PRESENT:
  +   phys = virt_to_phys_cr3(virt);
  +   install_pte(phys_to_virt(read_cr3()), 1, virt, phys, 0);
  +   write_cr3(read_cr3());
 
 What's the point of these?
 
Shouldn't we reload page tables after changing them?

  +   printf(Got not present #PF token %x virt addr %p phys 
  addr %p\n, read_cr2(), virt, phys);
  +   while(phys) {
  +   irq_enable();
  +   halt();
 
 Racy... you need safe_halt() here.
The code generated is sti; hlt; cli. By I as well may add safe_halt() to
do sti; hlt explicit.

 
  +   irq_disable();
  +   }
  +   break;
  +   case KVM_PV_REASON_PAGE_READY:
  +   printf(Got present #PF token %x\n, read_cr2());
  +   if ((uint32_t)read_cr2() == ~0)
  +   break;
  +   install_pte(phys_to_virt(read_cr3()), 1, virt, phys | 
  PTE_PRESENT | PTE_WRITE, 0);
  +   write_cr3(read_cr3());
  +   phys = 0;
  +   break;
  +   }
  +}
  +
 
 
 -- 
 error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Gleb Natapov
On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote:
 
 
 On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote:
 
  On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
  
  On 08.05.2012, at 22:14, Eduardo Habkost wrote:
  
  On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
  On 07.05.2012, at 20:21, Eduardo Habkost wrote:
  
  
  Andre? Are you able to help to answer the question below?
  
  I would like to clarify what's the expected behavior of -cpu host to
  be able to continue working on it. I believe the code will need to be
  fixed on either case, but first we need to figure out what are the
  expectations/requirements, to know _which_ changes will be needed.
  
  
  On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
  (CCing Andre Przywara, in case he can help to clarify what's the
  expected meaning of -cpu host)
  
  [...]
  I am not sure I understand what you are proposing. Let me explain the
  use case I am thinking about:
  
  - Feature FOO is of type (A) (e.g. just a new instruction set that
  doesn't require additional userspace support)
  - User has a Qemu vesion that doesn't know anything about feature FOO
  - User gets a new CPU that supports feature FOO
  - User gets a new kernel that supports feature FOO (i.e. has FOO in
  GET_SUPPORTED_CPUID)
  - User does _not_ upgrade Qemu.
  - User expects to get feature FOO enabled if using -cpu host, without
  upgrading Qemu.
  
  The problem here is: to support the above use-case, userspace need a
  probing mechanism that can differentiate _new_ (previously unknown)
  features that are in group (A) (safe to blindly enable) from features
  that are in group (B) (that can't be enabled without an userspace
  upgrade).
  
  In short, it becomes a problem if we consider the following case:
  
  - Feature BAR is of type (B) (it can't be enabled without extra
  userspace support)
  - User has a Qemu version that doesn't know anything about feature BAR
  - User gets a new CPU that supports feature BAR
  - User gets a new kernel that supports feature BAR (i.e. has BAR in
  GET_SUPPORTED_CPUID)
  - User does _not_ upgrade Qemu.
  - User simply shouldn't get feature BAR enabled, even if using -cpu
  host, otherwise Qemu would break.
  
  If userspace always limited itself to features it knows about, it would
  be really easy to implement the feature without any new probing
  mechanism from the kernel. But that's not how I think users expect 
  -cpu
  host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
  introduced the -cpu host feature, in case he can explain what's the
  expected semantics on the cases above.
  
  Can you think of any feature that'd go into category B?
  
  - TSC-deadline: can't be enabled unless userspace takes care to enable
  the in-kernel irqchip.
  
  The kernel can check if in-kernel irqchip has it enabled and otherwise 
  mask it out, no?
  
  How kernel should know that userspace does not emulate it?
 
 You have to enable the in-kernel apic to use it, at which point the kernel 
 knows it's in use, right?
 
  
  - x2apic: ditto.
  
  Same here. For user space irqchip the kernel side doesn't care. If 
  in-kernel APIC is enabled, check for its capabilities.
  
  Same here.
  
  Well, technically both of those features can't be implemented in
  userspace right now since MSRs are terminated in the kernel, but I
 
 Doesn't sound like the greatest design - unless you deprecate the 
 non-in-kernel apic case.
 
You mean terminating MSRs in kernel does not sound like the greatest
design? I do not disagree. That is why IMO kernel can't filter out
TSC-deadline and x2apic like you suggest.

  wouldn't make it into ABI.
  
  
  --
 Gleb.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Avi Kivity
On 05/09/2012 11:41 AM, Gleb Natapov wrote:
  
void vfree(void *mem)
{
unsigned long size = ((unsigned long *)mem)[-1];
   diff --git a/lib/x86/vm.h b/lib/x86/vm.h
   index 71ab4a8..ff4842f 100644
   --- a/lib/x86/vm.h
   +++ b/lib/x86/vm.h
   @@ -22,6 +22,7 @@ void vfree(void *mem);
void *vmap(unsigned long long phys, unsigned long size);
void *alloc_vpage(void);
void *alloc_vpages(ulong nr);
   +unsigned long virt_to_phys_cr3(void *mem);
  
  uint64_t.
 virt_to_phys() also unsigned long. And get_pte() that virt_to_phys_cr3()
 uses also. I guess the code is not ready for more then 2^32 memory in
 32bit VM.

It's certainly not enterprise quality yet.  But let's not add more problems.

  Alterative ways of doing this:
  - file-backed memory using FUSE to control paging
 Not sure how that can be done.

  - add madvise(MADV_DONTNEED) support to testdev, and have the guest
  trigger page-in itself.
 MADV_DONTNEED will drop page, not swap it out.

Right, but it will be have to be reloaded from disk (it has to be
file-backed for this to work).  If it's dirty, sync it first.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Gleb Natapov
On Wed, May 09, 2012 at 11:52:41AM +0300, Avi Kivity wrote:
 On 05/09/2012 11:41 AM, Gleb Natapov wrote:
   
 void vfree(void *mem)
 {
 unsigned long size = ((unsigned long *)mem)[-1];
diff --git a/lib/x86/vm.h b/lib/x86/vm.h
index 71ab4a8..ff4842f 100644
--- a/lib/x86/vm.h
+++ b/lib/x86/vm.h
@@ -22,6 +22,7 @@ void vfree(void *mem);
 void *vmap(unsigned long long phys, unsigned long size);
 void *alloc_vpage(void);
 void *alloc_vpages(ulong nr);
+unsigned long virt_to_phys_cr3(void *mem);
   
   uint64_t.
  virt_to_phys() also unsigned long. And get_pte() that virt_to_phys_cr3()
  uses also. I guess the code is not ready for more then 2^32 memory in
  32bit VM.
 
 It's certainly not enterprise quality yet.  But let's not add more problems.
 
Okay.

   Alterative ways of doing this:
   - file-backed memory using FUSE to control paging
  Not sure how that can be done.
 
   - add madvise(MADV_DONTNEED) support to testdev, and have the guest
   trigger page-in itself.
  MADV_DONTNEED will drop page, not swap it out.
 
 Right, but it will be have to be reloaded from disk (it has to be
 file-backed for this to work).  If it's dirty, sync it first.
 
Hmm, yes if it is file backed it may work. Setting up qemu to use file
backed memory is one more complication while running the test though.
I haven't checked by I am not sure that MADV_DONTNEED will drop page
immediately though. It probably puts it on some list to be freed later.
Hmm actually looking at the comments it seems like this is what happens:

/*
 * Application no longer needs these pages.  If the pages are dirty,
 * it's OK to just throw them away.  The app will be more careful about
 * data it wants to keep.  Be sure to free swap resources too.  The
 * zap_page_range call sets things up for shrink_active_list to actually
 * free
 * these pages later if no one else has touched them in the meantime,
 * although we could add these pages to a global reuse list for
 * shrink_active_list to pick up before reclaiming other pages.
 */

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/22] ppc: avoid buffer overrun: use pstrcpy, not strncpy

2012-05-09 Thread Jim Meyering
From: Jim Meyering meyer...@redhat.com

A terminal NUL is required by caller's use of strchr.
It's better not to use strncpy at all, since there is no need
to zero out hundreds of trailing bytes for each iteration.

Signed-off-by: Jim Meyering meyer...@redhat.com
---
 target-ppc/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index c09cc39..fb79e9f 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -587,7 +587,7 @@ static int read_cpuinfo(const char *field, char *value, int 
len)
 break;
 }
 if (!strncmp(line, field, field_len)) {
-strncpy(value, line, len);
+pstrcpy(value, len, line);
 ret = 0;
 break;
 }
-- 
1.7.10.1.487.ga3935e6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/22] ppc: avoid buffer overrun: use pstrcpy, not strncpy

2012-05-09 Thread Jim Meyering
From: Jim Meyering meyer...@redhat.com

A terminal NUL is required by caller's use of strchr.
It's better not to use strncpy at all, since there is no need
to zero out hundreds of trailing bytes for each iteration.

Signed-off-by: Jim Meyering meyer...@redhat.com
---
 target-ppc/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index c09cc39..fb79e9f 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -587,7 +587,7 @@ static int read_cpuinfo(const char *field, char *value, int 
len)
 break;
 }
 if (!strncmp(line, field, field_len)) {
-strncpy(value, line, len);
+pstrcpy(value, len, line);
 ret = 0;
 break;
 }
-- 
1.7.10.1.487.ga3935e6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Gleb Natapov
On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote:
 
 On 09.05.2012, at 10:51, Gleb Natapov wrote:
 
  On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote:
  
  
  On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote:
  
  On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
  
  On 08.05.2012, at 22:14, Eduardo Habkost wrote:
  
  On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
  On 07.05.2012, at 20:21, Eduardo Habkost wrote:
  
  
  Andre? Are you able to help to answer the question below?
  
  I would like to clarify what's the expected behavior of -cpu host to
  be able to continue working on it. I believe the code will need to be
  fixed on either case, but first we need to figure out what are the
  expectations/requirements, to know _which_ changes will be needed.
  
  
  On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
  (CCing Andre Przywara, in case he can help to clarify what's the
  expected meaning of -cpu host)
  
  [...]
  I am not sure I understand what you are proposing. Let me explain the
  use case I am thinking about:
  
  - Feature FOO is of type (A) (e.g. just a new instruction set that
  doesn't require additional userspace support)
  - User has a Qemu vesion that doesn't know anything about feature FOO
  - User gets a new CPU that supports feature FOO
  - User gets a new kernel that supports feature FOO (i.e. has FOO in
  GET_SUPPORTED_CPUID)
  - User does _not_ upgrade Qemu.
  - User expects to get feature FOO enabled if using -cpu host, 
  without
  upgrading Qemu.
  
  The problem here is: to support the above use-case, userspace need a
  probing mechanism that can differentiate _new_ (previously unknown)
  features that are in group (A) (safe to blindly enable) from features
  that are in group (B) (that can't be enabled without an userspace
  upgrade).
  
  In short, it becomes a problem if we consider the following case:
  
  - Feature BAR is of type (B) (it can't be enabled without extra
  userspace support)
  - User has a Qemu version that doesn't know anything about feature 
  BAR
  - User gets a new CPU that supports feature BAR
  - User gets a new kernel that supports feature BAR (i.e. has BAR in
  GET_SUPPORTED_CPUID)
  - User does _not_ upgrade Qemu.
  - User simply shouldn't get feature BAR enabled, even if using -cpu
  host, otherwise Qemu would break.
  
  If userspace always limited itself to features it knows about, it 
  would
  be really easy to implement the feature without any new probing
  mechanism from the kernel. But that's not how I think users expect 
  -cpu
  host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
  introduced the -cpu host feature, in case he can explain what's the
  expected semantics on the cases above.
  
  Can you think of any feature that'd go into category B?
  
  - TSC-deadline: can't be enabled unless userspace takes care to enable
  the in-kernel irqchip.
  
  The kernel can check if in-kernel irqchip has it enabled and otherwise 
  mask it out, no?
  
  How kernel should know that userspace does not emulate it?
  
  You have to enable the in-kernel apic to use it, at which point the kernel 
  knows it's in use, right?
  
  
  - x2apic: ditto.
  
  Same here. For user space irqchip the kernel side doesn't care. If 
  in-kernel APIC is enabled, check for its capabilities.
  
  Same here.
  
  Well, technically both of those features can't be implemented in
  userspace right now since MSRs are terminated in the kernel, but I
  
  Doesn't sound like the greatest design - unless you deprecate the 
  non-in-kernel apic case.
  
  You mean terminating MSRs in kernel does not sound like the greatest
  design? I do not disagree. That is why IMO kernel can't filter out
  TSC-deadline and x2apic like you suggest.
 
 I still don't see why it can't.
 
 Imagine we would filter TSC-deadline and x2apic by default in the kernel - 
 they are not known to exist yet.
 Now, we implement TSC-deadline in the kernel. We still filter TSC-deadline 
 out in GET_SUPORTED_CPUID in the kernel. But we provide an interface to user 
 space that says call me to enable TSC-deadline CPUID, but only if you're 
 using the in-kernel apic
 New user space calls that ioctl when it's using the in-kernel apic, it 
 doesn't when it's using the user space apic.
 Old user space doesn't call that ioctl.
First of all we already have TSC-deadline in GET_SUPORTED_CPUID without
any additional ioctls. And second I do not see why we need additional
iostls here.

Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and
x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not
from KVM_SET_CPUID? For those two features it may make sense indeed. Not
sure there won't be others that are not dependent on irq chip presence.
You propose to add additional ioctls to enable them if they appear?

 
 So at the end all bits in GET_SUPPORTED_CPUID are consistent with what user 
 

Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Alexander Graf

On 05/09/2012 11:38 AM, Gleb Natapov wrote:

On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote:

On 09.05.2012, at 10:51, Gleb Natapov wrote:


On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote:


On 09.05.2012, at 10:14, Gleb Natapovg...@redhat.com  wrote:


On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:

On 08.05.2012, at 22:14, Eduardo Habkost wrote:


On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:

On 07.05.2012, at 20:21, Eduardo Habkost wrote:


Andre? Are you able to help to answer the question below?

I would like to clarify what's the expected behavior of -cpu host to
be able to continue working on it. I believe the code will need to be
fixed on either case, but first we need to figure out what are the
expectations/requirements, to know _which_ changes will be needed.


On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:

(CCing Andre Przywara, in case he can help to clarify what's the
expected meaning of -cpu host)


[...]

I am not sure I understand what you are proposing. Let me explain the
use case I am thinking about:

- Feature FOO is of type (A) (e.g. just a new instruction set that
doesn't require additional userspace support)
- User has a Qemu vesion that doesn't know anything about feature FOO
- User gets a new CPU that supports feature FOO
- User gets a new kernel that supports feature FOO (i.e. has FOO in
GET_SUPPORTED_CPUID)
- User does _not_ upgrade Qemu.
- User expects to get feature FOO enabled if using -cpu host, without
upgrading Qemu.

The problem here is: to support the above use-case, userspace need a
probing mechanism that can differentiate _new_ (previously unknown)
features that are in group (A) (safe to blindly enable) from features
that are in group (B) (that can't be enabled without an userspace
upgrade).

In short, it becomes a problem if we consider the following case:

- Feature BAR is of type (B) (it can't be enabled without extra
userspace support)
- User has a Qemu version that doesn't know anything about feature BAR
- User gets a new CPU that supports feature BAR
- User gets a new kernel that supports feature BAR (i.e. has BAR in
GET_SUPPORTED_CPUID)
- User does _not_ upgrade Qemu.
- User simply shouldn't get feature BAR enabled, even if using -cpu
host, otherwise Qemu would break.

If userspace always limited itself to features it knows about, it would
be really easy to implement the feature without any new probing
mechanism from the kernel. But that's not how I think users expect -cpu
host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
introduced the -cpu host feature, in case he can explain what's the
expected semantics on the cases above.

Can you think of any feature that'd go into category B?

- TSC-deadline: can't be enabled unless userspace takes care to enable
the in-kernel irqchip.

The kernel can check if in-kernel irqchip has it enabled and otherwise mask it 
out, no?


How kernel should know that userspace does not emulate it?

You have to enable the in-kernel apic to use it, at which point the kernel 
knows it's in use, right?


- x2apic: ditto.

Same here. For user space irqchip the kernel side doesn't care. If in-kernel 
APIC is enabled, check for its capabilities.


Same here.

Well, technically both of those features can't be implemented in
userspace right now since MSRs are terminated in the kernel, but I

Doesn't sound like the greatest design - unless you deprecate the non-in-kernel 
apic case.


You mean terminating MSRs in kernel does not sound like the greatest
design? I do not disagree. That is why IMO kernel can't filter out
TSC-deadline and x2apic like you suggest.

I still don't see why it can't.

Imagine we would filter TSC-deadline and x2apic by default in the kernel - they 
are not known to exist yet.
Now, we implement TSC-deadline in the kernel. We still filter TSC-deadline out in 
GET_SUPORTED_CPUID in the kernel. But we provide an interface to user space that says 
call me to enable TSC-deadline CPUID, but only if you're using the in-kernel 
apic
New user space calls that ioctl when it's using the in-kernel apic, it doesn't 
when it's using the user space apic.
Old user space doesn't call that ioctl.

First of all we already have TSC-deadline in GET_SUPORTED_CPUID without
any additional ioctls. And second I do not see why we need additional
iostls here.


Yeah, some times our ABI is already broken :(. What a shame...


Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and
x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not
from KVM_SET_CPUID? For those two features it may make sense indeed. Not
sure there won't be others that are not dependent on irq chip presence.
You propose to add additional ioctls to enable them if they appear?


That's the only backwards compatible way to design this without putting 
a plethora of knowledge into user space I can see, yes.





So at the end all bits in 

Re: 1G huge pages in Linux guest VM

2012-05-09 Thread Avi Kivity
On 05/08/2012 09:38 PM, Sriram Murthy wrote:
 Hi,
I wanted to know if we can pass a hugetlbfs with pagesize=1G created on 
 the host to a Linux guest (The host is running RHEL 6.2 - x86_64 on Intel 
 Westmere)?. 

Yes, this is supported.

I am using RHEL 6.2 on the host and guest and the processor that has been 
 passed by KVM to the guest is Intel Westmere and has the pdpe1gb flag set. 
 Does this mean that the guest supports 1G huge pages?

Yes.  Note 1gb hugepages on the guest are supported even if not provided
by the host (of course the performance improvement will be smaller).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] Add async page fault test

2012-05-09 Thread Gleb Natapov
The test checks that when hypervisor prefaults swapped in page it does
not generates spurious #PFs in case page was unmapped while it was
swapped out.

diff --git a/config-x86-common.mak b/config-x86-common.mak
index c8fbda7..6976f78 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -34,7 +34,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \
$(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \
$(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
-   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat
+   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat 
$(TEST_DIR)/asyncpf.flat
 
 ifdef API
 tests-common += api/api-sample
@@ -90,6 +90,8 @@ $(TEST_DIR)/s3.elf: $(cstart.o) $(TEST_DIR)/s3.o
 
 $(TEST_DIR)/pmu.elf: $(cstart.o) $(TEST_DIR)/pmu.o
 
+$(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
+
 arch_clean:
$(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
$(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index c7e1afb..e46d8d0 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -302,4 +302,9 @@ static inline void invlpg(void *va)
 {
asm volatile(invlpg (%0) ::r (va) : memory);
 }
+
+static inline void safe_halt(void)
+{
+   asm volatile(sti; hlt);
+}
 #endif
diff --git a/lib/x86/vm.c b/lib/x86/vm.c
index 550ec9b..71b70fd 100644
--- a/lib/x86/vm.c
+++ b/lib/x86/vm.c
@@ -219,6 +219,11 @@ void *vmalloc(unsigned long size)
 return mem;
 }
 
+uint64_t virt_to_phys_cr3(void *mem)
+{
+return (get_pte(phys_to_virt(read_cr3()), mem)  PTE_ADDR) + ((ulong)mem  
(PAGE_SIZE - 1));
+}
+
 void vfree(void *mem)
 {
 unsigned long size = ((unsigned long *)mem)[-1];
diff --git a/lib/x86/vm.h b/lib/x86/vm.h
index 71ab4a8..3473f8d 100644
--- a/lib/x86/vm.h
+++ b/lib/x86/vm.h
@@ -22,6 +22,7 @@ void vfree(void *mem);
 void *vmap(unsigned long long phys, unsigned long size);
 void *alloc_vpage(void);
 void *alloc_vpages(ulong nr);
+uint64_t virt_to_phys_cr3(void *mem);
 
 void install_pte(unsigned long *cr3,
 int pte_level,
diff --git a/x86/asyncpf.c b/x86/asyncpf.c
new file mode 100644
index 000..62c0455
--- /dev/null
+++ b/x86/asyncpf.c
@@ -0,0 +1,113 @@
+/*
+ * Async PF test. For the test to actually do anywathing it ineeds to be 
started
+ * in memory cgroup with 512M of memory and with more then 1G memory provided 
+ * to the guest.
+ *
+ * To create cgroup do as root:
+ * mkdir /dev/cgroup
+ * mount -t cgroup none -omemory /dev/cgroup
+ * chmod a+rxw /dev/cgroup/
+ *
+ * From a shell you will start qemu from:
+ * mkdir /dev/cgroup/1
+ * echo $$   /dev/cgroup/1/tasks
+ * echo 512M  /dev/cgroup/1/memory.limit_in_bytes
+ *
+ */
+#include x86/msr.h
+#include x86/processor.h
+#include x86/apic-defs.h
+#include x86/apic.h
+#include x86/desc.h
+#include x86/isr.h
+#include x86/vm.h
+
+#include libcflat.h
+#include stdint.h
+
+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+
+#define KVM_ASYNC_PF_ENABLED(1  0)
+#define KVM_ASYNC_PF_SEND_ALWAYS(1  1)
+
+volatile uint32_t apf_reason __attribute__((aligned(64)));
+char *buf;
+volatile uint64_t  i;
+volatile uint64_t phys;
+bool fail;
+
+static inline uint32_t get_apf_reason(void)
+{
+   uint32_t r = apf_reason;
+   apf_reason = 0;
+   return r;
+}
+
+static void pf_isr(struct ex_regs *r)
+{
+   void* virt = (void*)((ulong)(buf+i)  ~(PAGE_SIZE-1));
+   uint32_t reason = get_apf_reason();
+
+   switch (reason) {
+   case 0:
+   printf(unexpected #PF at %p\n, read_cr2());
+   fail = true;
+   break;
+   case KVM_PV_REASON_PAGE_NOT_PRESENT:
+   phys = virt_to_phys_cr3(virt);
+   install_pte(phys_to_virt(read_cr3()), 1, virt, phys, 0);
+   write_cr3(read_cr3());
+   printf(Got not present #PF token %x virt addr %p phys 
addr %p\n, read_cr2(), virt, phys);
+   while(phys) {
+   safe_halt(); /* enables irq */
+   irq_disable();
+   }
+   break;
+   case KVM_PV_REASON_PAGE_READY:
+   printf(Got present #PF token %x\n, read_cr2());
+   if ((uint32_t)read_cr2() == ~0)
+   break;
+   install_pte(phys_to_virt(read_cr3()), 1, virt, phys | 
PTE_PRESENT | PTE_WRITE, 0);
+   write_cr3(read_cr3());
+   phys = 0;
+   break;
+   default:
+   printf(unexpected async pf reason %d\n, reason);
+   

[GIT PULL] KVM fixes for 3.4-rc6

2012-05-09 Thread Avi Kivity
Linus, please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

for you to fetch changes up to 331b646d60b0c3885208e1e02bd9f40319953efc:

  KVM: ia64: fix build due to typo (2012-05-09 13:53:28 +0300)

Two asynchronous page fault fixes (one guest, one host), a powerpc page
refcount fix, and an ia64 build fix.


Avi Kivity (1):
  KVM: ia64: fix build due to typo

David Gibson (1):
  KVM: PPC: Book3S HV: Fix refcounting of hugepages

Gleb Natapov (2):
  KVM: ensure async PF event wakes up vcpu from halt
  KVM: Do not take reference to mm during async #PF

 arch/ia64/kvm/kvm-ia64.c|2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   22 +-
 arch/powerpc/kvm/book3s_hv.c|2 --
 arch/x86/kernel/kvm.c   |9 +
 arch/x86/kvm/x86.c  |1 +
 5 files changed, 16 insertions(+), 20 deletions(-)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] kvm tools: fix sigsegv in irq__exit

2012-05-09 Thread Sasha Levin
From: Sasha Levin levinsasha...@gmail.com

We free the structure, but never remove them from the tree or list, then
we freed them the next time we ran through that structure.

This patch also simplifies irq__exit a bit.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/x86/irq.c |   17 +
 1 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/tools/kvm/x86/irq.c b/tools/kvm/x86/irq.c
index dc07f28..e83df99 100644
--- a/tools/kvm/x86/irq.c
+++ b/tools/kvm/x86/irq.c
@@ -179,25 +179,18 @@ int irq__exit(struct kvm *kvm)
 
free(irq_routing);
 
-   ent = rb_first(pci_tree);
-   for (;;) {
+   while ((ent = rb_first(pci_tree))) {
struct pci_dev *dev;
-   struct rb_node *next;
struct irq_line *line;
-   struct list_head *node, *tmp;
-
-   if (!ent)
-   break;
-
-   next = rb_next(ent);
 
dev = rb_entry(ent, struct pci_dev, node);
-   list_for_each_safe(node, tmp, dev-lines) {
-   line = list_entry(node, struct irq_line, node);
+   while (!list_empty(dev-lines)) {
+   line = list_first_entry(dev-lines, struct irq_line, 
node);
+   list_del(line-node);
free(line);
}
+   rb_erase(dev-node, pci_tree);
free(dev);
-   ent = next;
}
 
return 0;
-- 
1.7.8.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] kvm tools: use accessor function for virtio-9p FIDs

2012-05-09 Thread Sasha Levin
From: Sasha Levin levinsasha...@gmail.com

Since the 9p functions don't know the size of the fid array, they might
request an FID outside of the allowed range. Use an accessor to prevent
that and to hide the internal implementation from them.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/virtio/9p.c |   66 +++-
 1 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
index e054dff..35e089d 100644
--- a/tools/kvm/virtio/9p.c
+++ b/tools/kvm/virtio/9p.c
@@ -22,6 +22,14 @@
 static LIST_HEAD(devs);
 static int compat_id = -1;
 
+static struct p9_fid *get_fid(struct p9_dev *p9dev, int fid)
+{
+   if (fid = VIRTIO_9P_MAX_FID)
+   die(virtio-9p max FID (%u) reached!, VIRTIO_9P_MAX_FID);
+
+   return p9dev-fids[fid];
+}
+
 /* Warning: Immediately use value returned from this function */
 static const char *rel_to_abs(struct p9_dev *p9dev,
  const char *path, char *abs_path)
@@ -156,7 +164,7 @@ static void virtio_p9_open(struct p9_dev *p9dev,
 
 
virtio_p9_pdu_readf(pdu, dd, fid, flags);
-   new_fid = p9dev-fids[fid];
+   new_fid = get_fid(p9dev, fid);
 
if (lstat(new_fid-abs_path, st)  0)
goto err_out;
@@ -197,7 +205,7 @@ static void virtio_p9_create(struct p9_dev *p9dev,
 
virtio_p9_pdu_readf(pdu, dsddd, dfid_val,
name, flags, mode, gid);
-   dfid = p9dev-fids[dfid_val];
+   dfid = get_fid(p9dev, dfid_val);
 
flags = virtio_p9_openflags(flags);
 
@@ -245,7 +253,7 @@ static void virtio_p9_mkdir(struct p9_dev *p9dev,
 
virtio_p9_pdu_readf(pdu, dsdd, dfid_val,
name, mode, gid);
-   dfid = p9dev-fids[dfid_val];
+   dfid = get_fid(p9dev, dfid_val);
 
sprintf(full_path, %s/%s, dfid-abs_path, name);
ret = mkdir(full_path, mode);
@@ -287,11 +295,11 @@ static void virtio_p9_walk(struct p9_dev *p9dev,
 
 
virtio_p9_pdu_readf(pdu, ddw, fid_val, newfid_val, nwname);
-   new_fid = p9dev-fids[newfid_val];
+   new_fid = get_fid(p9dev, newfid_val);
 
nwqid = 0;
if (nwname) {
-   struct p9_fid *fid = p9dev-fids[fid_val];
+   struct p9_fid *fid = get_fid(p9dev, fid_val);
 
strcpy(new_fid-path, fid-path);
/* skip the space for count */
@@ -366,7 +374,7 @@ static void virtio_p9_attach(struct p9_dev *p9dev,
 
stat2qid(st, qid);
 
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
fid-fid = fid_val;
fid-uid = uid;
fid-is_dir = 1;
@@ -418,7 +426,7 @@ static void virtio_p9_read(struct p9_dev *p9dev,
 
rcount = 0;
virtio_p9_pdu_readf(pdu, dqd, fid_val, offset, count);
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
 
iov_base = pdu-in_iov[0].iov_base;
iov_len  = pdu-in_iov[0].iov_len;
@@ -469,7 +477,7 @@ static void virtio_p9_readdir(struct p9_dev *p9dev,
 
rcount = 0;
virtio_p9_pdu_readf(pdu, dqd, fid_val, offset, count);
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
 
if (!fid-is_dir) {
errno = -EINVAL;
@@ -525,7 +533,7 @@ static void virtio_p9_getattr(struct p9_dev *p9dev,
struct p9_stat_dotl statl;
 
virtio_p9_pdu_readf(pdu, dq, fid_val, request_mask);
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
if (lstat(fid-abs_path, st)  0)
goto err_out;
 
@@ -573,7 +581,7 @@ static void virtio_p9_setattr(struct p9_dev *p9dev,
struct p9_iattr_dotl p9attr;
 
virtio_p9_pdu_readf(pdu, dI, fid_val, p9attr);
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
 
if (p9attr.valid  ATTR_MODE) {
ret = chmod(fid-abs_path, p9attr.mode);
@@ -652,7 +660,7 @@ static void virtio_p9_write(struct p9_dev *p9dev,
int twrite_size = sizeof(u32) + sizeof(u64) + sizeof(u32);
 
virtio_p9_pdu_readf(pdu, dqd, fid_val, offset, count);
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
 
iov_base = pdu-out_iov[0].iov_base;
iov_len  = pdu-out_iov[0].iov_len;
@@ -691,7 +699,7 @@ static void virtio_p9_remove(struct p9_dev *p9dev,
struct p9_fid *fid;
 
virtio_p9_pdu_readf(pdu, d, fid_val);
-   fid = p9dev-fids[fid_val];
+   fid = get_fid(p9dev, fid_val);
 
ret = remove(fid-abs_path);
if (ret  0)
@@ -714,8 +722,8 @@ static void virtio_p9_rename(struct p9_dev *p9dev,
char full_path[PATH_MAX], *new_name;
 
virtio_p9_pdu_readf(pdu, dds, fid_val, new_fid_val, new_name);
-   fid = p9dev-fids[fid_val];
-   new_fid = p9dev-fids[new_fid_val];
+   fid = get_fid(p9dev, fid_val);
+   new_fid = get_fid(p9dev, new_fid_val);
 
sprintf(full_path, %s/%s, 

[PATCH 4/4] kvm tools: increase amount of virtio-9p FIDs

2012-05-09 Thread Sasha Levin
From: Sasha Levin levinsasha...@gmail.com

Increase the amount of FIDs since we easily reach current amount with
simple stress tests.

This should be changed to use a rbtree or something faster in the
future.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/virtio-9p.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/kvm/include/kvm/virtio-9p.h 
b/tools/kvm/include/kvm/virtio-9p.h
index cd2869a..5902701 100644
--- a/tools/kvm/include/kvm/virtio-9p.h
+++ b/tools/kvm/include/kvm/virtio-9p.h
@@ -12,7 +12,7 @@
 #define VIRTQUEUE_NUM  128
 #defineVIRTIO_9P_DEFAULT_TAG   kvm_9p
 #define VIRTIO_9P_HDR_LEN  (sizeof(u32)+sizeof(u8)+sizeof(u16))
-#define VIRTIO_9P_MAX_FID  256
+#define VIRTIO_9P_MAX_FID  16384
 #define VIRTIO_9P_VERSION_DOTL 9P2000.L
 #define MAX_TAG_LEN32
 
-- 
1.7.8.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] kvm tools: use bitwise 'and' in kvm-ipc

2012-05-09 Thread Sasha Levin
From: Sasha Levin levinsasha...@gmail.com

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/kvm-ipc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c
index 257c806c..70831b8 100644
--- a/tools/kvm/kvm-ipc.c
+++ b/tools/kvm/kvm-ipc.c
@@ -153,7 +153,7 @@ static void *kvm_ipc__thread(void *param)
r = kvm_ipc__receive(client);
} while (r == 0);
 
-   } else if (event.events  (EPOLLERR | EPOLLRDHUP | 
EPOLLHUP)) {
+   } else if (event.events  (EPOLLERR | EPOLLRDHUP | 
EPOLLHUP)) {
kvm_ipc__close_conn(fd);
} else {
kvm_ipc__receive(fd);
-- 
1.7.8.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] Add async page fault test

2012-05-09 Thread Gleb Natapov
On Wed, May 09, 2012 at 02:44:42PM +0200, Igor Mammedov wrote:
 On 05/09/2012 12:33 PM, Gleb Natapov wrote:
 The test checks that when hypervisor prefaults swapped in page it does
 not generates spurious #PFs in case page was unmapped while it was
 swapped out.
 
 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index c8fbda7..6976f78 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -34,7 +34,7 @@ tests-common = $(TEST_DIR)/vmexit.flat 
 $(TEST_DIR)/tsc.flat \
  $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \
  $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \
  $(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
 -   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat
 +   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat 
 $(TEST_DIR)/asyncpf.flat
 
   ifdef API
   tests-common += api/api-sample
 @@ -90,6 +90,8 @@ $(TEST_DIR)/s3.elf: $(cstart.o) $(TEST_DIR)/s3.o
 
   $(TEST_DIR)/pmu.elf: $(cstart.o) $(TEST_DIR)/pmu.o
 
 +$(TEST_DIR)/asyncpf.elf: $(cstart.o) $(TEST_DIR)/asyncpf.o
 +
   arch_clean:
  $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
  $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
 diff --git a/lib/x86/processor.h b/lib/x86/processor.h
 index c7e1afb..e46d8d0 100644
 --- a/lib/x86/processor.h
 +++ b/lib/x86/processor.h
 @@ -302,4 +302,9 @@ static inline void invlpg(void *va)
   {
  asm volatile(invlpg (%0) ::r (va) : memory);
   }
 +
 +static inline void safe_halt(void)
 +{
 +asm volatile(sti; hlt);
 +}
   #endif
 diff --git a/lib/x86/vm.c b/lib/x86/vm.c
 index 550ec9b..71b70fd 100644
 --- a/lib/x86/vm.c
 +++ b/lib/x86/vm.c
 @@ -219,6 +219,11 @@ void *vmalloc(unsigned long size)
   return mem;
   }
 
 +uint64_t virt_to_phys_cr3(void *mem)
 +{
 +return (get_pte(phys_to_virt(read_cr3()), mem)  PTE_ADDR) + 
 ((ulong)mem  (PAGE_SIZE - 1));
 +}
 +
   void vfree(void *mem)
   {
   unsigned long size = ((unsigned long *)mem)[-1];
 diff --git a/lib/x86/vm.h b/lib/x86/vm.h
 index 71ab4a8..3473f8d 100644
 --- a/lib/x86/vm.h
 +++ b/lib/x86/vm.h
 @@ -22,6 +22,7 @@ void vfree(void *mem);
   void *vmap(unsigned long long phys, unsigned long size);
   void *alloc_vpage(void);
   void *alloc_vpages(ulong nr);
 +uint64_t virt_to_phys_cr3(void *mem);
 
   void install_pte(unsigned long *cr3,
   int pte_level,
 diff --git a/x86/asyncpf.c b/x86/asyncpf.c
 new file mode 100644
 index 000..62c0455
 --- /dev/null
 +++ b/x86/asyncpf.c
 @@ -0,0 +1,113 @@
 +/*
 + * Async PF test. For the test to actually do anywathing it ineeds to be 
 started
   typo?
 
Either that, or a new word!

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to determine the backing host physical memory for a given guest ?

2012-05-09 Thread Chegu Vinod

Hello,

On an 8 socket Westmere host I am attempting to run a single guest and 
characterize the virtualization overhead for a system intensive 
workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 
20way/128G, ... 80way/512G). 

To do some comparisons between the native vs. guest runs. I have 
been using numactl to control the cpu node  memory node bindings for 
the qemu instance.  For larger guest sizes I end up binding across multiple 
localities. for e.g. a 40 way guest :

numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
qemu-system-x86_64 -smp 40 -m 262144 \


I understand that actual mappings from a guest virtual address to host physical 
address could change. 

Is there a way to determine [at a given instant] which host's NUMA node is 
providing the backing physical memory for the active guest's kernel and 
also for the the apps actively running in the guest ? 

Guessing that there is a better way (some tool available?) than just
diff'ng the per node memory usage...from the before and after output of 
numactl --hardware on the host.

Thanks
Vinod



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/11] KVM fixes for 3.3.5

2012-05-09 Thread Avi Kivity
After a long hiatus, here are a bunch of very delayed fixes for 3.3.5.

Alex Williamson (1):
  KVM: lock slots_lock around device assignment

Avi Kivity (3):
  KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
  KVM: VMX: Fix delayed load of shared MSRs
  KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context

Christian Borntraeger (1):
  KVM: s390: Sanitize fpc registers for KVM_SET_FPU

Gleb Natapov (1):
  KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction
emulation

Jens Freimann (1):
  KVM: s390: do store status after handling STOP_ON_STOP bit

Marcelo Tosatti (1):
  KVM: VMX: vmx_set_cr0 expects kvm-srcu locked

Nadav Har'El (1):
  KVM: nVMX: Fix erroneous exception bitmap check

Takuya Yoshikawa (2):
  KVM: Fix write protection race during dirty logging
  KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

 arch/ia64/kvm/kvm-ia64.c  |5 +
 arch/s390/kvm/intercept.c |   20 
 arch/s390/kvm/kvm-s390.c  |2 +-
 arch/x86/kvm/pmu.c|2 +-
 arch/x86/kvm/vmx.c|   10 +-
 arch/x86/kvm/x86.c|   19 +--
 include/linux/kvm_host.h  |7 +++
 virt/kvm/iommu.c  |   23 +++
 virt/kvm/kvm_main.c   |   23 ++-
 9 files changed, 77 insertions(+), 34 deletions(-)

-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] KVM: s390: do store status after handling STOP_ON_STOP bit

2012-05-09 Thread Avi Kivity
From: Jens Freimann jf...@linux.vnet.ibm.com

In handle_stop() handle the stop bit before doing the store status as
described for Stop and Store Status in the Principles of Operation.
We have to give up the local_int.lock before calling kvm store status
since it calls gmap_fault() which might sleep. Since local_int.lock
only protects local_int.* and not guest memory we can give up the lock.

Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 9e0d5473e2f0ba2d2fe9dab9408edef3060b710e)
---
 arch/s390/kvm/intercept.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index 0243454..a5f6eff 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -133,13 +133,6 @@ static int handle_stop(struct kvm_vcpu *vcpu)
 
vcpu-stat.exit_stop_request++;
spin_lock_bh(vcpu-arch.local_int.lock);
-   if (vcpu-arch.local_int.action_bits  ACTION_STORE_ON_STOP) {
-   vcpu-arch.local_int.action_bits = ~ACTION_STORE_ON_STOP;
-   rc = kvm_s390_vcpu_store_status(vcpu,
- KVM_S390_STORE_STATUS_NOADDR);
-   if (rc = 0)
-   rc = -EOPNOTSUPP;
-   }
 
if (vcpu-arch.local_int.action_bits  ACTION_RELOADVCPU_ON_STOP) {
vcpu-arch.local_int.action_bits = ~ACTION_RELOADVCPU_ON_STOP;
@@ -155,7 +148,18 @@ static int handle_stop(struct kvm_vcpu *vcpu)
rc = -EOPNOTSUPP;
}
 
-   spin_unlock_bh(vcpu-arch.local_int.lock);
+   if (vcpu-arch.local_int.action_bits  ACTION_STORE_ON_STOP) {
+   vcpu-arch.local_int.action_bits = ~ACTION_STORE_ON_STOP;
+   /* store status must be called unlocked. Since local_int.lock
+* only protects local_int.* and not guest memory we can give
+* up the lock here */
+   spin_unlock_bh(vcpu-arch.local_int.lock);
+   rc = kvm_s390_vcpu_store_status(vcpu,
+   KVM_S390_STORE_STATUS_NOADDR);
+   if (rc = 0)
+   rc = -EOPNOTSUPP;
+   } else
+   spin_unlock_bh(vcpu-arch.local_int.lock);
return rc;
 }
 
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/11] KVM: s390: Sanitize fpc registers for KVM_SET_FPU

2012-05-09 Thread Avi Kivity
From: Christian Borntraeger borntrae...@de.ibm.com

commit 7eef87dc99e419b1cc051e4417c37e4744d7b661 (KVM: s390: fix
register setting) added a load of the floating point control register
to the KVM_SET_FPU path. Lets make sure that the fpc is valid.

Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 851755871c1f3184f4124c466e85881f17fa3226)
---
 arch/s390/kvm/kvm-s390.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index d1c44573..d3cb86c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -418,7 +418,7 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
memcpy(vcpu-arch.guest_fpregs.fprs, fpu-fprs, sizeof(fpu-fprs));
-   vcpu-arch.guest_fpregs.fpc = fpu-fpc;
+   vcpu-arch.guest_fpregs.fpc = fpu-fpc  FPC_VALID_MASK;
restore_fp_regs(vcpu-arch.guest_fpregs);
return 0;
 }
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/11] KVM: Fix write protection race during dirty logging

2012-05-09 Thread Avi Kivity
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

This patch fixes a race introduced by:

  commit 95d4c16ce78cb6b7549a09159c409d52ddd18dae
  KVM: Optimize dirty logging by rmap_write_protect()

During protecting pages for dirty logging, other threads may also try
to protect a page in mmu_sync_children() or kvm_mmu_get_page().

In such a case, because get_dirty_log releases mmu_lock before flushing
TLB's, the following race condition can happen:

  A (get_dirty_log) B (another thread)

  lock(mmu_lock)
  clear pte.w
  unlock(mmu_lock)
lock(mmu_lock)
pte.w is already cleared
unlock(mmu_lock)
skip TLB flush
return
  ...
  TLB flush

Though thread B assumes the page has already been protected when it
returns, the remaining TLB entry will break that assumption.

This patch fixes this problem by making get_dirty_log hold the mmu_lock
until it flushes the TLB's.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 6dbf79e7164e9a86c1e466062c48498142ae6128)
---
 arch/x86/kvm/x86.c |   11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9cbfc06..410b6b0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2997,6 +2997,8 @@ static void write_protect_slot(struct kvm *kvm,
   unsigned long *dirty_bitmap,
   unsigned long nr_dirty_pages)
 {
+   spin_lock(kvm-mmu_lock);
+
/* Not many dirty pages compared to # of shadow pages. */
if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {
unsigned long gfn_offset;
@@ -3004,16 +3006,13 @@ static void write_protect_slot(struct kvm *kvm,
for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) {
unsigned long gfn = memslot-base_gfn + gfn_offset;
 
-   spin_lock(kvm-mmu_lock);
kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
-   spin_unlock(kvm-mmu_lock);
}
kvm_flush_remote_tlbs(kvm);
-   } else {
-   spin_lock(kvm-mmu_lock);
+   } else
kvm_mmu_slot_remove_write_access(kvm, memslot-id);
-   spin_unlock(kvm-mmu_lock);
-   }
+
+   spin_unlock(kvm-mmu_lock);
 }
 
 /*
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/11] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-05-09 Thread Avi Kivity
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

Other threads may process the same page in that small window and skip
TLB flush and then return before these functions do flush.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 565f3be2174611f364405bbea2d86e153c2e7e78)
---
 virt/kvm/kvm_main.c |   19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c4ac57e..4c68c1e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
 */
idx = srcu_read_lock(kvm-srcu);
spin_lock(kvm-mmu_lock);
+
kvm-mmu_notifier_seq++;
need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm-tlbs_dirty;
-   spin_unlock(kvm-mmu_lock);
-   srcu_read_unlock(kvm-srcu, idx);
-
/* we've to flush the tlb before the pages can be freed */
if (need_tlb_flush)
kvm_flush_remote_tlbs(kvm);
 
+   spin_unlock(kvm-mmu_lock);
+   srcu_read_unlock(kvm-srcu, idx);
 }
 
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
@@ -335,12 +335,12 @@ static void 
kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
for (; start  end; start += PAGE_SIZE)
need_tlb_flush |= kvm_unmap_hva(kvm, start);
need_tlb_flush |= kvm-tlbs_dirty;
-   spin_unlock(kvm-mmu_lock);
-   srcu_read_unlock(kvm-srcu, idx);
-
/* we've to flush the tlb before the pages can be freed */
if (need_tlb_flush)
kvm_flush_remote_tlbs(kvm);
+
+   spin_unlock(kvm-mmu_lock);
+   srcu_read_unlock(kvm-srcu, idx);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
mmu_notifier *mn,
 
idx = srcu_read_lock(kvm-srcu);
spin_lock(kvm-mmu_lock);
-   young = kvm_age_hva(kvm, address);
-   spin_unlock(kvm-mmu_lock);
-   srcu_read_unlock(kvm-srcu, idx);
 
+   young = kvm_age_hva(kvm, address);
if (young)
kvm_flush_remote_tlbs(kvm);
 
+   spin_unlock(kvm-mmu_lock);
+   srcu_read_unlock(kvm-srcu, idx);
+
return young;
 }
 
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/11] KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation

2012-05-09 Thread Avi Kivity
From: Gleb Natapov g...@redhat.com

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 270c6c79f4e15e599f47174ecedad932463af7a2)
---
 arch/x86/kvm/pmu.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7aad544..3e48c1d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -413,7 +413,7 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, 
u64 *data)
struct kvm_pmc *counters;
u64 ctr;
 
-   pmc = (3u  30) - 1;
+   pmc = ~(3u  30);
if (!fixed  pmc = pmu-nr_arch_gp_counters)
return 1;
if (fixed  pmc = pmu-nr_arch_fixed_counters)
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/11] KVM: VMX: Fix delayed load of shared MSRs

2012-05-09 Thread Avi Kivity
Shared MSRs (MSR_*STAR and related) are stored in both vmx-guest_msrs
and in the CPU registers, but vmx_set_msr() only updated memory. Prior
to 46199f33c2953, this didn't matter, since we called vmx_load_host_state(),
which scheduled a vmx_save_host_state(), which re-synchronized the CPU
state, but now we don't, so the CPU state will not be synchronized until
the next exit to host userspace.  This mostly affects nested vmx workloads,
which play with these MSRs a lot.

Fix by loading the MSR eagerly.

Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 9ee73970c03edb68146ceb1ba2a7033c99a5e017)
---
 arch/x86/kvm/vmx.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3b4c8d8..fafb325 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2219,6 +2219,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 data)
msr = find_msr_entry(vmx, msr_index);
if (msr) {
msr-data = data;
+   if (msr - vmx-guest_msrs  vmx-save_nmsrs)
+   kvm_set_shared_msr(msr-index, msr-data,
+  msr-mask);
break;
}
ret = kvm_set_msr_common(vcpu, msr_index, data);
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/11] KVM: VMX: vmx_set_cr0 expects kvm-srcu locked

2012-05-09 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

vmx_set_cr0 is called from vcpu run context, therefore it expects
kvm-srcu to be held (for setting up the real-mode TSS).

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 7a4f5ad051e02139a9f1c0f7f4b1acb88915852b)
---
 arch/x86/kvm/vmx.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5d1b0c7..7f33e33 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3918,7 +3918,9 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx-vpid);
 
vmx-vcpu.arch.cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
+   vcpu-srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
vmx_set_cr0(vmx-vcpu, kvm_read_cr0(vcpu)); /* enter rmode */
+   srcu_read_unlock(vcpu-kvm-srcu, vcpu-srcu_idx);
vmx_set_cr4(vmx-vcpu, 0);
vmx_set_efer(vmx-vcpu, 0);
vmx_fpu_activate(vmx-vcpu);
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/11] KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context

2012-05-09 Thread Avi Kivity
kvm_set_shared_msr() may not be called in preemptible context,
but vmx_set_msr() does so:

  BUG: using smp_processor_id() in preemptible [] code: qemu-kvm/22713
  caller is kvm_set_shared_msr+0x32/0xa0 [kvm]
  Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39
  Call Trace:
   [8131fa82] debug_smp_processor_id+0xe2/0x100
   [a0328ae2] kvm_set_shared_msr+0x32/0xa0 [kvm]
   [a03a103b] vmx_set_msr+0x28b/0x2d0 [kvm_intel]
   ...

Making kvm_set_shared_msr() work in preemptible is cleaner, but
it's used in the fast path.  Making two variants is overkill, so
this patch just disables preemption around the call.

Reported-by: Dave Jones da...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
(cherry picked from commit 2225fd56049643c1a7d645c0ce9d499d43c7974e)
---
 arch/x86/kvm/vmx.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7f33e33..a7a6f60 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2219,9 +2219,12 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 data)
msr = find_msr_entry(vmx, msr_index);
if (msr) {
msr-data = data;
-   if (msr - vmx-guest_msrs  vmx-save_nmsrs)
+   if (msr - vmx-guest_msrs  vmx-save_nmsrs) {
+   preempt_disable();
kvm_set_shared_msr(msr-index, msr-data,
   msr-mask);
+   preempt_enable();
+   }
break;
}
ret = kvm_set_msr_common(vcpu, msr_index, data);
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/11] KVM: lock slots_lock around device assignment

2012-05-09 Thread Avi Kivity
From: Alex Williamson alex.william...@redhat.com

As pointed out by Jason Baron, when assigning a device to a guest
we first set the iommu domain pointer, which enables mapping
and unmapping of memory slots to the iommu.  This leaves a window
where this path is enabled, but we haven't synchronized the iommu
mappings to the existing memory slots.  Thus a slot being removed
at that point could send us down unexpected code paths removing
non-existent pinnings and iommu mappings.  Take the slots_lock
around creating the iommu domain and initial mappings as well as
around iommu teardown to avoid this race.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
(cherry picked from commit 21a1416a1c945c5aeaeaf791b63c64926018eb77)
---
 virt/kvm/iommu.c |   23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index fec1723..e9fff98 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -240,9 +240,13 @@ int kvm_iommu_map_guest(struct kvm *kvm)
return -ENODEV;
}
 
+   mutex_lock(kvm-slots_lock);
+
kvm-arch.iommu_domain = iommu_domain_alloc(pci_bus_type);
-   if (!kvm-arch.iommu_domain)
-   return -ENOMEM;
+   if (!kvm-arch.iommu_domain) {
+   r = -ENOMEM;
+   goto out_unlock;
+   }
 
if (!allow_unsafe_assigned_interrupts 
!iommu_domain_has_cap(kvm-arch.iommu_domain,
@@ -253,17 +257,16 @@ int kvm_iommu_map_guest(struct kvm *kvm)
module option.\n, __func__);
iommu_domain_free(kvm-arch.iommu_domain);
kvm-arch.iommu_domain = NULL;
-   return -EPERM;
+   r = -EPERM;
+   goto out_unlock;
}
 
r = kvm_iommu_map_memslots(kvm);
if (r)
-   goto out_unmap;
-
-   return 0;
+   kvm_iommu_unmap_memslots(kvm);
 
-out_unmap:
-   kvm_iommu_unmap_memslots(kvm);
+out_unlock:
+   mutex_unlock(kvm-slots_lock);
return r;
 }
 
@@ -340,7 +343,11 @@ int kvm_iommu_unmap_guest(struct kvm *kvm)
if (!domain)
return 0;
 
+   mutex_lock(kvm-slots_lock);
kvm_iommu_unmap_memslots(kvm);
+   kvm-arch.iommu_domain = NULL;
+   mutex_unlock(kvm-slots_lock);
+
iommu_domain_free(domain);
return 0;
 }
-- 
1.7.10.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Takuya Yoshikawa
On Wed, 9 May 2012 11:59:17 +0300
Gleb Natapov g...@redhat.com wrote:

 Hmm, yes if it is file backed it may work. Setting up qemu to use file
 backed memory is one more complication while running the test though.
 I haven't checked by I am not sure that MADV_DONTNEED will drop page
 immediately though. It probably puts it on some list to be freed later.
 Hmm actually looking at the comments it seems like this is what happens:
 
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
  * data it wants to keep.  Be sure to free swap resources too.  The
  * zap_page_range call sets things up for shrink_active_list to actually
  * free
  * these pages later if no one else has touched them in the meantime,
  * although we could add these pages to a global reuse list for
  * shrink_active_list to pick up before reclaiming other pages.
  */

zap_page_range() actually frees these pages, no?

Virtio balloon seems to rely on this.

Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Avi Kivity
On 05/09/2012 04:18 PM, Takuya Yoshikawa wrote:
 On Wed, 9 May 2012 11:59:17 +0300
 Gleb Natapov g...@redhat.com wrote:

  Hmm, yes if it is file backed it may work. Setting up qemu to use file
  backed memory is one more complication while running the test though.
  I haven't checked by I am not sure that MADV_DONTNEED will drop page
  immediately though. It probably puts it on some list to be freed later.
  Hmm actually looking at the comments it seems like this is what happens:
  
  /*
   * Application no longer needs these pages.  If the pages are dirty,
   * it's OK to just throw them away.  The app will be more careful about
   * data it wants to keep.  Be sure to free swap resources too.  The
   * zap_page_range call sets things up for shrink_active_list to actually
   * free
   * these pages later if no one else has touched them in the meantime,
   * although we could add these pages to a global reuse list for
   * shrink_active_list to pick up before reclaiming other pages.
   */

 zap_page_range() actually frees these pages, no?

 Virtio balloon seems to rely on this.


The pages are removed from the user address space.  But if they're not
anonymous, the pages still live in the page cache.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH unit-tests] Add async page fault test

2012-05-09 Thread Takuya Yoshikawa
On Wed, 09 May 2012 16:20:23 +0300
Avi Kivity a...@redhat.com wrote:

  zap_page_range() actually frees these pages, no?
 
  Virtio balloon seems to rely on this.
 
 
 The pages are removed from the user address space.  But if they're not
 anonymous, the pages still live in the page cache.

Ah, about non-anonymous/file-backed case, I see.

Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
kvm needs to update some hypervisor variables atomically
in a sense that the operation can't be interrupted
in the middle. However the hypervisor always runs
on the same CPU so it does not need any memory
barrier or lock prefix.

At Peter Anvin's suggestion, add _local bitops for this purpose:
define them as non-atomics for x86 and (for now) atomics
for everyone else.

Uses are not restricted to virtualization: they
might be useful to communicate with an interrupt
handler if we know that it's running on the same CPU.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Link to previous discussion:
http://www.spinics.net/lists/kvm/msg72241.html


 Documentation/atomic_ops.txt  |   19 ++
 arch/x86/include/asm/bitops.h |1 +
 include/asm-generic/bitops.h  |1 +
 include/asm-generic/bitops/local-atomic.h |   92 +
 include/asm-generic/bitops/local.h|   85 ++
 include/linux/bitops.h|8 +++
 6 files changed, 206 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/bitops/local-atomic.h
 create mode 100644 include/asm-generic/bitops/local.h

diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index 27f2b21..b7e3b67 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -520,6 +520,25 @@ The __clear_bit_unlock version is non-atomic, however it 
still implements
 unlock barrier semantics. This can be useful if the lock itself is protecting
 the other bits in the word.
 
+Local versions of the bitmask operations are also provided.  They are used in
+contexts where the operations need to be performed atomically with respect to
+the local CPU, but no other CPU accesses the bitmask.  This assumption makes it
+possible to avoid the need for SMP protection and use less expensive atomic
+operations in the implementation.
+They have names similar to the above bitmask operation interfaces,
+except that _local is sufficed to the interface name.
+
+   void set_bit_local(unsigned long nr, volatile unsigned long *addr);
+   void clear_bit_local(unsigned long nr, volatile unsigned long *addr);
+   void change_bit_local(unsigned long nr, volatile unsigned long *addr);
+   int test_and_set_bit_local(unsigned long nr, volatile unsigned long 
*addr);
+   int test_and_clear_bit_local(unsigned long nr, volatile unsigned long 
*addr);
+   int test_and_change_bit_local(unsigned long nr, volatile unsigned long 
*addr);
+
+These local variants are useful for example if the bitmask may be accessed from
+a local intrerrupt, or from a hypervisor on the same CPU if running in a VM.
+These local variants also do not have any special memory barrier semantics.
+
 Finally, there are non-atomic versions of the bitmask operations
 provided.  They are used in contexts where some other higher-level SMP
 locking scheme is being used to protect the bitmask, and thus less
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index b97596e..8784cd7 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -509,6 +509,7 @@ static __always_inline int fls64(__u64 x)
 #include asm-generic/bitops/le.h
 
 #include asm-generic/bitops/ext2-atomic-setbit.h
+#include asm-generic/bitops/local.h
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_X86_BITOPS_H */
diff --git a/include/asm-generic/bitops.h b/include/asm-generic/bitops.h
index 280ca7a..d720c9e 100644
--- a/include/asm-generic/bitops.h
+++ b/include/asm-generic/bitops.h
@@ -40,5 +40,6 @@
 #include asm-generic/bitops/non-atomic.h
 #include asm-generic/bitops/le.h
 #include asm-generic/bitops/ext2-atomic.h
+#include asm-generic/bitops/local-atomic.h
 
 #endif /* __ASM_GENERIC_BITOPS_H */
diff --git a/include/asm-generic/bitops/local-atomic.h 
b/include/asm-generic/bitops/local-atomic.h
new file mode 100644
index 000..94ad261
--- /dev/null
+++ b/include/asm-generic/bitops/local-atomic.h
@@ -0,0 +1,92 @@
+#ifndef ASM_GENERIC_BITOPS_LOCAL_ATOMIC_H
+#define ASM_GENERIC_BITOPS_LOCAL_ATOMIC_H
+/**
+ * Local atomic operations
+ *
+ * These operations give no atomicity or ordering guarantees if result
+ * observed from another CPU.  Atomicity is guaranteed if result is observed
+ * from the same CPU, e.g. from a local interrupt, or a hypervisor if running
+ * in a VM.
+ * Atomicity is not guaranteed across CPUs: if two examples of these operations
+ * race on different CPUs, one can appear to succeed but actually fail.  Use
+ * non-local atomics instead or protect such SMP accesses with a lock.
+ * These operations can be reordered. No memory barrier is implied.
+ */
+
+/**
+ * Implement local operations in terms of atomics.
+ * For use from a local interrupt, this is always safe but suboptimal: many
+ * architectures can use bitops/local.h instead.
+ * For use from a hypervisor, make sure your architecture doesn't
+ * rely on locks for atomics: if it does - 

Re: How to determine the backing host physical memory for a given guest ?

2012-05-09 Thread Avi Kivity
On 05/09/2012 04:05 PM, Chegu Vinod wrote:
 Hello,

 On an 8 socket Westmere host I am attempting to run a single guest and 
 characterize the virtualization overhead for a system intensive 
 workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 
 20way/128G, ... 80way/512G). 

 To do some comparisons between the native vs. guest runs. I have 
 been using numactl to control the cpu node  memory node bindings for 
 the qemu instance.  For larger guest sizes I end up binding across multiple 
 localities. for e.g. a 40 way guest :

 numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
 qemu-system-x86_64 -smp 40 -m 262144 \
 

 I understand that actual mappings from a guest virtual address to host 
 physical 
 address could change. 

 Is there a way to determine [at a given instant] which host's NUMA node is 
 providing the backing physical memory for the active guest's kernel and 
 also for the the apps actively running in the guest ? 

 Guessing that there is a better way (some tool available?) than just
 diff'ng the per node memory usage...from the before and after output of 
 numactl --hardware on the host.


Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread H. Peter Anvin
On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote:
 kvm needs to update some hypervisor variables atomically
 in a sense that the operation can't be interrupted
 in the middle. However the hypervisor always runs
 on the same CPU so it does not need any memory
 barrier or lock prefix.
 
 At Peter Anvin's suggestion, add _local bitops for this purpose:
 define them as non-atomics for x86 and (for now) atomics
 for everyone else.
 
 Uses are not restricted to virtualization: they
 might be useful to communicate with an interrupt
 handler if we know that it's running on the same CPU.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

I don't think you can use the x86 nonatomics as-is, because they don't
contain optimization barriers.

-hpa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Avi Kivity
On 05/09/2012 04:45 PM, Michael S. Tsirkin wrote:
  
 +Local versions of the bitmask operations are also provided.  They are used in
 +contexts where the operations need to be performed atomically with respect to
 +the local CPU, but no other CPU accesses the bitmask.  This assumption makes 
 it
 +possible to avoid the need for SMP protection and use less expensive atomic
 +operations in the implementation.
 +They have names similar to the above bitmask operation interfaces,
 +except that _local is sufficed to the interface name.

suffixed (better: appended)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 07:03:37AM -0700, H. Peter Anvin wrote:
 On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote:
  kvm needs to update some hypervisor variables atomically
  in a sense that the operation can't be interrupted
  in the middle. However the hypervisor always runs
  on the same CPU so it does not need any memory
  barrier or lock prefix.
  
  At Peter Anvin's suggestion, add _local bitops for this purpose:
  define them as non-atomics for x86 and (for now) atomics
  for everyone else.
  
  Uses are not restricted to virtualization: they
  might be useful to communicate with an interrupt
  handler if we know that it's running on the same CPU.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 I don't think you can use the x86 nonatomics as-is, because they don't
 contain optimization barriers.
 
   -hpa

You are right of course. So I'll remove bitops/local.h
move the code to x86/ and open-code.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 07:03:37AM -0700, H. Peter Anvin wrote:
 On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote:
  kvm needs to update some hypervisor variables atomically
  in a sense that the operation can't be interrupted
  in the middle. However the hypervisor always runs
  on the same CPU so it does not need any memory
  barrier or lock prefix.
  
  At Peter Anvin's suggestion, add _local bitops for this purpose:
  define them as non-atomics for x86 and (for now) atomics
  for everyone else.
  
  Uses are not restricted to virtualization: they
  might be useful to communicate with an interrupt
  handler if we know that it's running on the same CPU.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 I don't think you can use the x86 nonatomics as-is, because they don't
 contain optimization barriers.
 
   -hpa

Just adding a memory clobber to asm will be enough, right?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 07:03:37AM -0700, H. Peter Anvin wrote:
 On 05/09/2012 06:45 AM, Michael S. Tsirkin wrote:
  kvm needs to update some hypervisor variables atomically
  in a sense that the operation can't be interrupted
  in the middle. However the hypervisor always runs
  on the same CPU so it does not need any memory
  barrier or lock prefix.
  
  At Peter Anvin's suggestion, add _local bitops for this purpose:
  define them as non-atomics for x86 and (for now) atomics
  for everyone else.
  
  Uses are not restricted to virtualization: they
  might be useful to communicate with an interrupt
  handler if we know that it's running on the same CPU.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 I don't think you can use the x86 nonatomics as-is, because they don't
 contain optimization barriers.
 
   -hpa

By the way, clear_bit on x86 does not seem to contain
an optimization barrier - is my reading correct?
Lock prefix does not affect the compiler, right?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread H. Peter Anvin
On 05/09/2012 08:47 AM, Michael S. Tsirkin wrote:
 
 By the way, clear_bit on x86 does not seem to contain
 an optimization barrier - is my reading correct?
 Lock prefix does not affect the compiler, right?

Yes, as it clearly states in the comment:

 * clear_bit() is atomic and may not be reordered.  However, it does
 * not contain a memory barrier, so if it is used for locking purposes,
 * you should call smp_mb__before_clear_bit() and/or
smp_mb__after_clear_bit()
 * in order to ensure changes are visible on other processors.

There is clear_bit_unlock() which has the barrier semantics.

-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
kvm needs to update some hypervisor variables atomically
in a sense that the operation can't be interrupted
in the middle. However the hypervisor always runs
on the same CPU so it does not need any memory
barrier or lock prefix.

Add _local bitops for this purpose: define them
as non-atomics for x86 and (for now) atomics for
everyone else.

Uses are not restricted to virtualization: they
might be useful to communicate with an interrupt
handler if we know that it's running on the same CPU.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Arnd Bergmann a...@arndb.de
---

Changes from v1:
- Fix x86 version so it includes optimization barriers
  as suggested by Peter Anvin
- Fix documentation as suggested by Avi Kivity
- Remove duplicate files as suggested by Arnd Bergmann

Link to discussion preceding v1:
http://www.spinics.net/lists/kvm/msg72241.html



 Documentation/atomic_ops.txt  |   19 
 arch/x86/include/asm/bitops.h |  132 +
 include/asm-generic/bitops.h  |1 +
 include/asm-generic/bitops/local-atomic.h |  114 +
 include/linux/bitops.h|8 ++
 5 files changed, 274 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/bitops/local-atomic.h

diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index 27f2b21..b45eb12 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -520,6 +520,25 @@ The __clear_bit_unlock version is non-atomic, however it 
still implements
 unlock barrier semantics. This can be useful if the lock itself is protecting
 the other bits in the word.
 
+Local versions of the bitmask operations are also provided.  They are used in
+contexts where the operations need to be performed atomically with respect to
+the local CPU, but no other CPU accesses the bitmask.  This assumption makes it
+possible to avoid the need for SMP protection and use less expensive atomic
+operations in the implementation.
+They have names similar to the above bitmask operation interfaces,
+except that _local is appended to the interface name.
+
+   void set_bit_local(unsigned long nr, volatile unsigned long *addr);
+   void clear_bit_local(unsigned long nr, volatile unsigned long *addr);
+   void change_bit_local(unsigned long nr, volatile unsigned long *addr);
+   int test_and_set_bit_local(unsigned long nr, volatile unsigned long 
*addr);
+   int test_and_clear_bit_local(unsigned long nr, volatile unsigned long 
*addr);
+   int test_and_change_bit_local(unsigned long nr, volatile unsigned long 
*addr);
+
+These local variants are useful for example if the bitmask may be accessed from
+a local intrerrupt, or from a hypervisor on the same CPU if running in a VM.
+These local variants also do not have any special memory barrier semantics.
+
 Finally, there are non-atomic versions of the bitmask operations
 provided.  They are used in contexts where some other higher-level SMP
 locking scheme is being used to protect the bitmask, and thus less
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index b97596e..4e6d900 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -41,6 +41,20 @@
 #define CONST_MASK_ADDR(nr, addr)  BITOP_ADDR((void *)(addr) + ((nr)3))
 #define CONST_MASK(nr) (1  ((nr)  7))
 
+/*
+ * x86 has its own version of local atomic operations.  These operations
+ * change memory using a single instruction, but give no atomicity or ordering
+ * guarantees if result observed from another CPU.  Atomicity is guaranteed if
+ * result is observed from the same CPU, e.g. from a local interrupt, or a
+ * hypervisor if running in a VM.
+ * Atomicity is not guaranteed across CPUs: if two examples of these operations
+ * race on different CPUs, one can appear to succeed but actually fail.
+ * To get atomicity guarantees across CPUs use non-local atomics instead.
+ * These operations can be reordered. No memory barrier is implied.
+ */
+
+#define ARCH_HAS_BITOPS_LOCAL_ATOMIC 1
+
 /**
  * set_bit - Atomically set a bit in memory
  * @nr: the bit to set
@@ -85,6 +99,19 @@ static inline void __set_bit(int nr, volatile unsigned long 
*addr)
 }
 
 /**
+ * set_bit_local - Atomically set a bit in memory
+ * @nr: the bit to set
+ * @addr: the address to start counting from
+ *
+ * This operation is atomic with respect to local CPU only. No memory barrier
+ * is implied.
+ */
+static inline void set_bit_local(int nr, volatile unsigned long *addr)
+{
+   asm volatile(bts %1,%0 : ADDR : Ir (nr) : memory);
+}
+
+/**
  * clear_bit - Clears a bit in memory
  * @nr: Bit to clear
  * @addr: Address to start counting from
@@ -149,6 +176,30 @@ static inline void __clear_bit_unlock(unsigned nr, 
volatile unsigned long *addr)
 #define smp_mb__after_clear_bit()  barrier()
 
 /**
+ * clear_bit - Clears a bit in memory
+ * @nr: Bit to clear
+ * 

Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 09:24:41AM -0700, H. Peter Anvin wrote:
 On 05/09/2012 08:47 AM, Michael S. Tsirkin wrote:
  
  By the way, clear_bit on x86 does not seem to contain
  an optimization barrier - is my reading correct?
  Lock prefix does not affect the compiler, right?
 
 Yes, as it clearly states in the comment:
 
  * clear_bit() is atomic and may not be reordered.  However, it does
  * not contain a memory barrier, so if it is used for locking purposes,
  * you should call smp_mb__before_clear_bit() and/or
 smp_mb__after_clear_bit()
  * in order to ensure changes are visible on other processors.
 
 There is clear_bit_unlock() which has the barrier semantics.
 
   -hpa

Well it talks about a memory barrier, not an
optimization barrier.

If compiler reorders code, changes will appear in
the wrong order on the current processor,
not just on other processors, no?

Sorry if I'm confused about this point, this is
what Documentation/atomic_ops.txt made me believe:
quote
For example consider the following code:

while (a  0)
do_something();

If the compiler can prove that do_something() does not store to the
variable a, then the compiler is within its rights transforming this to
the following:

tmp = a;
if (a  0)
for (;;)
do_something();

/quote
 
 -- 
 H. Peter Anvin, Intel Open Source Technology Center
 I work for Intel.  I don't speak on their behalf.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread H. Peter Anvin
On 05/09/2012 09:36 AM, Michael S. Tsirkin wrote:
 
 Well it talks about a memory barrier, not an
 optimization barrier.
 

Same thing.

 If compiler reorders code, changes will appear in
 the wrong order on the current processor,
 not just on other processors, no?

Yes.

For your _local I would just copy the atomic bitops but remote the locks
in most cases.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 09:45:57AM -0700, H. Peter Anvin wrote:
 On 05/09/2012 09:36 AM, Michael S. Tsirkin wrote:
  
  Well it talks about a memory barrier, not an
  optimization barrier.
  
 
 Same thing.

I see. So it really should say 'any barrier', right?
Documentation/atomic_ops.txt goes to great length
to distinguish between the two and we probably
should not confuse things.

  If compiler reorders code, changes will appear in
  the wrong order on the current processor,
  not just on other processors, no?
 
 Yes.

So this seems to contradict what the comment says:

clear_bit() is atomic and may not be reordered.
and you say compiler *can* reorder it, and below

 you should call smp_mb__before_clear_bit() and/or * smp_mb__after_clear_bit()
 in order to ensure changes are visible on other processors.

and in fact this is not enough, you also need to call
barrier() to ensure changes are visible on the same
processor in the correct order.

 For your _local I would just copy the atomic bitops but remote the locks
 in most cases.
 
   -hpa

Right, I sent v2 that does exactly this.

My question about documentation for change_bit
is an unrelated one: to me, it looks like the documentation for
change_bit does not match the implementation, or at least is somewhat
confusing.

 -- 
 H. Peter Anvin, Intel Open Source Technology Center
 I work for Intel.  I don't speak on their behalf.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Andrew Morton
On Wed, 9 May 2012 16:45:29 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 kvm needs to update some hypervisor variables atomically
 in a sense that the operation can't be interrupted
 in the middle. However the hypervisor always runs
 on the same CPU so it does not need any memory
 barrier or lock prefix.

Well.  It adds more complexity, makes the kernel harder to understand
and maintain and introduces more opportunities for developers to add
bugs.  So from that point of view, the best way of handling this patch
is to delete it.

Presumably the patch offers some benefit to offest all those costs. 
But you didn't tell us what that benefit is, so we cannot make
a decision.

IOW: numbers, please.  Convincing ones, for realistic test cases.


Secondly: can KVM just use __set_bit() and friends?  I suspect those
interfaces happen to meet your requirements.  At least on architectures
you care about.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread H. Peter Anvin
On 05/09/2012 12:19 PM, Andrew Morton wrote:
 
 Secondly: can KVM just use __set_bit() and friends?  I suspect those
 interfaces happen to meet your requirements.  At least on architectures
 you care about.
 

__set_bit() and friends right now don't have optimization barriers,
meaning the compiler is free to move them around.

-hpa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Eduardo Habkost
On Wed, May 09, 2012 at 12:38:37PM +0300, Gleb Natapov wrote:
 On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote:
  
  On 09.05.2012, at 10:51, Gleb Natapov wrote:
  
   On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote:
   
   
   On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote:
   
   On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
   
   On 08.05.2012, at 22:14, Eduardo Habkost wrote:
   
   On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
   On 07.05.2012, at 20:21, Eduardo Habkost wrote:
   
   
   Andre? Are you able to help to answer the question below?
   
   I would like to clarify what's the expected behavior of -cpu host 
   to
   be able to continue working on it. I believe the code will need to 
   be
   fixed on either case, but first we need to figure out what are the
   expectations/requirements, to know _which_ changes will be needed.
   
   
   On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
   (CCing Andre Przywara, in case he can help to clarify what's the
   expected meaning of -cpu host)
   
   [...]
   I am not sure I understand what you are proposing. Let me explain 
   the
   use case I am thinking about:
   
   - Feature FOO is of type (A) (e.g. just a new instruction set that
   doesn't require additional userspace support)
   - User has a Qemu vesion that doesn't know anything about feature 
   FOO
   - User gets a new CPU that supports feature FOO
   - User gets a new kernel that supports feature FOO (i.e. has FOO in
   GET_SUPPORTED_CPUID)
   - User does _not_ upgrade Qemu.
   - User expects to get feature FOO enabled if using -cpu host, 
   without
   upgrading Qemu.
   
   The problem here is: to support the above use-case, userspace need 
   a
   probing mechanism that can differentiate _new_ (previously unknown)
   features that are in group (A) (safe to blindly enable) from 
   features
   that are in group (B) (that can't be enabled without an userspace
   upgrade).
   
   In short, it becomes a problem if we consider the following case:
   
   - Feature BAR is of type (B) (it can't be enabled without extra
   userspace support)
   - User has a Qemu version that doesn't know anything about feature 
   BAR
   - User gets a new CPU that supports feature BAR
   - User gets a new kernel that supports feature BAR (i.e. has BAR in
   GET_SUPPORTED_CPUID)
   - User does _not_ upgrade Qemu.
   - User simply shouldn't get feature BAR enabled, even if using 
   -cpu
   host, otherwise Qemu would break.
   
   If userspace always limited itself to features it knows about, it 
   would
   be really easy to implement the feature without any new probing
   mechanism from the kernel. But that's not how I think users expect 
   -cpu
   host to work. Maybe I am wrong, I don't know. I am CCing Andre, 
   who
   introduced the -cpu host feature, in case he can explain what's 
   the
   expected semantics on the cases above.
   
   Can you think of any feature that'd go into category B?
   
   - TSC-deadline: can't be enabled unless userspace takes care to enable
   the in-kernel irqchip.
   
   The kernel can check if in-kernel irqchip has it enabled and otherwise 
   mask it out, no?
   
   How kernel should know that userspace does not emulate it?
   
   You have to enable the in-kernel apic to use it, at which point the 
   kernel knows it's in use, right?
   
   
   - x2apic: ditto.
   
   Same here. For user space irqchip the kernel side doesn't care. If 
   in-kernel APIC is enabled, check for its capabilities.
   
   Same here.
   
   Well, technically both of those features can't be implemented in
   userspace right now since MSRs are terminated in the kernel, but I
   
   Doesn't sound like the greatest design - unless you deprecate the 
   non-in-kernel apic case.
   
   You mean terminating MSRs in kernel does not sound like the greatest
   design? I do not disagree. That is why IMO kernel can't filter out
   TSC-deadline and x2apic like you suggest.
  
  I still don't see why it can't.
  
  Imagine we would filter TSC-deadline and x2apic by default in the kernel - 
  they are not known to exist yet.
  Now, we implement TSC-deadline in the kernel. We still filter
  TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide
  an interface to user space that says call me to enable TSC-deadline
  CPUID, but only if you're using the in-kernel apic

We have that interface already, it is called KVM_SET_CPUID.  :-)

  New user space calls that ioctl when it's using the in-kernel apic, it 
  doesn't when it's using the user space apic.
  Old user space doesn't call that ioctl.
 First of all we already have TSC-deadline in GET_SUPORTED_CPUID without
 any additional ioctls. And second I do not see why we need additional
 iostls here.

We don't have TSC-deadline set today (and that's what started this
thread), but we have x2apic. So what you say is true for x2apic, anyway.

 Hmm, so may be I 

Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 12:19:40PM -0700, Andrew Morton wrote:
 On Wed, 9 May 2012 16:45:29 +0300
 Michael S. Tsirkin m...@redhat.com wrote:
 
  kvm needs to update some hypervisor variables atomically
  in a sense that the operation can't be interrupted
  in the middle. However the hypervisor always runs
  on the same CPU so it does not need any memory
  barrier or lock prefix.
 
 Well.  It adds more complexity, makes the kernel harder to understand
 and maintain and introduces more opportunities for developers to add
 bugs.  So from that point of view, the best way of handling this patch
 is to delete it.
 
 Presumably the patch offers some benefit to offest all those costs. 
 But you didn't tell us what that benefit is, so we cannot make
 a decision.
 
 IOW: numbers, please.  Convincing ones, for realistic test cases.

I can try but in practice it's not an optimization.
What kvm needs is a guarantee that a memory update will be done in a
single instruction.

 Secondly: can KVM just use __set_bit() and friends?  I suspect those
 interfaces happen to meet your requirements.  At least on architectures
 you care about.

Sigh. I started with this, but then Avi Kivity said that he's worried
about someone changing __test_and_clear_bit on x86
such that the change won't be atomic (single instruction) anymore.
So I put inline asm into kvm.c. This drew comment from Peter
that maintaining separate asm code in kvm.c adds too much
maintainance overhead so I should implement _local, add
asm-generic fallback and put it all in generic code.

In practice ATM any of the above will work. We probably don't even need
to add barrier() calls since what we do afterwards is apic access which
has an optimization barrier anyway.  But I'm fine with adding them in
there just in case if that's what people want.

However, since we've come full circle, I'd like to have a discussion
on what, exactly, is acceptable to all maintainers.
Avi, Andrew, Peter, could you please comment?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread H. Peter Anvin
On 05/09/2012 01:07 PM, Michael S. Tsirkin wrote:
 
 In practice ATM any of the above will work. We probably don't even need
 to add barrier() calls since what we do afterwards is apic access which
 has an optimization barrier anyway.  But I'm fine with adding them in
 there just in case if that's what people want.
 

If you have the optimization barrier anyway, then I'd be fine with you
just using __test_and_clear_bit() and add to a comment in
arch/x86/include/asm/bitops.h that KVM needs it to be locally atomic.

-hpa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Michael S. Tsirkin
On Wed, May 09, 2012 at 01:10:04PM -0700, H. Peter Anvin wrote:
 On 05/09/2012 01:07 PM, Michael S. Tsirkin wrote:
  
  In practice ATM any of the above will work. We probably don't even need
  to add barrier() calls since what we do afterwards is apic access which
  has an optimization barrier anyway.  But I'm fine with adding them in
  there just in case if that's what people want.
  
 
 If you have the optimization barrier anyway, then I'd be fine with you
 just using __test_and_clear_bit() and add to a comment in
 arch/x86/include/asm/bitops.h that KVM needs it to be locally atomic.
 
   -hpa

Sounds good. Avi?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-09 Thread Alexander Graf

On 09.05.2012, at 21:38, Eduardo Habkost wrote:

 On Wed, May 09, 2012 at 12:38:37PM +0300, Gleb Natapov wrote:
 On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote:
 
 On 09.05.2012, at 10:51, Gleb Natapov wrote:
 
 On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote:
 
 
 On 09.05.2012, at 10:14, Gleb Natapov g...@redhat.com wrote:
 
 On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
 
 On 08.05.2012, at 22:14, Eduardo Habkost wrote:
 
 On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
 On 07.05.2012, at 20:21, Eduardo Habkost wrote:
 
 
 Andre? Are you able to help to answer the question below?
 
 I would like to clarify what's the expected behavior of -cpu host 
 to
 be able to continue working on it. I believe the code will need to be
 fixed on either case, but first we need to figure out what are the
 expectations/requirements, to know _which_ changes will be needed.
 
 
 On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
 (CCing Andre Przywara, in case he can help to clarify what's the
 expected meaning of -cpu host)
 
 [...]
 I am not sure I understand what you are proposing. Let me explain 
 the
 use case I am thinking about:
 
 - Feature FOO is of type (A) (e.g. just a new instruction set that
 doesn't require additional userspace support)
 - User has a Qemu vesion that doesn't know anything about feature 
 FOO
 - User gets a new CPU that supports feature FOO
 - User gets a new kernel that supports feature FOO (i.e. has FOO in
 GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User expects to get feature FOO enabled if using -cpu host, 
 without
 upgrading Qemu.
 
 The problem here is: to support the above use-case, userspace need a
 probing mechanism that can differentiate _new_ (previously unknown)
 features that are in group (A) (safe to blindly enable) from 
 features
 that are in group (B) (that can't be enabled without an userspace
 upgrade).
 
 In short, it becomes a problem if we consider the following case:
 
 - Feature BAR is of type (B) (it can't be enabled without extra
 userspace support)
 - User has a Qemu version that doesn't know anything about feature 
 BAR
 - User gets a new CPU that supports feature BAR
 - User gets a new kernel that supports feature BAR (i.e. has BAR in
 GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User simply shouldn't get feature BAR enabled, even if using -cpu
 host, otherwise Qemu would break.
 
 If userspace always limited itself to features it knows about, it 
 would
 be really easy to implement the feature without any new probing
 mechanism from the kernel. But that's not how I think users expect 
 -cpu
 host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
 introduced the -cpu host feature, in case he can explain what's 
 the
 expected semantics on the cases above.
 
 Can you think of any feature that'd go into category B?
 
 - TSC-deadline: can't be enabled unless userspace takes care to enable
 the in-kernel irqchip.
 
 The kernel can check if in-kernel irqchip has it enabled and otherwise 
 mask it out, no?
 
 How kernel should know that userspace does not emulate it?
 
 You have to enable the in-kernel apic to use it, at which point the 
 kernel knows it's in use, right?
 
 
 - x2apic: ditto.
 
 Same here. For user space irqchip the kernel side doesn't care. If 
 in-kernel APIC is enabled, check for its capabilities.
 
 Same here.
 
 Well, technically both of those features can't be implemented in
 userspace right now since MSRs are terminated in the kernel, but I
 
 Doesn't sound like the greatest design - unless you deprecate the 
 non-in-kernel apic case.
 
 You mean terminating MSRs in kernel does not sound like the greatest
 design? I do not disagree. That is why IMO kernel can't filter out
 TSC-deadline and x2apic like you suggest.
 
 I still don't see why it can't.
 
 Imagine we would filter TSC-deadline and x2apic by default in the kernel - 
 they are not known to exist yet.
 Now, we implement TSC-deadline in the kernel. We still filter
 TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide
 an interface to user space that says call me to enable TSC-deadline
 CPUID, but only if you're using the in-kernel apic
 
 We have that interface already, it is called KVM_SET_CPUID.  :-)
 
 New user space calls that ioctl when it's using the in-kernel apic, it 
 doesn't when it's using the user space apic.
 Old user space doesn't call that ioctl.
 First of all we already have TSC-deadline in GET_SUPORTED_CPUID without
 any additional ioctls. And second I do not see why we need additional
 iostls here.
 
 We don't have TSC-deadline set today (and that's what started this
 thread), but we have x2apic. So what you say is true for x2apic, anyway.
 
 Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and
 x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not
 from KVM_SET_CPUID? For those two features it may make 

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-09 Thread Marcelo Tosatti
On Tue, May 08, 2012 at 03:39:43PM +0300, Avi Kivity wrote:
 On 05/08/2012 05:25 AM, Marcelo Tosatti wrote:
   
We need call kvm_cond_flush_remote_tlbs in rmap_write-protect
unconditionally?
   
   Yes, that's the point.  We change
   
  spin_lock(mmu_lock)
  conditionally flush the tlb
  spin_unlock(mmu_lock)
   
   to
   
 spin_lock(mmu_lock)
 conditionally mark the tlb as dirty
 spin_unlock(mmu_lock)
 kvm_cond_flush_remote_tlbs()
   
   but that means the entire codebase has to be converted.
 
  Is there any other site which expects sptes and TLBs to be in sync, 
  other than rmap_write_protect?  
 
 I wouldn't say rmap_write_protect() expects sptes and TLBs to be in
 sync.  Rather its callers.
 
  Please convert the flush_remote_tlbs at the end of set_spte (RW-RO) to
  mark_dirty.
 
 I'd like to take an incremental approach, since there are many paths.  I
 don't have a concrete plan though.
 
  Looks good in general (patchset is incomplete though). One thing that
  is annoying is that there is no guarantee of progress for flushed_count
  increment (it can, in theory, always race with a mark_dirty). But since
  that would be only a performance, and not correctness, aspect, it is
  fine.
 
 We don't have a while () look in cond_flush(), so it won't be slowed
 down by the race; whoever caused the race will have to flush on their
 own, but they do that anyway now.
 
 We can regress if vcpu 0 marks the tlb as dirty, and then vcpu 0 and 1
 simultaneously flush, even though vcpu 1 did nothing to deserve it.  I
 don't see a way around it except to hope its a rare event.
 
  It has the advantage that it requires code to explicitly document where
  the TLB must be flushed and the sites which expect sptes to be in sync
  with TLBs.
 
 I'm looking for an idea of how to make the flush in those cases not hold
 mmu_lock.

You can't easily for rmap_write_protect, must check that the state is
unchanged (write protect operation is still relevant).

Current patchset (with corrections to comments by Xiao) is good enough
already IMO.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH next v2 03/74] target-i386: Pass X86CPU to do_cpu_{init,sipi}()

2012-05-09 Thread Andreas Färber
Allows to use cpu_reset() in place of cpu_state_reset().

Signed-off-by: Andreas Färber afaer...@suse.de
---
 cpu-exec.c   |4 ++--
 target-i386/cpu.h|4 ++--
 target-i386/helper.c |   13 -
 target-i386/kvm.c|6 --
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 0344cd5..fbb39cb 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -287,11 +287,11 @@ int cpu_exec(CPUArchState *env)
 #if defined(TARGET_I386)
 if (interrupt_request  CPU_INTERRUPT_INIT) {
 svm_check_intercept(env, SVM_EXIT_INIT);
-do_cpu_init(env);
+do_cpu_init(x86_env_get_cpu(env));
 env-exception_index = EXCP_HALTED;
 cpu_loop_exit(env);
 } else if (interrupt_request  CPU_INTERRUPT_SIPI) {
-do_cpu_sipi(env);
+do_cpu_sipi(x86_env_get_cpu(env));
 } else if (env-hflags2  HF2_GIF_MASK) {
 if ((interrupt_request  CPU_INTERRUPT_SMI) 
 !(env-hflags  HF_SMM_MASK)) {
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index b5b9a50..4bff61d 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1051,8 +1051,8 @@ static inline void cpu_get_tb_cpu_state(CPUX86State *env, 
target_ulong *pc,
 (env-eflags  (IOPL_MASK | TF_MASK | RF_MASK | VM_MASK));
 }
 
-void do_cpu_init(CPUX86State *env);
-void do_cpu_sipi(CPUX86State *env);
+void do_cpu_init(X86CPU *cpu);
+void do_cpu_sipi(X86CPU *cpu);
 
 #define MCE_INJECT_BROADCAST1
 #define MCE_INJECT_UNCOND_AO2
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 0b22582..fe181be 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1187,27 +1187,30 @@ CPUX86State *cpu_x86_init(const char *cpu_model)
 }
 
 #if !defined(CONFIG_USER_ONLY)
-void do_cpu_init(CPUX86State *env)
+void do_cpu_init(X86CPU *cpu)
 {
+CPUX86State *env = cpu-env;
 int sipi = env-interrupt_request  CPU_INTERRUPT_SIPI;
 uint64_t pat = env-pat;
 
-cpu_state_reset(env);
+cpu_reset(CPU(cpu));
 env-interrupt_request = sipi;
 env-pat = pat;
 apic_init_reset(env-apic_state);
 env-halted = !cpu_is_bsp(env);
 }
 
-void do_cpu_sipi(CPUX86State *env)
+void do_cpu_sipi(X86CPU *cpu)
 {
+CPUX86State *env = cpu-env;
+
 apic_sipi(env-apic_state);
 }
 #else
-void do_cpu_init(CPUX86State *env)
+void do_cpu_init(X86CPU *cpu)
 {
 }
-void do_cpu_sipi(CPUX86State *env)
+void do_cpu_sipi(X86CPU *cpu)
 {
 }
 #endif
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index e74a9e4..0d0d8f6 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1698,6 +1698,8 @@ void kvm_arch_post_run(CPUX86State *env, struct kvm_run 
*run)
 
 int kvm_arch_process_async_events(CPUX86State *env)
 {
+X86CPU *cpu = x86_env_get_cpu(env);
+
 if (env-interrupt_request  CPU_INTERRUPT_MCE) {
 /* We must not raise CPU_INTERRUPT_MCE if it's not supported. */
 assert(env-mcg_cap);
@@ -1732,11 +1734,11 @@ int kvm_arch_process_async_events(CPUX86State *env)
 }
 if (env-interrupt_request  CPU_INTERRUPT_INIT) {
 kvm_cpu_synchronize_state(env);
-do_cpu_init(env);
+do_cpu_init(cpu);
 }
 if (env-interrupt_request  CPU_INTERRUPT_SIPI) {
 kvm_cpu_synchronize_state(env);
-do_cpu_sipi(env);
+do_cpu_sipi(cpu);
 }
 if (env-interrupt_request  CPU_INTERRUPT_TPR) {
 env-interrupt_request = ~CPU_INTERRUPT_TPR;
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH next v2 03/74] target-i386: Pass X86CPU to do_cpu_{init,sipi}()

2012-05-09 Thread Andreas Färber
Allows to use cpu_reset() in place of cpu_state_reset().

Signed-off-by: Andreas Färber afaer...@suse.de
---
 cpu-exec.c   |4 ++--
 target-i386/cpu.h|4 ++--
 target-i386/helper.c |   13 -
 target-i386/kvm.c|6 --
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 0344cd5..fbb39cb 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -287,11 +287,11 @@ int cpu_exec(CPUArchState *env)
 #if defined(TARGET_I386)
 if (interrupt_request  CPU_INTERRUPT_INIT) {
 svm_check_intercept(env, SVM_EXIT_INIT);
-do_cpu_init(env);
+do_cpu_init(x86_env_get_cpu(env));
 env-exception_index = EXCP_HALTED;
 cpu_loop_exit(env);
 } else if (interrupt_request  CPU_INTERRUPT_SIPI) {
-do_cpu_sipi(env);
+do_cpu_sipi(x86_env_get_cpu(env));
 } else if (env-hflags2  HF2_GIF_MASK) {
 if ((interrupt_request  CPU_INTERRUPT_SMI) 
 !(env-hflags  HF_SMM_MASK)) {
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index b5b9a50..4bff61d 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1051,8 +1051,8 @@ static inline void cpu_get_tb_cpu_state(CPUX86State *env, 
target_ulong *pc,
 (env-eflags  (IOPL_MASK | TF_MASK | RF_MASK | VM_MASK));
 }
 
-void do_cpu_init(CPUX86State *env);
-void do_cpu_sipi(CPUX86State *env);
+void do_cpu_init(X86CPU *cpu);
+void do_cpu_sipi(X86CPU *cpu);
 
 #define MCE_INJECT_BROADCAST1
 #define MCE_INJECT_UNCOND_AO2
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 0b22582..fe181be 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1187,27 +1187,30 @@ CPUX86State *cpu_x86_init(const char *cpu_model)
 }
 
 #if !defined(CONFIG_USER_ONLY)
-void do_cpu_init(CPUX86State *env)
+void do_cpu_init(X86CPU *cpu)
 {
+CPUX86State *env = cpu-env;
 int sipi = env-interrupt_request  CPU_INTERRUPT_SIPI;
 uint64_t pat = env-pat;
 
-cpu_state_reset(env);
+cpu_reset(CPU(cpu));
 env-interrupt_request = sipi;
 env-pat = pat;
 apic_init_reset(env-apic_state);
 env-halted = !cpu_is_bsp(env);
 }
 
-void do_cpu_sipi(CPUX86State *env)
+void do_cpu_sipi(X86CPU *cpu)
 {
+CPUX86State *env = cpu-env;
+
 apic_sipi(env-apic_state);
 }
 #else
-void do_cpu_init(CPUX86State *env)
+void do_cpu_init(X86CPU *cpu)
 {
 }
-void do_cpu_sipi(CPUX86State *env)
+void do_cpu_sipi(X86CPU *cpu)
 {
 }
 #endif
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index e74a9e4..0d0d8f6 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1698,6 +1698,8 @@ void kvm_arch_post_run(CPUX86State *env, struct kvm_run 
*run)
 
 int kvm_arch_process_async_events(CPUX86State *env)
 {
+X86CPU *cpu = x86_env_get_cpu(env);
+
 if (env-interrupt_request  CPU_INTERRUPT_MCE) {
 /* We must not raise CPU_INTERRUPT_MCE if it's not supported. */
 assert(env-mcg_cap);
@@ -1732,11 +1734,11 @@ int kvm_arch_process_async_events(CPUX86State *env)
 }
 if (env-interrupt_request  CPU_INTERRUPT_INIT) {
 kvm_cpu_synchronize_state(env);
-do_cpu_init(env);
+do_cpu_init(cpu);
 }
 if (env-interrupt_request  CPU_INTERRUPT_SIPI) {
 kvm_cpu_synchronize_state(env);
-do_cpu_sipi(env);
+do_cpu_sipi(cpu);
 }
 if (env-interrupt_request  CPU_INTERRUPT_TPR) {
 env-interrupt_request = ~CPU_INTERRUPT_TPR;
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: Implement PCID/INVPCID for guests with EPT

2012-05-09 Thread Mao, Junjie
This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor 
may cache information for multiple linear-address spaces so that the processor 
may retain cached information when software switches to a different 
linear-address space. Refer to section 4.10.1 in IA32 Intel Software 
Developer's Manual Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running 
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Mao, Junjie junjie@intel.com
---
 arch/x86/include/asm/cpufeature.h  |1 +
 arch/x86/include/asm/kvm_host.h|3 +-
 arch/x86/include/asm/processor-flags.h |2 +
 arch/x86/include/asm/vmx.h |2 +
 arch/x86/kvm/cpuid.c   |6 ++-
 arch/x86/kvm/cpuid.h   |8 
 arch/x86/kvm/svm.c |6 +++
 arch/x86/kvm/vmx.c |   63 ++--
 8 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 8d67d42..1aedbc0 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -203,6 +203,7 @@
 #define X86_FEATURE_SMEP   (9*32+ 7) /* Supervisor Mode Execution 
Protection */
 #define X86_FEATURE_BMI2   (9*32+ 8) /* 2nd group bit manipulation 
extensions */
 #define X86_FEATURE_ERMS   (9*32+ 9) /* Enhanced REP MOVSB/STOSB */
+#define X86_FEATURE_INVPCID(9*32+10) /* INVPCID instruction */
 
 #if defined(__KERNEL__)  !defined(__ASSEMBLY__)
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74c9edf..bb9a707 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -52,7 +52,7 @@
 #define CR4_RESERVED_BITS   \
(~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
- | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR  \
+ | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_RDWRGSFS \
  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
 
@@ -660,6 +660,7 @@ struct kvm_x86_ops {
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
+   bool (*pcid_supported)(void);
void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host);
 
void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index f8ab3ea..aea1d1d 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -44,6 +44,7 @@
  */
 #define X86_CR3_PWT0x0008 /* Page Write Through */
 #define X86_CR3_PCD0x0010 /* Page Cache Disable */
+#define X86_CR3_PCID_MASK 0x0fff /* PCID Mask */
 
 /*
  * Intel CPU features in CR4
@@ -61,6 +62,7 @@
 #define X86_CR4_OSXMMEXCPT 0x0400 /* enable unmasked SSE exceptions */
 #define X86_CR4_VMXE   0x2000 /* enable VMX virtualization */
 #define X86_CR4_RDWRGSFS 0x0001 /* enable RDWRGSFS support */
+#define X86_CR4_PCIDE  0x0002 /* enable PCID support */
 #define X86_CR4_OSXSAVE 0x0004 /* enable xsave and xrestore */
 #define X86_CR4_SMEP   0x0010 /* enable SMEP support */
 
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 31f180c..b81525c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -60,6 +60,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
+#define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
 
 #define PIN_BASED_EXT_INTR_MASK 0x0001
@@ -281,6 +282,7 @@ enum vmcs_field {
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD 54
 #define EXIT_REASON_XSETBV 55
+#define EXIT_REASON_INVPCID58
 
 /*
  * Interruption-information format
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 9fed5be..8d4a361 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -201,6 +201,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
unsigned f_lm = 0;
 #endif
unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0;
+   unsigned f_pcid = kvm_x86_ops-pcid_supported() ? F(PCID) : 0;
+   unsigned f_invpcid = kvm_x86_ops-pcid_supported() ? F(INVPCID) : 0;
 
/* cpuid 1.edx */
const u32 kvm_supported_word0_x86_features =
@@ -228,7 +230,7 

Re: How to determine the backing host physical memory for a given guest ?

2012-05-09 Thread Chegu Vinod

On 5/9/2012 6:46 AM, Avi Kivity wrote:

On 05/09/2012 04:05 PM, Chegu Vinod wrote:

Hello,

On an 8 socket Westmere host I am attempting to run a single guest and
characterize the virtualization overhead for a system intensive
workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
20way/128G, ... 80way/512G).

To do some comparisons between the native vs. guest runs. I have
been using numactl to control the cpu node  memory node bindings for
the qemu instance.  For larger guest sizes I end up binding across multiple
localities. for e.g. a 40 way guest :

numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
qemu-system-x86_64 -smp 40 -m 262144 \


I understand that actual mappings from a guest virtual address to host physical
address could change.

Is there a way to determine [at a given instant] which host's NUMA node is
providing the backing physical memory for the active guest's kernel and
also for the the apps actively running in the guest ?

Guessing that there is a better way (some tool available?) than just
diff'ng the per node memory usage...from the before and after output of
numactl --hardware on the host.


Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.



Thanks for the pointer Avi ! Will give it a try...

FYI... I tried using the recent version of the crash utility 
(http://people.redhat.com/anderson/) with the upstream kvm.git kernel 
(3.4.0-rc4+ ) and it seems to provides VA - PA mappings for a given app 
on a live system.


Also looks like there is an extension to this crash utility...  called : 
qemu-vtop. which is supposed to give the GPA-HVA-HPA mappings. Need to 
give this a try...and see if it works.


Thx!
Vinod


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: PPC: Book3S: PR: Handle EMUL_ASSIST

2012-05-09 Thread Alexander Graf
In addition to normal priviledged instruction traps, we can also receive
emulation assist traps on newer hardware that has the HV bit set.

Handle that one the same way as a privileged instruction, including the
instruction fetching. That way we don't execute old instructions that we
happen to still leave in that field when an emul assist trap comes.

This fixes -M mac99 / -M g3beige on p7 bare metal for me.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_segment.S |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index 8b2fc66..e446658 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -252,6 +252,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
beq ld_last_prev_inst
cmpwi   r12, BOOK3S_INTERRUPT_ALIGNMENT
beq-ld_last_inst
+#ifdef CONFIG_PPC64
+BEGIN_FTR_SECTION
+   cmpwi   r12, BOOK3S_INTERRUPT_H_EMUL_ASSIST
+   beq-ld_last_inst
+END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
+#endif
 
b   no_ld_last_inst
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: PPC: Book3S: PR: Fix hsrr code

2012-05-09 Thread Alexander Graf
When jumping back into the kernel to code that knows that it would be
using HSRR registers instead of SRR registers, we need to make sure we
pass it all information on where to jump to in HSRR registers.

Unfortunately, we used r10 to store the information to distinguish between
the HSRR and SRR case. That register got clobbered in between though,
rendering the later comparison invalid.

Instead, let's use cr1 to store this information. That way we don't
need yet another register and everyone's happy.

This fixes PR KVM on POWER7 bare metal for me.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_segment.S |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index e446658..798491a 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -198,8 +198,8 @@ kvmppc_interrupt:
/* Save guest PC and MSR */
 #ifdef CONFIG_PPC64
 BEGIN_FTR_SECTION
-   mr  r10, r12
-   andi.   r0,r12,0x2
+   andi.   r0, r12, 0x2
+   cmpwi   cr1, r0, 0
beq 1f
mfspr   r3,SPRN_HSRR0
mfspr   r4,SPRN_HSRR1
@@ -346,8 +346,7 @@ no_dcbz32_off:
 
 #ifdef CONFIG_PPC64
 BEGIN_FTR_SECTION
-   andi.   r0,r10,0x2
-   beq 1f
+   beq cr1, 1f
mtspr   SPRN_HSRR1, r6
mtspr   SPRN_HSRR0, r8
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu kvm: Accept PCID feature

2012-05-09 Thread Mao, Junjie
This patch makes Qemu accept the PCID feature specified from configuration or 
command line options.

Signed-off-by: Mao, Junjie junjie@intel.com
---
target-i386/cpuid.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index b9bfeaf..bc061b0 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -48,7 +48,7 @@ static const char *ext_feature_name[] = {
 ds_cpl, vmx, smx, est,
 tm2, ssse3, cid, NULL,
 fma, cx16, xtpr, pdcm,
-NULL, NULL, dca, sse4.1|sse4_1,
+NULL, pcid, dca, sse4.1|sse4_1,
 sse4.2|sse4_2, x2apic, movbe, popcnt,
 NULL, aes, xsave, osxsave,
 avx, NULL, NULL, hypervisor,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: PPC: Book3S: PR: Handle EMUL_ASSIST

2012-05-09 Thread Alexander Graf
In addition to normal priviledged instruction traps, we can also receive
emulation assist traps on newer hardware that has the HV bit set.

Handle that one the same way as a privileged instruction, including the
instruction fetching. That way we don't execute old instructions that we
happen to still leave in that field when an emul assist trap comes.

This fixes -M mac99 / -M g3beige on p7 bare metal for me.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_segment.S |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index 8b2fc66..e446658 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -252,6 +252,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
beq ld_last_prev_inst
cmpwi   r12, BOOK3S_INTERRUPT_ALIGNMENT
beq-ld_last_inst
+#ifdef CONFIG_PPC64
+BEGIN_FTR_SECTION
+   cmpwi   r12, BOOK3S_INTERRUPT_H_EMUL_ASSIST
+   beq-ld_last_inst
+END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
+#endif
 
b   no_ld_last_inst
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html