Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-25 Thread Avi Kivity


On 06/24/2014 07:45 PM, Marcelo Tosatti wrote:

On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote:

First, it's not sufficient to pin the debug store area, you also
have to pin the guest page tables that are used to map the debug
store.  But even if you do that, as soon as the guest fork()s, it
will create a new pgd which the host will be free to swap out.  The
processor can then attempt a PEBS store to an unmapped address which
will fail, even though the guest is configured correctly.

That's a good point. You're right of course.

The only way I can think around it would be to intercept CR3 writes
while PEBS is active and always pin all the table pages leading
to the PEBS buffer. That's slow, but should be only needed
while PEBS is running.

-Andi

Suppose that can be done separately from the pinned spte patchset.
And it requires accounting into mlock limits as well, as noted.

One set of pagetables per pinned virtual address leading down to the
last translations is sufficient per-vcpu.


Or 4, and use the CR3 exit filter to prevent vmexits among the last 4 
LRU CR3 values.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-24 Thread Marcelo Tosatti
On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote:
  First, it's not sufficient to pin the debug store area, you also
  have to pin the guest page tables that are used to map the debug
  store.  But even if you do that, as soon as the guest fork()s, it
  will create a new pgd which the host will be free to swap out.  The
  processor can then attempt a PEBS store to an unmapped address which
  will fail, even though the guest is configured correctly.
 
 That's a good point. You're right of course.
 
 The only way I can think around it would be to intercept CR3 writes
 while PEBS is active and always pin all the table pages leading 
 to the PEBS buffer. That's slow, but should be only needed
 while PEBS is running.
 
 -Andi

Suppose that can be done separately from the pinned spte patchset.
And it requires accounting into mlock limits as well, as noted.

One set of pagetables per pinned virtual address leading down to the
last translations is sufficient per-vcpu.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-22 Thread Avi Kivity


On 05/30/2014 04:12 AM, Andi Kleen wrote:

From: Andi Kleen a...@linux.intel.com

PEBS (Precise Event Bases Sampling) profiling is very powerful,
allowing improved sampling precision and much additional information,
like address or TSX abort profiling. cycles:p and :pp uses PEBS.

This patch enables PEBS profiling in KVM guests.

PEBS writes profiling records to a virtual address in memory. Since
the guest controls the virtual address space the PEBS record
is directly delivered to the guest buffer. We set up the PEBS state
that is works correctly.The CPU cannot handle any kinds of faults during
these guest writes.

To avoid any problems with guest pages being swapped by the host we
pin the pages when the PEBS buffer is setup, by intercepting
that MSR.

Typically profilers only set up a single page, so pinning that is not
a big problem. The pinning is limited to 17 pages currently (64K+1)

In theory the guest can change its own page tables after the PEBS
setup. The host has no way to track that with EPT. But if a guest
would do that it could only crash itself. It's not expected
that normal profilers do that.




Talking a bit with Gleb about this, I think this is impossible.

First, it's not sufficient to pin the debug store area, you also have to 
pin the guest page tables that are used to map the debug store.  But 
even if you do that, as soon as the guest fork()s, it will create a new 
pgd which the host will be free to swap out.  The processor can then 
attempt a PEBS store to an unmapped address which will fail, even though 
the guest is configured correctly.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-22 Thread Andi Kleen
 First, it's not sufficient to pin the debug store area, you also
 have to pin the guest page tables that are used to map the debug
 store.  But even if you do that, as soon as the guest fork()s, it
 will create a new pgd which the host will be free to swap out.  The
 processor can then attempt a PEBS store to an unmapped address which
 will fail, even though the guest is configured correctly.

That's a good point. You're right of course.

The only way I can think around it would be to intercept CR3 writes
while PEBS is active and always pin all the table pages leading 
to the PEBS buffer. That's slow, but should be only needed
while PEBS is running.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-19 Thread Paolo Bonzini

Il 10/06/2014 23:06, Marcelo Tosatti ha scritto:

 BTW how about general PMU migration? As far as I can tell there
 is no code to save/restore the state for that currently, right?

Paolo wrote support for it, recently. Paolo?


Yes, on the KVM side all that is needed is to special case MSR reads and 
writes that have side effects, for example:


case MSR_CORE_PERF_GLOBAL_STATUS:
if (msr_info-host_initiated) {
pmu-global_status = data;
return 0;
}
break; /* RO MSR */
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
if (!(data  (pmu-global_ctrl_mask  ~(3ull62 {
if (!msr_info-host_initiated)
pmu-global_status = ~data;
pmu-global_ovf_ctrl = data;
return 0;
}
break;

Right now this is only needed for writes.

Userspace then can read/write these MSRs, and add them to the migration 
stream.  QEMU has code for that.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-19 Thread Paolo Bonzini

Il 02/06/2014 21:57, Andi Kleen ha scritto:

 It would be a bigger concern if we expected virtual PMU migration to
 work, but I think it would be nice to update kvm_pmu_cpuid_update() to
 notice the presence/absence of the new CPUID bits, and then store that
 into per-VM kvm_pmu-pebs_allowed rather than relying only on the
 per-host perf_pebs_virtualization().

I hope at some point it can work. There shouldn't be any problems
with migrating to the same CPU model, in many cases (same event
and same PEBS format) it'll likely even work between models or
gracefully degrade.


The code is there in both kernel and QEMU, it's just very little tested.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-19 Thread Andi Kleen
 Userspace then can read/write these MSRs, and add them to the migration
 stream.  QEMU has code for that.

Thanks. The PEBS setup always redoes its state, can be arbitarily often redone.

So the only change needed would be to add the MSRs to some list in qemu?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-19 Thread Paolo Bonzini
  Userspace then can read/write these MSRs, and add them to the migration
  stream.  QEMU has code for that.
 
 Thanks. The PEBS setup always redoes its state, can be arbitarily often
 redone.
 
 So the only change needed would be to add the MSRs to some list in qemu?

Yes, and also adding them to the migration stream if the MSRs do not
have the default (all-zero? need to look at the SDM) values.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-10 Thread Marcelo Tosatti
On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
  {
   struct kvm_pmu *pmu = vcpu-arch.pmu;
 @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
 msr_data *msr_info)
   return 0;
   }
   break;
 + case MSR_IA32_DS_AREA:
 + pmu-ds_area = data;
 + return 0;
 + case MSR_IA32_PEBS_ENABLE:
 + if (data  ~0xf000fULL)
 + break;

Bit 63 == PS_ENABLE ?

  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 33e8c02..4f39917 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
   atomic_switch_perf_msrs(vmx);
   debugctlmsr = get_debugctlmsr();
  
 + /* Move this somewhere else? */

Unless you hook into vcpu-arch.pmu.ds_area and perf_get_ds_area()
writers, it has to be at every vcpu entry.

Could compare values in MSR save area to avoid switch.

 + if (vcpu-arch.pmu.ds_area)
 + add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
 +   vcpu-arch.pmu.ds_area,
 +   perf_get_ds_area());

Should clear_atomic_switch_msr before 
add_atomic_switch_msr.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-10 Thread Andi Kleen
On Tue, Jun 10, 2014 at 03:04:48PM -0300, Marcelo Tosatti wrote:
 On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
   {
  struct kvm_pmu *pmu = vcpu-arch.pmu;
  @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
  msr_data *msr_info)
  return 0;
  }
  break;
  +   case MSR_IA32_DS_AREA:
  +   pmu-ds_area = data;
  +   return 0;
  +   case MSR_IA32_PEBS_ENABLE:
  +   if (data  ~0xf000fULL)
  +   break;
 
 Bit 63 == PS_ENABLE ?

PEBS_EN is [3:0] for each counter, but only one bit on Silvermont.
LL_EN is [36:32], but currently unused.

 
   void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 33e8c02..4f39917 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
  *vcpu)
  atomic_switch_perf_msrs(vmx);
  debugctlmsr = get_debugctlmsr();
   
  +   /* Move this somewhere else? */
 
 Unless you hook into vcpu-arch.pmu.ds_area and perf_get_ds_area()
 writers, it has to be at every vcpu entry.
 
 Could compare values in MSR save area to avoid switch.

Ok.

 
  +   if (vcpu-arch.pmu.ds_area)
  +   add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
  + vcpu-arch.pmu.ds_area,
  + perf_get_ds_area());
 
 Should clear_atomic_switch_msr before 
 add_atomic_switch_msr.

Ok.

BTW how about general PMU migration? As far as I can tell there 
is no code to save/restore the state for that currently, right?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-10 Thread Marcelo Tosatti
On Tue, Jun 10, 2014 at 12:22:07PM -0700, Andi Kleen wrote:
 On Tue, Jun 10, 2014 at 03:04:48PM -0300, Marcelo Tosatti wrote:
  On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
{
 struct kvm_pmu *pmu = vcpu-arch.pmu;
   @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
   msr_data *msr_info)
 return 0;
 }
 break;
   + case MSR_IA32_DS_AREA:
   + pmu-ds_area = data;
   + return 0;
   + case MSR_IA32_PEBS_ENABLE:
   + if (data  ~0xf000fULL)
   + break;
  
  Bit 63 == PS_ENABLE ?
 
 PEBS_EN is [3:0] for each counter, but only one bit on Silvermont.
 LL_EN is [36:32], but currently unused.
 
  
void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
   diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
   index 33e8c02..4f39917 100644
   --- a/arch/x86/kvm/vmx.c
   +++ b/arch/x86/kvm/vmx.c
   @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
   *vcpu)
 atomic_switch_perf_msrs(vmx);
 debugctlmsr = get_debugctlmsr();

   + /* Move this somewhere else? */
  
  Unless you hook into vcpu-arch.pmu.ds_area and perf_get_ds_area()
  writers, it has to be at every vcpu entry.
  
  Could compare values in MSR save area to avoid switch.
 
 Ok.
 
  
   + if (vcpu-arch.pmu.ds_area)
   + add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
   +   vcpu-arch.pmu.ds_area,
   +   perf_get_ds_area());
  
  Should clear_atomic_switch_msr before 
  add_atomic_switch_msr.
 
 Ok.
 
 BTW how about general PMU migration? As far as I can tell there 
 is no code to save/restore the state for that currently, right?
 
 -Andi

Paolo wrote support for it, recently. Paolo?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Gleb Natapov
On Fri, May 30, 2014 at 09:24:24AM -0700, Andi Kleen wrote:
   To avoid any problems with guest pages being swapped by the host we
   pin the pages when the PEBS buffer is setup, by intercepting
   that MSR.
  It will avoid guest page to be swapped, but shadow paging code may still 
  drop
  shadow PT pages that build a mapping from DS virtual address to the guest 
  page.
 
 You're saying the EPT code could tear down the EPT mappings?

Under memory pressure yes. mmu_shrink_scan() calls
prepare_zap_oldest_mmu_page() which destroys oldest mmu pages like its
name says. As far as I can tell running nested guest can also result in
EPT mapping to be dropped since it will create a lot of shadow pages and
this will cause make_mmu_pages_available() to destroy some shadow pages
and it may choose EPT pages to destroy.

CCing Marcelo to confirm/correct.

 
 OK that would need to be prevented too. Any suggestions how?
Only high level. Mark shadow pages involved in translation we want to keep and 
skip them in
prepare_zap_oldest_mmu_page().

 
  With EPT it is less likely to happen (but still possible IIRC depending on 
  memory
  pressure and how much memory shadow paging code is allowed to use), without 
  EPT
  it will happen for sure.
 
 Don't care about the non EPT case, this is white listed only for EPT 
 supporting 
 CPUs.
User may still disable EPT during module load, so pebs should be dropped
from a guest's cpuid in this case.

 
  There is nothing, as far as I can see, that says what will happen if the
  condition is not met. I always interpreted it as undefined behaviour so
  anything can happen including CPU dies completely.  You are saying above
  on one hand that CPU cannot handle any kinds of faults during write to
  DS area, but on the other hand a guest could only crash itself. Is this
  architecturally guarantied?
 
 You essentially would get random page faults, and the PEBS event will
 be cancelled. No hangs.
Is this a guest who will get those random page faults or a host?

 
 It's not architecturally guaranteed, but we white list anyways so 
 we only care about the white listed CPUs at this point. For them
 I have confirmation that it works.
 
 -Andi

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Andi Kleen

BTW I found some more problems in the v1 version.

   With EPT it is less likely to happen (but still possible IIRC depending 
   on memory
   pressure and how much memory shadow paging code is allowed to use), 
   without EPT
   it will happen for sure.
  
  Don't care about the non EPT case, this is white listed only for EPT 
  supporting 
  CPUs.
 User may still disable EPT during module load, so pebs should be dropped
 from a guest's cpuid in this case.

Ok.

 
  
   There is nothing, as far as I can see, that says what will happen if the
   condition is not met. I always interpreted it as undefined behaviour so
   anything can happen including CPU dies completely.  You are saying above
   on one hand that CPU cannot handle any kinds of faults during write to
   DS area, but on the other hand a guest could only crash itself. Is this
   architecturally guarantied?
  
  You essentially would get random page faults, and the PEBS event will
  be cancelled. No hangs.
 Is this a guest who will get those random page faults or a host?

The guest (on the white listed CPU models)

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Eric Northup
On Thu, May 29, 2014 at 6:12 PM, Andi Kleen a...@firstfloor.org wrote:
 From: Andi Kleen a...@linux.intel.com

 PEBS (Precise Event Bases Sampling) profiling is very powerful,
 allowing improved sampling precision and much additional information,
 like address or TSX abort profiling. cycles:p and :pp uses PEBS.

 This patch enables PEBS profiling in KVM guests.

 PEBS writes profiling records to a virtual address in memory. Since
 the guest controls the virtual address space the PEBS record
 is directly delivered to the guest buffer. We set up the PEBS state
 that is works correctly.The CPU cannot handle any kinds of faults during
 these guest writes.

 To avoid any problems with guest pages being swapped by the host we
 pin the pages when the PEBS buffer is setup, by intercepting
 that MSR.

 Typically profilers only set up a single page, so pinning that is not
 a big problem. The pinning is limited to 17 pages currently (64K+1)

 In theory the guest can change its own page tables after the PEBS
 setup. The host has no way to track that with EPT. But if a guest
 would do that it could only crash itself. It's not expected
 that normal profilers do that.

 The patch also adds the basic glue to enable the PEBS CPUIDs
 and other PEBS MSRs, and ask perf to enable PEBS as needed.

 Due to various limitations it currently only works on Silvermont
 based systems.

 This patch doesn't implement the extended MSRs some CPUs support.
 For example latency profiling on SLM will not work at this point.

 Timing:

 The emulation is somewhat more expensive than a real PMU. This
 may trigger the expensive PMI detection in the guest.
 Usually this can be disabled with
 echo 0  /proc/sys/kernel/perf_cpu_time_max_percent

 Migration:

 In theory it should should be possible (as long as we migrate to
 a host with the same PEBS event and the same PEBS format), but I'm not
 sure the basic KVM PMU code supports it correctly: no code to
 save/restore state, unless I'm missing something. Once the PMU
 code grows proper migration support it should be straight forward
 to handle the PEBS state too.

 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h   |   6 ++
  arch/x86/include/uapi/asm/msr-index.h |   4 +
  arch/x86/kvm/cpuid.c  |  10 +-
  arch/x86/kvm/pmu.c| 184 
 --
  arch/x86/kvm/vmx.c|   6 ++
  5 files changed, 196 insertions(+), 14 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 7de069af..d87cb66 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -319,6 +319,8 @@ struct kvm_pmc {
 struct kvm_vcpu *vcpu;
  };

 +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
 +
  struct kvm_pmu {
 unsigned nr_arch_gp_counters;
 unsigned nr_arch_fixed_counters;
 @@ -335,6 +337,10 @@ struct kvm_pmu {
 struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 struct irq_work irq_work;
 u64 reprogram_pmi;
 +   u64 pebs_enable;
 +   u64 ds_area;
 +   struct page *pinned_pages[MAX_PINNED_PAGES];
 +   unsigned num_pinned_pages;
  };

  enum {
 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index fcf2b3a..409a582 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -72,6 +72,10 @@
  #define MSR_IA32_PEBS_ENABLE   0x03f1
  #define MSR_IA32_DS_AREA   0x0600
  #define MSR_IA32_PERF_CAPABILITIES 0x0345
 +#define PERF_CAP_PEBS_TRAP (1U  6)
 +#define PERF_CAP_ARCH_REG  (1U  7)
 +#define PERF_CAP_PEBS_FORMAT   (0xf  8)
 +
  #define MSR_PEBS_LD_LAT_THRESHOLD  0x03f6

  #define MSR_MTRRfix64K_0   0x0250
 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
 index f47a104..c8cc76b 100644
 --- a/arch/x86/kvm/cpuid.c
 +++ b/arch/x86/kvm/cpuid.c
 @@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
 unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0;
 unsigned f_invpcid = kvm_x86_ops-invpcid_supported() ? F(INVPCID) : 
 0;
 unsigned f_mpx = kvm_x86_ops-mpx_supported() ? F(MPX) : 0;
 +   bool pebs = perf_pebs_virtualization();
 +   unsigned f_ds = pebs ? F(DS) : 0;
 +   unsigned f_pdcm = pebs ? F(PDCM) : 0;
 +   unsigned f_dtes64 = pebs ? F(DTES64) : 0;

 /* cpuid 1.edx */
 const u32 kvm_supported_word0_x86_features =
 @@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
 F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
 F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
 F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
 -   0 /* Reserved, DS, ACPI */ | F(MMX) |
 +   f_ds /* 

Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Marcelo Tosatti
On Mon, Jun 02, 2014 at 07:45:35PM +0300, Gleb Natapov wrote:
 On Fri, May 30, 2014 at 09:24:24AM -0700, Andi Kleen wrote:
To avoid any problems with guest pages being swapped by the host we
pin the pages when the PEBS buffer is setup, by intercepting
that MSR.
   It will avoid guest page to be swapped, but shadow paging code may still 
   drop
   shadow PT pages that build a mapping from DS virtual address to the guest 
   page.
  
  You're saying the EPT code could tear down the EPT mappings?
 
 Under memory pressure yes. mmu_shrink_scan() calls
 prepare_zap_oldest_mmu_page() which destroys oldest mmu pages like its
 name says. As far as I can tell running nested guest can also result in
 EPT mapping to be dropped since it will create a lot of shadow pages and
 this will cause make_mmu_pages_available() to destroy some shadow pages
 and it may choose EPT pages to destroy.
 
 CCing Marcelo to confirm/correct.

Yes. Given SLAB pressure any shadow pages can be deleted except pinned 
via root_count=1 ones.

  OK that would need to be prevented too. Any suggestions how?
 Only high level. Mark shadow pages involved in translation we want to keep 
 and skip them in
 prepare_zap_oldest_mmu_page().

Should special case such translations so that they are not zapped
(either via page deletion or single entry EPT deletion). Them
and any other their parents, bummer.

Maybe its cleaner to check that DS area is EPT mapped before VM-entry.

No way the processor can generate VM-exits ?

Is it not an option to fake a DS-save area in the host (and trap
any accesses to the DS_AREA MSR from the guest) ? 
Then before notifying the PEBS event, copy from that host area to 
guests address. Slow probably.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Andi Kleen
 It seems to me that with this patch, there is no way to expose a
 PMU-without-PEBS to the guest if the host has PEBS.

If you clear the CPUIDs then noone would ilikely access it.

But fair enough, I'll add extra checks for CPUID.

 It would be a bigger concern if we expected virtual PMU migration to
 work, but I think it would be nice to update kvm_pmu_cpuid_update() to
 notice the presence/absence of the new CPUID bits, and then store that
 into per-VM kvm_pmu-pebs_allowed rather than relying only on the
 per-host perf_pebs_virtualization().

I hope at some point it can work. There shouldn't be any problems
with migrating to the same CPU model, in many cases (same event 
and same PEBS format) it'll likely even work between models or
gracefully degrade.

BTW in practice it'll likely work anyways because many profilers
regularly re-set the PMU. 

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-05-30 Thread Gleb Natapov
On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 PEBS (Precise Event Bases Sampling) profiling is very powerful,
 allowing improved sampling precision and much additional information,
 like address or TSX abort profiling. cycles:p and :pp uses PEBS.
 
 This patch enables PEBS profiling in KVM guests.
That sounds really cool!

 
 PEBS writes profiling records to a virtual address in memory. Since
 the guest controls the virtual address space the PEBS record
 is directly delivered to the guest buffer. We set up the PEBS state
 that is works correctly.The CPU cannot handle any kinds of faults during
 these guest writes.
 
 To avoid any problems with guest pages being swapped by the host we
 pin the pages when the PEBS buffer is setup, by intercepting
 that MSR.
It will avoid guest page to be swapped, but shadow paging code may still drop
shadow PT pages that build a mapping from DS virtual address to the guest page.
With EPT it is less likely to happen (but still possible IIRC depending on 
memory
pressure and how much memory shadow paging code is allowed to use), without EPT
it will happen for sure.

 
 Typically profilers only set up a single page, so pinning that is not
 a big problem. The pinning is limited to 17 pages currently (64K+1)
 
 In theory the guest can change its own page tables after the PEBS
 setup. The host has no way to track that with EPT. But if a guest
 would do that it could only crash itself. It's not expected
 that normal profilers do that.
Spec says:

 The following restrictions should be applied to the DS save area.
   • The three DS save area sections should be allocated from a
   non-paged pool, and marked accessed and dirty. It is the responsibility
   of the operating system to keep the pages that contain the buffer
   present and to mark them accessed and dirty. The implication is that
   the operating system cannot do “lazy” page-table entry propagation
   for these pages.

There is nothing, as far as I can see, that says what will happen if the
condition is not met. I always interpreted it as undefined behaviour so
anything can happen including CPU dies completely.  You are saying above
on one hand that CPU cannot handle any kinds of faults during write to
DS area, but on the other hand a guest could only crash itself. Is this
architecturally guarantied?


 
 The patch also adds the basic glue to enable the PEBS CPUIDs
 and other PEBS MSRs, and ask perf to enable PEBS as needed.
 
 Due to various limitations it currently only works on Silvermont
 based systems.
 
 This patch doesn't implement the extended MSRs some CPUs support.
 For example latency profiling on SLM will not work at this point.
 
 Timing:
 
 The emulation is somewhat more expensive than a real PMU. This
 may trigger the expensive PMI detection in the guest.
 Usually this can be disabled with
 echo 0  /proc/sys/kernel/perf_cpu_time_max_percent
 
 Migration:
 
 In theory it should should be possible (as long as we migrate to
 a host with the same PEBS event and the same PEBS format), but I'm not
 sure the basic KVM PMU code supports it correctly: no code to
 save/restore state, unless I'm missing something. Once the PMU
 code grows proper migration support it should be straight forward
 to handle the PEBS state too.
 
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h   |   6 ++
  arch/x86/include/uapi/asm/msr-index.h |   4 +
  arch/x86/kvm/cpuid.c  |  10 +-
  arch/x86/kvm/pmu.c| 184 
 --
  arch/x86/kvm/vmx.c|   6 ++
  5 files changed, 196 insertions(+), 14 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 7de069af..d87cb66 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -319,6 +319,8 @@ struct kvm_pmc {
   struct kvm_vcpu *vcpu;
  };
  
 +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
 +
  struct kvm_pmu {
   unsigned nr_arch_gp_counters;
   unsigned nr_arch_fixed_counters;
 @@ -335,6 +337,10 @@ struct kvm_pmu {
   struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
   struct irq_work irq_work;
   u64 reprogram_pmi;
 + u64 pebs_enable;
 + u64 ds_area;
 + struct page *pinned_pages[MAX_PINNED_PAGES];
 + unsigned num_pinned_pages;
  };
  
  enum {
 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index fcf2b3a..409a582 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -72,6 +72,10 @@
  #define MSR_IA32_PEBS_ENABLE 0x03f1
  #define MSR_IA32_DS_AREA 0x0600
  #define MSR_IA32_PERF_CAPABILITIES   0x0345
 +#define PERF_CAP_PEBS_TRAP   (1U  6)
 +#define PERF_CAP_ARCH_REG(1U  7)
 +#define PERF_CAP_PEBS_FORMAT (0xf  8)
 +
  #define 

[PATCH 4/4] kvm: Implement PEBS virtualization

2014-05-29 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

PEBS (Precise Event Bases Sampling) profiling is very powerful,
allowing improved sampling precision and much additional information,
like address or TSX abort profiling. cycles:p and :pp uses PEBS.

This patch enables PEBS profiling in KVM guests.

PEBS writes profiling records to a virtual address in memory. Since
the guest controls the virtual address space the PEBS record
is directly delivered to the guest buffer. We set up the PEBS state
that is works correctly.The CPU cannot handle any kinds of faults during
these guest writes.

To avoid any problems with guest pages being swapped by the host we
pin the pages when the PEBS buffer is setup, by intercepting
that MSR.

Typically profilers only set up a single page, so pinning that is not
a big problem. The pinning is limited to 17 pages currently (64K+1)

In theory the guest can change its own page tables after the PEBS
setup. The host has no way to track that with EPT. But if a guest
would do that it could only crash itself. It's not expected
that normal profilers do that.

The patch also adds the basic glue to enable the PEBS CPUIDs
and other PEBS MSRs, and ask perf to enable PEBS as needed.

Due to various limitations it currently only works on Silvermont
based systems.

This patch doesn't implement the extended MSRs some CPUs support.
For example latency profiling on SLM will not work at this point.

Timing:

The emulation is somewhat more expensive than a real PMU. This
may trigger the expensive PMI detection in the guest.
Usually this can be disabled with
echo 0  /proc/sys/kernel/perf_cpu_time_max_percent

Migration:

In theory it should should be possible (as long as we migrate to
a host with the same PEBS event and the same PEBS format), but I'm not
sure the basic KVM PMU code supports it correctly: no code to
save/restore state, unless I'm missing something. Once the PMU
code grows proper migration support it should be straight forward
to handle the PEBS state too.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/include/asm/kvm_host.h   |   6 ++
 arch/x86/include/uapi/asm/msr-index.h |   4 +
 arch/x86/kvm/cpuid.c  |  10 +-
 arch/x86/kvm/pmu.c| 184 --
 arch/x86/kvm/vmx.c|   6 ++
 5 files changed, 196 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7de069af..d87cb66 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -319,6 +319,8 @@ struct kvm_pmc {
struct kvm_vcpu *vcpu;
 };
 
+#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
+
 struct kvm_pmu {
unsigned nr_arch_gp_counters;
unsigned nr_arch_fixed_counters;
@@ -335,6 +337,10 @@ struct kvm_pmu {
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
struct irq_work irq_work;
u64 reprogram_pmi;
+   u64 pebs_enable;
+   u64 ds_area;
+   struct page *pinned_pages[MAX_PINNED_PAGES];
+   unsigned num_pinned_pages;
 };
 
 enum {
diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index fcf2b3a..409a582 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -72,6 +72,10 @@
 #define MSR_IA32_PEBS_ENABLE   0x03f1
 #define MSR_IA32_DS_AREA   0x0600
 #define MSR_IA32_PERF_CAPABILITIES 0x0345
+#define PERF_CAP_PEBS_TRAP (1U  6)
+#define PERF_CAP_ARCH_REG  (1U  7)
+#define PERF_CAP_PEBS_FORMAT   (0xf  8)
+
 #define MSR_PEBS_LD_LAT_THRESHOLD  0x03f6
 
 #define MSR_MTRRfix64K_0   0x0250
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..c8cc76b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0;
unsigned f_invpcid = kvm_x86_ops-invpcid_supported() ? F(INVPCID) : 0;
unsigned f_mpx = kvm_x86_ops-mpx_supported() ? F(MPX) : 0;
+   bool pebs = perf_pebs_virtualization();
+   unsigned f_ds = pebs ? F(DS) : 0;
+   unsigned f_pdcm = pebs ? F(PDCM) : 0;
+   unsigned f_dtes64 = pebs ? F(DTES64) : 0;
 
/* cpuid 1.edx */
const u32 kvm_supported_word0_x86_features =
@@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
-   0 /* Reserved, DS, ACPI */ | F(MMX) |
+   f_ds /* Reserved, ACPI */ | F(MMX) |
F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
0 /* HTT, TM, Reserved, PBE */;
/* cpuid 0x8001.edx */
@@ -283,10