Re: [RFC PATCH 00/13] KVM: x86: SMM support

2015-05-20 Thread Andi Kleen
Paolo Bonzini pbonz...@redhat.com writes:

 On 19/05/2015 16:25, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2015-04-30:
 This patch series introduces system management mode support.
 
 Just curious what's motivation to add vSMM supporting? Is there any
 usage case inside guest requires SMM? Thanks.

 SMM provides the trusted base for UEFI Secure Boot.

I was wondering the same thing. This definitely should be somewhere in
the changelogs!

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/3] perf protect LBR when Intel PT is enabled.

2014-07-03 Thread Andi Kleen
 If there's active LBR users out there, we should refuse to enable PT and
 vice versa. 

This doesn't work, e.g. hardware debuggers can take over at any time.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.

2014-07-02 Thread Andi Kleen
 Signed-off-by: Andi Kleen a...@linux.intel.com

I did not contribute to this patch, so please remove that SOB.

 Signed-off-by: Kan Liang kan.li...@intel.com

   struct extra_reg *extra_regs;
   unsigned int er_flags;
 + boolextra_msr_access;  /* EXTRA REG MSR can be 
 accessed */
  

This doesn't look right, needs a flag for each extra register.

They are completely unrelated to each other.

BTW this will also cause KVM messages at each boot now.

   wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config);
   wrmsrl(hwc-config_base, (hwc-config | enable_mask)  ~disable_mask);
  }
 diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
 b/arch/x86/kernel/cpu/perf_event_intel.c
 index adb02aa..8011d42 100644
 --- a/arch/x86/kernel/cpu/perf_event_intel.c
 +++ b/arch/x86/kernel/cpu/perf_event_intel.c
 @@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void)
   }
   }
  
 + /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM 
 doesn't support LBR MSR */
 + if (x86_pmu.lbr_nr)
 + x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos)  
 test_msr_access(x86_pmu.lbr_from);

s///

And also this doesn't cover the case when someone takes over the LBRs and they 
start #GPing later.
So for LBR the test has to be still at each access.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/3] perf protect LBR when Intel PT is enabled.

2014-07-02 Thread Andi Kleen
On Wed, Jul 02, 2014 at 11:14:14AM -0700, kan.li...@intel.com wrote:
 From: Kan Liang kan.li...@intel.com
 
 If RTIT_CTL.TraceEn=1, any attempt to read or write the LBR or LER MSRs, 
 including LBR_TOS, will result in a #GP.
 Since Intel PT can be enabled/disabled at runtime, LBR MSRs have to be 
 protected by _safe() at runtime.
 
 Signed-off-by: Kan Liang kan.li...@intel.com

Patch looks good to me.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

2014-06-30 Thread Andi Kleen
  So it we take from it that translation should be present the same goes for
  accessed and dirty. If Andi can clarify this within Intel it would be great.
 
 Andi?

There were some problems on really old CPUs with non dirty/accessed pages
(P4 generation) with PEBS. But PEBS virtualization is white listed and will
never run on those.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] perf: ignore LBR and offcore_rsp.

2014-06-23 Thread Andi Kleen
Peter Zijlstra pet...@infradead.org writes:


 So I really hate this patch, it makes the code hideous. Also, its a
 death by a thousand cuts adding endless branches in this code.

FWIW compared to the cost of a RDMSR (which is a very complex operation
for the CPU) the cost of a predicted branch is nearly in the noise.


 Also, the offcore part is retarded, just make sure extra_reg isn't set.

 As to the LBR, just make it so that we avoid calling the LBR code to
 begin with; ideally without adding extra code to fast paths.

You mean check once at initialization time.

I'm not sure that would handle all cases unfortunately.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-22 Thread Andi Kleen
 First, it's not sufficient to pin the debug store area, you also
 have to pin the guest page tables that are used to map the debug
 store.  But even if you do that, as soon as the guest fork()s, it
 will create a new pgd which the host will be free to swap out.  The
 processor can then attempt a PEBS store to an unmapped address which
 will fail, even though the guest is configured correctly.

That's a good point. You're right of course.

The only way I can think around it would be to intercept CR3 writes
while PEBS is active and always pin all the table pages leading 
to the PEBS buffer. That's slow, but should be only needed
while PEBS is running.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

2014-06-19 Thread Andi Kleen
 + * Failure to instantiate pages will abort guest entry.
 + *
 + * Page frames should be pinned with get_page in advance.
 + *
 + * Pinning is not guaranteed while executing as L2 guest.
 
 Does this undermine security?

It should not. In the worst case it'll randomly lose PEBS records.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-19 Thread Andi Kleen
 Userspace then can read/write these MSRs, and add them to the migration
 stream.  QEMU has code for that.

Thanks. The PEBS setup always redoes its state, can be arbitarily often redone.

So the only change needed would be to add the MSRs to some list in qemu?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] perf: ignore LBR and offcore_rsp.

2014-06-19 Thread Andi Kleen
Andi Kleen a...@firstfloor.org writes:

 Signed-off-by: Kan Liang kan.li...@intel.com

 And here I thought that Andi was of the opinion that if you set CPUID to
 indicate a particular CPU you had better also handle all its MSRs.

 Yes, philosophically that would be the right way,
 but we needed a short term fix to stop things from crashing, and that
 was the simplest.

I should add there is another reason for this patch now,
and doing it in perf instead of somewhere else
(this should probably go into the description).

With PT on enabling LBR can #GP. So perf needs to handle
this case without crashing. This can happen independently
of any hypervisors.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] perf: ignore LBR and offcore_rsp.

2014-06-18 Thread Andi Kleen
Peter Zijlstra pet...@infradead.org writes:

 This order indicates Andi is the author; but there's no corresponding
 From.

I wrote an early version of the patch, but Kan took it over and extended
it. So both are authors.

BTW Kan you may want to use git send-email to get standard format.


 Signed-off-by: Kan Liang kan.li...@intel.com

 And here I thought that Andi was of the opinion that if you set CPUID to
 indicate a particular CPU you had better also handle all its MSRs.

Yes, philosophically that would be the right way,
but we needed a short term fix to stop things from crashing, and that
was the simplest.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/5] KVM: support for pinning sptes

2014-06-18 Thread Andi Kleen
On Wed, Jun 18, 2014 at 08:12:03PM -0300, mtosa...@redhat.com wrote:
 Required by PEBS support as discussed at 
 
 Subject: [patch 0/5] Implement PEBS virtualization for Silvermont
 Message-Id: 1401412327-14810-1-git-send-email-a...@firstfloor.org

Thanks marcelo. I'll give it a stress test here.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-10 Thread Andi Kleen
On Tue, Jun 10, 2014 at 03:04:48PM -0300, Marcelo Tosatti wrote:
 On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
   {
  struct kvm_pmu *pmu = vcpu-arch.pmu;
  @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
  msr_data *msr_info)
  return 0;
  }
  break;
  +   case MSR_IA32_DS_AREA:
  +   pmu-ds_area = data;
  +   return 0;
  +   case MSR_IA32_PEBS_ENABLE:
  +   if (data  ~0xf000fULL)
  +   break;
 
 Bit 63 == PS_ENABLE ?

PEBS_EN is [3:0] for each counter, but only one bit on Silvermont.
LL_EN is [36:32], but currently unused.

 
   void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 33e8c02..4f39917 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
  *vcpu)
  atomic_switch_perf_msrs(vmx);
  debugctlmsr = get_debugctlmsr();
   
  +   /* Move this somewhere else? */
 
 Unless you hook into vcpu-arch.pmu.ds_area and perf_get_ds_area()
 writers, it has to be at every vcpu entry.
 
 Could compare values in MSR save area to avoid switch.

Ok.

 
  +   if (vcpu-arch.pmu.ds_area)
  +   add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
  + vcpu-arch.pmu.ds_area,
  + perf_get_ds_area());
 
 Should clear_atomic_switch_msr before 
 add_atomic_switch_msr.

Ok.

BTW how about general PMU migration? As far as I can tell there 
is no code to save/restore the state for that currently, right?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Andi Kleen

BTW I found some more problems in the v1 version.

   With EPT it is less likely to happen (but still possible IIRC depending 
   on memory
   pressure and how much memory shadow paging code is allowed to use), 
   without EPT
   it will happen for sure.
  
  Don't care about the non EPT case, this is white listed only for EPT 
  supporting 
  CPUs.
 User may still disable EPT during module load, so pebs should be dropped
 from a guest's cpuid in this case.

Ok.

 
  
   There is nothing, as far as I can see, that says what will happen if the
   condition is not met. I always interpreted it as undefined behaviour so
   anything can happen including CPU dies completely.  You are saying above
   on one hand that CPU cannot handle any kinds of faults during write to
   DS area, but on the other hand a guest could only crash itself. Is this
   architecturally guarantied?
  
  You essentially would get random page faults, and the PEBS event will
  be cancelled. No hangs.
 Is this a guest who will get those random page faults or a host?

The guest (on the white listed CPU models)

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Andi Kleen
 It seems to me that with this patch, there is no way to expose a
 PMU-without-PEBS to the guest if the host has PEBS.

If you clear the CPUIDs then noone would ilikely access it.

But fair enough, I'll add extra checks for CPUID.

 It would be a bigger concern if we expected virtual PMU migration to
 work, but I think it would be nice to update kvm_pmu_cpuid_update() to
 notice the presence/absence of the new CPUID bits, and then store that
 into per-VM kvm_pmu-pebs_allowed rather than relying only on the
 per-host perf_pebs_virtualization().

I hope at some point it can work. There shouldn't be any problems
with migrating to the same CPU model, in many cases (same event 
and same PEBS format) it'll likely even work between models or
gracefully degrade.

BTW in practice it'll likely work anyways because many profilers
regularly re-set the PMU. 

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters

2014-05-30 Thread Andi Kleen
On Fri, May 30, 2014 at 09:31:53AM +0200, Peter Zijlstra wrote:
 On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote:
  From: Andi Kleen a...@linux.intel.com
  
  Currently perf unconditionally disables PEBS for guest.
  
  Now that we have the infrastructure in place to handle
  it we can allow it for KVM owned guest events. For
  the perf needs to know that a event is owned by
  a guest. Add a new state bit in the perf_event for that.
  
 
 This doesn't make sense; why does it need to be owned?

Please read the complete patch kit

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] perf: Add PEBS virtualization enable for Silvermont

2014-05-29 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

To avoid various problems (like leaking counters) the PEBS
virtualization needs white listing per CPU model. Add state to the
x86_pmu for this and enable it for Silvermont.

Silvermont is currently the only CPU where it is safe
to virtualize PEBS, as it doesn't leak PEBS event
through exits (as long as the exit MSR list disables
the counter with PEBS_ENABLE)

Also Silvermont is relatively simple to handle,
as it only has one PEBS counter.

Also export the information to (modular) KVM.

Used in followon patches.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/include/asm/perf_event.h |  6 ++
 arch/x86/kernel/cpu/perf_event.h  |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c|  1 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 20 
 4 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 8249df4..c49c7d3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -250,6 +250,9 @@ struct perf_guest_switch_msr {
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
+extern unsigned long long perf_get_ds_area(void);
+extern unsigned long long perf_get_pebs_enable(void);
+extern bool perf_pebs_virtualization(void);
 #else
 static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {
@@ -264,6 +267,9 @@ static inline void perf_get_x86_pmu_capability(struct 
x86_pmu_capability *cap)
 
 static inline void perf_events_lapic_init(void){ }
 static inline void perf_check_microcode(void) { }
+static inline unsigned long long perf_get_ds_area(void) { return 0; }
+static inline unsigned long long perf_get_pebs_enable(void) { return 0; }
+static inline bool perf_pebs_virtualization(void) { return false; }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS)  defined(CONFIG_CPU_SUP_AMD)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..6ab8fdd 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -449,6 +449,7 @@ struct x86_pmu {
struct event_constraint *pebs_constraints;
void(*pebs_aliases)(struct perf_event *event);
int max_pebs_events;
+   boolpebs_virtualization;
 
/*
 * Intel LBR
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index aa333d9..86ccb81 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2399,6 +2399,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
x86_pmu.extra_regs = intel_slm_extra_regs;
x86_pmu.er_flags |= ERF_HAS_RSP_1;
+   x86_pmu.pebs_virtualization = true;
pr_cont(Silvermont events, );
break;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c 
b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ae96cfa..29622a7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -429,6 +429,26 @@ void reserve_ds_buffers(void)
put_online_cpus();
 }
 
+unsigned long long perf_get_ds_area(void)
+{
+   return (u64)__get_cpu_var(cpu_hw_events).ds;
+}
+EXPORT_SYMBOL_GPL(perf_get_ds_area);
+
+unsigned long long perf_get_pebs_enable(void)
+{
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
+
+   return cpuc-pebs_enabled;
+}
+EXPORT_SYMBOL_GPL(perf_get_pebs_enable);
+
+bool perf_pebs_virtualization(void)
+{
+   return x86_pmu.pebs_virtualization;
+}
+EXPORT_SYMBOL_GPL(perf_pebs_virtualization);
+
 /*
  * BTS
  */
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Implement PEBS virtualization for Silvermont

2014-05-29 Thread Andi Kleen
PEBS is very useful (e.g. enabling the more cycles:pp event or
memory profiling) Unfortunately it didn't work in virtualization,
which is becoming more and more common.

This patch kit implements simple PEBS virtualization for KVM on Silvermont
CPUs. Silvermont does not have the leak problems that prevented successfull
PEBS virtualization earlier.

It needs some (simple) modifcations on the host perf code, in addition to 
a PEBS device model in KVM. The guest does not need any modifications.

It also requires running with -cpu host. This may in term cause
some other problems with the guest perf (due to writes to various missing
MSRs), but these can be addressed separately.

For more details please see the description of the individual patches.

Available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git 
perf/kvm-pebs-slm

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] kvm: Implement PEBS virtualization

2014-05-29 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

PEBS (Precise Event Bases Sampling) profiling is very powerful,
allowing improved sampling precision and much additional information,
like address or TSX abort profiling. cycles:p and :pp uses PEBS.

This patch enables PEBS profiling in KVM guests.

PEBS writes profiling records to a virtual address in memory. Since
the guest controls the virtual address space the PEBS record
is directly delivered to the guest buffer. We set up the PEBS state
that is works correctly.The CPU cannot handle any kinds of faults during
these guest writes.

To avoid any problems with guest pages being swapped by the host we
pin the pages when the PEBS buffer is setup, by intercepting
that MSR.

Typically profilers only set up a single page, so pinning that is not
a big problem. The pinning is limited to 17 pages currently (64K+1)

In theory the guest can change its own page tables after the PEBS
setup. The host has no way to track that with EPT. But if a guest
would do that it could only crash itself. It's not expected
that normal profilers do that.

The patch also adds the basic glue to enable the PEBS CPUIDs
and other PEBS MSRs, and ask perf to enable PEBS as needed.

Due to various limitations it currently only works on Silvermont
based systems.

This patch doesn't implement the extended MSRs some CPUs support.
For example latency profiling on SLM will not work at this point.

Timing:

The emulation is somewhat more expensive than a real PMU. This
may trigger the expensive PMI detection in the guest.
Usually this can be disabled with
echo 0  /proc/sys/kernel/perf_cpu_time_max_percent

Migration:

In theory it should should be possible (as long as we migrate to
a host with the same PEBS event and the same PEBS format), but I'm not
sure the basic KVM PMU code supports it correctly: no code to
save/restore state, unless I'm missing something. Once the PMU
code grows proper migration support it should be straight forward
to handle the PEBS state too.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/include/asm/kvm_host.h   |   6 ++
 arch/x86/include/uapi/asm/msr-index.h |   4 +
 arch/x86/kvm/cpuid.c  |  10 +-
 arch/x86/kvm/pmu.c| 184 --
 arch/x86/kvm/vmx.c|   6 ++
 5 files changed, 196 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7de069af..d87cb66 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -319,6 +319,8 @@ struct kvm_pmc {
struct kvm_vcpu *vcpu;
 };
 
+#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
+
 struct kvm_pmu {
unsigned nr_arch_gp_counters;
unsigned nr_arch_fixed_counters;
@@ -335,6 +337,10 @@ struct kvm_pmu {
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
struct irq_work irq_work;
u64 reprogram_pmi;
+   u64 pebs_enable;
+   u64 ds_area;
+   struct page *pinned_pages[MAX_PINNED_PAGES];
+   unsigned num_pinned_pages;
 };
 
 enum {
diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index fcf2b3a..409a582 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -72,6 +72,10 @@
 #define MSR_IA32_PEBS_ENABLE   0x03f1
 #define MSR_IA32_DS_AREA   0x0600
 #define MSR_IA32_PERF_CAPABILITIES 0x0345
+#define PERF_CAP_PEBS_TRAP (1U  6)
+#define PERF_CAP_ARCH_REG  (1U  7)
+#define PERF_CAP_PEBS_FORMAT   (0xf  8)
+
 #define MSR_PEBS_LD_LAT_THRESHOLD  0x03f6
 
 #define MSR_MTRRfix64K_0   0x0250
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..c8cc76b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0;
unsigned f_invpcid = kvm_x86_ops-invpcid_supported() ? F(INVPCID) : 0;
unsigned f_mpx = kvm_x86_ops-mpx_supported() ? F(MPX) : 0;
+   bool pebs = perf_pebs_virtualization();
+   unsigned f_ds = pebs ? F(DS) : 0;
+   unsigned f_pdcm = pebs ? F(PDCM) : 0;
+   unsigned f_dtes64 = pebs ? F(DTES64) : 0;
 
/* cpuid 1.edx */
const u32 kvm_supported_word0_x86_features =
@@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
-   0 /* Reserved, DS, ACPI */ | F(MMX) |
+   f_ds /* Reserved, ACPI */ | F(MMX) |
F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
0 /* HTT, TM, Reserved, PBE */;
/* cpuid 0x8001.edx */
@@ -283,10

[PATCH 2/4] perf: Allow guest PEBS for KVM owned counters

2014-05-29 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Currently perf unconditionally disables PEBS for guest.

Now that we have the infrastructure in place to handle
it we can allow it for KVM owned guest events. For
the perf needs to know that a event is owned by
a guest. Add a new state bit in the perf_event for that.

The bit is only set by KVM and cannot be selected
by anyone else.

Then change the MSR entry/exit list to allow
PEBS for these counters.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/kernel/cpu/perf_event.h   |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c | 14 +++---
 arch/x86/kvm/pmu.c |  1 +
 include/linux/perf_event.h | 15 ++-
 kernel/events/core.c   |  7 ---
 5 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 6ab8fdd..422bca5 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -163,6 +163,7 @@ struct cpu_hw_events {
 */
u64 intel_ctrl_guest_mask;
u64 intel_ctrl_host_mask;
+   u64 intel_ctrl_guest_owned;
struct perf_guest_switch_msrguest_switch_msrs[X86_PMC_IDX_MAX];
 
/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 86ccb81..3bcfda0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1202,6 +1202,7 @@ static void intel_pmu_disable_event(struct perf_event 
*event)
 
cpuc-intel_ctrl_guest_mask = ~(1ull  hwc-idx);
cpuc-intel_ctrl_host_mask = ~(1ull  hwc-idx);
+   cpuc-intel_ctrl_guest_owned = ~(1ull  hwc-idx);
cpuc-intel_cp_status = ~(1ull  hwc-idx);
 
/*
@@ -1274,6 +1275,8 @@ static void intel_pmu_enable_event(struct perf_event 
*event)
 
if (event-attr.exclude_host)
cpuc-intel_ctrl_guest_mask |= (1ull  hwc-idx);
+   if (event-guest_owned)
+   cpuc-intel_ctrl_guest_owned |= (1ull  hwc-idx);
if (event-attr.exclude_guest)
cpuc-intel_ctrl_host_mask |= (1ull  hwc-idx);
 
@@ -1775,18 +1778,23 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
 {
struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc-guest_switch_msrs;
+   u64 mask;
 
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl  ~cpuc-intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl  ~cpuc-intel_ctrl_host_mask;
+
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc-pebs_enabled;
/*
+* For PEBS virtualization only allow guest owned counters.
+*
 * If PMU counter has PEBS enabled it is not enough to disable counter
 * on a guest entry since PEBS memory write can overshoot guest entry
 * and corrupt guest memory. Disabling PEBS solves the problem.
 */
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc-pebs_enabled;
-   arr[1].guest = 0;
+   mask = cpuc-intel_ctrl_guest_owned;
+   arr[1].guest = cpuc-pebs_enabled  (mask | (mask  32));
 
*nr = 2;
return arr;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5c4f631..4c6f417 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -188,6 +188,7 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
PTR_ERR(event));
return;
}
+   event-guest_owned = true;
 
pmc-perf_event = event;
clear_bit(pmc-idx, (unsigned long*)pmc-vcpu-arch.pmu.reprogram_pmi);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3356abc..ad2b3f6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -437,6 +437,8 @@ struct perf_event {
int cgrp_defer_enabled;
 #endif
 
+   boolguest_owned;/* Owned by a guest */
+
 #endif /* CONFIG_PERF_EVENTS */
 };
 
@@ -550,11 +552,22 @@ extern int perf_event_refresh(struct perf_event *event, 
int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
+__perf_event_create_kernel_counter(struct perf_event_attr *attr,
+   int cpu,
+   struct task_struct *task,
+   perf_overflow_handler_t callback,
+   void *context, bool guest_owned);
+static inline struct perf_event *
 perf_event_create_kernel_counter(struct perf_event_attr *attr,
int cpu,
struct task_struct *task

[PATCH 3/4] perf: Handle guest PEBS events with a fake event

2014-05-29 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

With PEBS virtualization the PEBS record gets delivered to the guest,
but the host sees the PMI. This would normally result in a spurious
PEBS PMI that is ignored. But we need to inject the PMI into the guest,
so that the guest PMI handler can handle the PEBS record.

Check for this case in the perf PEBS handler.  When any guest PEBS
counters are active always check the counters explicitely for
overflow. If a guest PEBs counter overflowed trigger a fake event. The
fake event results in calling the KVM PMI callback, which injects
the PMI into the guest. The guest handler then retrieves the correct
information from its own PEBS record and the guest state.

Note: in very rare cases with exotic events this may lead to spurious PMIs
in the guest.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 49 +++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c 
b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 29622a7..0267174 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -998,6 +998,53 @@ static void intel_pmu_drain_pebs_core(struct pt_regs 
*iregs)
__intel_pmu_pebs_event(event, iregs, at);
 }
 
+/*
+ * We may be running with virtualized PEBS, so the PEBS record
+ * was logged into the guest's DS and is invisible to us.
+ *
+ * For guest-owned counters we always have to check the counter
+ * and see if they are overflowed, because PEBS thresholds
+ * are not reported in the GLOBAL_STATUS.
+ *
+ * In this case just trigger a fake event for KVM to forward
+ * to the guest as PMI.  The guest will then see the real PEBS
+ * record and read the counter values.
+ *
+ * The contents of the event do not matter.
+ */
+static void intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
+   struct pt_regs *iregs)
+{
+   int bit;
+   struct perf_event *event;
+
+   if (!cpuc-intel_ctrl_guest_owned)
+   return;
+
+   for_each_set_bit(bit, (unsigned long *)cpuc-intel_ctrl_guest_owned,
+x86_pmu.max_pebs_events) {
+   struct perf_sample_data data;
+   s64 count;
+   int shift;
+
+   event = cpuc-events[bit];
+   if (!event-attr.precise_ip)
+   continue;
+   rdpmcl(event-hw.event_base_rdpmc, count);
+
+   /* sign extend */
+   shift = 64 - x86_pmu.cntval_bits;
+   count = ((s64)((u64)count  shift))  shift;
+
+   if (count  0)
+   continue;
+
+   perf_sample_data_init(data, 0, event-hw.last_period);
+   if (perf_event_overflow(event, data, iregs))
+   x86_pmu_stop(event, 0);
+   }
+}
+
 static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
@@ -1010,6 +1057,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
*iregs)
if (!x86_pmu.pebs_active)
return;
 
+   intel_pmu_handle_guest_pebs(cpuc, iregs);
+
at  = (struct pebs_record_nhm *)(unsigned long)ds-pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds-pebs_index;
 
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/14] asmlinkage, kvm: Make kvm_rebooting visible

2014-02-07 Thread Andi Kleen
kvm_rebooting is referenced from assembler code, thus
needs to be visible.

Cc: g...@redhat.com
Cc: pbonz...@redhat.com
Cc: kvm@vger.kernel.org
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 03a0381..b5ec7fb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -102,7 +102,7 @@ static void kvm_release_pfn_dirty(pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm *kvm,
struct kvm_memory_slot *memslot, gfn_t gfn);
 
-bool kvm_rebooting;
+__visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
-- 
1.8.5.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] asmlinkage, kvm: Make kvm_rebooting visible

2013-10-22 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

kvm_rebooting is referenced from assembler code, thus
needs to be visible.

Cc: g...@redhat.com
Cc: pbonz...@redhat.com
Cc: kvm@vger.kernel.org
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a9dd682..6ca3564 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -95,7 +95,7 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-bool kvm_rebooting;
+__visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Make kvm_rebooting visible

2013-08-07 Thread Andi Kleen
 
 Should this go into 3.11 too?  Or it never worked?

It's ok to keep it for .12. It was broken since it was merged,
but normal builds don't trigger the problem.

-andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Make kvm_rebooting visible

2013-08-05 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

kvm_rebooting is referenced from assembler code, thus
needs to be visible.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1580dd4..eff6abd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -95,7 +95,7 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-bool kvm_rebooting;
+__visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Make kvm_rebooting visible

2013-08-05 Thread Andi Kleen
   static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
 -bool kvm_rebooting;
 +__visible bool kvm_rebooting;
   EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 How many of these are there kernel wide?

Not very many (at least on x86 allyes) ~10.
Also most users are not exported.

Probably not worth an own macro.

 
 Could you do something like this instead:
 
 DEFINE_AND_EXPORT_GPL(bool, kvm_rebooting);
 
 The definition of DEFINE_AND_EXPORT_GPL(_type, _name) is left as an
 exercise for the reader.

I actually had EXPORT_SYMBOL make things always visible for a long time,
but it prevents optimizing away unused code in very small
non modular configurations. So I switched to separate annotations.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5

2013-07-18 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

[KVM maintainers:
The underlying support for this is in perf/core now. So please merge
this patch into the KVM tree.]

This is not arch perfmon, but older CPUs will just ignore it. This makes
it possible to do at least some TSX measurements from a KVM guest

v2: Various fixes to address review feedback
v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
v4: Use reserved bits for #GP
v5: Remove obsolete argument
Acked-by: Gleb Natapov g...@redhat.com
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.c  | 25 -
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index af9c552..01493a1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -315,6 +315,7 @@ struct kvm_pmu {
u64 global_ovf_ctrl;
u64 counter_bitmask[2];
u64 global_ctrl_mask;
+   u64 reserved_bits;
u8 version;
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c53e797..5c4f631 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)
 
 static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
unsigned config, bool exclude_user, bool exclude_kernel,
-   bool intr)
+   bool intr, bool in_tx, bool in_tx_cp)
 {
struct perf_event *event;
struct perf_event_attr attr = {
@@ -173,6 +173,10 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
type,
.exclude_kernel = exclude_kernel,
.config = config,
};
+   if (in_tx)
+   attr.config |= HSW_IN_TX;
+   if (in_tx_cp)
+   attr.config |= HSW_IN_TX_CHECKPOINTED;
 
attr.sample_period = (-pmc-counter)  pmc_bitmask(pmc);
 
@@ -226,7 +230,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
eventsel)
 
if (!(eventsel  (ARCH_PERFMON_EVENTSEL_EDGE |
ARCH_PERFMON_EVENTSEL_INV |
-   ARCH_PERFMON_EVENTSEL_CMASK))) {
+   ARCH_PERFMON_EVENTSEL_CMASK |
+   HSW_IN_TX |
+   HSW_IN_TX_CHECKPOINTED))) {
config = find_arch_event(pmc-vcpu-arch.pmu, event_select,
unit_mask);
if (config != PERF_COUNT_HW_MAX)
@@ -239,7 +245,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
eventsel)
reprogram_counter(pmc, type, config,
!(eventsel  ARCH_PERFMON_EVENTSEL_USR),
!(eventsel  ARCH_PERFMON_EVENTSEL_OS),
-   eventsel  ARCH_PERFMON_EVENTSEL_INT);
+   eventsel  ARCH_PERFMON_EVENTSEL_INT,
+   (eventsel  HSW_IN_TX),
+   (eventsel  HSW_IN_TX_CHECKPOINTED));
 }
 
 static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
@@ -256,7 +264,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 
en_pmi, int idx)
arch_events[fixed_pmc_events[idx]].event_type,
!(en  0x2), /* exclude user */
!(en  0x1), /* exclude kernel */
-   pmi);
+   pmi, false, false);
 }
 
 static inline u8 fixed_en_pmi(u64 ctrl, int idx)
@@ -408,7 +416,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data 
*msr_info)
} else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
if (data == pmc-eventsel)
return 0;
-   if (!(data  0x0020ull)) {
+   if (!(data  pmu-reserved_bits)) {
reprogram_gp_counter(pmc, data);
return 0;
}
@@ -450,6 +458,7 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
pmu-counter_bitmask[KVM_PMC_GP] = 0;
pmu-counter_bitmask[KVM_PMC_FIXED] = 0;
pmu-version = 0;
+   pmu-reserved_bits = 0x0020ull;
 
entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -478,6 +487,12 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
pmu-global_ctrl = ((1  pmu-nr_arch_gp_counters) - 1) |
(((1ull  pmu-nr_arch_fixed_counters) - 1)  
INTEL_PMC_IDX_FIXED);
pmu-global_ctrl_mask = ~pmu-global_ctrl;
+
+   entry = kvm_find_cpuid_entry(vcpu, 7, 0);
+   if (entry 
+   (boot_cpu_has(X86_FEATURE_HLE) || boot_cpu_has(X86_FEATURE_RTM)) 
+   (entry-ebx  (X86_FEATURE_HLE|X86_FEATURE_RTM)))
+   pmu

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Andi Kleen

FWIW I use the paravirt spinlock ops for adding lock elision
to the spinlocks.

This needs to be done at the top level (so the level you're removing)

However I don't like the pv mechanism very much and would 
be fine with using an static key hook in the main path
like I do for all the other lock types.

It also uses interrupt ops patching, for that it would 
be still needed though.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Andi Kleen
On Sat, Jun 01, 2013 at 01:28:00PM -0700, Jeremy Fitzhardinge wrote:
 On 06/01/2013 01:14 PM, Andi Kleen wrote:
  FWIW I use the paravirt spinlock ops for adding lock elision
  to the spinlocks.
 
 Does lock elision still use the ticketlock algorithm/structure, or are
 they different?  If they're still basically ticketlocks, then it seems
 to me that they're complimentary - hle handles the fastpath, and pv the
 slowpath.

It uses the ticketlock algorithm/structure, but:

- it needs to know that the lock is free with an own operation
- it has an additional field for strong adaptation state
(but that field is independent of the low level lock implementation,
so can be used with any kind of lock)

So currently it inlines the ticket lock code into its own.

Doing pv on the slow path would be possible, but would need
some additional (minor) hooks I think.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Andi Kleen
Rik van Riel r...@redhat.com writes:

 If we always incremented the ticket number by 2 (instead of 1), then
 we could use the lower bit of the ticket number as the spinlock.

Spinning on a single bit is very inefficient, as you need to do
try lock in a loop which is very unfriendly to the MESI state protocol.
It's much better to have at least three states and allow
spinning-while-reading-only.

This is typically very visible on systems with 2S.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Prepare kvm for lto

2012-09-13 Thread Andi Kleen
On Thu, Sep 13, 2012 at 11:27:43AM +0300, Avi Kivity wrote:
 On 09/12/2012 10:17 PM, Andi Kleen wrote:
  On Wed, Sep 12, 2012 at 05:50:41PM +0300, Avi Kivity wrote:
  vmx.c has an lto-unfriendly bit, fix it up.
  
  While there, clean up our asm code.
  
  Avi Kivity (3):
KVM: VMX: Make lto-friendly
KVM: VMX: Make use of asm.h
KVM: SVM: Make use of asm.h
  
  Works for me in my LTO build, thanks Avi.
  I cannot guarantee I always hit the unit splitting case, but it looks
  good so far.
 
 Actually I think patch 1 is missing a .global vmx_return.

Ok can you add it please? It always depends how the LTO partitioner
decides to split the subunits.

I can run it with randomconfig in a loop over night. That's the best way I know
to try to cover these cases.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Prepare kvm for lto

2012-09-12 Thread Andi Kleen
On Wed, Sep 12, 2012 at 05:50:41PM +0300, Avi Kivity wrote:
 vmx.c has an lto-unfriendly bit, fix it up.
 
 While there, clean up our asm code.
 
 Avi Kivity (3):
   KVM: VMX: Make lto-friendly
   KVM: VMX: Make use of asm.h
   KVM: SVM: Make use of asm.h

Works for me in my LTO build, thanks Avi.
I cannot guarantee I always hit the unit splitting case, but it looks
good so far.

I replaced my patches with yours.

Acked-by: Andi Kleen a...@linux.intel.com

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file

2012-08-19 Thread Andi Kleen
 The reason we use a local label is so that we the function isn't split
 into two from the profiler's point of view.  See cd2276a795b013d1.

Hmm that commit message is not very enlightening.

The goal was to force a compiler error?

With LTO there is no way to force two functions be in the same assembler
file. The partitioner is always allowed to split.

 
 One way to fix this is to have a .data variable initialized to point to
 .Lkvm_vmx_return (this can be done from the same asm statement in
 vmx_vcpu_run), and reference that variable in
 vmx_set_constant_host_state().  If no one comes up with a better idea,
 I'll write a patch doing this.

I'm not clear how that is better than my patch.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file

2012-08-19 Thread Andi Kleen
On Sun, Aug 19, 2012 at 06:12:57PM +0300, Avi Kivity wrote:
 On 08/19/2012 06:09 PM, Andi Kleen wrote:
  The reason we use a local label is so that we the function isn't split
  into two from the profiler's point of view.  See cd2276a795b013d1.
  
  Hmm that commit message is not very enlightening.
  
  The goal was to force a compiler error?
 
 No, the goal was to avoid a global label in the middle of a function.
 The profiler interprets it as a new function.  After your patch,

Ah got it now. I always used to have the same problem with sys_call_return.`

I wonder if there shouldn't be a way to tell perf to ignore a symbol.

  
  One way to fix this is to have a .data variable initialized to point to
  .Lkvm_vmx_return (this can be done from the same asm statement in
  vmx_vcpu_run), and reference that variable in
  vmx_set_constant_host_state().  If no one comes up with a better idea,
  I'll write a patch doing this.
  
  I'm not clear how that is better than my patch.
 
 My patch will not generate the artifact with kvm_vmx_return.

Ok fine for me. I'll keep this patch for now, until you have
something better.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-03-30 Thread Andi Kleen
 So if a guest exits due to an external event it's easy to inspect the
 state of that guest and avoid to schedule away when it was interrupted
 in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly
well. pthread mutexes have a tendency to beat kernel spinlocks there.
So avoiding sleeping locks completely (that is what pv locks are
essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention
and relatively small systems.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-14 Thread Andi Kleen
On Wed, Sep 14, 2011 at 10:00:07AM +0300, Avi Kivity wrote:
 On 09/13/2011 10:21 PM, Don Zickus wrote:
 Or are you saying an NMI in an idle system will have the same %rip thus
 falsely detecting a back-to-back NMI?
 
 
 
 That's easy to avoid - insert an instruction zeroing the last nmi_rip 
 somewhere before or after hlt.  It's always okay to execute such an 
 instruction (outside the nmi handler itself), since nmi_rip is meant to 
 detect a no instructions executed condition.

At least for classic hlt there is no simple after hlt because it's all
interrupt handlers and exceptions and everything else that can interrupt
combined.

It may work with newer MWAIT.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-14 Thread Andi Kleen
 If an NMI hits in an interrupt handler, or in the after hlt section 
 before the write-to-last-nmi-rip, then we'll see that %rip has changed.  
 If it hits after the write-to-last-nmi-rip instruction (or in the hlt 
 itself), then we'll also see that %rip has changed, due to the effect of 
 that instruction.

It won't handle multiple NMIs in halt. I assume that's reasonable common.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-14 Thread Andi Kleen
On Wed, Sep 14, 2011 at 10:26:21PM +0300, Avi Kivity wrote:
 On 09/14/2011 08:28 PM, Andi Kleen wrote:
   If an NMI hits in an interrupt handler, or in the after hlt section
   before the write-to-last-nmi-rip, then we'll see that %rip has changed.
   If it hits after the write-to-last-nmi-rip instruction (or in the hlt
   itself), then we'll also see that %rip has changed, due to the effect of
   that instruction.
 
 It won't handle multiple NMIs in halt. I assume that's reasonable common.
 
 
 Why not?

They all have the same original RIPs and there is no way to distingush
them.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Andi Kleen
 So I got around to implementing this and it seems to work great.  The back
 to back NMIs are detected properly using the %rip and that info is passed to
 the NMI notifier.  That info is used to determine if only the first
 handler to report 'handled' is executed or _all_ the handlers are
 executed.
 
 I think all the 'unknown' NMIs I generated with various perf runs have
 disappeared.  I'll post a new version of my nmi notifier rewrite soon.

This will fail when the system is idle.

-Andi

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Andi Kleen
 Or are you saying an NMI in an idle system will have the same %rip thus
 falsely detecting a back-to-back NMI?

Yup.

Another problem is very long running instructions, like WBINVD and some others.
If there's a high frequency NMI it may well hit multiple times in a single
 instance.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Andi Kleen
On Tue, Sep 13, 2011 at 04:53:18PM -0400, Don Zickus wrote:
 On Tue, Sep 13, 2011 at 09:58:38PM +0200, Andi Kleen wrote:
   Or are you saying an NMI in an idle system will have the same %rip thus
   falsely detecting a back-to-back NMI?
  
  Yup.
 
 Hmm.  That sucks.  Is there another register that can be used in
 conjunction to seperate it, like sp or something?  Or we can we assume

Not that I know of.

 than an idle cpu isn't doing much for local NMI IPIs and that the only
 NMIs that would interrupt it would be external ones?

There can be still activity on the idle system, e.g. SMM or some 
Hypervisor in the background. If you profile events those might trigger 
samples, but they will be all accounted to the MWAIT.

  Another problem is very long running instructions, like WBINVD and some 
  others.
  If there's a high frequency NMI it may well hit multiple times in a single
   instance.
 
 I thought NMIs happen on instruction boundaries, maybe not.

Yes, but there may be multiple queued up when the instruction has finished
executing. So you get multiple at the boundary.

 Honestly, our current implementation would probably be tripped up with
 those examples too, so I don't think I am making things worse (assuming
 the only high frequency NMI is coming from perf).

I wish perf/oprofile would just stop using NMIs.  The interrupt off regions
are getting smaller and smaller and they can be profiled in a limited way
using PEBS anyways. Or maybe make it a knob with default to off.

-Andi

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-02 Thread Andi Kleen

 But, erm, does that even make sense?  I'm assuming the NMI reason port
 tells the CPU why it got an NMI.  If multiple CPUs can get NMIs and
 there's only a single reason port, then doesn't that mean that either 1)
 they all got the NMI for the same reason, or 2) having a single port is
 inherently racy?  How does the locking actually work there?

All the code to determine NMI reasons is inherently racy,
and each new NMI user makes it worse.

-Andi

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] KVM: MMU: Optimize guest page table walk

2011-05-01 Thread Andi Kleen

 Ok it looks like the 32bit kernel only handles 1/2/4. Maybe that
 was the problem if you ran on 32bit.

 I'm happy with a slower copy_from_user() for that particular case.

It wouldn't be hard to fix.

-Andi


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] KVM: MMU: Optimize guest page table walk

2011-04-29 Thread Andi Kleen
 Only I can guess for that reason is the reduction of some function calls
 by inlining some functions.

Yes once at a time cfu was inline too and just checked for the right
sizes and the used g*u, but it got lost in the icache over everything else 
mania which is unfortunately en vogue for quite some time in kernel land (aka 
measuring patches only based on their impact on the .text
size, not the actual performance)

But you might getter better gains by fixing this general
c*u() regression.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] KVM: MMU: Optimize guest page table walk

2011-04-29 Thread Andi Kleen
 Do you think the following case would not differ so much
 from (1' 2') ?
 
 walk_addr_generic()  ---1''
   copy_from_user()   ---2''

Yes it should be the same and is cleaner.

If you do a make .../foo.i and look at the code coming out of the
preprocessor you'll see it expands to a 

if (!__builtin_constant_p(size))
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
case 1:__get_user_asm(*(u8 *)dst, (u8 __user *)src,
  ret, b, b, =q, 1);
return ret;
case 2: ..
case 4: ..
case 8: ..
case 10: ..
case 16: ..
}

Ok it looks like the 32bit kernel only handles 1/2/4. Maybe that
was the problem if you ran on 32bit.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] KVM: MMU: Optimize guest page table walk

2011-04-28 Thread Andi Kleen
Avi Kivity a...@redhat.com writes:

 Good optimization.  copy_from_user() really isn't optimized for short
 buffers, I expect much of the improvement comes from that.

Actually it is equivalent to get_user for the lenghts supported by
get_user, assuming you pass in a constant length. You probably do not.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EuroSec'11 Presentation

2011-04-13 Thread Andi Kleen
Avi Kivity a...@redhat.com writes:

 With EPT or NPT you cannot detect if a page is read only.

Why not? You can always walk the page tables manually again.

 Furthermore, at least Linux (without highmem) maps all of memory with
 a read/write mapping in addition to the per-process mapping, so no
 page is read-only.

Even with 32bit highmem most memory will be eventually mapped writable by
kmap when. There's currently no concept of a ro-kmap. However I suspect
it wouldn't be too hard to add one.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add the __noclone attribute

2011-02-11 Thread Andi Kleen
On Fri, Feb 11, 2011 at 02:29:40PM +0800, Lai Jiangshan wrote:
 The changelog of 104f226 said adds the __noclone attribute,
 but it was missing in its patch. I think it is still needed.

Looks good. Thanks.

Acked-by: Andi Kleen a...@linux.intel.com

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-27 Thread Andi Kleen
I personally would consider it cleaner to have clearly
defined wrappers instead of complicted flags in the caller.

 The number of args to these functions is getting nutty - you'll
 probably find that it is beneficial to inline these wrapepr functions, if
 the number of callsites is small.

Really the functions are so heavy weight it should not matter.
The problem with inlining is that you end up with the code in
the header file and I personally find that much harder to browse
instead of having everything in one file.

Also longer term we'll get compilers that can do cross-file inlining
for optimized builds.

So please better avoid these kinds of micro optimizations unless
it's a really really extremly speed critical path.

Thanks,

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-27 Thread Andi Kleen
  Also longer term we'll get compilers that can do cross-file inlining
  for optimized builds.
 
 Which we'll probably need to turn off all over the place :(

Why? 

 
  So please better avoid these kinds of micro optimizations unless
  it's a really really extremly speed critical path.
 
 It's not just speed and it's not just .text size, either. Calling a
 ten-arg function consumes stack space.

gcc is pretty good at handling tail calls. Take a look at the code
it generates.


-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QEMU, MCE, unpoison memory address across reboot

2010-12-23 Thread Andi Kleen

 Can't you free and reallocate all guest memory instead, on reboot, if
 there's a hwpoisoned page? Then you don't need this interface.

I think that would be more efficient. You can potentially save a lot
of memory if the new guest doesn't need as much as the old one.


-Andi

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()

2010-11-18 Thread Andi Kleen

On 11/18/2010 1:17 PM, Avi Kivity wrote:

cea15c2 (KVM: Move KVM context switch into own function) split vmx_vcpu_run()
to prevent multiple copies of the context switch from being generated (causing
problems due to a label).  This patch folds them back together again and adds
the __noclone attribute to prevent the label from being duplicated.


That won't work on gcc versions that didn't have __noclone yet. Noclone 
is fairly recent

(4.5 or 4.4)

-Andi

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()

2010-11-18 Thread Andi Kleen

On 11/18/2010 3:32 PM, Avi Kivity wrote:

On 11/18/2010 03:48 PM, Andi Kleen wrote:

On 11/18/2010 1:17 PM, Avi Kivity wrote:
cea15c2 (KVM: Move KVM context switch into own function) split 
vmx_vcpu_run()
to prevent multiple copies of the context switch from being 
generated (causing
problems due to a label).  This patch folds them back together again 
and adds

the __noclone attribute to prevent the label from being duplicated.


That won't work on gcc versions that didn't have __noclone yet. 
Noclone is fairly recent

(4.5 or 4.4)


Are the gcc versions that don't have noclone susceptible to cloning?


I believe the problem can happen due to inlining already

-Andi


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-15 Thread Andi Kleen
 The issue of d) is that there are multiple ways to inject MCE. Now one
 software based, one APEI based, and maybe some others in the future.
 They all use different interfaces. And as debug interface, there are not
 considered kernel ABI too (some are in debugfs). So I think it is better
 to use these ABI only in some test suite.

In some cases the injection may be also through external hardware
debugging mechanisms. So yes requiring that in qemu isn't a good
idea.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-10 Thread Andi Kleen
 We need host kernel to break down the 2M huge page into 4k pages. Then
 send SIGBUS to QEMU with the poisoned 4k page. Because host kernel will
 poison the whole 2M virtual address space otherwise, and other 4k pages
 inside the 2M page can not used accessed in guest (will trigger SIGBUS
 and SRAR MCE).

The easiest way would be to port the respective code from Andrea's 
transparent hugetlb patchkit. It already does break down huge pages
as needed. You just have to be careful to not touch (/copy) the corrupted
subpage.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-10 Thread Andi Kleen
 Doing it in userspace in easier, since we can replace the vma for
 that section (and avoid mixed 4k/2M pages in hugetlbfs).

You can't do that today, there's no way currently to access the non corrupted
portion of the 2MB page. Once it's poisoned it's all gone.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-10 Thread Andi Kleen
On Wed, Nov 10, 2010 at 07:47:11PM +0200, Avi Kivity wrote:
 On 11/10/2010 07:44 PM, Andi Kleen wrote:
   Doing it in userspace in easier, since we can replace the vma for
   that section (and avoid mixed 4k/2M pages in hugetlbfs).
 
 You can't do that today, there's no way currently to access the non corrupted
 portion of the 2MB page. Once it's poisoned it's all gone.
 
 I see.  Thanks.

BTW my understanding is that KVM will only use hugepages with
transparent huge pages anyways, correct?

So it may be reasonable to simply fix this as part of the transparent
hugepages work, but ignore it before that.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fyi: gcc33-hammer crashes when compiling kvm emulate.c

2010-10-26 Thread Andi Kleen
 We have said 3.4 minimum for x86 for a long time now, and have an RFC

Ok makes sense. I thought it was still at 3.3. I should retire
this 3.3 fossil anyways, it's really only for old compat testing.

I don't remember seeing a warning -- aren't there supposed to be warnings
for unsupported compilers?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fyi: gcc33-hammer crashes when compiling kvm emulate.c

2010-10-26 Thread Andi Kleen
 Not unless they are actively known to break.  People get huffy about it

Well they do -- i just found out.

 because even if it is known to have problems it doesn't break *their*
 particular configuration.  I'm getting to be of the opinion that people
 who compile modern kernels with ancient compilers and expect it to work
 are suffering from some particular kind of insanity -- it's nothing the
 distros do.  The only exception are embedded people who compile with the
 latest 3.4 gcc; they have explained they do so because newer gccs have
 too many dependencies (the actual compiler, not the generated code) and
 for speed.

At least in the old days the main reason for gcc 3 was build speed.
AKPM and some others used to be fond of that.

3.x is apparently much faster than 4.x

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fyi: gcc33-hammer crashes when compiling kvm emulate.c

2010-10-26 Thread Andi Kleen
 That is an issue too, as 3.x does a lot fewer optimizations than 4.x.

Well to be fair the default -Os build disables most of the fancy stuff
(and the resulting code is often terrible) 

I guess it doesn't matter too much, at least not with the 
CONFIG_CC_OPTIMIZE_SIZE default.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Move KVM context switch into own function

2010-10-20 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

gcc 4.5 with some special options is able to duplicate the VMX
context switch asm in vmx_vcpu_run(). This results in a compile error
because the inline asm sequence uses an on local label. The non local
label is needed because other code wants to set up the return address.

This patch moves the asm code into an own function and marks
that explicitely noinline to avoid this problem.

Better would be probably to just move it into an .S file.

The diff looks worse than the change really is, it's all just
code movement and no logic change.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/kvm/vmx.c |   63 +++
 1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 49b25ee..935e33f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3922,33 +3922,17 @@ static void fixup_rmode_irq(struct vcpu_vmx *vmx)
 #define Q l
 #endif
 
-static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
+/* 
+ * We put this into a separate noinline function to prevent the compiler
+ * from duplicating the code. This is needed because this code
+ * uses non local labels that cannot be duplicated.
+ * Do not put any flow control into this function.
+ * Better would be to put this whole monstrosity into a .S file.
+ */
+static void noinline do_vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-   /* Record the guest's net vcpu time for enforced NMI injections. */
-   if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked))
-   vmx-entry_time = ktime_get();
-
-   /* Don't enter VMX if guest state is invalid, let the exit handler
-  start emulation until we arrive back to a valid state */
-   if (vmx-emulation_required  emulate_invalid_guest_state)
-   return;
-
-   if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty))
-   vmcs_writel(GUEST_RSP, vcpu-arch.regs[VCPU_REGS_RSP]);
-   if (test_bit(VCPU_REGS_RIP, (unsigned long *)vcpu-arch.regs_dirty))
-   vmcs_writel(GUEST_RIP, vcpu-arch.regs[VCPU_REGS_RIP]);
-
-   /* When single-stepping over STI and MOV SS, we must clear the
-* corresponding interruptibility bits in the guest state. Otherwise
-* vmentry fails as it then expects bit 14 (BS) in pending debug
-* exceptions being set, but that's not correct for the guest debugging
-* case. */
-   if (vcpu-guest_debug  KVM_GUESTDBG_SINGLESTEP)
-   vmx_set_interrupt_shadow(vcpu, 0);
-
-   asm(
+   asm volatile(
/* Store host registers */
push %%Rdx; push %%Rbp;
push %%Rcx \n\t
@@ -4043,6 +4027,35 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
, r8, r9, r10, r11, r12, r13, r14, r15
 #endif
  );
+}
+
+static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /* Record the guest's net vcpu time for enforced NMI injections. */
+   if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked))
+   vmx-entry_time = ktime_get();
+
+   /* Don't enter VMX if guest state is invalid, let the exit handler
+  start emulation until we arrive back to a valid state */
+   if (vmx-emulation_required  emulate_invalid_guest_state)
+   return;
+
+   if (test_bit(VCPU_REGS_RSP, (unsigned long *)vcpu-arch.regs_dirty))
+   vmcs_writel(GUEST_RSP, vcpu-arch.regs[VCPU_REGS_RSP]);
+   if (test_bit(VCPU_REGS_RIP, (unsigned long *)vcpu-arch.regs_dirty))
+   vmcs_writel(GUEST_RIP, vcpu-arch.regs[VCPU_REGS_RIP]);
+
+   /* When single-stepping over STI and MOV SS, we must clear the
+* corresponding interruptibility bits in the guest state. Otherwise
+* vmentry fails as it then expects bit 14 (BS) in pending debug
+* exceptions being set, but that's not correct for the guest debugging
+* case. */
+   if (vcpu-guest_debug  KVM_GUESTDBG_SINGLESTEP)
+   vmx_set_interrupt_shadow(vcpu, 0);
+
+   do_vmx_vcpu_run(vcpu);
 
vcpu-arch.regs_avail = ~((1  VCPU_REGS_RIP) | (1  VCPU_REGS_RSP)
  | (1  VCPU_EXREG_PDPTR));
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Move KVM context switch into own function

2010-10-20 Thread Andi Kleen
On Wed, Oct 20, 2010 at 06:12:11PM +0200, Avi Kivity wrote:
  On 10/20/2010 05:56 PM, Andi Kleen wrote:
 From: Andi Kleena...@linux.intel.com
 
 gcc 4.5 with some special options is able to duplicate the VMX
 context switch asm in vmx_vcpu_run(). This results in a compile error
 because the inline asm sequence uses an on local label. The non local
 label is needed because other code wants to set up the return address.
 
 
 Curious, any idea why gcc does that?  I guess it's the unlikely()
 near the top.

Don't know why, i wasn't motivated to go through its debugging dumps
for this. But strictly the problem could always happen,
not even a volatile stops that.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3][RFC] NUMA: add host side pinning

2010-06-24 Thread Andi Kleen
Anthony Liguori anth...@codemonkey.ws writes:

 If we extended integrated -mem-path with -numa such that a different
 path could be used with each numa node (and we let an explicit file be
 specified instead of just a directory), then if I understand
 correctly, we could use numactl without any specific integration in
 qemu.  Does this sound correct?

It's a bit tricky to coordinate because numactl policy only helps
before first fault (unless you want to migrate, but that has more
overhead), and if you run the numactl in parallel with qemu
you never know who faults first. So you would need another step in
precreating the files before starting qemu.

Another issue with using tmpfs this way is that you first need to resize
it to be larger than 0.5*RAM.  So more configuration hazzle.

Overall it would be rather a lot of steps this way. I guess
most people would put it into a wrapper, but why not have
that wrapper in qemu directly? Supporting interleave too
would be rather straight forward.

Also a lot of things you could do with numactl on shm you 
can be also done after the fact with cpusets.

-Andi 

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/17] Add timekeeping documentation

2010-06-18 Thread Andi Kleen
 The point is about hotplug CPUs.  Any hotplugged CPU will not have a 
 perfectly synchronized TSC, ever, even on a single socket, single crystal 
 board.

hotplug was in the next section, not in this.

Besides most systems do not support hotplug CPUs.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/17] Add timekeeping documentation

2010-06-17 Thread Andi Kleen
Zachary Amsden zams...@redhat.com writes:

I think listing all the obscure bits in the PIT was an attempt to 
weed out the weak and weary readers early, right?

 +this as well.  Several hardware limitations make the problem worse - if it is
 +not possible to write the full 32-bits of the TSC, it may be impossible to
 +match the TSC in newly arriving CPUs to that of the rest of the system,
 +resulting in unsynchronized TSCs.  This may be done by BIOS or system 
 software,
 +but in practice, getting a perfectly synchronized TSC will not be possible
 +unless all values are read from the same clock, which generally only is
 +possible on single socket systems or those with special hardware
 +support.

That's not true, single crystal for all sockets is very common
as long as you only have a single motherboard.

Of course there might be other reasons why the TSC is unsynchronized
(e.g. stop count in C-states), but the single clock is not the problem.

 +3.4) TSC and C-states
 +
 +C-states, or idling states of the processor, especially C1E and deeper sleep
 +states may be problematic for TSC as well.  The TSC may stop advancing in 
 such
 +a state, resulting in a TSC which is behind that of other CPUs when execution
 +is resumed.  Such CPUs must be detected and flagged by the operating system
 +based on CPU and chipset identifications.
 +
 +The TSC in such a case may be corrected by catching it up to a known external
 +clocksource.

... This is fixed in recent CPUs ...

 +
 +3.5) TSC frequency change / P-states
 +
 +To make things slightly more interesting, some CPUs may change requency.  
 They
 +may or may not run the TSC at the same rate, and because the frequency change
 +may be staggered or slewed, at some points in time, the TSC rate may not be
 +known other than falling within a range of values.  In this case, the TSC 
 will
 +not be a stable time source, and must be calibrated against a known, stable,
 +external clock to be a usable source of time.
 +
 +Whether the TSC runs at a constant rate or scales with the P-state is model
 +dependent and must be determined by inspecting CPUID, chipset or various MSR
 +fields.

... In general newer CPUs should not have problems with this anymore

 +
 +4) Virtualization Problems
 +
 +Timekeeping is especially problematic for virtualization because a number of
 +challenges arise.  The most obvious problem is that time is now shared 
 between
 +the host and, potentially, a number of virtual machines.  This happens
 +naturally on X86 systems when SMM mode is used by the BIOS, but not to such a
 +degree nor with such frequency.  However, the fact that SMM mode may cause

The SMM reference here seems at best odd. 

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Andi Kleen
On Tue, Jun 15, 2010 at 02:22:06PM +0300, Avi Kivity wrote:
 Too much duplication.  How about putting the tail end of the function in a 
 common helper (with an inatomic flag)?

 btw, is_hwpoison_address() is racy.  While it looks up the address, some 
 other task can unmap the page tables under us.

Where is is_hwpoison_address() coming from? I can't find it anywhere.

Anyways hwpoison will not remove the page as long as there
is a reference to it, so as long as you get the reference
race free against another task you're ok. Of course there might 
be always an error in between and the hardware may poison
the data, but we don't try to handle all kernel errors.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Andi Kleen
 The page is fine, the page tables are not.  Another task can munmap() the 
 thing while is_hwpoison_address() is running.

Ok  that boils down to me not seeing that source.

If it accesses the page tables yes then it's racy. But whoever
looked up the page tables in the first place should have
returned an -EFAULT. There's no useful address attached
to poison.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [19/23] KVM: Fix unused but set warnings

2010-06-10 Thread Andi Kleen

No real bugs in this one, the real bug I found is in a separate
patch.

Cc: a...@redhat.com
Cc: kvm@vger.kernel.org

Signed-off-by: Andi Kleen a...@linux.intel.com

---
 arch/x86/kvm/paging_tmpl.h |1 +
 arch/x86/kvm/vmx.c |3 +--
 virt/kvm/assigned-dev.c|2 --
 3 files changed, 2 insertions(+), 4 deletions(-)

Index: linux-2.6.35-rc2-gcc/arch/x86/kvm/paging_tmpl.h
===
--- linux-2.6.35-rc2-gcc.orig/arch/x86/kvm/paging_tmpl.h
+++ linux-2.6.35-rc2-gcc/arch/x86/kvm/paging_tmpl.h
@@ -442,6 +442,7 @@ static int FNAME(page_fault)(struct kvm_
kvm_mmu_free_some_pages(vcpu);
sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault,
 level, write_pt, pfn);
+   (void)sptep;
pgprintk(%s: shadow pte %p %llx ptwrite %d\n, __func__,
 sptep, *sptep, write_pt);
 
Index: linux-2.6.35-rc2-gcc/arch/x86/kvm/vmx.c
===
--- linux-2.6.35-rc2-gcc.orig/arch/x86/kvm/vmx.c
+++ linux-2.6.35-rc2-gcc/arch/x86/kvm/vmx.c
@@ -1624,10 +1624,9 @@ static void enter_pmode(struct kvm_vcpu
 static gva_t rmode_tss_base(struct kvm *kvm)
 {
if (!kvm-arch.tss_addr) {
-   struct kvm_memslots *slots;
gfn_t base_gfn;
 
-   slots = kvm_memslots(kvm);
+   kvm_memslots(kvm);
base_gfn = kvm-memslots-memslots[0].base_gfn +
 kvm-memslots-memslots[0].npages - 3;
return base_gfn  PAGE_SHIFT;
Index: linux-2.6.35-rc2-gcc/virt/kvm/assigned-dev.c
===
--- linux-2.6.35-rc2-gcc.orig/virt/kvm/assigned-dev.c
+++ linux-2.6.35-rc2-gcc/virt/kvm/assigned-dev.c
@@ -58,12 +58,10 @@ static int find_index_from_host_irq(stru
 static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
 {
struct kvm_assigned_dev_kernel *assigned_dev;
-   struct kvm *kvm;
int i;
 
assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
interrupt_work);
-   kvm = assigned_dev-kvm;
 
spin_lock_irq(assigned_dev-assigned_dev_lock);
if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [11/23] KVM: Fix KVM_SET_SIGNAL_MASK

2010-06-10 Thread Andi Kleen

Real bug fix.

When the user passed in a NULL mask pass this on from the ioctl
handler.

Found by gcc 4.6's new warnings.

Cc: a...@redhat.com
Cc: kvm@vger.kernel.org

Signed-off-by: Andi Kleen a...@linux.intel.com

---
 virt/kvm/kvm_main.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.35-rc2-gcc/virt/kvm/kvm_main.c
===
--- linux-2.6.35-rc2-gcc.orig/virt/kvm/kvm_main.c
+++ linux-2.6.35-rc2-gcc/virt/kvm/kvm_main.c
@@ -1520,7 +1520,7 @@ out_free2:
goto out;
p = sigset;
}
-   r = kvm_vcpu_ioctl_set_sigmask(vcpu, sigset);
+   r = kvm_vcpu_ioctl_set_sigmask(vcpu, p);
break;
}
case KVM_GET_FPU: {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-07 Thread Andi Kleen
Stephen Hemminger shemmin...@vyatta.com writes:

 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

3. If they come from an internal pool what happens when the kernel runs 
low on memory? How is that pool balanced against other kernel
memory users?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Andi Kleen
On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote:
 On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
  
  There are two separate problems: the more general problem is that
  the hypervisor can put a vcpu to sleep while holding a lock, causing
  other vcpus to spin until the end of their time slice.  This can
  only be addressed with hypervisor help.
 
 Fyi - I have a early patch ready to address this issue. Basically I am using
 host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to 
 hint 
 host whenever guest is in spin-lock'ed section, which is read by host 
 scheduler 
 to defer preemption.

Looks like a ni.ce simple way to handle this for the kernel.

However I suspect user space will hit the same issue sooner
or later. I assume your way is not easily extensable to futexes?

 One pathological case where this may actually hurt is routines in guest like 
 flush_tlb_others_ipi() which take a spinlock and then enter a while() loop
 waiting for other cpus to ack something. In this case, deferring preemption 
 just
 because guest is in critical section actually hurts! Hopefully the upper bound
 for deferring preemtion and the fact that such routines may not be frequently
 hit should help alleviate such situations.

So do you defer during the whole spinlock region or just during the spin?

I assume the the first?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Andi Kleen
On Thu, Jun 03, 2010 at 12:06:39PM +0100, David Woodhouse wrote:
 On Tue, 2010-06-01 at 21:36 +0200, Andi Kleen wrote:
   Collecting the contention/usage statistics on a per spinlock
   basis seems complex.  I believe a practical approximation
   to this are adaptive mutexes where upon hitting a spin
   time threshold, punt and let the scheduler reconcile fairness.
  
  That would probably work, except: how do you get the
  adaptive spinlock into a paravirt op without slowing
  down a standard kernel?
 
 It only ever comes into play in the case where the spinlock is contended
 anyway -- surely it shouldn't be _that_ much of a performance issue?

The problem is fitting the state into the u32

Also lightly contended is not that uncommon.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Andi Kleen
On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
 And they aren't even using ticket spinlocks!!

I suppose they simply don't have unfair memory. Makes things easier.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Andi Kleen
 That would certainly be a part of it, I'm sure they have stronger
 fairness and guarantees at the expense of some performance. We saw the
 spinlock starvation first on 8-16 core Opterons I think, wheras Altix
 had been over 1024 cores and POWER7 1024 threads now apparently without
 reported problems.

I suppose P7 handles that in the HV through the pvcall.

Altix AFAIK has special hardware for this in the interconnect, 
but as individual nodes get larger and have more cores you'll start
seeing it there too.

In general we now have the problem that with increasing core counts
per socket each NUMA node can be a fairly large SMP by itself
and several of the old SMP scalability problems that were fixed
by having per node datastructures are back now.

For example this is a serious problem with the zone locks in some
workloads now on 8core+HT systems.

 So I think actively enforcing fairness at the lock level would be
 required. Something like if it is detected that a core is not making

I suppose how that exactly works is IBM's secret sauce. Anyways
as long as there are no reports I wouldn't worry about it.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-02 Thread Andi Kleen
On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
 On 06/01/2010 08:27 PM, Andi Kleen wrote:
 On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:

 We are running everything on NUMA (since all modern machines are now NUMA).
   At what scale do the issues become observable?
  
 On Intel platforms it's visible starting with 4 sockets.


 Can you recommend a benchmark that shows bad behaviour?  I'll run it with 

Pretty much anything with high lock contention.

 ticket spinlocks and Gleb's patch.  I have a 4-way Nehalem-EX, presumably 
 the huge number of threads will magnify the problem even more there.

Yes more threads cause more lock contention too.

 Do  you have any idea how we can tackle both problems?

Apparently Xen has something, perhaps that can be leveraged
(but I haven't looked at their solution in detail)

Otherwise I would probably try to start with a adaptive
spinlock that at some point calls into the HV (or updates
shared memory?), like john cooper suggested. The tricky part here would
be to find the thresholds and fit that state into
paravirt ops and the standard spinlock_t.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
Gleb Natapov g...@redhat.com writes:

 The patch below allows to patch ticket spinlock code to behave similar to
 old unfair spinlock when hypervisor is detected. After patching unlocked

The question is what happens when you have a system with unfair
memory and you run the hypervisor on that. There it could be much worse.

Your new code would starve again, right?

There's a reason the ticket spinlocks were added in the first place.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
On Tue, Jun 01, 2010 at 07:24:14PM +0300, Gleb Natapov wrote:
 On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote:
  Gleb Natapov g...@redhat.com writes:
  
   The patch below allows to patch ticket spinlock code to behave similar to
   old unfair spinlock when hypervisor is detected. After patching unlocked
  
  The question is what happens when you have a system with unfair
  memory and you run the hypervisor on that. There it could be much worse.
  
 How much worse performance hit could be?

It depends on the workload. Overall it means that a contended
lock can have much higher latencies.

If you want to study some examples see the locking problems the
RT people have with their heavy weight mutex-spinlocks.

But the main problem is that in the worst case you 
can see extremly long stalls (upto a second has been observed),
which then turns in a correctness issue.
 
  Your new code would starve again, right?
  
 Yes, of course it may starve with unfair spinlock. Since vcpus are not
 always running there is much smaller chance then vcpu on remote memory
 node will starve forever. Old kernels with unfair spinlocks are running
 fine in VMs on NUMA machines with various loads.

Try it on a NUMA system with unfair memory.

  There's a reason the ticket spinlocks were added in the first place.
  
 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.

Extreme unfairness can be unbearable too.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
 We are running everything on NUMA (since all modern machines are now NUMA). 
  At what scale do the issues become observable?

On Intel platforms it's visible starting with 4 sockets.


 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.
  
 Extreme unfairness can be unbearable too.


 Well, the question is what happens first.  In our experience, vcpu 
 overcommit is a lot more painful.  People will never see the NUMA 
 unfairness issue if they can't use kvm due to the vcpu overcommit problem.

You really have to address both, if you don't fix them both
users will eventually into one of them and be unhappy.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
 Collecting the contention/usage statistics on a per spinlock
 basis seems complex.  I believe a practical approximation
 to this are adaptive mutexes where upon hitting a spin
 time threshold, punt and let the scheduler reconcile fairness.

That would probably work, except: how do you get the
adaptive spinlock into a paravirt op without slowing
down a standard kernel?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Suggested Parameters for SLES 10 64-bit

2010-05-18 Thread Andi Kleen
Peter Lieven p...@dlh.net writes:

 Starting mail service (Postfix)
 NMI Watchdog detected LOCKUP on CPU 0

You could simply turn off the NMI watchdog (nmi_watchdog=0 at the kernel
command line) 
 
Perhaps the PMU emulation is not complete and nmi watchdog
needs PMU. It's not really needed normally.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM warning about uncertified CPU for SMP for AMD model 2, stepping 3

2010-03-31 Thread Andi Kleen
 I guess these warnings could be just disabled. With nearly everyone
 using multi-core these days they are kind of obsolete anyways.
 Well, the warning refers to an old single-core only CPU model. Most of 
 those were able to run in SMP boards, but only a subset of them was 
 officially certified to do so (Athlon-MP instead of Athlon-XP).

Thanks for pointing out the obvious.

 To avoid complaints about instability of such systems, the warning was 
 introduced. If you consider these systems still supported, I wouldn't 
 disable this warning, at least the check should be disabled only if the 
 hypervisor CPUID bit is set.

The point was that everyone can have a multi-core system for cheap
these days, no real motivation for anyone anymore to use old unsupported
configurations to get one.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] x86: remove SMP check/taint for AMD K7 (was Re: KVM warning about uncertified CPU for SMP for AMD model 2, stepping 3)

2010-03-31 Thread Andi Kleen
On Wed, Mar 31, 2010 at 10:15:28AM +0200, Jiri Kosina wrote:
 On Wed, 31 Mar 2010, Andi Kleen wrote:
 
   booting 32bit guest on 32bit host on AMD system gives me the following 
   warning when KVM is instructed to boot as SMP:
  
  I guess these warnings could be just disabled. With nearly everyone
  using multi-core these days they are kind of obsolete anyways.
 
 Why is it there for K7 only anyway? Was that the only product line which 
 had instances which were explicitly marked as unsuitable for SMP by AMD?

Later ones were fused, so even if you wanted to you couldn't cheat this
way. Earlier CPUs didn't support SMP.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM warning about uncertified CPU for SMP for AMD model 2, stepping 3

2010-03-30 Thread Andi Kleen
On Wed, Mar 31, 2010 at 01:03:02AM +0200, Jiri Kosina wrote:
 Hi,
 
 booting 32bit guest on 32bit host on AMD system gives me the following 
 warning when KVM is instructed to boot as SMP:

I guess these warnings could be just disabled. With nearly everyone
using multi-core these days they are kind of obsolete anyways.

-Andi


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Unify KVM kernel-space and user-space code into a single project

2010-03-24 Thread Andi Kleen
 If you're profiling a single guest it makes more sense to do this from 
 inside the guest - you can profile userspace as well as the kernel.

I'm interested in debugging the guest without guest cooperation.

In many cases qemu's new gdb stub works for that, but in some cases
I would prefer instruction/branch traces over standard gdb style
debugging.

I used to use that very successfully with simulators in the past
for some hard bugs.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Unify KVM kernel-space and user-space code into a single project

2010-03-24 Thread Andi Kleen
Avi Kivity a...@redhat.com writes:

 On 03/24/2010 09:38 AM, Andi Kleen wrote:
 If you're profiling a single guest it makes more sense to do this from
 inside the guest - you can profile userspace as well as the kernel.
  
 I'm interested in debugging the guest without guest cooperation.

 In many cases qemu's new gdb stub works for that, but in some cases
 I would prefer instruction/branch traces over standard gdb style
 debugging.


 Isn't gdb supposed to be able to use branch traces? 

AFAIK not. The ptrace interface is only used by idb I believe.
I might be wrong on that.

Not sure if there is even a remote protocol command for 
branch traces either.

There's a concept of tracepoints in the protocol, but it 
doesn't quite match at.

 It makes sense to
 expose them via the gdb stub then.  Not to say an external tool
 doesn't make sense.

Ok that would work for me too. As long as I can set start/stop
triggers and pipe the log somewhere it's fine for me.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Andi Kleen
Soeren Sandmann sandm...@daimi.au.dk writes:

 To fix that problem, it seems like we need some way to have python
 export what is going on. Maybe the same mechanism could be used to
 both access what is going on in qemu and python.

oprofile already has an interface to let JITs export
information about the JITed code. C Python is not a JIT,
but presumably one of the python JITs could do it.

http://oprofile.sourceforge.net/doc/devel/index.html

I know it's not envogue anymore and you won't be a approved 
cool kid if you do, but you could just use oprofile? 

Ok presumably one would need to do a python interface for this
first. I believe it's currently only implemented for Java and
Mono. I presume it might work today with IronPython on Mono.

IMHO it doesn't make sense to invent another interface for this,
although I'm sure someone will propose just that.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Andi Kleen
Soeren Sandmann sandm...@daimi.au.dk writes:

 Examples:

 - What is going on inside QEMU? 

That's something the JIT interface could answer.

 - Which client is the X server servicing?

 - What parts of a python/shell/scheme/javascript program is
   taking the most CPU time?

I suspect for those you rather need event based tracers of some sort,
similar to kernel trace points. Otherwise you would need own
separate stacks and other complications.

systemtap has some effort to use the dtrace instrumentation
that crops up in more and more user programs for this.  It wouldn't
surprise me if that was already in python and other programs
you're interested in.

I presume right now it only works if you apply the utrace monstrosity
though, but perhaps the new uprobes patches floating around 
will come to rescue.

There also was some effort to have a pure user space
daemon based approach for LTT, but I believe that currently
needs own trace points.

Again I fully expect someone to reinvent the wheel here
and afterwards complain about community inefficiences :-)

 I don't think the oprofile JIT interface solves any of these
 problems. (In fact, I don't see why the JIT problem is even hard. The
 JIT compiler can just generate a little ELF file with symbols in it,
 and the profiler can pick it up through the mmap events that you get
 through the perf interface).

That would require keeping those temporary ELF files for
potentially unlimited time around (profilers today look at the ELF
files at the final analysis phase, which might be weeks away)

Also that would be a lot of overhead for the JIT and most likely
be a larger scale rewrite for a given JIT code base.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/12] Add handle page fault PV helper.

2010-01-18 Thread Andi Kleen
Peter Zijlstra pet...@infradead.org writes:

 Whatever are we doing to end up in do_page_fault() as it stands? Surely
 we can tell the CPU to go elsewhere to handle faults?

 Isn't that as simple as calling set_intr_gate(14, my_page_fault)
 somewhere on the cpuinit instead of the regular page_fault handler?

That typically requires ugly ifdefs in entry*S and could be described
as code obfuscation (come from)

As long as he avoids a global reference (patch) the if should be practially
free anyways.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset

2010-01-05 Thread Andi Kleen
 --- a/qemu-kvm-x86.c
 +++ b/qemu-kvm-x86.c
 @@ -1015,6 +1015,7 @@ void kvm_arch_load_regs(CPUState *env)
  #endif
  set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME,  env-system_time_msr);
  set_msr_entry(msrs[n++], MSR_KVM_WALL_CLOCK,  env-wall_clock_msr);
 +set_msr_entry(msrs[n++], MSR_MCG_STATUS, 0);

Still need to keep EIPV and possibly RIPV, don't we?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Andi Kleen
 i.e. it has all the makings of a stupid, avoidable, permanent fork. The thing 

Nearly. There was no equivalent of a kernel based virtual driver host
before.

 - Are a pure software concept and any compatibility mismatch is   

   self-inflicted. The patches are in fact breaking the ABI to KVM 


In practice, especially considering older kernel releases, VMs 
behave like hardware, with all its quirks, compatibility requirements,
sometimes not fully understood, etc.

 is, if AlacricityVM is better, and KVM developers are not willing to fix 
 their 
 stuff, replace KVM with it.

In the end the driver model is only a very small part of KVM though.

 
 It's a bit as if someone found a performance problem with sys_open() and came 
 up with sys_open_v2() and claimed that he wants to work with the VFS 
 developers while not really doing so but advances sys_open_v2() all the time. 

AFAIK Gregory tried for several months to work with the KVM maintainers,
but failed at their NIH filter.

 
 Do we allow sys_open_v2() upstream, in the name of openness and diversity, 
 letting some apps use that syscall while other apps still use sys_open()? Or 
 do we say enough is enough of this stupidity, come up with some strong 
 reasons to replace sys_open, and if so, replace the thing and be done with 
 the 
 pain!.

I thought the published benchmark numbers were strong reasons.
I certainly haven't seen similarly convincing numbers for vhost-net.

 The main difference is that Gregory claims that improved performance is not 
 possible within the existing KVM framework, while the KVM developers 
 disagree. 
 The good news is that this is a hard, testable fact.

Yes clearly the onus at this point is on the vhost-net developers/
pci is all that is ever needed for PV proponents to show similar numbers
with their current code.

If they can show the same performance there's really no need for
the alacrityvm model (or at least I haven't seen a convincing reason
other than performance so far to have a separate model)

I heard claims earlier from one side to the other that some benchmarks
were not fair. Apart from such accusations not being very constructive
(I don't think anyone is trying to intentionally mislead others here), 
but even if that's the case surely the other side can do similar 
benchmarks and demonstrate they are as fast.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Andi Kleen
 http://www.redhat.com/f/pdf/summit/cwright_11_open_source_virt.pdf

 See slide 32.  This is without vhost-net.

Thanks. Do you also have latency numbers?

It seems like there's definitely still potential for improvement
with messages 4K. But for the large messages they indeed 
look rather good.

It's unclear what message size the Alacrity numbers used, but I presume
it was rather large.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Andi Kleen
 And its moot, anyway, as I have already retracted my one outstanding
 pull request based on Linus' observation.  So at this time, I am not
 advocating _anything_ for upstream inclusion.  And I am contemplating
 _never_ doing so again.  It's not worth _this_.

That certainly sounds like the wrong reaction. Out of tree drivers
are typically a pain to use. 

And upstream submission is not always like this!

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Andi Kleen
  It seems like there's definitely still potential for improvement
  with messages4K. But for the large messages they indeed
  look rather good.
 
 You are misreading the graph.  At 4K it is tracking bare metal (the
 green and yellow lines are bare metal, the red and blue bars are virtio).
 At 4k we start to drop off (esp. on RX).

I see. True.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Andi Kleen
Ingo Molnar mi...@elte.hu writes:

 Yes, there's (obviously) compatibility requirements and artifacts and past 
 mistakes (as with any software interface), but you need to admit it to 

Yes that's exactly what I meant. 

 yourself that your virtualization is sloppy just like hardware claim is 
 just 

In my experience hardware is a lot less sloppy than software.
Imagine your latest CPU had as many regressions as 2.6.32 @)

I wish software and even VMs were as good.

 a cheap excuse to not do a proper job of interface engineering.

Past mistakes cannot be easily fixed. And undoubtedly even the new
shiny interfaces will have bugs and problems. Also the behaviour is
often not completely understood. Maybe it can be easier debugged with
fully available source, but even then it's hard to fix the old
software (or rather even if you can fix it deploy the fixes). In
that regard it's a lot like hardware.

I agree with you that this makes it important to design good
interfaces, but again realistically mistakes will be made
and they cannot be all fixed retroactively.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Andi Kleen
Ira W. Snyder i...@ovro.caltech.edu writes:

 (You'll quickly find that you must use DMA to transfer data across PCI.
 AFAIK, CPU's cannot do burst accesses to the PCI bus. I get a 10+ times

AFAIK that's what write-combining on x86 does. DMA has other
advantages of course.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] Nested VMX patch 1 implements vmon and vmoff

2009-12-20 Thread Andi Kleen
Gleb Natapov g...@redhat.com writes:
  
 +int nested = 1;
 +EXPORT_SYMBOL_GPL(nested);

Unless this is a lot better tested and audited wouldn't it make more sense
to default it to off?

I don't think it's a big burden to let users set a special knob for this,
but it would be a big problem if there was some kind of jail break 
hidden in there that could be exploited by malicious guests.

Since VMX was not originally designed to be nested that wouldn't surprise me.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC: kvm tsc virtualization 15/20] Fix longstanding races

2009-12-15 Thread Andi Kleen
Zachary Amsden zams...@redhat.com writes:

 Damn, this is complicated crap.  The analagous task in real life would
 be keeping a band of howler monkeys, each in their own tree, singing in
 unison while the lead vocalist jumps from tree to tree, and meanwhile,
 an unseen conductor keeps changing the tempo the piece is played at.
 Thankfully, there are no key changes, however, occasionally new trees
 sprout up at random and live ones fall over.

On CPUs where the TSC frequency is not constant typically you can't tell
exactly when the frequency changes. So you would always have a race window
where the frequency is unknown and wrong results occur. This can be worked
around, but it's quite complicated.

The safe way is to not use the TSC on such CPUs.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtualization Performance: Intel vs. AMD

2009-11-16 Thread Andi Kleen
Thomas Fjellstrom tfjellst...@shaw.ca writes:

 Hardware context switches aren't free either. 

FWIW, SMT has no hardware context switches, the 'S' stands for
simultaneous: the operations from the different threads are travelling
simultaneously through the CPU's pipeline.

You seem to confuse it with 'CMT' (Coarse-grained Multi Threading),
which has context switches.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
 Fine?

I cannot say -- are there paths that could drop the device beforehand?
(as in do you hold a reference to it?)

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Andi Kleen
On Wed, Nov 04, 2009 at 03:08:28PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 01:59:57PM +0100, Andi Kleen wrote:
   Fine?
  
  I cannot say -- are there paths that could drop the device beforehand?
 
 Do you mean drop the mm reference?

No the reference to the device, which owns the mm for you.

 
  (as in do you hold a reference to it?)
 
 By design I think I always have a reference to mm before I use it.
 
 This works like this:
 ioctl SET_OWNER - calls get_task_mm, I think this gets a reference to mm
 ioctl SET_BACKEND - checks that SET_OWNER was run, starts virtqueue
 ioctl RESET_OWNER - stops virtqueues, drops the reference to mm
 file close - stops virtqueues, if we still have it then drops mm
 
 This is why I think I can call use_mm/unuse_mm while virtqueue is running,
 safely.
 Makes sense?

Do you protect against another thread doing RESET_OWNER in parallel while
RESET_OWNER runs?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >