Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?

2015-09-22 Thread Konrad Rzeszutek Wilk
On Sun, Sep 20, 2015 at 09:49:04PM -0700, Andy Lutomirski wrote:
> On Fri, Sep 18, 2015 at 12:04 PM, Borislav Petkov  wrote:
> > On Fri, Sep 18, 2015 at 08:20:46AM -0700, Andy Lutomirski wrote:
> >> In any event, Borislav, you must have typed rdmsr_safe for a reason :)
> >
> > Wasn't me:
> >
> > 6c62aa4a3c12 ("x86: make amd.c have 64bit support code")
> >
> > I think the error handling of rdmsrl_safe() was needed to do the pfn
> > games which are done in the if-clause.
> 
> I just tried it.  rdmsrl_safe and friends definitely work fine in that
> code.  I think that Linux's Xen startup code is buggy and fails to set
> up early exception handling.
> 
> Try this (horribly whitespace damaged):
> 
>  static void __init early_identify_cpu(struct cpuinfo_x86 *c)
>  {
> +   u64 tmp;
>  #ifdef CONFIG_X86_64
> c->x86_clflush_size = 64;
> c->x86_phys_bits = 36;
> @@ -752,6 +753,9 @@ static void __init early_identify_cpu(struct cpuinfo_x86 
> *c)
> c->cpu_index = 0;
> filter_cpuid_features(c, false);
> 
> +   pr_err("trying to crash\n");
> +   rdmsrl_safe(0x12345678, );
> +
> 
> It works fine.  I bet it crashes on a Xen guest, though.  I assume
> that Xen just works in most cases by luck.

(d31) mapping kernel into physical memory
(d31) about to get started...
(XEN) traps.c:3151: GPF (): 82d0801a31ed -> 82d08023c77b
(XEN) traps.c:459:d31v0 Unhandled general protection fault fault/trap [#13] on 
VCPU 0 [ec=]
(XEN) domain_crash_sync called from entry.S: fault at 82d080238213 
create_bounce_frame+0x12b/0x13a
(XEN) Domain 31 (vcpu#0) crashed on cpu#35:
(XEN) [ Xen-4.5.0  x86_64  debug=n  Not tainted ]
(XEN) CPU:35
(XEN) RIP:e033:[]
(XEN) RFLAGS: 0246   EM: 1   CONTEXT: pv guest
(XEN) rax:    rbx: 81c03e64   rcx: 12345678
(XEN) rdx: 81c03de8   rsi: 81c03dec   rdi: 12345278
(XEN) rbp: 81c03e48   rsp: 81c03dd0   r8:  7420676e69797274
(XEN) r9:  6873617263206f74   r10:    r11: 
(XEN) r12: 12345678   r13: 81c03f00   r14: 
(XEN) r15:    cr0: 8005003b   cr4: 001526f0
(XEN) cr3: 0014e8c97000   cr2: 
(XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=81c03dd0:
(XEN)12345678   81041b64
(XEN)0001e030 00010046 81c03e18 e02b
(XEN)81041b5d 81c03e48 00811809 
(XEN)01a0 0100 82009000 81c03e68
(XEN)81d211ea   81c03ed8
(XEN)81d1be59 81c03ed8 811892ab 0010
(XEN)81c03ee8 81c03ea8 697a696c61697469 81f15442
(XEN) 81db3900  
(XEN) 81c03f28 81d10f0a 
(XEN)   
(XEN)   81c03f38
(XEN)81d10603 81c03ff8 81d15f5c 
(XEN)   
(XEN)   
(XEN)   
(XEN)   
(XEN) ffd83a031f898b75 22400800 0001
(XEN)  00010102464c457f 
(XEN)0001003e0003 0940 0040 12a0
(XEN)00380040 001100120044 00050001 
[root@ovs107 ~]# 

(gdb) x/20i 0x81041b64
   0x81041b64:  rdmsr  

> 
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?

2015-09-17 Thread Konrad Rzeszutek Wilk
On Wed, Sep 16, 2015 at 06:39:03PM -0400, Cole Robinson wrote:
> On 09/16/2015 05:08 PM, Konrad Rzeszutek Wilk wrote:
> > On Wed, Sep 16, 2015 at 05:04:31PM -0400, Cole Robinson wrote:
> >> On 09/16/2015 04:07 PM, M A Young wrote:
> >>> On Wed, 16 Sep 2015, Cole Robinson wrote:
> >>>
> >>>> Unfortunately I couldn't get anything else extra out of xen using any of 
> >>>> these
> >>>> options or the ones Major recommended... in fact I couldn't get anything 
> >>>> to
> >>>> the serial console at all. console=con1 would seem to redirect messages 
> >>>> since
> >>>> they wouldn't show up on the graphical display, but nothing went to the 
> >>>> serial
> >>>> log. Maybe I'm missing something...
> >>>
> >>> That should be console=com1 so you have a typo either in this message or 
> >>> in your tests.
> >>>
> >>
> >> Yeah that was it :/ So here's the crash output use -cpu host:
> >>
> >> - Cole
> >>
> 
> 
> 
> >> about to get started...
> >> (XEN) traps.c:459:d0v0 Unhandled general protection fault fault/trap [#13] 
> >> on
> >> VCPU 0 [ec=]
> >> (XEN) domain_crash_sync called from entry.S: fault at 82d08023a5d3
> >> create_bounce_frame+0x12b/0x13a
> >> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> >> (XEN) [ Xen-4.5.1  x86_64  debug=n  Not tainted ]
> >> (XEN) CPU:0
> >> (XEN) RIP:e033:[]
> > 
> > That is the Linux kernel EIP. Can you figure out what is at 
> > 810032b0 ?
> > 
> > gdb vmlinux and then
> > x/20i 0x810032b0
> > 
> > can help with that.
> > 
> 
> Updated to the latest kernel 4.1.6-201.fc22.x86_64. Trace is now:
> 
> about to get started...
> (XEN) traps.c:459:d0v0 Unhandled general protection fault fault/trap [#13] on
> VCPU 0 [ec=]
> (XEN) domain_crash_sync called from entry.S: fault at 82d08023a5d3
> create_bounce_frame+0x12b/0x13a
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) [ Xen-4.5.1  x86_64  debug=n  Not tainted ]
> (XEN) CPU:0
> (XEN) RIP:e033:[]
> (XEN) RFLAGS: 0282   EM: 1   CONTEXT: pv guest
> (XEN) rax: 0015   rbx: 81c03e1c   rcx: c0010112
> (XEN) rdx: 0001   rsi: 81c03e1c   rdi: c0010112
> (XEN) rbp: 81c03df8   rsp: 81c03da0   r8:  81c03e28
> (XEN) r9:  81c03e2c   r10:    r11: 
> (XEN) r12: 81d25a60   r13: 0400   r14: 
> (XEN) r15:    cr0: 80050033   cr4: 000406f0
> (XEN) cr3: 75c0b000   cr2: 
> (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=81c03da0:
> (XEN)c0010112   810031f0
> (XEN)0001e030 00010082 81c03de0 e02b
> (XEN) 000c 81c03e1c 81c03e48
> (XEN)8102a7a4 81c03e48 8102aa3b 81c03e48
> (XEN)cf1fa5f5e026f464 0100 81c03ef8 0400
> (XEN) 81c03e58 81d5d142 81c03ee8
> (XEN)81d58b56   81c03e88
> (XEN)810f8a39 81c03ee8 81798b13 0010
> (XEN)81c03ef8 81c03eb8 cf1fa5f5e026f464 81f1de9c
> (XEN)  81df7920 
> (XEN) 81c03f28 81d51c74 cf1fa5f5e026f464
> (XEN) 81c03f60 81c03f5c 
> (XEN) 81c03f38 81d51339 81c03ff8
> (XEN)81d548b1  00600f12 00010800
> (XEN)03010032 0005  
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN)0f0060c0c748 c305  
> (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
> 
> 
> gdb output:
> 
> (gdb) x/20i 0x810031f0
>0x810031f0 <xen_read_msr

Re: [PATCH 0/9] qspinlock stuff -v15

2015-03-27 Thread Konrad Rzeszutek Wilk
On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote:
 On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote:
  Ah nice. That could be spun out as a seperate patch to optimize the existing
  ticket locks I presume.
 
 Yes I suppose we can do something similar for the ticket and patch in
 the right increment. We'd need to restructure the code a bit, but
 its not fundamentally impossible.
 
 We could equally apply the head hashing to the current ticket
 implementation and avoid the current bitmap iteration.
 
  Now with the old pv ticketlock code an vCPU would only go to sleep once and
  be woken up when it was its turn. With this new code it is woken up twice 
  (and twice it goes to sleep). With an overcommit scenario this would imply
  that we will have at least twice as many VMEXIT as with the previous code.
 
 An astute observation, I had not considered that.

Thank you.
 
  I presume when you did benchmarking this did not even register? Thought
  I wonder if it would if you ran the benchmark for a week or so.
 
 You presume I benchmarked :-) I managed to boot something virt and run
 hackbench in it. I wouldn't know a representative virt setup if I ran
 into it.
 
 The thing is, we want this qspinlock for real hardware because its
 faster and I really want to avoid having to carry two spinlock
 implementations -- although I suppose that if we really really have to
 we could.

In some way you already have that - for virtualized environments where you
don't have an PV mechanism you just use the byte spinlock - which is good.

And switching to PV ticketlock implementation after boot.. ugh. I feel your 
pain.

What if you used an PV bytelock implemenation? The code you posted already
'sprays' all the vCPUS to wake up. And that is exactly what you need for PV
bytelocks - well, you only need to wake up the vCPUS that have gone to sleep
waiting on an specific 'struct spinlock' and just stash those in an per-cpu
area. The old Xen spinlock code (Before 3.11?) had this.

Just an idea thought.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] qspinlock stuff -v15

2015-03-25 Thread Konrad Rzeszutek Wilk
On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote:
 Hi Waiman,
 
 As promised; here is the paravirt stuff I did during the trip to BOS last 
 week.
 
 All the !paravirt patches are more or less the same as before (the only real
 change is the copyright lines in the first patch).
 
 The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more
 convoluted and I've no real way to test that but it should be stright fwd to
 make work.
 
 I ran this using the virtme tool (thanks Andy) on my laptop with a 4x
 overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and
 it both booted and survived a hackbench run (perf bench sched messaging -g 20
 -l 5000).
 
 So while the paravirt code isn't the most optimal code ever conceived it does 
 work.
 
 Also, the paravirt patching includes replacing the call with movb $0, %arg1
 for the native case, which should greatly reduce the cost of having
 CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware.

Ah nice. That could be spun out as a seperate patch to optimize the existing
ticket locks I presume.

Now with the old pv ticketlock code an vCPU would only go to sleep once and
be woken up when it was its turn. With this new code it is woken up twice 
(and twice it goes to sleep). With an overcommit scenario this would imply
that we will have at least twice as many VMEXIT as with the previous code.

I presume when you did benchmarking this did not even register? Thought
I wonder if it would if you ran the benchmark for a week or so.

 
 I feel that if someone were to do a Xen patch we can go ahead and merge this
 stuff (finally!).
 
 These patches do not implement the paravirt spinlock debug stats currently
 implemented (separately) by KVM and Xen, but that should not be too hard to do
 on top and in the 'generic' code -- no reason to duplicate all that.
 
 Of course; once this lands people can look at improving the paravirt nonsense.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH v3 0/2] x86/arm64: add xenconfig

2015-02-25 Thread Konrad Rzeszutek Wilk
On Wed, Feb 25, 2015 at 01:11:04PM -0800, David Rientjes wrote:
 On Wed, 25 Feb 2015, Luis R. Rodriguez wrote:
 
  I am reworking Xen's kconfig stuff right now, so perhaps what is best
  is for this series to be folded under those changes and I'd submit
  them as the last series in the changes. That would avoid collateral
  changes as I revamp tons of Xen kconfig things. This would then go
  under David Vrabel's tree, but since it involves x86 stuff its unclear
  if its OK for that -- I think so? Let me know.
  
 
 Ok, sounds good, and I agree it would be better to hold off on doing this 
 if there are going to be substantial changes to the config options later.  
 I think once the x86 bits get an ack from one of the x86 guys that you 
 should be good to go!

Ingo (x86 guys) mentioned in one of his emails that he glosses over
emails if they say 'xen'.

Perhaps you need to change the title to be more catchy?


 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH v3 0/2] x86/arm64: add xenconfig

2015-02-25 Thread Konrad Rzeszutek Wilk
On Wed, Feb 25, 2015 at 01:25:59PM -0800, Luis R. Rodriguez wrote:
 On Wed, Feb 25, 2015 at 1:19 PM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
  On Wed, Feb 25, 2015 at 01:11:04PM -0800, David Rientjes wrote:
  On Wed, 25 Feb 2015, Luis R. Rodriguez wrote:
 
   I am reworking Xen's kconfig stuff right now, so perhaps what is best
   is for this series to be folded under those changes and I'd submit
   them as the last series in the changes. That would avoid collateral
   changes as I revamp tons of Xen kconfig things. This would then go
   under David Vrabel's tree, but since it involves x86 stuff its unclear
   if its OK for that -- I think so? Let me know.
  
 
  Ok, sounds good, and I agree it would be better to hold off on doing this
  if there are going to be substantial changes to the config options later.
  I think once the x86 bits get an ack from one of the x86 guys that you
  should be good to go!
 
  Ingo (x86 guys) mentioned in one of his emails that he glosses over
  emails if they say 'xen'.
 
 Then this should be able to go through David Vrabel no, as he does care.

:-)
 
  Perhaps you need to change the title to be more catchy?
 
 If folks don't want to deal with Xen patches it should be fine, but we
 do need a route upstream, this seems to make sense to go through David
 Vrabel in the future no?

Yes, any of the Xen maintainers (me, David, or Boris).

 
  Luis
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-06 Thread Konrad Rzeszutek Wilk
On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
 On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
  The pvclock vdso code was too abstracted to understand easily and
  excessively paranoid.  Simplify it for a huge speedup.
 
  This opens the door for additional simplifications, as the vdso no
  longer accesses the pvti for any vcpu other than vcpu 0.
 
  Before, vclock_gettime using kvm-clock took about 64ns on my machine.
  With this change, it takes 19ns, which is almost as fast as the pure TSC
  implementation.
 
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   arch/x86/vdso/vclock_gettime.c | 82 
  --
   1 file changed, 47 insertions(+), 35 deletions(-)
 
  diff --git a/arch/x86/vdso/vclock_gettime.c 
  b/arch/x86/vdso/vclock_gettime.c
  index 9793322751e0..f2e0396d5629 100644
  --- a/arch/x86/vdso/vclock_gettime.c
  +++ b/arch/x86/vdso/vclock_gettime.c
  @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
  *get_pvti(int cpu)
 
   static notrace cycle_t vread_pvclock(int *mode)
   {
  - const struct pvclock_vsyscall_time_info *pvti;
  + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
cycle_t ret;
  - u64 last;
  - u32 version;
  - u8 flags;
  - unsigned cpu, cpu1;
  -
  + u64 tsc, pvti_tsc;
  + u64 last, delta, pvti_system_time;
  + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
  -  * Note: hypervisor must guarantee that:
  -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
  -  * 2. that per-CPU pvclock time info is updated if the
  -  *underlying CPU changes.
  -  * 3. that version is increased whenever underlying CPU
  -  *changes.
  +  * Note: The kernel and hypervisor must guarantee that cpu ID
  +  * number maps 1:1 to per-CPU pvclock time info.
  +  *
  +  * Because the hypervisor is entirely unaware of guest userspace
  +  * preemption, it cannot guarantee that per-CPU pvclock time
  +  * info is updated if the underlying CPU changes or that that
  +  * version is increased whenever underlying CPU changes.
  +  *
  +  * On KVM, we are guaranteed that pvti updates for any vCPU are
  +  * atomic as seen by *all* vCPUs.  This is an even stronger
  +  * guarantee than we get with a normal seqlock.
 *
  +  * On Xen, we don't appear to have that guarantee, but Xen still
  +  * supplies a valid seqlock using the version field.
  +
  +  * We only do pvclock vdso timing at all if
  +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
  +  * mean that all vCPUs have matching pvti and that the TSC is
  +  * synced, so we can just look at vCPU 0's pvti.
 */
 
  Can Xen guarantee that ?
 
 I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
 at all.  I have no idea going forward, though.
 
 Xen people?

The person who would know of the top of his head is Dan Magenheimer, who
is now enjoy retirement :-(

I will have to dig in the code to answer this - that will take a bit of time
sadly (I am sick this week).
 
 
  - do {
  - cpu = __getcpu()  VGETCPU_CPU_MASK;
  - /* TODO: We can put vcpu id into higher bits of pvti.version.
  -  * This will save a couple of cycles by getting rid of
  -  * __getcpu() calls (Gleb).
  -  */
  -
  - pvti = get_pvti(cpu);
  -
  - version = __pvclock_read_cycles(pvti-pvti, ret, flags);
  -
  - /*
  -  * Test we're still on the cpu as well as the version.
  -  * We could have been migrated just after the first
  -  * vgetcpu but before fetching the version, so we
  -  * wouldn't notice a version change.
  -  */
  - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
  - } while (unlikely(cpu != cpu1 ||
  -   (pvti-pvti.version  1) ||
  -   pvti-pvti.version != version));
  -
  - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
  +
  + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
  + return 0;
  + }
 
  This check must be performed after reading a stable pvti.
 
 
 We can even read it in the middle, guarded by the version checks.
 I'll do that for v2.
 
  +
  + do {
  + version = pvti-version;
  +
  + /* This is also a read barrier, so we'll read version first. 
  */
  + rdtsc_barrier();
  + tsc = __native_read_tsc();
  +
  + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
  + pvti_tsc_shift = pvti-tsc_shift;
  + pvti_system_time = pvti-system_time;
  + pvti_tsc = pvti-tsc_timestamp;
  +
  +   

Re: [PATCH] x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit

2014-12-08 Thread Konrad Rzeszutek Wilk
On Fri, Dec 05, 2014 at 07:03:28PM -0800, Andy Lutomirski wrote:
 paravirt_enabled has the following effects:
 
  - Disables the F00F bug workaround warning.  There is no F00F bug
workaround any more because Linux's standard IDT handling already
works around the F00F bug, but the warning still exists.  This
is only cosmetic, and, in any event, there is no such thing as
KVM on a CPU with the F00F bug.
 
  - Disables 32-bit APM BIOS detection.  On a KVM paravirt system,
there should be no APM BIOS anyway.
 
  - Disables tboot.  I think that the tboot code should check the
CPUID hypervisor bit directly if it matters.
 
  - paravirt_enabled disables espfix32.  espfix32 should *not* be
disabled under KVM paravirt.
 
 The last point is the purpose of this patch.  It fixes a leak of the
 high 16 bits of the kernel stack address on 32-bit KVM paravirt
 guests.
 
 While I'm at it, this removes pv_info setup from kvmclock.  That
 code seems to serve no purpose.
 
 Cc: sta...@vger.kernel.org
 Signed-off-by: Andy Lutomirski l...@amacapital.net

Suggested-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  arch/x86/kernel/kvm.c  | 9 -
  arch/x86/kernel/kvmclock.c | 2 --
  2 files changed, 8 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index f6945bef2cd1..94f643484300 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -283,7 +283,14 @@ NOKPROBE_SYMBOL(do_async_page_fault);
  static void __init paravirt_ops_setup(void)
  {
   pv_info.name = KVM;
 - pv_info.paravirt_enabled = 1;
 +
 + /*
 +  * KVM isn't paravirt in the sense of paravirt_enabled.  A KVM
 +  * guest kernel works like a bare metal kernel with additional
 +  * features, and paravirt_enabled is about features that are
 +  * missing.
 +  */
 + pv_info.paravirt_enabled = 0;
  
   if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
   pv_cpu_ops.io_delay = kvm_io_delay;
 diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
 index d9156ceecdff..d4d9a8ad7893 100644
 --- a/arch/x86/kernel/kvmclock.c
 +++ b/arch/x86/kernel/kvmclock.c
 @@ -263,8 +263,6 @@ void __init kvmclock_init(void)
  #endif
   kvm_get_preset_lpj();
   clocksource_register_hz(kvm_clock, NSEC_PER_SEC);
 - pv_info.paravirt_enabled = 1;
 - pv_info.name = KVM;
  
   if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 -- 
 1.9.3
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Stupid Xen vs KVM question

2014-12-05 Thread Konrad Rzeszutek Wilk
On Fri, Dec 05, 2014 at 08:29:54AM +0100, Paolo Bonzini wrote:
 
 
 On 05/12/2014 03:24, Konrad Rzeszutek Wilk wrote:
  We could do a simple thing - which is that the paravirt_enabled
  could have the value 1 for Xen and 2 for KVM. The assembler logic
  would be inverted and just check for 1. I am not going to attempt
  to write the assembler code :-)
 
 Wouldn't Xen HVM also want to be 2?

Oddly enough it was never set!

Looking at where the paravit_enabled() macro is used, on KVM it could
be just set to zero.
 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Stupid Xen vs KVM question

2014-12-04 Thread Konrad Rzeszutek Wilk
On Thu, Dec 04, 2014 at 02:59:48PM -0800, Andy Lutomirski wrote:
 This code in arch/x86/kernel/entry_32.S is wrong:
 
 #ifdef CONFIG_PARAVIRT
 /*
  * The kernel can't run on a non-flat stack if paravirt mode
  * is active.  Rather than try to fixup the high bits of
  * ESP, bypass this code entirely.  This may break DOSemu
  * and/or Wine support in a paravirt VM, although the option
  * is still available to implement the setting of the high
  * 16-bits in the INTERRUPT_RETURN paravirt-op.
  */
 cmpl $0, pv_info+PARAVIRT_enabled
 jne restore_nocheck
 #endif
 
 On KVM guests, it notices that paravirt is enabled and bails.  It
 should work fine on KVM -- the condition it should be checking is
 whether we have native segmentation.
 
 Do you know the right way to ask that?

We could do a simple thing - which is that the paravirt_enabled
could have the value 1 for Xen and 2 for KVM. The assembler logic
would be inverted and just check for 1. I am not going to attempt
to write the assembler code :-)

 
 Thanks,
 Andy
 
 -- 
 Andy Lutomirski
 AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM

2014-12-02 Thread Konrad Rzeszutek Wilk
On Wed, Oct 29, 2014 at 04:19:10PM -0400, Waiman Long wrote:
 This patch adds the necessary KVM specific code to allow KVM to
 support the CPU halting and kicking operations needed by the queue
 spinlock PV code.
 
 Two KVM guests of 20 CPU cores (2 nodes) were created for performance
 testing in one of the following three configurations:
  1) Only 1 VM is active
  2) Both VMs are active and they share the same 20 physical CPUs
 (200% overcommit)
 
 The tests run included the disk workload of the AIM7 benchmark on
 both ext4 and xfs RAM disks at 3000 users on a 3.17 based kernel. The
 ebizzy -m test and futextest was was also run and its performance
 data were recorded.  With two VMs running, the idle=poll kernel
 option was added to simulate a busy guest. If PV qspinlock is not
 enabled, unfairlock will be used automically in a guest.

What is the unfairlock? Isn't it just using a bytelock at this point?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-12-01 Thread Konrad Rzeszutek Wilk
On Tue, Nov 25, 2014 at 07:33:58PM -0500, Waiman Long wrote:
 On 10/27/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:
 On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?
 It has too. When the modules are loaded the .paravirt symbols are exposed
 and the module loader patches that.
 
 And during bootup time (before modules are loaded) it also patches everything
 - when it only runs on one CPU.
 
 
 I have been changing the patching code to patch the unlock call sites and it
 seems to be working now. However, when I manually inserted a kernel module
 using insmod and run the code in the newly inserted module, I got memory
 access violation as follows:
 
 BUG: unable to handle kernel NULL pointer dereference at   (null)
 IP: [  (null)]   (null)
 PGD 18d62f3067 PUD 18d476f067 PMD 0
 Oops: 0010 [#1] SMP
 Modules linked in: locktest(OE) ebtable_nat ebtables xt_CHECKSUM
 iptable_mangle bridge autofs4 8021q garp stp llc ipt_REJECT
 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
 ip6_tables ipv6 vhost_net macvtap macvlan vhost tun uinput ppdev parport_pc
 parport sg microcode pcspkr virtio_balloon snd_hda_codec_generic
 virtio_console snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep
 snd_seq snd_seq_device snd_pcm snd_timer snd soundcore virtio_net i2c_piix4
 i2c_core ext4(E) jbd2(E) mbcache(E) floppy(E) virtio_blk(E) sr_mod(E)
 cdrom(E) virtio_pci(E) virtio_ring(E) virtio(E) pata_acpi(E) ata_generic(E)
 ata_piix(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last
 unloaded: speedstep_lib]
 CPU: 1 PID: 3907 Comm: run-locktest Tainted: GW  OE  3.17.0-pvqlock
 #3
 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
 task: 8818cc5baf90 ti: 8818b7094000 task.ti: 8818b7094000
 RIP: 0010:[]  [  (null)]   (null)
 RSP: 0018:8818b7097db0  EFLAGS: 00010246
 RAX:  RBX: 004c4b40 RCX: 
 RDX: 0001 RSI:  RDI: 8818d3f052c0
 RBP: 8818b7097dd8 R08: 80522014 R09: 
 R10: 1000 R11: 0001 R12: 0001
 R13:  R14: 0001 R15: 8818b7097ea0
 FS:  7fb828ece700() GS:88193ec2() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2:  CR3: 0018cc7e9000 CR4: 06e0
 Stack:
  a06ff395 8818d465e000 8164bec0 0001
  0050 8818b7097e18 a06ff785 8818b7097e38
  0246 54755e3a 39f8ba72 8818c174f000
 Call Trace:
  [a06ff395] ? test_spinlock+0x65/0x90 [locktest]
  [a06ff785] etime_show+0xd5/0x120 [locktest]
  [812a2dc6] kobj_attr_show+0x16/0x20
  [8121a7fa] sysfs_kf_seq_show+0xca/0x1b0
  [81218a13] kernfs_seq_show+0x23/0x30
  [811c82db] seq_read+0xbb/0x400
  [812197e5] kernfs_fop_read+0x35/0x40
  [811a4223] vfs_read+0xa3/0x110
  [811a47e6] SyS_read+0x56/0xd0
  [810f3e16] ? __audit_syscall_exit+0x216/0x2c0
  [815b3ca9] system_call_fastpath+0x16/0x1b
 Code:  Bad RIP value.
  RSP 8818b7097db0
 CR2: 
 ---[ end trace 69d0e259c9ec632f ]---
 
 It seems like call site patching isn't properly done or the kernel module
 that I built was missing some critical information necessary for the proper

Did the readelf give you the paravirt note section?
 linking. Anyway, I will include the unlock call patching code as a separate
 patch as it seems there may be problem under certain circumstance.

one way to troubleshoot those is to enable the paravirt patching code to
actually print where it is patching the code. That way when you load the
module you can confirm it has done its job.

Then you can verify that the address  where the code is called:

a06ff395

is indeed patched. You might as well also do a hexdump in the module loading
to confim that the patching had been done correctly.
 
 BTW, the kernel panic problem that your team reported had been fixed. The
 fix will be in the next version of the patch.
 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support

2014-12-01 Thread Konrad Rzeszutek Wilk
On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote:
 This patch adds para-virtualization support to the queue spinlock
 code base with minimal impact to the native case. There are some
 minor code changes in the generic qspinlock.c file which should be
 usable in other architectures. The other code changes are specific
 to x86 processors and so are all put under the arch/x86 directory.
 
 On the lock side, the slowpath code is split into 2 separate functions
 generated from the same code - one for bare metal and one for PV guest.
 The switching is done in the _raw_spin_lock* functions. This makes
 sure that the performance impact to the bare metal case is minimal,
 just a few NOPs in the _raw_spin_lock* functions. In the PV slowpath
 code, there are 2 paravirt callee saved calls that minimize register
 pressure.
 
 On the unlock side, however, the disabling of unlock function inlining
 does have some slight impact on bare metal performance.
 
 The actual paravirt code comes in 5 parts;
 
  - init_node; this initializes the extra data members required for PV
state. PV state data is kept 1 cacheline ahead of the regular data.
 
  - link_and_wait_node; this replaces the regular MCS queuing code. CPU
halting can happen if the wait is too long.
 
  - wait_head; this waits until the lock is avialable and the CPU will
be halted if the wait is too long.
 
  - wait_check; this is called after acquiring the lock to see if the
next queue head CPU is halted. If this is the case, the lock bit is
changed to indicate the queue head will have to be kicked on unlock.
 
  - queue_unlock;  this routine has a jump label to check if paravirt
is enabled. If yes, it has to do an atomic cmpxchg to clear the lock
bit or call the slowpath function to kick the queue head cpu.
 
 Tracking the head is done in two parts, firstly the pv_wait_head will
 store its cpu number in whichever node is pointed to by the tail part
 of the lock word. Secondly, pv_link_and_wait_node() will propagate the
 existing head from the old to the new tail node.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 ---
  arch/x86/include/asm/paravirt.h   |   19 ++
  arch/x86/include/asm/paravirt_types.h |   20 ++
  arch/x86/include/asm/pvqspinlock.h|  411 
 +
  arch/x86/include/asm/qspinlock.h  |   71 ++-
  arch/x86/kernel/paravirt-spinlocks.c  |6 +
  include/asm-generic/qspinlock.h   |2 +
  kernel/locking/qspinlock.c|   69 +-
  7 files changed, 591 insertions(+), 7 deletions(-)
  create mode 100644 arch/x86/include/asm/pvqspinlock.h
 
 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
 index cd6e161..7e296e6 100644
 --- a/arch/x86/include/asm/paravirt.h
 +++ b/arch/x86/include/asm/paravirt.h
 @@ -712,6 +712,24 @@ static inline void __set_fixmap(unsigned /* enum 
 fixed_addresses */ idx,
  
  #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +
 +static __always_inline void pv_kick_cpu(int cpu)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu);
 +}
 +
 +static __always_inline void pv_lockwait(u8 *lockbyte)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte);
 +}
 +
 +static __always_inline void pv_lockstat(enum pv_lock_stats type)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.lockstat, type);
 +}
 +
 +#else
  static __always_inline void __ticket_lock_spinning(struct arch_spinlock 
 *lock,
   __ticket_t ticket)
  {
 @@ -723,6 +741,7 @@ static __always_inline void __ticket_unlock_kick(struct 
 arch_spinlock *lock,
  {
   PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
  }
 +#endif
  
  #endif
  
 diff --git a/arch/x86/include/asm/paravirt_types.h 
 b/arch/x86/include/asm/paravirt_types.h
 index 7549b8b..49e4b76 100644
 --- a/arch/x86/include/asm/paravirt_types.h
 +++ b/arch/x86/include/asm/paravirt_types.h
 @@ -326,6 +326,9 @@ struct pv_mmu_ops {
  phys_addr_t phys, pgprot_t flags);
  };
  
 +struct mcs_spinlock;
 +struct qspinlock;
 +
  struct arch_spinlock;
  #ifdef CONFIG_SMP
  #include asm/spinlock_types.h
 @@ -333,9 +336,26 @@ struct arch_spinlock;
  typedef u16 __ticket_t;
  #endif
  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +enum pv_lock_stats {
 + PV_HALT_QHEAD,  /* Queue head halting   */
 + PV_HALT_QNODE,  /* Other queue node halting */
 + PV_HALT_ABORT,  /* Halting aborted  */
 + PV_WAKE_KICKED, /* Wakeup by kicking*/
 + PV_WAKE_SPURIOUS,   /* Spurious wakeup  */
 + PV_KICK_NOHALT  /* Kick but CPU not halted  */
 +};
 +#endif
 +
  struct pv_lock_ops {
 +#ifdef CONFIG_QUEUE_SPINLOCK
 + struct paravirt_callee_save kick_cpu;
 + struct paravirt_callee_save lockstat;
 + struct paravirt_callee_save lockwait;
 +#else
   struct paravirt_callee_save lock_spinning;
   void (*unlock_kick)(struct 

Re: [PATCH] x86, microcode: Don't initialize microcode code on paravirt

2014-12-01 Thread Konrad Rzeszutek Wilk
On Mon, Dec 01, 2014 at 04:27:44PM -0500, Boris Ostrovsky wrote:
 Paravirtual guests are not expected to load microcode into processors
 and therefore it is not necessary to initialize microcode loading
 logic.

CC-ing the KVM folks since they use the paravirt interface too.
 
 In fact, under certain circumstances initializing this logic may cause
 the guest to crash. Specifically, 32-bit kernels use __pa_nodebug()
 macro which does not work in Xen (the code path that leads to this macro
 happens during resume when we call mc_bp_resume()-load_ucode_ap()
 -check_loader_disabled_ap())
 
 Signed-off-by: Boris Ostrovsky boris.ostrov...@oracle.com
 ---
  arch/x86/kernel/cpu/microcode/core.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kernel/cpu/microcode/core.c 
 b/arch/x86/kernel/cpu/microcode/core.c
 index 2ce9051..ebd232d 100644
 --- a/arch/x86/kernel/cpu/microcode/core.c
 +++ b/arch/x86/kernel/cpu/microcode/core.c
 @@ -557,7 +557,7 @@ static int __init microcode_init(void)
   struct cpuinfo_x86 *c = cpu_data(0);
   int error;
  
 - if (dis_ucode_ldr)
 + if (paravirt_enabled() || dis_ucode_ldr)
   return 0;
  
   if (c-x86_vendor == X86_VENDOR_INTEL)
 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Konrad Rzeszutek Wilk
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 On 10/24/2014 04:54 AM, Peter Zijlstra wrote:
 On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:
 
 Since enabling paravirt spinlock will disable unlock function inlining,
 a jump label can be added to the unlock function without adding patch
 sites all over the kernel.
 But you don't have to. My patches allowed for the inline to remain,
 again reducing the overhead of enabling PV spinlocks while running on a
 real machine.
 
 Look at:
 
http://lkml.kernel.org/r/20140615130154.213923...@chello.nl
 
 In particular this hunk:
 
 Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 ===
 --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
 +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
   DEF_NATIVE(, mov32, mov %edi, %eax);
   DEF_NATIVE(, mov64, mov %rdi, %rax);
 
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
 +#endif
 +
   unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
   {
  return paravirt_patch_insns(insnbuf, len,
 @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
  PATCH_SITE(pv_cpu_ops, clts);
  PATCH_SITE(pv_mmu_ops, flush_tlb_single);
  PATCH_SITE(pv_cpu_ops, wbinvd);
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +   PATCH_SITE(pv_lock_ops, queue_unlock);
 +#endif
 
  patch_site:
  ret = paravirt_patch_insns(ibuf, len, start, end);
 
 
 That makes sure to overwrite the callee-saved call to the
 pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).
 
 
 Therefore you can retain the inlined unlock with hardly (there might be
 some NOP padding) any overhead at all. On PV it reverts to a callee
 saved function call.
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?

It has too. When the modules are loaded the .paravirt symbols are exposed
and the module loader patches that.

And during bootup time (before modules are loaded) it also patches everything
- when it only runs on one CPU.
 
 So I think we may still need to disable unlock function inlining even if we
 used your way kernel site patching.

No need. Inline should (And is) working just fine.
 
 Regards,
 Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] xen-pciback: use pci device flag operation helper function

2014-07-22 Thread Konrad Rzeszutek Wilk
On Wed, Jul 23, 2014 at 12:19:03AM +0800, Ethan Zhao wrote:
 Use pci device flag operation helper functions when set device
 to assigned or deassigned state.
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  drivers/xen/xen-pciback/pci_stub.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/xen/xen-pciback/pci_stub.c 
 b/drivers/xen/xen-pciback/pci_stub.c
 index 62fcd48..71f69f1 100644
 --- a/drivers/xen/xen-pciback/pci_stub.c
 +++ b/drivers/xen/xen-pciback/pci_stub.c
 @@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref)
   xen_pcibk_config_free_dyn_fields(dev);
   xen_pcibk_config_free_dev(dev);
  
 - dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
 + pci_set_dev_deassigned(dev);
   pci_dev_put(dev);
  
   kfree(psdev);
 @@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev)
   dev_dbg(dev-dev, reset device\n);
   xen_pcibk_reset_device(dev);
  
 - dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
 + pci_set_dev_assigned(dev);
   return 0;
  
  config_release:
 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] qspinlock: Paravirt support

2014-07-15 Thread Konrad Rzeszutek Wilk
On Mon, Jul 07, 2014 at 05:27:34PM +0200, Peter Zijlstra wrote:
 On Fri, Jun 20, 2014 at 09:46:08AM -0400, Konrad Rzeszutek Wilk wrote:
  I dug in the code and I have some comments about it, but before
  I post them I was wondering if you have any plans to run any performance
  tests against the PV ticketlock with normal and over-committed scenarios?
 
 I can barely boot a guest.. I'm not sure I can make them do anything
 much at all yet. All this virt crap is totally painful.

HA!

The reason I asked about that is from a pen-and-paper view it looks
suboptimal in the worst case scenario compared to PV ticketlock.

The 'worst case scenario' is when we over-commit (more CPUs than there
are physical CPUs) or have to delay guests (the sum of all virtual
CPUs  physical CPUs and all of the guests are compiling kernels).

In those cases the PV ticketlock goes to sleep and gets woken up
once the ticket holder has finished. In the PV qspinlock we do
wake up the first in queue, but we also wake the next one in queue
so it can progress further. And so on.

Perhaps a better mechanism is just ditch the queue part and utilize
the byte part and under KVM and Xen just do bytelocking (since we
have 8 bits). For the PV halt/waking we can stash in the 'struct mcs'
the current lock that each CPU is waiting for. And the unlocker
can iterate over all of those and wake them all up. Perhaps make
the iteration random. Anyhow, that is how the old PV bytelock under
Xen worked (before 3.11) and it had worked pretty well (it didn't
do it random thought - always started with 'for_each_online_cpu').

Squashing in the ticketlock concept in qspinlock for PV looks
scary.

And as I said - this is all pen-and-paper - so it might be that this
'wake-up-go-sleep-on-the-queue' kick is actually not that bad?

Lastly - thank you for taking a stab at this.
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-27 Thread Konrad Rzeszutek Wilk
On Mon, Jun 23, 2014 at 06:12:00PM +0200, Peter Zijlstra wrote:
 On Tue, Jun 17, 2014 at 04:03:29PM -0400, Konrad Rzeszutek Wilk wrote:
+   new = tail | (val  _Q_LOCKED_MASK);
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+
+   /*
+* we won the trylock; forget about queueing.
+*/
+   if (new == _Q_LOCKED_VAL)
+   goto release;
+
+   /*
+* if there was a previous node; link it and wait.
+*/
+   if (old  ~_Q_LOCKED_MASK) {
+   prev = decode_tail(old);
+   ACCESS_ONCE(prev-next) = node;
+
+   arch_mcs_spin_lock_contended(node-locked);
  
  Could you add a comment here:
  
  /* We are spinning forever until the previous node updates locked - which
  it does once the it has updated lock-val with our tail number. */
 
 That's incorrect -- or at least, I understand that to be incorrect. The
 previous node will not have changed the tail to point to us. You always
 change to tail to point to yourself, seeing how you add yourself to the
 tail.
 
 Is the existing comment any better if I s/wait./wait for it to release
 us./ ?

Yes!
 
+   /*
+* claim the lock:
+*
+* n,0 - 0,1 : lock, uncontended
+* *,0 - *,1 : lock, contended
+*/
+   for (;;) {
+   new = _Q_LOCKED_VAL;
+   if (val != tail)
+   new |= val;
   
  ..snip..
   
   Could you help a bit in explaining it in English please?
  
  After looking at the assembler code I finally figured out how
  we can get here. And the 'contended' part threw me off. Somehow
  I imagined there are two more more CPUs stampeding here and 
  trying to update the lock-val. But in reality the other CPUs
  are stuck in the arch_mcs_spin_lock_contended spinning on their
  local value.
 
 Well, the lock as a whole is contended (there's 1 waiters), and the
 point of MCS style locks it to make sure they're not actually pounding
 on the same cacheline. So the whole thing is consistent.
 
  Perhaps you could add this comment.
  
  /* Once queue_spin_unlock is called (which _subtracts_ _Q_LOCKED_VAL from
  the lock-val and still preserving the tail data), the winner gets to
  claim the ticket. 
 
 There's no tickets :/

s/ticket/be first in line/ ?

 
  Since we still need the other CPUs to continue and
  preserve the strict ordering in which they setup node-next, we:
   1) update lock-val to the tail value (so tail CPU and its index) with
  _Q_LOCKED_VAL.
 
 We don't, we preserve the tail value, unless we're the tail, in which
 case we clear the tail.
 
   2). Once we are done, we poke the other CPU (the one that linked to
  us) by writting to node-locked (below) so they can make progress and
  loop on lock-val changing from _Q_LOCKED_MASK to zero).
 
 _If_ there was another cpu, ie. the tail didn't point to us.

nods
 
 ---
 
 I don't do well with natural language comments like that; they tend to
 confuse me more than anything.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-27 Thread Konrad Rzeszutek Wilk
On Mon, Jun 23, 2014 at 05:56:50PM +0200, Peter Zijlstra wrote:
 On Mon, Jun 16, 2014 at 04:49:18PM -0400, Konrad Rzeszutek Wilk wrote:
   Index: linux-2.6/kernel/locking/mcs_spinlock.h
   ===
   --- linux-2.6.orig/kernel/locking/mcs_spinlock.h
   +++ linux-2.6/kernel/locking/mcs_spinlock.h
   @@ -17,6 +17,7 @@
struct mcs_spinlock {
 struct mcs_spinlock *next;
 int locked; /* 1 if lock acquired */
   + int count;
  
  This could use a comment.
 
 like so?
 
   int count; /* nesting count, see qspinlock.c */

/* nested level -  in user, softirq, hard irq or nmi context. */ ?

 
 
   +static inline u32 encode_tail(int cpu, int idx)
   +{
   + u32 tail;
   +
   + tail  = (cpu + 1)  _Q_TAIL_CPU_OFFSET;
   + tail |= idx  _Q_TAIL_IDX_OFFSET; /* assume  4 */
  
  Should there an
  
  ASSSERT (idx  4)
  
  just in case we screw up somehow (I can't figure out how, but
  that is partially why ASSERTS are added).
 
 #ifdef CONFIG_DEBUG_SPINLOCK
   BUG_ON(idx  3);
 #endif
 
 might do, I suppose.

nods
 
   +/**
   + * queue_spin_lock_slowpath - acquire the queue spinlock
   + * @lock: Pointer to queue spinlock structure
   + * @val: Current value of the queue spinlock 32-bit word
   + *
   + * (queue tail, lock bit)
  
  Except it is not a lock bit. It is a lock uint8_t.
 
 It is indeed, although that's an accident of implementation. I could do
 s/bit// and not mention the entire storage angle at all?

I think giving as much details as possible is good.

What you said 'accident of implementation' is a could be woven
in there?
 
  Is the queue tail at this point the composite of 'cpu|idx'?
 
 Yes, as per {en,de}code_tail() above.
 
   + *
   + *  fast  :slow  :   
unlock
   + *:  :
   + * uncontended  (0,0)   --:-- (0,1) 
   :-- (*,0)
   + *:   | ^./  :
   + *:   v   \   |  :
   + * uncontended:(n,x) --+-- (n,0) |  :
  
  So many CPUn come in right? Is 'n' for the number of CPUs?
 
 Nope, 'n' for any one specific tail, in particular the first one to
 arrive. This is the 'uncontended queue' case as per the label, so we
 need a named value for the first, in order to distinguish between the
 state to the right (same tail, but unlocked) and the state below
 (different tail).
 
   + *   queue:   | ^--'  |  :
   + *:   v   |  :
   + * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
   + *   queue: ^--' :
  
  And here um, what are the '*' for? Are they the four different
  types of handlers that can be nested? So task, sofitrq, hardisk, and
  nmi?
 
 '*' as in wildcard, any tail, specifically not 'n'.

Ah, thank you for the explanation! Would it be possible to include
that in the comment please?

 
   +void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
   +{
   + struct mcs_spinlock *prev, *next, *node;
   + u32 new, old, tail;
   + int idx;
   +
   + BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
   +
   + node = this_cpu_ptr(mcs_nodes[0]);
   + idx = node-count++;
  
  If this is the first time we enter this, wouldn't idx end up
  being 1?
 
 Nope, postfix ++ returns first and increments later.

blushes Yes it does.
 
   + tail = encode_tail(smp_processor_id(), idx);
   +
   + node += idx;
  
  Meaning we end up skipping the 'mcs_nodes[0]' one altogether - even
  on the first 'level' (task, softirq, hardirq, nmi)? Won't that
  cause us to blow past the array when we are nested at the nmi
  handler?
 
 Seeing how its all static storage, which is automagically initialized to
 0, combined with the postfix ++ (as opposed to the prefix ++) we should
 be getting 0 here.

I've no idea what I was thinking, but thank you for setting me straight.

 
   + node-locked = 0;
   + node-next = NULL;
   +
   + /*
   +  * trylock || xchg(lock, node)
   +  *
   +  * 0,0 - 0,1 ; trylock
   +  * p,x - n,x ; prev = xchg(lock, node)
  
  I looked at that for 10 seconds and I was not sure what you meant.
  Is this related to the MCS document you had pointed to? It would help
  if you mention that the comments follow the document. (But they
  don't seem to)
  
  I presume what you mean is that if we are the next after the
  lock-holder we need only to update the 'next' (or the
  composite value of smp_processor_idx | idx) to point to us.
  
  As in, swap the 'L' with 'I' (looking at the doc)
 
 They are the 'tail','lock' tuples, so this composite atomic operation
 completes either:
 
   0,0 - 0,1  -- we had no tail, not locked; into: no tail, locked.
 
 OR
 
   p,x - n,x  -- tail was p

Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-27 Thread Konrad Rzeszutek Wilk
On Mon, Jun 23, 2014 at 06:26:22PM +0200, Peter Zijlstra wrote:
 On Tue, Jun 17, 2014 at 04:05:31PM -0400, Konrad Rzeszutek Wilk wrote:
   + * The basic principle of a queue-based spinlock can best be understood
   + * by studying a classic queue-based spinlock implementation called the
   + * MCS lock. The paper below provides a good description for this kind
   + * of lock.
   + *
   + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
   + *
   + * This queue spinlock implementation is based on the MCS lock, however 
   to make
   + * it fit the 4 bytes we assume spinlock_t to be, and preserve its 
   existing
   + * API, we must modify it some.
   + *
   + * In particular; where the traditional MCS lock consists of a tail 
   pointer
   + * (8 bytes) and needs the next pointer (another 8 bytes) of its own 
   node to
   + * unlock the next pending (next-locked), we compress both these: {tail,
   + * next-locked} into a single u32 value.
   + *
   + * Since a spinlock disables recursion of its own context and there is a 
   limit
   + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, 
   we can
   + * encode the tail as and index indicating this context and a cpu number.
   + *
   + * We can further change the first spinner to spin on a bit in the lock 
   word
   + * instead of its node; whereby avoiding the need to carry a node from 
   lock to
   + * unlock, and preserving API.
  
  You also made changes (compared to the MCS) in that the unlock path is not
  spinning waiting for the successor and that the job of passing the lock
  is not done in the unlock path either.
  
  Instead all of that is now done in the path of the lock acquirer logic. 
  
  Could you update the comment to say that please?
 
 I _think_ I know what you mean.. So that is actually implied by the last

You do :-)

 paragraph, but I suppose I can make it explicit; something like:
 
   *
   * Another way to look at it is:
   *
   *  lock(tail,locked)
   *struct mcs_spinlock node;
   *mcs_spin_lock(tail, node);
   *test-and-set locked;
   *mcs_spin_unlock(tail, node);
   *
   *  unlock(tail,locked)
   *clear locked
   *
   * Where we have compressed (tail,locked) into a single u32 word.
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] qspinlock: Paravirt support

2014-06-20 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:07PM +0200, Peter Zijlstra wrote:
 Add minimal paravirt support.
 
 The code aims for minimal impact on the native case.

Woot!
 
 On the lock side we add one jump label (asm_goto) and 4 paravirt
 callee saved calls that default to NOPs. The only effects are the
 extra NOPs and some pointless MOVs to accomodate the calling
 convention. No register spills happen because of this (x86_64).
 
 On the unlock side we have one paravirt callee saved call, which
 defaults to the actual unlock sequence: movb $0, (%rdi) and a NOP.
 
 The actual paravirt code comes in 3 parts;
 
  - init_node; this initializes the extra data members required for PV
state. PV state data is kept 1 cacheline ahead of the regular data.
 
  - link_and_wait_node/kick_node; these are paired with the regular MCS
queueing and are placed resp. before/after the paired MCS ops.
 
  - wait_head/queue_unlock; the interesting part here is finding the
head node to kick.
 
 Tracking the head is done in two parts, firstly the pv_wait_head will
 store its cpu number in whichever node is pointed to by the tail part
 of the lock word. Secondly, pv_link_and_wait_node() will propagate the
 existing head from the old to the new tail node.

I dug in the code and I have some comments about it, but before
I post them I was wondering if you have any plans to run any performance
tests against the PV ticketlock with normal and over-committed scenarios?

Looking at this with a pen and paper I see that compared to
PV ticketlock for the CPUs that are contending on the queue (so they
go to pv_wait_head_and_link, then progress to pv_wait_head), they
go sleep twice and get woken up twice. In PV ticketlock the
contending CPUs would only go to sleep once and woken up once it
was their turn.

That of course is the worst case scenario - where the CPU
that has the lock is taking forever to do its job and the
host is quite overcommitted.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-18 Thread Konrad Rzeszutek Wilk
On Wed, Jun 18, 2014 at 01:29:48PM +0200, Paolo Bonzini wrote:
 Il 17/06/2014 22:36, Konrad Rzeszutek Wilk ha scritto:
 +/* One more attempt - but if we fail mark it as pending. */
 +if (val == _Q_LOCKED_VAL) {
 +new = Q_LOCKED_VAL |_Q_PENDING_VAL;
 +
 +old = atomic_cmpxchg(lock-val, val, new);
 +if (old == _Q_LOCKED_VAL) /* YEEY! */
 +return;
 +val = old;
 +}
 
 Note that Peter's code is in a for(;;) loop:
 
 
 + for (;;) {
 + /*
 +  * If we observe any contention; queue.
 +  */
 + if (val  ~_Q_LOCKED_MASK)
 + goto queue;
 +
 + new = _Q_LOCKED_VAL;
 + if (val == new)
 + new |= _Q_PENDING_VAL;
 +
 + old = atomic_cmpxchg(lock-val, val, new);
 + if (old == val)
 + break;
 +
 + val = old;
 + }
 +
 + /*
 +  * we won the trylock
 +  */
 + if (new == _Q_LOCKED_VAL)
 + return;
 
 So what you'd have is basically:
 
   /*
* One more attempt if no one is already in queue.  Perhaps
* they have unlocked the spinlock already.
*/
   if (val == _Q_LOCKED_VAL  atomic_read(lock-val) == 0) {
   old = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
   if (old == 0) /* YEEY! */
   return;
   val = old;
   }
 
 But I agree with Waiman that this is unlikely to trigger often enough. It
 does have to be handled in the slowpath for correctness, but the most likely
 path is (0,0,1)-(0,1,1).

nods
 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-18 Thread Konrad Rzeszutek Wilk
On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
 Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
 On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch extracts the logic for the exchange of new and previous tail
 code words into a new xchg_tail() function which can be optimized in a
 later patch.
 
 And also adds a third try on acquiring the lock. That I think should
 be a seperate patch.
 
 It doesn't really add a new try, the old code is:
 
 
 - for (;;) {
 - new = _Q_LOCKED_VAL;
 - if (val)
 - new = tail | (val  _Q_LOCKED_PENDING_MASK);
 -
 - old = atomic_cmpxchg(lock-val, val, new);
 - if (old == val)
 - break;
 -
 - val = old;
 - }
 
   /*
 -  * we won the trylock; forget about queueing.
*/
 - if (new == _Q_LOCKED_VAL)
 - goto release;
 
 The trylock happens if the if (val) hits the else branch.
 
 What the patch does is change it from attempting two transition with a
 single cmpxchg:
 
 -  * 0,0,0 - 0,0,1 ; trylock
 -  * p,y,x - n,y,x ; prev = xchg(lock, node)
 
 to first doing the trylock, then the xchg.  If the trylock passes and the
 xchg returns prev=0,0,0, the next step of the algorithm goes to the
 locked/uncontended state
 
 + /*
 +  * claim the lock:
 +  *
 +  * n,0 - 0,1 : lock, uncontended
 
 Similar to your suggestion of patch 3, it's expected that the xchg will
 *not* return prev=0,0,0 after a failed trylock.

I do like your explanation. I hope that Peter will put it in the
description as it explains the change quite well.

 
 However, I *do* agree with you that it's simpler to just squash this patch
 into 01/11.

Uh, did I say that? Oh I said why don't make it right the first time!

I meant in terms of seperating the slowpath (aka the bytelock on the pending
bit) from the queue (MCS code). Or renaming the function to be called
'complex' instead of 'slowpath' as it is getting quite hairy.

The #1 patch is nice by itself - as it lays out the foundation of the
MCS-similar code - and if Ingo decides he does not want this pending
byte-lock bit business - it can be easily reverted or dropped.

In terms of squashing this in #1 - I would advocate against that.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-18 Thread Konrad Rzeszutek Wilk
 However, I *do* agree with you that it's simpler to just squash this patch
 into 01/11.
 Uh, did I say that? Oh I said why don't make it right the first time!
 
 I meant in terms of seperating the slowpath (aka the bytelock on the pending
 bit) from the queue (MCS code). Or renaming the function to be called
 'complex' instead of 'slowpath' as it is getting quite hairy.
 
 The #1 patch is nice by itself - as it lays out the foundation of the
 MCS-similar code - and if Ingo decides he does not want this pending
 byte-lock bit business - it can be easily reverted or dropped.
 
 The pending bit code is needed for performance parity with ticket spinlock
 for light load. My own measurement indicates that the queuing overhead will
 cause the queue spinlock to be slower than ticket spinlock with 2-4
 contending tasks. The pending bit solves the performance problem with 2

Aha!

 contending tasks, leave only the 3-4 tasks cases being a bit slower than the
 ticket spinlock which should be more than compensated by its superior
 performance with heavy contention and slightly better performance with no
 contention.

That should be mentioned in the commit description as the rationale for
the patch qspinlock: Add pending bit and also in the code.

Thank you!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:06PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch renames the paravirt_ticketlocks_enabled static key to a
 more generic paravirt_spinlocks_enabled name.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org

Acked-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com

 ---
  arch/x86/include/asm/spinlock.h  |4 ++--
  arch/x86/kernel/kvm.c|2 +-
  arch/x86/kernel/paravirt-spinlocks.c |4 ++--
  arch/x86/xen/spinlock.c  |2 +-
  4 files changed, 6 insertions(+), 6 deletions(-)
 
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -39,7 +39,7 @@
  /* How long a lock should spin before we consider blocking */
  #define SPIN_THRESHOLD   (1  15)
  
 -extern struct static_key paravirt_ticketlocks_enabled;
 +extern struct static_key paravirt_spinlocks_enabled;
  static __always_inline bool static_key_false(struct static_key *key);
  
  #ifdef CONFIG_QUEUE_SPINLOCK
 @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowp
  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
  {
   if (TICKET_SLOWPATH_FLAG 
 - static_key_false(paravirt_ticketlocks_enabled)) {
 + static_key_false(paravirt_spinlocks_enabled)) {
   arch_spinlock_t prev;
  
   prev = *lock;
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump
   if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
   return 0;
  
 - static_key_slow_inc(paravirt_ticketlocks_enabled);
 + static_key_slow_inc(paravirt_spinlocks_enabled);
   printk(KERN_INFO KVM setup paravirtual spinlock\n);
  
   return 0;
 --- a/arch/x86/kernel/paravirt-spinlocks.c
 +++ b/arch/x86/kernel/paravirt-spinlocks.c
 @@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = {
  };
  EXPORT_SYMBOL(pv_lock_ops);
  
 -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
 -EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
 +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
 +EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 --- a/arch/x86/xen/spinlock.c
 +++ b/arch/x86/xen/spinlock.c
 @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jum
   if (!xen_domain())
   return 0;
  
 - static_key_slow_inc(paravirt_ticketlocks_enabled);
 + static_key_slow_inc(paravirt_spinlocks_enabled);
   return 0;
  }
  early_initcall(xen_init_spinlocks_jump);
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/11] qspinlock: Revert to test-and-set on hypervisors

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:05PM +0200, Peter Zijlstra wrote:
 When we detect a hypervisor (!paravirt, see later patches), revert to

Please spell out the name of the patches.

 a simple test-and-set lock to avoid the horrors of queue preemption.

Heheh.
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  arch/x86/include/asm/qspinlock.h |   14 ++
  include/asm-generic/qspinlock.h  |7 +++
  kernel/locking/qspinlock.c   |3 +++
  3 files changed, 24 insertions(+)
 
 --- a/arch/x86/include/asm/qspinlock.h
 +++ b/arch/x86/include/asm/qspinlock.h
 @@ -1,6 +1,7 @@
  #ifndef _ASM_X86_QSPINLOCK_H
  #define _ASM_X86_QSPINLOCK_H
  
 +#include asm/cpufeature.h
  #include asm-generic/qspinlock_types.h
  
  #if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
 @@ -20,6 +21,19 @@ static inline void queue_spin_unlock(str
  
  #endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
  
 +#define virt_queue_spin_lock virt_queue_spin_lock
 +
 +static inline bool virt_queue_spin_lock(struct qspinlock *lock)
 +{
 + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
 + return false;
 +
 + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
 + cpu_relax();
 +
 + return true;
 +}
 +
  #include asm-generic/qspinlock.h
  
  #endif /* _ASM_X86_QSPINLOCK_H */
 --- a/include/asm-generic/qspinlock.h
 +++ b/include/asm-generic/qspinlock.h
 @@ -98,6 +98,13 @@ static __always_inline void queue_spin_u
  }
  #endif
  
 +#ifndef virt_queue_spin_lock
 +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
 +{
 + return false;
 +}
 +#endif
 +
  /*
   * Initializier
   */
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -247,6 +247,9 @@ void queue_spin_lock_slowpath(struct qsp
  
   BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
  
 + if (virt_queue_spin_lock(lock))
 + return;
 +
   /*
* wait for in-progress pending-locked hand-overs
*
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:04PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 Currently, atomic_cmpxchg() is used to get the lock. However, this is
 not really necessary if there is more than one task in the queue and
 the queue head don't need to reset the queue code word. For that case,

s/queue code word/tail {number,value}/ ?


 a simple write to set the lock bit is enough as the queue head will
 be the only one eligible to get the lock as long as it checks that
 both the lock and pending bits are not set. The current pending bit
 waiting code will ensure that the bit will not be set as soon as the
 queue code word (tail) in the lock is set.

Just use the same word as above.
 
 With that change, the are some slight improvement in the performance
 of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
 Westere-EX machine as shown in the tables below.
 
   [Standalone/Embedded - same node]
   # of tasks  Before patchAfter patch %Change
   --  --- --  ---
3   2324/2321  2248/2265-3%/-2%
4   2890/2896  2819/2831-2%/-2%
5   3611/3595  3522/3512-2%/-2%
6   4281/4276  4173/4160-3%/-3%
7   5018/5001  4875/4861-3%/-3%
8   5759/5750  5563/5568-3%/-3%
 
   [Standalone/Embedded - different nodes]
   # of tasks  Before patchAfter patch %Change
   --  --- --  ---
3  12242/12237 12087/12093  -1%/-1%
4  10688/10696 10507/10521  -2%/-2%
 
 It was also found that this change produced a much bigger performance
 improvement in the newer IvyBridge-EX chip and was essentially to close
 the performance gap between the ticket spinlock and queue spinlock.
 
 The disk workload of the AIM7 benchmark was run on a 4-socket
 Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
 on a 3.14 based kernel. The results of the test runs were:
 
 AIM7 XFS Disk Test
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   ticketlock56782333.17   96.61   5.81
   qspinlock 57507993.13   94.83   5.97
 
 AIM7 EXT4 Disk Test
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   ticketlock1114551   16.15  509.72   7.11
   qspinlock 21844668.24  232.99   6.01
 
 The ext4 filesystem run had a much higher spinlock contention than
 the xfs filesystem run.
 
 The ebizzy -m test was also run with the following results:
 
   kernel   records/s  Real Time   Sys TimeUsr Time
   --  -   
   ticketlock 2075   10.00  216.35   3.49
   qspinlock  3023   10.00  198.20   4.80
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  kernel/locking/qspinlock.c |   59 
 -
  1 file changed, 43 insertions(+), 16 deletions(-)
 
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decod
   * By using the whole 2nd least significant byte for the pending bit, we
   * can allow better optimization of the lock acquisition for the pending
   * bit holder.
 + *
 + * This internal structure is also used by the set_locked function which
 + * is not restricted to _Q_PENDING_BITS == 8.
   */
 -#if _Q_PENDING_BITS == 8
 -
  struct __qspinlock {
   union {
   atomic_t val;
 - struct {
  #ifdef __LITTLE_ENDIAN
 + u8   locked;
 + struct {
   u16 locked_pending;
   u16 tail;
 + };
  #else
 + struct {
   u16 tail;
   u16 locked_pending;
 -#endif
   };
 + struct {
 + u8  reserved[3];
 + u8  locked;
 + };
 +#endif
   };
  };
  
 +#if _Q_PENDING_BITS == 8
  /**
   * clear_pending_set_locked - take ownership and clear the pending bit.
   * @lock: Pointer to queue spinlock structure
 @@ -197,6 +206,19 @@ static __always_inline u32 xchg_tail(str
  #endif /* _Q_PENDING_BITS == 8 */
  
  /**
 + * set_locked - Set the lock bit and own the lock

Full stop missing.

 + * @lock: Pointer to queue spinlock structure

Ditto.
 + *
 + * *,*,0 - *,0,1
 + */
 +static __always_inline void set_locked(struct qspinlock *lock)
 +{
 + struct __qspinlock *l = (void *)lock;
 +
 + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
 +}
 +
 +/**
   * 

Re: [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote:
 From: Peter Zijlstra pet...@infradead.org
 
 When we allow for a max NR_CPUS  2^14 we can optimize the pending
 wait-acquire and the xchg_tail() operations.
 
 By growing the pending bit to a byte, we reduce the tail to 16bit.
 This means we can use xchg16 for the tail part and do away with all
 the repeated compxchg() operations.
 
 This in turn allows us to unconditionally acquire; the locked state
 as observed by the wait loops cannot change. And because both locked
 and pending are now a full byte we can use simple stores for the
 state transition, obviating one atomic operation entirely.

I have to ask - how much more performance do you get from this?

Is this extra atomic operation hurting that much?
 
 All this is horribly broken on Alpha pre EV56 (and any other arch that
 cannot do single-copy atomic byte stores).
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/asm-generic/qspinlock_types.h |   13 
  kernel/locking/qspinlock.c|  103 
 ++
  2 files changed, 106 insertions(+), 10 deletions(-)
 
 --- a/include/asm-generic/qspinlock_types.h
 +++ b/include/asm-generic/qspinlock_types.h
 @@ -38,6 +38,14 @@ typedef struct qspinlock {
  /*
   * Bitfields in the atomic value:
   *
 + * When NR_CPUS  16K
 + *  0- 7: locked byte
 + * 8: pending
 + *  9-15: not used
 + * 16-17: tail index
 + * 18-31: tail cpu (+1)
 + *
 + * When NR_CPUS = 16K
   *  0- 7: locked byte
   * 8: pending
   *  9-10: tail index
 @@ -50,7 +58,11 @@ typedef struct qspinlock {
  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
  
  #define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
 +#if CONFIG_NR_CPUS  (1U  14)
 +#define _Q_PENDING_BITS  8
 +#else
  #define _Q_PENDING_BITS  1
 +#endif
  #define _Q_PENDING_MASK  _Q_SET_MASK(PENDING)
  
  #define _Q_TAIL_IDX_OFFSET   (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 @@ -61,6 +73,7 @@ typedef struct qspinlock {
  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
  
 +#define _Q_TAIL_OFFSET   _Q_TAIL_IDX_OFFSET
  #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
  
  #define _Q_LOCKED_VAL(1U  _Q_LOCKED_OFFSET)
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -22,6 +22,7 @@
  #include linux/percpu.h
  #include linux/hardirq.h
  #include linux/mutex.h
 +#include asm/byteorder.h
  #include asm/qspinlock.h
  
  /*
 @@ -48,6 +49,9 @@
   * We can further change the first spinner to spin on a bit in the lock word
   * instead of its node; whereby avoiding the need to carry a node from lock 
 to
   * unlock, and preserving API.
 + *
 + * N.B. The current implementation only supports architectures that allow
 + *  atomic operations on smaller 8-bit and 16-bit data types.
   */
  
  #include mcs_spinlock.h
 @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decod
  
  #define _Q_LOCKED_PENDING_MASK   (_Q_LOCKED_MASK | _Q_PENDING_MASK)
  
 +/*
 + * By using the whole 2nd least significant byte for the pending bit, we
 + * can allow better optimization of the lock acquisition for the pending
 + * bit holder.
 + */
 +#if _Q_PENDING_BITS == 8
 +
 +struct __qspinlock {
 + union {
 + atomic_t val;
 + struct {
 +#ifdef __LITTLE_ENDIAN
 + u16 locked_pending;
 + u16 tail;
 +#else
 + u16 tail;
 + u16 locked_pending;
 +#endif
 + };
 + };
 +};
 +
 +/**
 + * clear_pending_set_locked - take ownership and clear the pending bit.
 + * @lock: Pointer to queue spinlock structure
 + * @val : Current value of the queue spinlock 32-bit word
 + *
 + * *,1,0 - *,0,1
 + *
 + * Lock stealing is not allowed if this function is used.
 + */
 +static __always_inline void
 +clear_pending_set_locked(struct qspinlock *lock, u32 val)
 +{
 + struct __qspinlock *l = (void *)lock;
 +
 + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
 +}
 +
 +/*
 + * xchg_tail - Put in the new queue tail code word  retrieve previous one

Missing full stop.
 + * @lock : Pointer to queue spinlock structure
 + * @tail : The new queue tail code word
 + * Return: The previous queue tail code word
 + *
 + * xchg(lock, tail)
 + *
 + * p,*,* - n,*,* ; prev = xchg(lock, node)
 + */
 +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 +{
 + struct __qspinlock *l = (void *)lock;
 +
 + return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
 +}
 +
 +#else /* _Q_PENDING_BITS == 8 */
 +
 +/**
 + * clear_pending_set_locked - take ownership and clear the pending bit.
 + * @lock: Pointer to queue spinlock structure
 + * @val : Current value of the queue spinlock 32-bit word
 + *
 + * *,1,0 - *,0,1
 + */
 +static __always_inline void
 +clear_pending_set_locked(struct 

Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-17 Thread Konrad Rzeszutek Wilk
  +   new = tail | (val  _Q_LOCKED_MASK);
  +
  +   old = atomic_cmpxchg(lock-val, val, new);
  +   if (old == val)
  +   break;
  +
  +   val = old;
  +   }
  +
  +   /*
  +* we won the trylock; forget about queueing.
  +*/
  +   if (new == _Q_LOCKED_VAL)
  +   goto release;
  +
  +   /*
  +* if there was a previous node; link it and wait.
  +*/
  +   if (old  ~_Q_LOCKED_MASK) {
  +   prev = decode_tail(old);
  +   ACCESS_ONCE(prev-next) = node;
  +
  +   arch_mcs_spin_lock_contended(node-locked);

Could you add a comment here:

/* We are spinning forever until the previous node updates locked - which
it does once the it has updated lock-val with our tail number. */

  +   }
  +
  +   /*
  +* we're at the head of the waitqueue, wait for the owner to go away.
  +*
  +* *,x - *,0
  +*/
  +   while ((val = atomic_read(lock-val))  _Q_LOCKED_MASK)
  +   cpu_relax();
  +
  +   /*
  +* claim the lock:
  +*
  +* n,0 - 0,1 : lock, uncontended
  +* *,0 - *,1 : lock, contended
  +*/
  +   for (;;) {
  +   new = _Q_LOCKED_VAL;
  +   if (val != tail)
  +   new |= val;
 
..snip..
 
 Could you help a bit in explaining it in English please?

After looking at the assembler code I finally figured out how
we can get here. And the 'contended' part threw me off. Somehow
I imagined there are two more more CPUs stampeding here and 
trying to update the lock-val. But in reality the other CPUs
are stuck in the arch_mcs_spin_lock_contended spinning on their
local value.

Perhaps you could add this comment.

/* Once queue_spin_unlock is called (which _subtracts_ _Q_LOCKED_VAL from
the lock-val and still preserving the tail data), the winner gets to
claim the ticket. Since we still need the other CPUs to continue and
preserve the strict ordering in which they setup node-next, we:
 1) update lock-val to the tail value (so tail CPU and its index) with
_Q_LOCKED_VAL.
 2). Once we are done, we poke the other CPU (the one that linked to
us) by writting to node-locked (below) so they can make progress and
loop on lock-val changing from _Q_LOCKED_MASK to zero).

*/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-17 Thread Konrad Rzeszutek Wilk
 + * The basic principle of a queue-based spinlock can best be understood
 + * by studying a classic queue-based spinlock implementation called the
 + * MCS lock. The paper below provides a good description for this kind
 + * of lock.
 + *
 + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
 + *
 + * This queue spinlock implementation is based on the MCS lock, however to 
 make
 + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
 + * API, we must modify it some.
 + *
 + * In particular; where the traditional MCS lock consists of a tail pointer
 + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
 + * unlock the next pending (next-locked), we compress both these: {tail,
 + * next-locked} into a single u32 value.
 + *
 + * Since a spinlock disables recursion of its own context and there is a 
 limit
 + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
 + * encode the tail as and index indicating this context and a cpu number.
 + *
 + * We can further change the first spinner to spin on a bit in the lock word
 + * instead of its node; whereby avoiding the need to carry a node from lock 
 to
 + * unlock, and preserving API.

You also made changes (compared to the MCS) in that the unlock path is not
spinning waiting for the successor and that the job of passing the lock
is not done in the unlock path either.

Instead all of that is now done in the path of the lock acquirer logic. 

Could you update the comment to say that please?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote:
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.

Could you add this in the description please:

And by second cacheline we mean the local 'node'. That is the:
mcs_nodes[0] and mcs_nodes[idx]

Perhaps it might be better then to split this in the header file
as this is trying to not be a slowpath code - but rather - a
pre-slow-path-lets-try-if-we can do another cmpxchg in case
the unlocker has just unlocked itself.

So something like:

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e8a7ae8..29cc9c7 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock 
*lock, u32 val);
  */
 static __always_inline void queue_spin_lock(struct qspinlock *lock)
 {
-   u32 val;
+   u32 val, new;
 
val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
if (likely(val == 0))
return;
+
+   /* One more attempt - but if we fail mark it as pending. */
+   if (val == _Q_LOCKED_VAL) {
+   new = Q_LOCKED_VAL |_Q_PENDING_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == _Q_LOCKED_VAL) /* YEEY! */
+   return;
+   val = old;
+   }
queue_spin_lock_slowpath(lock, val);
 }

and then the slowpath preserves most of the old logic path
(with the pending bit stuff)?

 
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/asm-generic/qspinlock_types.h |   12 ++-
  kernel/locking/qspinlock.c|  109 
 +++---
  2 files changed, 97 insertions(+), 24 deletions(-)
 
 --- a/include/asm-generic/qspinlock_types.h
 +++ b/include/asm-generic/qspinlock_types.h
 @@ -39,8 +39,9 @@ typedef struct qspinlock {
   * Bitfields in the atomic value:
   *
   *  0- 7: locked byte
 - *  8- 9: tail index
 - * 10-31: tail cpu (+1)
 + * 8: pending
 + *  9-10: tail index
 + * 11-31: tail cpu (+1)
   */
  #define  _Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
  _Q_ ## type ## _OFFSET)
 @@ -48,7 +49,11 @@ typedef struct qspinlock {
  #define _Q_LOCKED_BITS   8
  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
  
 -#define _Q_TAIL_IDX_OFFSET   (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
 +#define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
 +#define _Q_PENDING_BITS  1
 +#define _Q_PENDING_MASK  _Q_SET_MASK(PENDING)
 +
 +#define _Q_TAIL_IDX_OFFSET   (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
  #define _Q_TAIL_IDX_BITS 2
  #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX)
  
 @@ -57,5 +62,6 @@ typedef struct qspinlock {
  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
  
  #define _Q_LOCKED_VAL(1U  _Q_LOCKED_OFFSET)
 +#define _Q_PENDING_VAL   (1U  _Q_PENDING_OFFSET)
  
  #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -83,24 +83,28 @@ static inline struct mcs_spinlock *decod
   return per_cpu_ptr(mcs_nodes[idx], cpu);
  }
  
 +#define _Q_LOCKED_PENDING_MASK   (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 +
  /**
   * queue_spin_lock_slowpath - acquire the queue spinlock
   * @lock: Pointer to queue spinlock structure
   * @val: Current value of the queue spinlock 32-bit word
   *
 - * (queue tail, lock bit)
 - *
 - *  fast  :slow  :
 unlock
 - *:  :
 - * uncontended  (0,0)   --:-- (0,1) :-- 
 (*,0)
 - *:   | ^./  :
 - *:   v   \   |  :
 - * uncontended:(n,x) --+-- (n,0) |  :
 - *   queue:   | ^--'  |  :
 - *:   v   |  :
 - * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
 - *   queue: ^--' :
 + * (queue tail, pending bit, lock bit)
   *
 + *  fast :slow  :
 unlock
 + *   :  :
 + * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
 (*,*,0)
 + *   :   | ^.--. /  :
 + *   :   v   \  \|  :
 + * pending   :(0,1,1) +-- (0,1,0)   \   |  :
 + *   :   | ^--'  |   |  :
 + *   :   v   |   |  :
 + * 

Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-17 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch extracts the logic for the exchange of new and previous tail
 code words into a new xchg_tail() function which can be optimized in a
 later patch.

And also adds a third try on acquiring the lock. That I think should
be a seperate patch.

And instead of saying 'later patch' you should spell out the name
of the patch. Especially as this might not be obvious from somebody
doing git bisection.

 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/asm-generic/qspinlock_types.h |2 +
  kernel/locking/qspinlock.c|   58 
 +-
  2 files changed, 38 insertions(+), 22 deletions(-)
 
 --- a/include/asm-generic/qspinlock_types.h
 +++ b/include/asm-generic/qspinlock_types.h
 @@ -61,6 +61,8 @@ typedef struct qspinlock {
  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
  
 +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 +
  #define _Q_LOCKED_VAL(1U  _Q_LOCKED_OFFSET)
  #define _Q_PENDING_VAL   (1U  _Q_PENDING_OFFSET)
  
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod
  #define _Q_LOCKED_PENDING_MASK   (_Q_LOCKED_MASK | _Q_PENDING_MASK)
  
  /**
 + * xchg_tail - Put in the new queue tail code word  retrieve previous one
 + * @lock : Pointer to queue spinlock structure
 + * @tail : The new queue tail code word
 + * Return: The previous queue tail code word
 + *
 + * xchg(lock, tail)
 + *
 + * p,*,* - n,*,* ; prev = xchg(lock, node)
 + */
 +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 +{
 + u32 old, new, val = atomic_read(lock-val);
 +
 + for (;;) {
 + new = (val  _Q_LOCKED_PENDING_MASK) | tail;
 + old = atomic_cmpxchg(lock-val, val, new);
 + if (old == val)
 + break;
 +
 + val = old;
 + }
 + return old;
 +}
 +
 +/**
   * queue_spin_lock_slowpath - acquire the queue spinlock
   * @lock: Pointer to queue spinlock structure
   * @val: Current value of the queue spinlock 32-bit word
 @@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp
   node-next = NULL;
  
   /*
 -  * we already touched the queueing cacheline; don't bother with pending
 -  * stuff.
 -  *
 -  * trylock || xchg(lock, node)
 -  *
 -  * 0,0,0 - 0,0,1 ; trylock
 -  * p,y,x - n,y,x ; prev = xchg(lock, node)
 +  * We touched a (possibly) cold cacheline in the per-cpu queue node;
 +  * attempt the trylock once more in the hope someone let go while we
 +  * weren't watching.
*/
 - for (;;) {
 - new = _Q_LOCKED_VAL;
 - if (val)
 - new = tail | (val  _Q_LOCKED_PENDING_MASK);
 -
 - old = atomic_cmpxchg(lock-val, val, new);
 - if (old == val)
 - break;
 -
 - val = old;
 - }
 + if (queue_spin_trylock(lock))
 + goto release;

So now are three of them? One in queue_spin_lock, then at the start
of this function when checking for the pending bit, and the once more
here. And that is because the local cache line might be cold for the
'mcs_index' struct?

That all seems to be a bit of experimental. But then we are already
in the slowpath so we could as well do:

for (i = 0; i  10; i++)
if (queue_spin_trylock(lock))
goto release;

And would have the same effect.


  
   /*
 -  * we won the trylock; forget about queueing.
 +  * we already touched the queueing cacheline; don't bother with pending
 +  * stuff.

I guess we could also just erase the pending bit if we wanted too. The
optimistic spinning will still hit go to the queue label as lock-val will
have the tail value.

 +  *
 +  * p,*,* - n,*,*
*/
 - if (new == _Q_LOCKED_VAL)
 - goto release;
 + old = xchg_tail(lock, tail);
  
   /*
* if there was a previous node; link it and wait.
*/
 - if (old  ~_Q_LOCKED_PENDING_MASK) {
 + if (old  _Q_TAIL_MASK) {
   prev = decode_tail(old);
   ACCESS_ONCE(prev-next) = node;
  
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk
On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote:
 On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote:
 On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote:
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.
 Could you add this in the description please:
 
 And by second cacheline we mean the local 'node'. That is the:
 mcs_nodes[0] and mcs_nodes[idx]
 
 Perhaps it might be better then to split this in the header file
 as this is trying to not be a slowpath code - but rather - a
 pre-slow-path-lets-try-if-we can do another cmpxchg in case
 the unlocker has just unlocked itself.
 
 So something like:
 
 diff --git a/include/asm-generic/qspinlock.h 
 b/include/asm-generic/qspinlock.h
 index e8a7ae8..29cc9c7 100644
 --- a/include/asm-generic/qspinlock.h
 +++ b/include/asm-generic/qspinlock.h
 @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock 
 *lock, u32 val);
*/
   static __always_inline void queue_spin_lock(struct qspinlock *lock)
   {
 -u32 val;
 +u32 val, new;
 
  val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
  if (likely(val == 0))
  return;
 +
 +/* One more attempt - but if we fail mark it as pending. */
 +if (val == _Q_LOCKED_VAL) {
 +new = Q_LOCKED_VAL |_Q_PENDING_VAL;
 +
 +old = atomic_cmpxchg(lock-val, val, new);
 +if (old == _Q_LOCKED_VAL) /* YEEY! */
 +return;
 
 No, it can leave like that. The unlock path will not clear the pending bit.

Err, you are right. It needs to go back in the slowpath.

 We are trying to make the fastpath as simple as possible as it may be
 inlined. The complexity of the queue spinlock is in the slowpath.

Sure, but then it shouldn't be called slowpath anymore as it is not
slow. It is a combination of fast path (the potential chance of
grabbing the lock and setting the pending lock) and the real slow
path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ?

 
 Moreover, an cmpxchg followed immediately followed by another cmpxchg will
 just increase the level of memory contention when a lock is fairly
 contended. The chance of second cmpxchg() succeeding will be pretty low.

Then why even do the pending bit - which is what the slowpath does
for the first time. And if it grabs it (And sets the pending bit) it
immediately exits. Why not perculate that piece of code in-to this header.

And the leave all that slow code (queing, mcs_lock access, etc) in the slowpath.

 
 -Longman
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support

2014-06-17 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 03:16:54PM +0200, Peter Zijlstra wrote:
 On Thu, Jun 12, 2014 at 04:48:41PM -0400, Waiman Long wrote:
  I don't have a good understanding of the kernel alternatives mechanism.
 
 I didn't either; I do now, cost me a whole day reading up on
 alternative/paravirt code patching.
 
 See the patches I just send out; I got the 'native' case with paravirt
 enabled to be one NOP worse than the native case without paravirt -- for
 queue_spin_unlock.
 
 The lock slowpath is several nops and some pointless movs more expensive.

You could use the asm goto which would optimize the fast path to be the
'native' case. That way you wouldn't have the the nops and movs in the
path.

(And asm goto also uses the alternative_asm macros).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk
On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote:
 On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote:
  On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote:
  On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote:
  Because the qspinlock needs to touch a second cacheline; add a pending
  bit and allow a single in-word spinner before we punt to the second
  cacheline.
  Could you add this in the description please:
  
  And by second cacheline we mean the local 'node'. That is the:
  mcs_nodes[0] and mcs_nodes[idx]
  
  Perhaps it might be better then to split this in the header file
  as this is trying to not be a slowpath code - but rather - a
  pre-slow-path-lets-try-if-we can do another cmpxchg in case
  the unlocker has just unlocked itself.
  
  So something like:
  
  diff --git a/include/asm-generic/qspinlock.h 
  b/include/asm-generic/qspinlock.h
  index e8a7ae8..29cc9c7 100644
  --- a/include/asm-generic/qspinlock.h
  +++ b/include/asm-generic/qspinlock.h
  @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock 
  *lock, u32 val);
 */
static __always_inline void queue_spin_lock(struct qspinlock *lock)
{
  -  u32 val;
  +  u32 val, new;
  
 val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
 if (likely(val == 0))
 return;
  +
  +  /* One more attempt - but if we fail mark it as pending. */
  +  if (val == _Q_LOCKED_VAL) {
  +  new = Q_LOCKED_VAL |_Q_PENDING_VAL;
  +
  +  old = atomic_cmpxchg(lock-val, val, new);
  +  if (old == _Q_LOCKED_VAL) /* YEEY! */
  +  return;
  
  No, it can leave like that. The unlock path will not clear the pending bit.
 
 Err, you are right. It needs to go back in the slowpath.

What I should have wrote is:

if (old == 0) /* YEEY */
  return;

As that would the same thing as this patch does on the pending bit - that
is if we can on the second compare and exchange set the pending bit (and the
lock) and the lock has been released - we are good.

And it is a quick path.

 
  We are trying to make the fastpath as simple as possible as it may be
  inlined. The complexity of the queue spinlock is in the slowpath.
 
 Sure, but then it shouldn't be called slowpath anymore as it is not
 slow. It is a combination of fast path (the potential chance of
 grabbing the lock and setting the pending lock) and the real slow
 path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ?
 

I forgot to mention - that was the crux of my comments - just change
the slowpath to complex name at that point to better reflect what
it does.

  
  Moreover, an cmpxchg followed immediately followed by another cmpxchg will
  just increase the level of memory contention when a lock is fairly
  contended. The chance of second cmpxchg() succeeding will be pretty low.
 
 Then why even do the pending bit - which is what the slowpath does
 for the first time. And if it grabs it (And sets the pending bit) it
 immediately exits. Why not perculate that piece of code in-to this header.
 
 And the leave all that slow code (queing, mcs_lock access, etc) in the 
 slowpath.
 
  
  -Longman
  
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk

On Jun 17, 2014 6:25 PM, Waiman Long waiman.l...@hp.com wrote:

 On 06/17/2014 05:10 PM, Konrad Rzeszutek Wilk wrote: 
  On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote: 
  On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote: 
  On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote: 
  On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote: 
  Because the qspinlock needs to touch a second cacheline; add a pending 
  bit and allow a single in-word spinner before we punt to the second 
  cacheline. 
  Could you add this in the description please: 
  
  And by second cacheline we mean the local 'node'. That is the: 
  mcs_nodes[0] and mcs_nodes[idx] 
  
  Perhaps it might be better then to split this in the header file 
  as this is trying to not be a slowpath code - but rather - a 
  pre-slow-path-lets-try-if-we can do another cmpxchg in case 
  the unlocker has just unlocked itself. 
  
  So something like: 
  
  diff --git a/include/asm-generic/qspinlock.h 
  b/include/asm-generic/qspinlock.h 
  index e8a7ae8..29cc9c7 100644 
  --- a/include/asm-generic/qspinlock.h 
  +++ b/include/asm-generic/qspinlock.h 
  @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct 
  qspinlock *lock, u32 val); 
     */ 
    static __always_inline void queue_spin_lock(struct qspinlock *lock) 
    { 
  - u32 val; 
  + u32 val, new; 
  
    val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); 
    if (likely(val == 0)) 
    return; 
  + 
  + /* One more attempt - but if we fail mark it as pending. */ 
  + if (val == _Q_LOCKED_VAL) { 
  + new = Q_LOCKED_VAL |_Q_PENDING_VAL; 
  + 
  + old = atomic_cmpxchg(lock-val, val, new); 
  + if (old == _Q_LOCKED_VAL) /* YEEY! */ 
  + return; 
  No, it can leave like that. The unlock path will not clear the pending 
  bit. 
  Err, you are right. It needs to go back in the slowpath. 
  What I should have wrote is: 
  
  if (old == 0) /* YEEY */ 
     return; 

 Unfortunately, that still doesn't work. If old is 0, it just meant the 
 cmpxchg failed. It still haven't got the lock. 
  As that would the same thing as this patch does on the pending bit - that 
  is if we can on the second compare and exchange set the pending bit (and 
  the 
  lock) and the lock has been released - we are good. 

 That is not true. When the lock is freed, the pending bit holder will 
 still have to clear the pending bit and set the lock bit as is done in 
 the slowpath. We cannot skip the step here. The problem of moving the 
 pending code here is that it includes a wait loop which we don't want to 
 put in the fastpath. 
  
  And it is a quick path. 
  
  We are trying to make the fastpath as simple as possible as it may be 
  inlined. The complexity of the queue spinlock is in the slowpath. 
  Sure, but then it shouldn't be called slowpath anymore as it is not 
  slow. It is a combination of fast path (the potential chance of 
  grabbing the lock and setting the pending lock) and the real slow 
  path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ? 
  
  I forgot to mention - that was the crux of my comments - just change 
  the slowpath to complex name at that point to better reflect what 
  it does. 

 Actually in my v11 patch, I subdivided the slowpath into a slowpath for 
 the pending code and slowerpath for actual queuing. Perhaps, we could 
 use quickpath and slowpath instead. Anyway, it is a minor detail that we 
 can discuss after the core code get merged.

 -Longman

Why not do it the right way the first time around?

That aside - these optimization - seem to make the code harder to read. And 
they do remind me of the scheduler code in 2.6.x which was based on heuristics 
- and eventually ripped out.

So are these optimizations based on turning off certain hardware features? Say 
hardware prefetching?

What I am getting at - can the hardware do this at some point (or perhaps 
already does on IvyBridge-EX?) - that is prefetch the per-cpu areas so they are 
always hot? And rendering this optimization not needed?

Thanks!


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-16 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:46:58PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch introduces a new generic queue spinlock implementation that
 can serve as an alternative to the default ticket spinlock. Compared
 with the ticket spinlock, this queue spinlock should be almost as fair
 as the ticket spinlock. It has about the same speed in single-thread
 and it can be much faster in high contention situations especially when
 the spinlock is embedded within the data structure to be protected.
 
 Only in light to moderate contention where the average queue depth
 is around 1-3 will this queue spinlock be potentially a bit slower
 due to the higher slowpath overhead.
 
 This queue spinlock is especially suit to NUMA machines with a large
 number of cores as the chance of spinlock contention is much higher
 in those machines. The cost of contention is also higher because of
 slower inter-node memory traffic.
 
 Due to the fact that spinlocks are acquired with preemption disabled,
 the process will not be migrated to another CPU while it is trying
 to get a spinlock. Ignoring interrupt handling, a CPU can only be
 contending in one spinlock at any one time. Counting soft IRQ, hard
 IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
 activities.  By allocating a set of per-cpu queue nodes and used them
 to form a waiting queue, we can encode the queue node address into a
 much smaller 24-bit size (including CPU number and queue node index)
 leaving one byte for the lock.
 
 Please note that the queue node is only needed when waiting for the
 lock. Once the lock is acquired, the queue node can be released to
 be used later.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org

Thank you for the repost. I have some questions about the implementation
that hopefully will be easy to answer and said answers I hope can
be added in the code to enlighten other folks.

See below.
.. snip..

 Index: linux-2.6/kernel/locking/mcs_spinlock.h
 ===
 --- linux-2.6.orig/kernel/locking/mcs_spinlock.h
 +++ linux-2.6/kernel/locking/mcs_spinlock.h
 @@ -17,6 +17,7 @@
  struct mcs_spinlock {
   struct mcs_spinlock *next;
   int locked; /* 1 if lock acquired */
 + int count;

This could use a comment.

  };
  
  #ifndef arch_mcs_spin_lock_contended
 Index: linux-2.6/kernel/locking/qspinlock.c
 ===
 --- /dev/null
 +++ linux-2.6/kernel/locking/qspinlock.c
 @@ -0,0 +1,197 @@
 +/*
 + * Queue spinlock
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2 of the License, or
 + * (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
 + *
 + * Authors: Waiman Long waiman.l...@hp.com
 + *  Peter Zijlstra pzijl...@redhat.com
 + */
 +#include linux/smp.h
 +#include linux/bug.h
 +#include linux/cpumask.h
 +#include linux/percpu.h
 +#include linux/hardirq.h
 +#include linux/mutex.h
 +#include asm/qspinlock.h
 +
 +/*
 + * The basic principle of a queue-based spinlock can best be understood
 + * by studying a classic queue-based spinlock implementation called the
 + * MCS lock. The paper below provides a good description for this kind
 + * of lock.
 + *
 + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
 + *
 + * This queue spinlock implementation is based on the MCS lock, however to 
 make
 + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
 + * API, we must modify it some.
 + *
 + * In particular; where the traditional MCS lock consists of a tail pointer
 + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
 + * unlock the next pending (next-locked), we compress both these: {tail,
 + * next-locked} into a single u32 value.
 + *
 + * Since a spinlock disables recursion of its own context and there is a 
 limit
 + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
 + * encode the tail as and index indicating this context and a cpu number.
 + *
 + * We can further change the first spinner to spin on a bit in the lock word
 + * instead of its node; whereby avoiding the need to carry a node from lock 
 to
 + * unlock, and preserving API.
 + */
 +
 +#include mcs_spinlock.h
 +
 +/*
 + * Per-CPU queue node structures; we can never have more than 4 nested
 + * contexts: task, softirq, hardirq, nmi.
 + *
 + * Exactly fits one cacheline.
 + */
 +static 

Re: [PATCH 00/11] qspinlock with paravirt support

2014-06-16 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:46:57PM +0200, Peter Zijlstra wrote:
 Since Waiman seems incapable of doing simple things; here's my take on the
 paravirt crap.
 
 The first few patches are taken from Waiman's latest series, but the virt
 support is completely new. Its primary aim is to not mess up the native code.

OK. I finally cleared some time to look over this and are reading the code
in details to make sure I have it clear in mind. I will most likely ask
some questions that are naive - hopefully they will lead to the code being
self-explanatory for anybody else taking a stab at understanding them when
bugs appear.
 
 I've not stress tested it, but the virt and paravirt (kvm) cases boot on 
 simple
 smp guests. I've not done Xen, but the patch should be simple and similar.

Looking forward to seeing it. Glancing over the KVM one and comparing it
to the original version that Waiman posted it should be fairly simple. Perhaps
even some of the code could be shared?

 
 I ripped out all the unfair nonsense as its not at all required for paravirt
 and optimizations that make paravirt better at the cost of code clarity and/or
 native performance are just not worth it.
 
 Also; if we were to ever add some of that unfair nonsense you do so _after_ 
 you
 got the simple things working.
 
 The thing I'm least sure about is the head tracking, I chose to do something
 different from what Waiman did, because his is O(nr_cpus) and had the
 assumption that guests have small nr_cpus. AFAIK this is not at all true. The
 biggest problem I have with what I did is that it contains wait loops itself.
 
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] PCI: Introduce new device binding path using pci_dev.driver_override

2014-05-16 Thread Konrad Rzeszutek Wilk
On Fri, May 16, 2014 at 10:48:00AM -0400, Konrad Rzeszutek Wilk wrote:
 On Fri, May 9, 2014 at 12:50 PM, Alex Williamson alex.william...@redhat.com
  wrote:
 
  The driver_override field allows us to specify the driver for a device
  ...
 
 ...
 
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
 
 
 Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com

And somehow my emailer (Google) decided to add yucky HTML crud
to the end.

So here is an nice plain email so that it can go the the right
mailing lists - and sorry for the extra emails to those on the
'To' and 'CC'!
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-05-07 Thread Konrad Rzeszutek Wilk
On Wed, May 07, 2014 at 11:01:28AM -0400, Waiman Long wrote:
 v9-v10:
   - Make some minor changes to qspinlock.c to accommodate review feedback.
   - Change author to PeterZ for 2 of the patches.
   - Include Raghavendra KT's test results in patch 18.

Any chance you can post these on a git tree? Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM

2014-05-07 Thread Konrad Rzeszutek Wilk
 Raghavendra KT had done some performance testing on this patch with
 the following results:
 
 Overall we are seeing good improvement for pv-unfair version.
 
 System: 32 cpu sandybridge with HT on (4 node with 32 GB each)
 Guest : 8GB with 16 vcpu/VM.
 Average was taken over 8-10 data points.
 
 Base = 3.15-rc2  with PRAVIRT_SPINLOCK = y
 
 A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock)
 
 B =  3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n
 (queue spinlock without paravirt)
 
 C = 3.15-rc2 + qspinlock v9 patch with  QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = y  PARAVIRT_UNFAIR_LOCKS = n
 (queue spinlock with paravirt)

Could you do s/PRAVIRT/PARAVIRT/ please?

 
 Ebizzy %improvements
 
 overcommit  ABC
 0.5x 4.42652.0611   1.5824
 1.0x 0.9015   -7.7828   4.5443
 1.5x46.1162   -2.9845  -3.5046
 2.0x99.8150   -2.7116   4.7461

Considering B sucks
 
 Dbench %improvements
 
 overcommit  ABC
 0.5x 3.26173.54362.5676
 1.0x 0.63022.23425.2201
 1.5x 5.00274.82753.8375
 2.0x23.82424.578212.6067
 
 Absolute values of base results: (overcommit, value, stdev)
 Ebizzy ( records / sec with 120 sec run)
 0.5x 20941.8750 (2%)
 1.0x 17623.8750 (5%)
 1.5x  5874.7778 (15%)
 2.0x  3581.8750 (7%)
 
 Dbench (throughput in MB/sec)
 0.5x 10009.6610 (5%)
 1.0x  6583.0538 (1%)
 1.5x  3991.9622 (4%)
 2.0x  2527.0613 (2.5%)
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/kernel/kvm.c |  135 
 +
  kernel/Kconfig.locks  |2 +-
  2 files changed, 136 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index 7ab8ab3..eef427b 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu)
   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
  }
  
 +#ifndef CONFIG_QUEUE_SPINLOCK
  enum kvm_contention_stat {
   TAKEN_SLOW,
   TAKEN_SLOW_PICKUP,
 @@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
 __ticket_t ticket)
   }
   }
  }
 +#else /* !CONFIG_QUEUE_SPINLOCK */
 +
 +#ifdef CONFIG_KVM_DEBUG_FS
 +static struct dentry *d_spin_debug;
 +static struct dentry *d_kvm_debug;
 +static u32 kick_nohlt_stats; /* Kick but not halt count  */
 +static u32 halt_qhead_stats; /* Queue head halting count */
 +static u32 halt_qnode_stats; /* Queue node halting count */
 +static u32 halt_abort_stats; /* Halting abort count  */
 +static u32 wake_kick_stats;  /* Wakeup by kicking count  */
 +static u32 wake_spur_stats;  /* Spurious wakeup count*/
 +static u64 time_blocked; /* Total blocking time  */
 +
 +static int __init kvm_spinlock_debugfs(void)
 +{
 + d_kvm_debug = debugfs_create_dir(kvm-guest, NULL);
 + if (!d_kvm_debug) {
 + printk(KERN_WARNING
 +Could not create 'kvm' debugfs directory\n);
 + return -ENOMEM;
 + }
 + d_spin_debug = debugfs_create_dir(spinlocks, d_kvm_debug);
 +
 + debugfs_create_u32(kick_nohlt_stats,
 +0644, d_spin_debug, kick_nohlt_stats);
 + debugfs_create_u32(halt_qhead_stats,
 +0644, d_spin_debug, halt_qhead_stats);
 + debugfs_create_u32(halt_qnode_stats,
 +0644, d_spin_debug, halt_qnode_stats);
 + debugfs_create_u32(halt_abort_stats,
 +0644, d_spin_debug, halt_abort_stats);
 + debugfs_create_u32(wake_kick_stats,
 +0644, d_spin_debug, wake_kick_stats);
 + debugfs_create_u32(wake_spur_stats,
 +0644, d_spin_debug, wake_spur_stats);
 + debugfs_create_u64(time_blocked,
 +0644, d_spin_debug, time_blocked);
 + return 0;
 +}
 +
 +static inline void kvm_halt_stats(enum pv_lock_stats type)
 +{
 + if (type == PV_HALT_QHEAD)
 + add_smp(halt_qhead_stats, 1);
 + else if (type == PV_HALT_QNODE)
 + add_smp(halt_qnode_stats, 1);
 + else /* type == PV_HALT_ABORT */
 + add_smp(halt_abort_stats, 1);
 +}
 +
 +static inline void kvm_lock_stats(enum pv_lock_stats type)
 +{
 + if (type == PV_WAKE_KICKED)
 + add_smp(wake_kick_stats, 1);
 + else if (type == PV_WAKE_SPURIOUS)
 + add_smp(wake_spur_stats, 1);
 + else /* type == PV_KICK_NOHALT */
 + add_smp(kick_nohlt_stats, 1);
 +}
 +
 +static inline u64 spin_time_start(void)
 +{
 + return sched_clock();
 +}
 +
 +static inline void spin_time_accum_blocked(u64 start)
 +{
 + u64 delta;
 +
 + 

Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS

2014-04-23 Thread Konrad Rzeszutek Wilk
On Wed, Apr 23, 2014 at 10:23:43AM -0400, Waiman Long wrote:
 On 04/18/2014 05:40 PM, Waiman Long wrote:
 On 04/18/2014 03:05 PM, Peter Zijlstra wrote:
 On Fri, Apr 18, 2014 at 01:52:50PM -0400, Waiman Long wrote:
 I am confused by your notation.
 Nah, I think I was confused :-) Make the 1 _Q_LOCKED_VAL though, as
 that's the proper constant to use.
 
 Everyone gets confused once in a while:-) I have plenty of that myself.
 
 I will change 1 to _Q_LOCKED_VAL as suggested.
 
 -Longman
 
 
 The attached patch file contains the additional changes that I had
 made to qspinlock.c file so far. Please let me know if you or others
 have any additional feedbacks or changes that will need to go to the
 next version of the patch series.
 
 I am going to take vacation starting from tomorrow and will be back
 on 5/5 (Mon). So I will not be able to respond to emails within this
 period.
 
 BTW, is there any chance that this patch can be merged to 3.16?

Um, it needs to have Acks from KVM and Xen maintainers who have not
done so. Also Peter needs to chime in. (BTW, please CC
xen-de...@lists.xenproject.org next time you post so that David and Boris
can take a peek at it).

I would strongly recommend you put all your patches on github (free git
service) so that we can test it and poke it at during your vacation
(and even after).

 
 -Longman

 diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
 index be2adca..2e184b8 100644
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -25,10 +25,6 @@
  #include asm/byteorder.h
  #include asm/qspinlock.h
  
 -#if !defined(__LITTLE_ENDIAN)  !defined(__BIG_ENDIAN)
 -#error Missing either LITTLE_ENDIAN or BIG_ENDIAN definition.
 -#endif
 -
  /*
   * The basic principle of a queue-based spinlock can best be understood
   * by studying a classic queue-based spinlock implementation called the
 @@ -200,7 +196,7 @@ clear_pending_set_locked(struct qspinlock *lock, u32 val)
  {
   struct __qspinlock *l = (void *)lock;
  
 - ACCESS_ONCE(l-locked_pending) = 1;
 + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
  }
  
  /*
 @@ -567,16 +563,16 @@ static __always_inline int get_qlock(struct qspinlock 
 *lock)
  /**
   * trylock_pending - try to acquire queue spinlock using the pending bit
   * @lock : Pointer to queue spinlock structure
 - * @pval : Pointer to value of the queue spinlock 32-bit word
 + * @val  : Current value of the queue spinlock 32-bit word
   * Return: 1 if lock acquired, 0 otherwise
   *
   * The pending bit won't be set as soon as one or more tasks queue up.
   * This function should only be called when lock stealing will not happen.
   * Otherwise, it has to be disabled.
   */
 -static inline int trylock_pending(struct qspinlock *lock, u32 *pval)
 +static inline int trylock_pending(struct qspinlock *lock, u32 val)
  {
 - u32 old, new, val = *pval;
 + u32 old, new;
   int retry = 1;
  
   /*
 @@ -593,8 +589,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
   if (val  _Q_TAIL_MASK)
   return 0;
  
 - if ((val  _Q_LOCKED_PENDING_MASK) ==
 - (_Q_LOCKED_VAL|_Q_PENDING_VAL)) {
 + if (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)) {
   /*
* If both the lock and pending bits are set, we wait
* a while to see if that either bit will be cleared.
 @@ -605,9 +600,9 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
   retry--;
   cpu_relax();
   cpu_relax();
 - *pval = val = atomic_read(lock-val);
 + val = atomic_read(lock-val);
   continue;
 - } else if ((val  _Q_LOCKED_PENDING_MASK) == _Q_PENDING_VAL) {
 + } else if (val == _Q_PENDING_VAL) {
   /*
* Pending bit is set, but not the lock bit.
* Assuming that the pending bit holder is going to
 @@ -615,7 +610,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
* it is better to wait than to exit at this point.
*/
   cpu_relax();
 - *pval = val = atomic_read(lock-val);
 + val = atomic_read(lock-val);
   continue;
   }
  
 @@ -627,7 +622,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
   if (old == val)
   break;
  
 - *pval = val = old;
 + val = old;
   }
  
   /*
 @@ -643,7 +638,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
*
* this wait loop must be a load-acquire such that we match the
* store-release that clears the locked bit and create lock
 -  * sequentiality; this because 

Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS

2014-04-23 Thread Konrad Rzeszutek Wilk
On Wed, Apr 23, 2014 at 01:43:58PM -0400, Waiman Long wrote:
 On 04/23/2014 10:56 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 23, 2014 at 10:23:43AM -0400, Waiman Long wrote:
 On 04/18/2014 05:40 PM, Waiman Long wrote:
 On 04/18/2014 03:05 PM, Peter Zijlstra wrote:
 On Fri, Apr 18, 2014 at 01:52:50PM -0400, Waiman Long wrote:
 I am confused by your notation.
 Nah, I think I was confused :-) Make the 1 _Q_LOCKED_VAL though, as
 that's the proper constant to use.
 Everyone gets confused once in a while:-) I have plenty of that myself.
 
 I will change 1 to _Q_LOCKED_VAL as suggested.
 
 -Longman
 
 The attached patch file contains the additional changes that I had
 made to qspinlock.c file so far. Please let me know if you or others
 have any additional feedbacks or changes that will need to go to the
 next version of the patch series.
 
 I am going to take vacation starting from tomorrow and will be back
 on 5/5 (Mon). So I will not be able to respond to emails within this
 period.
 
 BTW, is there any chance that this patch can be merged to 3.16?
 Um, it needs to have Acks from KVM and Xen maintainers who have not
 done so. Also Peter needs to chime in. (BTW, please CC
 xen-de...@lists.xenproject.org next time you post so that David and Boris
 can take a peek at it).
 
 I will cc xen-de...@lists.xenproject.org when I sent out the next patch.
 
 I would strongly recommend you put all your patches on github (free git
 service) so that we can test it and poke it at during your vacation
 (and even after).
 
 
 I am not used to setting up a public repo in github. If I create a
 repo there, should I put a snapshot of the whole kernel source tree
 or just a portion of the relevant files as the base? With the later,
 it won't be buildable.

You just push your local branch. It should look like a normal
Linux tree with your commits on top.

 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-04-18 Thread Konrad Rzeszutek Wilk
On Thu, Apr 17, 2014 at 09:48:36PM -0400, Waiman Long wrote:
 On 04/17/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
 On Thu, Apr 17, 2014 at 11:03:52AM -0400, Waiman Long wrote:
 v8-v9:
- Integrate PeterZ's version of the queue spinlock patch with some
  modification:
  http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
- Break the more complex patches into smaller ones to ease review effort.
- Fix a racing condition in the PV qspinlock code.
 I am not seeing anything mentioning that the overcommit scenario
 for KVM and Xen had been fixed. Or was the 'racing condition' said
 issue?
 
 Thanks.
 
 The hanging is caused by a racing condition which should be fixed in
 the v9 patch. Please let me know if you are still seeing it.

OK, is there a git tree with these patches to easily slurp them up?


Thanks!
 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 03/19] qspinlock: Add pending bit

2014-04-18 Thread Konrad Rzeszutek Wilk
On Fri, Apr 18, 2014 at 12:23:29PM -0400, Waiman Long wrote:
 On 04/18/2014 03:42 AM, Ingo Molnar wrote:
 * Waiman Longwaiman.l...@hp.com  wrote:
 
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.
 
 Signed-off-by: Peter Zijlstrapet...@infradead.org
 Signed-off-by: Waiman Longwaiman.l...@hp.com
 This patch should have a From: Peter in it as well, right?
 
 Thanks,
 
  Ingo
 
 Do you mean a From: line in the mail header? It will be a bit hard
 to have different From: header in the same patch series. I can
 certainly do that if there is an easy way to do it.

It is pretty easy.

Just do 'git commit --amend --author The Right Author' and when
you send the patches (git send-email) it will include that.

 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-04-17 Thread Konrad Rzeszutek Wilk
On Thu, Apr 17, 2014 at 11:03:52AM -0400, Waiman Long wrote:
 v8-v9:
   - Integrate PeterZ's version of the queue spinlock patch with some
 modification:
 http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
   - Break the more complex patches into smaller ones to ease review effort.
   - Fix a racing condition in the PV qspinlock code.

I am not seeing anything mentioning that the overcommit scenario
for KVM and Xen had been fixed. Or was the 'racing condition' said
issue?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

2014-04-07 Thread Konrad Rzeszutek Wilk
On Mon, Apr 07, 2014 at 04:12:58PM +0200, Peter Zijlstra wrote:
 On Fri, Apr 04, 2014 at 12:57:27PM -0400, Konrad Rzeszutek Wilk wrote:
  On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
   
   So I'm just not ever going to pick up this patch; I spend a week trying
   to reverse engineer this; I posted a 7 patch series creating the
   equivalent, but in a gradual and readable fashion:
   
 http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
   
   You keep on ignoring that; I'll keep on ignoring your patches.
   
   I might at some point rewrite some of your pv stuff on top to get this
   moving again, but I'm not really motivated to work with you atm.
  
  Uh? Did you CC also xen-de...@lists.xenproject.org on your patches Peter?
  I hadn't had a chance to see or comment on them :-(
 
 No of course not :-)
 
 Also as noted elsewhere, I didn't actually do any PV muck yet. I spend
 the time trying to get my head around patch 1; all the while Waiman kept
 piling more and more on top.

Ah, I see. Looking forward to seeing your 'muck' code then :-)

Thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Wed, Apr 02, 2014 at 10:32:01AM -0400, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
  N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
  
  v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?

Each VCPU seems to be stuck with this stack trace:

rip: 810013a8 xen_hypercall_sched_op+0x8
flags: 0002 nz
rsp: 88029f13fb98
rax:    rcx: fffa   rdx: 
rbx:    rsi: 88029f13fba8   rdi: 0003
rbp: 88029f13fbd0r8: 8807ee65a1c0r9: 88080d800b10
r10: 48cb   r11:    r12: 0013
r13: 0004   r14: 0001   r15: ea00076a8cd0
 cs: 0010ss: ds: es: 
 fs:  @ 2b24c3e7e380
 gs:  @ 88080e20/
Code (instr addr 810013a8)
cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 0f 01 c1 c3 cc cc cc cc 
cc cc cc cc cc cc


Stack:
 81352d9e 00299f13fbb0 88029f13fba4 88020001
  88029f13fbd0 0045 88029f13fbe0
 81354240 88029f13fc00 81012cb6 88080f4da200
 88080e214b00 88029f13fc48 815e4631 

Call Trace:
  [810013a8] xen_hypercall_sched_op+0x8  --
  [81352d9e] xen_poll_irq_timeout+0x3e
  [81354240] xen_poll_irq+0x10
  [81012cb6] xen_hibernate+0x46
  [815e4631] queue_spin_lock_slowerpath+0x84
  [810ab96e] queue_spin_lock_slowpath+0xee
  [815eff8f] _raw_spin_lock_irqsave+0x3f
  [81144e4d] pagevec_lru_move_fn+0x8d
  [81144780] __pagevec_lru_add_fn
  [81144ed7] __pagevec_lru_add+0x17
  [81145540] __lru_cache_add+0x60
  [8114590e] lru_cache_add+0xe
  [8116d4ba] page_add_new_anon_rmap+0xda
  [81162ab1] handle_mm_fault+0xaa1
  [81169d42] mmap_region+0x2c2
  [815f3c4d] __do_page_fault+0x18d
  [811544e1] vm_mmap_pgoff+0xb1
  [815f3fdb] do_page_fault+0x2b
  [815f06c8] page_fault+0x28
rip: 810013a8 xen_hypercall_sched_op+0x8


 
 (CC ing Marcos who had run the test)
  
  v6-v7:
- Remove an atomic operation from the 2-task contending code
- Shorten the names of some macros
- Make the queue waiter to attempt to steal lock when unfair lock is
  enabled.
- Remove lock holder kick from the PV code and fix a race condition
- Run the unfair lock  PV code on overcommitted KVM guests to collect
  performance data.
  
  v5-v6:
   - Change the optimized 2-task contending code to make it fairer at the
 expense of a bit of performance.
   - Add a patch to support unfair queue spinlock for Xen.
   - Modify the PV qspinlock code to follow what was done in the PV
 ticketlock.
   - Add performance data for the unfair lock as well as the PV
 support code.
  
  v4-v5:
   - Move the optimized 2-task contending code to the generic file to
 enable more architectures to use it without code duplication.
   - Address some of the style-related comments by PeterZ.
   - Allow the use of unfair queue spinlock in a real para-virtualized
 execution environment.
   - Add para-virtualization support to the qspinlock code by ensuring
 that the lock holder and queue head stay alive as much as possible.
  
  v3-v4:
   - Remove debugging code and fix a configuration error
   - Simplify the qspinlock structure and streamline the code to make it
 perform a bit better
   - Add an x86 version of asm/qspinlock.h for holding x86 specific
 optimization.
   - Add an optimized x86 code path for 2 contending tasks to improve
 low contention performance.
  
  v2-v3:
   - Simplify the code by using numerous mode only without an unfair option.
   - Use the latest smp_load_acquire()/smp_store_release() barriers.
   - Move the queue spinlock code to kernel/locking.
   - Make the use of queue spinlock the default for x86-64 without user
 configuration.
   - Additional performance tuning.
  
  v1-v2:
   - Add some more comments to document what the code does.
   - Add a numerous CPU mode to support = 16K CPUs
   - Add a configuration option to allow lock stealing which

Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
 On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
 On 04/02/2014 04:35 PM, Waiman Long wrote:
 On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
 
 v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?
 
 Thank for reporting that. I haven't done that much testing on Xen.
 My focus was in KVM. I will perform more test on Xen to see if I
 can reproduce the problem.
 
 BTW, does the halting and sending IPI mechanism work in HVM? I saw
 Yes.
 that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
 However, this piece of code isn't in upstream code. So I wonder if
 there is problem with that.
 The PV ticketlock fixed it for HVM. It was disabled before because
 the PV guests were using bytelocks while the HVM were using ticketlocks
 and you couldnt' swap in PV bytelocks for ticketlocks during startup.
 
 The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
 kernel for all configurations. So PV ticketlock as well as Xen and
 KVM support was compiled in. I think booting the kernel on bare
 metal will cause the Xen code to work in HVM mode thus activating
 the PV spinlock code which has a negative impact on performance.

Huh? -EPARSE

 That may be why it was disabled so that the bare metal performance
 will not be impacted.

I am not following you.
 
 BTW, could you send me more information about the configuration of
 the machine, like the .config file that you used?

Marcos, could you please send that information to Peter. Thanks!
 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

2014-04-04 Thread Konrad Rzeszutek Wilk
On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
 
 So I'm just not ever going to pick up this patch; I spend a week trying
 to reverse engineer this; I posted a 7 patch series creating the
 equivalent, but in a gradual and readable fashion:
 
   http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
 
 You keep on ignoring that; I'll keep on ignoring your patches.
 
 I might at some point rewrite some of your pv stuff on top to get this
 moving again, but I'm not really motivated to work with you atm.

Uh? Did you CC also xen-de...@lists.xenproject.org on your patches Peter?
I hadn't had a chance to see or comment on them :-(

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote:
 On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
 On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
 On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
 On 04/02/2014 04:35 PM, Waiman Long wrote:
 On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
 
 v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?
 
 Thank for reporting that. I haven't done that much testing on Xen.
 My focus was in KVM. I will perform more test on Xen to see if I
 can reproduce the problem.
 
 BTW, does the halting and sending IPI mechanism work in HVM? I saw
 Yes.
 that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
 However, this piece of code isn't in upstream code. So I wonder if
 there is problem with that.
 The PV ticketlock fixed it for HVM. It was disabled before because
 the PV guests were using bytelocks while the HVM were using ticketlocks
 and you couldnt' swap in PV bytelocks for ticketlocks during startup.
 The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
 kernel for all configurations. So PV ticketlock as well as Xen and
 KVM support was compiled in. I think booting the kernel on bare
 metal will cause the Xen code to work in HVM mode thus activating
 the PV spinlock code which has a negative impact on performance.
 Huh? -EPARSE
 
 That may be why it was disabled so that the bare metal performance
 will not be impacted.
 I am not following you.
 
 What I am saying is that when XEN and PV spinlock is compiled into
 the current upstream kernel, the PV spinlock jump label is turned on
 when booted on bare metal. In other words, the PV spinlock code is

How does it turn it on? I see that the jump lables are only turned
on when the jump label is enable when it detects that it is running
under Xen or KVM. It won't turn it on under baremetal.

 active even when they are not needed and actually slow thing down in
 that situation. This is a problem and we need to find way to make
 sure that the PV spinlock code won't be activated on bare metal.

Could you explain to me which piece of code enables the jump labels
on baremetal please?
 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Fri, Apr 04, 2014 at 01:58:15PM -0400, Konrad Rzeszutek Wilk wrote:
 On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote:
  On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
  On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
  On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
  On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
  On 04/02/2014 04:35 PM, Waiman Long wrote:
  On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
  On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
  N.B. Sorry for the duplicate. This patch series were resent as the
original one was rejected by the vger.kernel.org list server
due to long header. There is no change in content.
  
  v7-v8:
 - Remove one unneeded atomic operation from the slowpath, thus
   improving performance.
 - Simplify some of the codes and add more comments.
 - Test for X86_FEATURE_HYPERVISOR CPU feature bit to 
   enable/disable
   unfair lock.
 - Reduce unfair lock slowpath lock stealing frequency depending
   on its distance from the queue head.
 - Add performance data for IvyBridge-EX CPU.
  FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
  HVM guest under Xen after a while stops working. The workload
  is doing 'make -j32' on the Linux kernel.
  
  Completely unresponsive. Thoughts?
  
  Thank for reporting that. I haven't done that much testing on Xen.
  My focus was in KVM. I will perform more test on Xen to see if I
  can reproduce the problem.
  
  BTW, does the halting and sending IPI mechanism work in HVM? I saw
  Yes.
  that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
  However, this piece of code isn't in upstream code. So I wonder if
  there is problem with that.
  The PV ticketlock fixed it for HVM. It was disabled before because
  the PV guests were using bytelocks while the HVM were using ticketlocks
  and you couldnt' swap in PV bytelocks for ticketlocks during startup.
  The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
  kernel for all configurations. So PV ticketlock as well as Xen and
  KVM support was compiled in. I think booting the kernel on bare
  metal will cause the Xen code to work in HVM mode thus activating
  the PV spinlock code which has a negative impact on performance.
  Huh? -EPARSE
  
  That may be why it was disabled so that the bare metal performance
  will not be impacted.
  I am not following you.
  
  What I am saying is that when XEN and PV spinlock is compiled into
  the current upstream kernel, the PV spinlock jump label is turned on
  when booted on bare metal. In other words, the PV spinlock code is
 
 How does it turn it on? I see that the jump lables are only turned
 on when the jump label is enable when it detects that it is running
 under Xen or KVM. It won't turn it on under baremetal.

Well, it seems that it does turn it on baremetal which is an stupid mistake.

Sending a patch shortly.
 
  active even when they are not needed and actually slow thing down in
  that situation. This is a problem and we need to find way to make
  sure that the PV spinlock code won't be activated on bare metal.
 
 Could you explain to me which piece of code enables the jump labels
 on baremetal please?
  
  -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-03 Thread Konrad Rzeszutek Wilk
On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
 On 04/02/2014 04:35 PM, Waiman Long wrote:
 On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
 
 v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?
 
 
 Thank for reporting that. I haven't done that much testing on Xen.
 My focus was in KVM. I will perform more test on Xen to see if I
 can reproduce the problem.
 
 
 BTW, does the halting and sending IPI mechanism work in HVM? I saw

Yes.
 that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
 However, this piece of code isn't in upstream code. So I wonder if
 there is problem with that.

The PV ticketlock fixed it for HVM. It was disabled before because
the PV guests were using bytelocks while the HVM were using ticketlocks
and you couldnt' swap in PV bytelocks for ticketlocks during startup.

 
 -Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-02 Thread Konrad Rzeszutek Wilk
On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
  original one was rejected by the vger.kernel.org list server
  due to long header. There is no change in content.
 
 v7-v8:
   - Remove one unneeded atomic operation from the slowpath, thus
 improving performance.
   - Simplify some of the codes and add more comments.
   - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
 unfair lock.
   - Reduce unfair lock slowpath lock stealing frequency depending
 on its distance from the queue head.
   - Add performance data for IvyBridge-EX CPU.

FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
HVM guest under Xen after a while stops working. The workload
is doing 'make -j32' on the Linux kernel.

Completely unresponsive. Thoughts?

(CC ing Marcos who had run the test)
 
 v6-v7:
   - Remove an atomic operation from the 2-task contending code
   - Shorten the names of some macros
   - Make the queue waiter to attempt to steal lock when unfair lock is
 enabled.
   - Remove lock holder kick from the PV code and fix a race condition
   - Run the unfair lock  PV code on overcommitted KVM guests to collect
 performance data.
 
 v5-v6:
  - Change the optimized 2-task contending code to make it fairer at the
expense of a bit of performance.
  - Add a patch to support unfair queue spinlock for Xen.
  - Modify the PV qspinlock code to follow what was done in the PV
ticketlock.
  - Add performance data for the unfair lock as well as the PV
support code.
 
 v4-v5:
  - Move the optimized 2-task contending code to the generic file to
enable more architectures to use it without code duplication.
  - Address some of the style-related comments by PeterZ.
  - Allow the use of unfair queue spinlock in a real para-virtualized
execution environment.
  - Add para-virtualization support to the qspinlock code by ensuring
that the lock holder and queue head stay alive as much as possible.
 
 v3-v4:
  - Remove debugging code and fix a configuration error
  - Simplify the qspinlock structure and streamline the code to make it
perform a bit better
  - Add an x86 version of asm/qspinlock.h for holding x86 specific
optimization.
  - Add an optimized x86 code path for 2 contending tasks to improve
low contention performance.
 
 v2-v3:
  - Simplify the code by using numerous mode only without an unfair option.
  - Use the latest smp_load_acquire()/smp_store_release() barriers.
  - Move the queue spinlock code to kernel/locking.
  - Make the use of queue spinlock the default for x86-64 without user
configuration.
  - Additional performance tuning.
 
 v1-v2:
  - Add some more comments to document what the code does.
  - Add a numerous CPU mode to support = 16K CPUs
  - Add a configuration option to allow lock stealing which can further
improve performance in many cases.
  - Enable wakeup of queue head CPU at unlock time for non-numerous
CPU mode.
 
 This patch set has 3 different sections:
  1) Patches 1-4: Introduces a queue-based spinlock implementation that
 can replace the default ticket spinlock without increasing the
 size of the spinlock data structure. As a result, critical kernel
 data structures that embed spinlock won't increase in size and
 break data alignments.
  2) Patches 5-6: Enables the use of unfair queue spinlock in a
 para-virtualized execution environment. This can resolve some
 of the locking related performance issues due to the fact that
 the next CPU to get the lock may have been scheduled out for a
 period of time.
  3) Patches 7-10: Enable qspinlock para-virtualization support
 by halting the waiting CPUs after spinning for a certain amount of
 time. The unlock code will detect the a sleeping waiter and wake it
 up. This is essentially the same logic as the PV ticketlock code.
 
 The queue spinlock has slightly better performance than the ticket
 spinlock in uncontended case. Its performance can be much better
 with moderate to heavy contention.  This patch has the potential of
 improving the performance of all the workloads that have moderate to
 heavy spinlock contention.
 
 The queue spinlock is especially suitable for NUMA machines with at
 least 2 sockets, though noticeable performance benefit probably won't
 show up in machines with less than 4 sockets.
 
 The purpose of this patch set is not to solve any particular spinlock
 contention problems. Those need to be solved by refactoring the code
 to make more efficient use of the lock or finer granularity ones. The
 main purpose is to make the lock contention problems more tolerable
 until someone can spend the time and effort to fix them.
 
 To illustrate the performance benefit of the queue spinlock, the
 ebizzy benchmark was run with the -m option in two different computers:
 
   Test machineticket-lock 

Re: [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN

2014-04-02 Thread Konrad Rzeszutek Wilk
 diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
 index a70fdeb..451e392 100644
 --- a/kernel/Kconfig.locks
 +++ b/kernel/Kconfig.locks
 @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
  
  config QUEUE_SPINLOCK
   def_bool y if ARCH_USE_QUEUE_SPINLOCK
 - depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
 + depends on SMP

If I read this correctly that means you cannot select any more the old
ticketlocks? As in, if you select CONFIG_PARAVIRT on X86 it will automatically
select ARCH_USE_QUEUE_SPINLOCK which will then enable this by default?

Should the 'def_bool' be selectable?

 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest

2014-03-17 Thread Konrad Rzeszutek Wilk
On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote:
 Il 13/03/2014 11:54, David Vrabel ha scritto:
 On 12/03/14 18:54, Waiman Long wrote:
 Locking is always an issue in a virtualized environment as the virtual
 CPU that is waiting on a lock may get scheduled out and hence block
 any progress in lock acquisition even when the lock has been freed.
 
 One solution to this problem is to allow unfair lock in a
 para-virtualized environment. In this case, a new lock acquirer can
 come and steal the lock if the next-in-line CPU to get the lock is
 scheduled out. Unfair lock in a native environment is generally not a
 good idea as there is a possibility of lock starvation for a heavily
 contended lock.
 
 I do not think this is a good idea -- the problems with unfair locks are
 worse in a virtualized guest.  If a waiting VCPU deschedules and has to
 be kicked to grab a lock then it is very likely to lose a race with
 another running VCPU trying to take a lock (since it takes time for the
 VCPU to be rescheduled).
 
 Actually, I think the unfair version should be automatically
 selected if running on a hypervisor.  Per-hypervisor pvops can
 choose to enable the fair one.
 
 Lock unfairness may be particularly evident on a virtualized guest
 when the host is overcommitted, but problems with fair locks are
 even worse.
 
 In fact, RHEL/CentOS 6 already uses unfair locks if
 X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in
 favor of pv ticketlocks, but pv ticketlocks do not cover all
 hypervisors so perhaps we could revisit that choice.
 
 Measurements were done by Gleb for two guests running 2.6.32 with 16
 vcpus each, on a 16-core system.  One guest ran with unfair locks,
 one guest ran with fair locks.  Two kernel compilations (time make

And when you say fair locks are you saying PV ticketlocks or generic
ticketlocks? 
 -j 16 all) were started at the same time on both guests, and times
 were as follows:
 
 unfair: fair:
 real 13m34.674s real 19m35.827s
 user 96m2.638s  user 102m38.665s
 sys 56m14.991s  sys 158m22.470s
 
 real 13m3.768s  real 19m4.375s
 user 95m34.509s user 111m9.903s
 sys 53m40.550s  sys 141m59.370s
 
 Actually, interpreting the numbers shows an even worse slowdown.
 
 Compilation took ~6.5 minutes in a guest when the host was not
 overcommitted, and with unfair locks everything scaled just fine.

You should see the same values with the PV ticketlock. It is not clear
to me if this testing did include that variant of locks?

 
 Ticketlocks fell completely apart; during the first 13 minutes they
 were allotted 16*6.5=104 minutes of CPU time, and they spent almost
 all of it spinning in the kernel (102 minutes in the first run).

Right, the non-PV variant of them do fall apart. That is why
PV ticketlocks are so nice.

 They did perhaps 30 seconds worth of work because, as soon as the
 unfair-lock guest finished and the host was no longer overcommitted,
 compilation finished in 6 minutes.
 
 So that's approximately 12x slowdown from using non-pv fair locks
 (vs. unfair locks) on a 200%-overcommitted host.

Ah, so it was non-PV.

I am curious if the test was any different if you tested PV ticketlocks
vs Red Hat variant of unfair locks.

 
 Paolo
 
 With the unfair locking activated on bare metal 4-socket Westmere-EX
 box, the execution times (in ms) of a spinlock micro-benchmark were
 as follows:
 
   # ofTicket   Fair Unfair
   taskslock queue lockqueue lock
   --  ---   ----
 1   135135   137
 2  1045   1120   747
 3  1827   2345  1084
 4  2689   2934  1438
 5  3736   3658  1722
 6  4942   4434  2092
 7  6304   5176  2245
 8  7736   5955  2388
 
 Are these figures with or without the later PV support patches?
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-28 Thread Konrad Rzeszutek Wilk
On Mon, Jan 27, 2014 at 01:33:01PM +0100, Julian Stecklina wrote:
 On 01/24/2014 07:08 PM, Konrad Rzeszutek Wilk wrote:
  On Thu, Jan 16, 2014 at 03:13:44PM +0100, Julian Stecklina wrote:
  The paravirtualized clock used in KVM and Xen uses a version field to
  allow the guest to see when the shared data structure is inconsistent.
  The code reads the version field twice (before and after the data
  structure is copied) and checks whether they haven't
  changed and that the lowest bit is not set. As the second access is not
  synchronized, the compiler could generate code that accesses version
  more than two times and you end up with inconsistent data.
  
  Could you paste in the code that the 'bad' compiler generates
  vs the compiler that generate 'good' code please?
 
 At least 4.8 and probably older compilers compile this as intended. The
 point is that the standard does not guarantee the indented behavior,
 i.e. the code is wrong.

Perhaps I misunderstood Jan's response but it sounded to me like
that the compiler was not adhering to the standard?

 
 I can refer to this lwn article:
 https://lwn.net/Articles/508991/
 
 The whole point of ACCESS_ONCE is to avoid time bombs like that. There
 are lots of place where ACCESS_ONCE is used in the kernel:
 
 http://lxr.free-electrons.com/ident?i=ACCESS_ONCE
 
 See for example the check_zero function here:
 http://lxr.free-electrons.com/source/arch/x86/kernel/kvm.c#L559
 

In other words, you don't have a sample of 'bad' compiler code.


 Julian
 
  
 
  An example using pvclock_get_time_values:
 
  host starts updating data, sets src-version to 1
  guest reads src-version (1) and stores it into dst-version.
  guest copies inconsistent data
  guest reads src-version (1) and computes xor with dst-version.
  host finishes updating data and sets src-version to 2
  guest reads src-version (2) and checks whether lower bit is not set.
  while loop exits with inconsistent data!
 
  AFAICS the compiler is allowed to optimize the given code this way.
 
  Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
  ---
   arch/x86/kernel/pvclock.c | 10 +++---
   1 file changed, 7 insertions(+), 3 deletions(-)
 
  diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
  index 42eb330..f62b41c 100644
  --- a/arch/x86/kernel/pvclock.c
  +++ b/arch/x86/kernel/pvclock.c
  @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct 
  pvclock_shadow_time *shadow)
   static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
 struct pvclock_vcpu_time_info *src)
   {
  +  u32 nversion;
  +
 do {
 dst-version = src-version;
 rmb();  /* fetch version before data */
  @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct 
  pvclock_shadow_time *dst,
 dst-tsc_shift = src-tsc_shift;
 dst-flags = src-flags;
 rmb();  /* test version after fetching data */
  -  } while ((src-version  1) || (dst-version != src-version));
  +  nversion = ACCESS_ONCE(src-version);
  +  } while ((nversion  1) || (dst-version != nversion));
   
 return dst-version;
   }
  @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
  *wall_clock,
 struct pvclock_vcpu_time_info *vcpu_time,
 struct timespec *ts)
   {
  -  u32 version;
  +  u32 version, nversion;
 u64 delta;
 struct timespec now;
   
  @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
  *wall_clock,
 now.tv_sec  = wall_clock-sec;
 now.tv_nsec = wall_clock-nsec;
 rmb();  /* fetch time before checking version */
  -  } while ((wall_clock-version  1) || (version != wall_clock-version));
  +  nversion = ACCESS_ONCE(wall_clock-version);
  +  } while ((nversion  1) || (version != nversion));
   
 delta = pvclock_clocksource_read(vcpu_time);/* time since system 
  boot */
 delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
  -- 
  1.8.4.2
 
 
  ___
  Xen-devel mailing list
  xen-de...@lists.xen.org
  http://lists.xen.org/xen-devel
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-28 Thread Konrad Rzeszutek Wilk
On Mon, Jan 27, 2014 at 06:47:58PM +0100, Paolo Bonzini wrote:
 Il 17/01/2014 10:41, Jan Beulich ha scritto:
  One half of this doesn't apply here, due to the explicit barriers
  that are there. The half about converting local variable accesses
  back to memory reads (i.e. eliding the local variable), however,
  is only a theoretical issue afaict: If a compiler really did this, I
  think there'd be far more places where this would hurt.
 
 Perhaps.  But for example seqlocks get it right.
 
  I don't think so - this would only be an issue if the conditions used
  | instead of ||. || implies a sequence point between evaluating the
  left and right sides, and the standard says: The presence of a
  sequence point between the evaluation of expressions A and B
  implies that every value computation and side effect associated
  with A is sequenced before every value computation and side
  effect associated with B.
 
 I suspect this is widely ignored by compilers if A is not 
 side-effecting.  The above wording would imply that
 
  x = a || b=x = (a | b) != 0
 
 (where a and b are non-volatile globals) would be an invalid 
 change.  The compiler would have to do:
 
  temp = a;
  barrier();
  x = (temp | b) != 0
 
 and I'm pretty sure that no compiler does it this way unless C11/C++11
 atomics are involved (at which point accesses become side-effecting).
 
 The code has changed and pvclock_get_time_values moved to
 __pvclock_read_cycles, but I think the problem remains.  Another approach
 to fixing this (and one I prefer) is to do the same thing as seqlocks:
 turn off the low bit in the return value of __pvclock_read_cycles,
 and drop the || altogether.  Untested patch after my name.

Is there a good test-case to confirm that this patch does not introduce
any regressions?


 
 Paolo
 
 diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
 index d6b078e9fa28..5aec80adaf54 100644
 --- a/arch/x86/include/asm/pvclock.h
 +++ b/arch/x86/include/asm/pvclock.h
 @@ -75,7 +75,7 @@ unsigned __pvclock_read_cycles(const struct 
 pvclock_vcpu_time_info *src,
   cycle_t ret, offset;
   u8 ret_flags;
  
 - version = src-version;
 + version = src-version  ~1;
   /* Note: emulated platforms which do not advertise SSE2 support
* result in kvmclock not using the necessary RDTSC barriers.
* Without barriers, it is possible that RDTSC instruction reads from
 diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
 index 2f355d229a58..a5052a87d55e 100644
 --- a/arch/x86/kernel/pvclock.c
 +++ b/arch/x86/kernel/pvclock.c
 @@ -66,7 +66,7 @@ u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
  
   do {
   version = __pvclock_read_cycles(src, ret, flags);
 - } while ((src-version  1) || version != src-version);
 + } while (version != src-version);
  
   return flags  valid_flags;
  }
 @@ -80,7 +80,7 @@ cycle_t pvclock_clocksource_read(struct 
 pvclock_vcpu_time_info *src)
  
   do {
   version = __pvclock_read_cycles(src, ret, flags);
 - } while ((src-version  1) || version != src-version);
 + } while (version != src-version);
  
   if (unlikely((flags  PVCLOCK_GUEST_STOPPED) != 0)) {
   src-flags = ~PVCLOCK_GUEST_STOPPED;
 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 index eb5d7a56f8d4..f09b09bcb515 100644
 --- a/arch/x86/vdso/vclock_gettime.c
 +++ b/arch/x86/vdso/vclock_gettime.c
 @@ -117,7 +117,6 @@ static notrace cycle_t vread_pvclock(int *mode)
*/
   cpu1 = __getcpu()  VGETCPU_CPU_MASK;
   } while (unlikely(cpu != cpu1 ||
 -   (pvti-pvti.version  1) ||
 pvti-pvti.version != version));
  
   if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-24 Thread Konrad Rzeszutek Wilk
On Thu, Jan 16, 2014 at 03:13:44PM +0100, Julian Stecklina wrote:
 The paravirtualized clock used in KVM and Xen uses a version field to
 allow the guest to see when the shared data structure is inconsistent.
 The code reads the version field twice (before and after the data
 structure is copied) and checks whether they haven't
 changed and that the lowest bit is not set. As the second access is not
 synchronized, the compiler could generate code that accesses version
 more than two times and you end up with inconsistent data.

Could you paste in the code that the 'bad' compiler generates
vs the compiler that generate 'good' code please?

 
 An example using pvclock_get_time_values:
 
 host starts updating data, sets src-version to 1
 guest reads src-version (1) and stores it into dst-version.
 guest copies inconsistent data
 guest reads src-version (1) and computes xor with dst-version.
 host finishes updating data and sets src-version to 2
 guest reads src-version (2) and checks whether lower bit is not set.
 while loop exits with inconsistent data!
 
 AFAICS the compiler is allowed to optimize the given code this way.
 
 Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
 ---
  arch/x86/kernel/pvclock.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
 index 42eb330..f62b41c 100644
 --- a/arch/x86/kernel/pvclock.c
 +++ b/arch/x86/kernel/pvclock.c
 @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct 
 pvclock_shadow_time *shadow)
  static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
   struct pvclock_vcpu_time_info *src)
  {
 + u32 nversion;
 +
   do {
   dst-version = src-version;
   rmb();  /* fetch version before data */
 @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct 
 pvclock_shadow_time *dst,
   dst-tsc_shift = src-tsc_shift;
   dst-flags = src-flags;
   rmb();  /* test version after fetching data */
 - } while ((src-version  1) || (dst-version != src-version));
 + nversion = ACCESS_ONCE(src-version);
 + } while ((nversion  1) || (dst-version != nversion));
  
   return dst-version;
  }
 @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
 *wall_clock,
   struct pvclock_vcpu_time_info *vcpu_time,
   struct timespec *ts)
  {
 - u32 version;
 + u32 version, nversion;
   u64 delta;
   struct timespec now;
  
 @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
 *wall_clock,
   now.tv_sec  = wall_clock-sec;
   now.tv_nsec = wall_clock-nsec;
   rmb();  /* fetch time before checking version */
 - } while ((wall_clock-version  1) || (version != wall_clock-version));
 + nversion = ACCESS_ONCE(wall_clock-version);
 + } while ((nversion  1) || (version != nversion));
  
   delta = pvclock_clocksource_read(vcpu_time);/* time since system 
 boot */
   delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
 -- 
 1.8.4.2
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86 kconfig: Add recommendation to enable paravirt spinlock

2013-10-21 Thread Konrad Rzeszutek Wilk
On Mon, Oct 21, 2013 at 09:35:08PM +0530, Raghavendra K T wrote:
 Since paravirt spinlock optimization are in 3.12 kernel, we have
 very good performance benefit for paravirtualized KVM / Xen kernel.
 Also we no longer suffer from 5% side effect on native kernel.

Yeey!
 
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  Would like to thank Sander for spotting and suggesting this.
  pvspinlock benefit on KVM link: https://lkml.org/lkml/2013/8/6/178 
  
  Attilio's tests on native kernel impact:
  
 http://blog.xen.org/index.php/2012/05/11/benchmarking-the-new-pv-ticketlock-implementation/
 
  arch/x86/Kconfig | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index f67e839..4ba9d32 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -638,10 +638,10 @@ config PARAVIRT_SPINLOCKS
 spinlock implementation with something virtualization-friendly
 (for example, block the virtual CPU rather than spinning).
  
 -   Unfortunately the downside is an up to 5% performance hit on
 -   native kernels, with various workloads.
 +   It has minimal impact on native kernels and gives nice performance
 +   benefit for paravirtualized KVM / Xen kernels.
  
 -   If you are unsure how to answer this question, answer N.
 +   If you are unsure how to answer this question, answer Y.
  
  source arch/x86/xen/Kconfig
  
 -- 
 1.7.11.7
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V12 0/14] Paravirtualized ticket spinlocks

2013-08-09 Thread Konrad Rzeszutek Wilk
On Fri, Aug 09, 2013 at 06:20:02PM +0530, Raghavendra K T wrote:
 On 08/09/2013 04:34 AM, H. Peter Anvin wrote:
 
 Okay, I figured it out.
 
 One of several problems with the formatting of this patchset is that it
 has one- and two-digit patch numbers in the headers, which meant that my
 scripts tried to apply patch 10 first.
 
 
 My bad. I 'll send out in uniform digit form next time.
 

If you use 'git format-patch --subject-prefix PATCH V14 v3.11-rc4..'
and 'git send-email --subject [PATCH V14] bla blah ..'
that should be automatically taken care of?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V12 0/14] Paravirtualized ticket spinlocks

2013-08-07 Thread Konrad Rzeszutek Wilk
On Wed, Aug 07, 2013 at 12:15:21PM +0530, Raghavendra K T wrote:
 On 08/07/2013 10:18 AM, H. Peter Anvin wrote:
 Please let me know, if I should rebase again.
 
 
 tip:master is not a stable branch; it is more like linux-next.  We need
 to figure out which topic branches are dependencies for this set.
 
 Okay. I 'll start looking at the branches that would get affected.
 (Xen, kvm are obvious ones).
 Please do let me know the branches I might have to check for.

From the Xen standpoint anything past v3.11-rc4 would work.

 
 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/4] x86: correctly detect hypervisor

2013-08-05 Thread Konrad Rzeszutek Wilk
On Mon, Aug 05, 2013 at 11:38:14AM +0800, Jason Wang wrote:
 On 07/25/2013 04:54 PM, Jason Wang wrote:
  We try to handle the hypervisor compatibility mode by detecting hypervisor
  through a specific order. This is not robust, since hypervisors may 
  implement
  each others features.
 
  This patch tries to handle this situation by always choosing the last one 
  in the
  CPUID leaves. This is done by letting .detect() returns a priority instead 
  of
  true/false and just re-using the CPUID leaf where the signature were found 
  as
  the priority (or 1 if it was found by DMI). Then we can just pick 
  hypervisor who
  has the highest priority. Other sophisticated detection method could also be
  implemented on top.
 
  Suggested by H. Peter Anvin and Paolo Bonzini.
 
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: Ingo Molnar mi...@redhat.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: x...@kernel.org
  Cc: K. Y. Srinivasan k...@microsoft.com
  Cc: Haiyang Zhang haiya...@microsoft.com
  Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
  Cc: Jeremy Fitzhardinge jer...@goop.org
  Cc: Doug Covelli dcove...@vmware.com
  Cc: Borislav Petkov b...@suse.de
  Cc: Dan Hecht dhe...@vmware.com
  Cc: Paul Gortmaker paul.gortma...@windriver.com
  Cc: Marcelo Tosatti mtosa...@redhat.com
  Cc: Gleb Natapov g...@redhat.com
  Cc: Paolo Bonzini pbonz...@redhat.com
  Cc: Frederic Weisbecker fweis...@gmail.com
  Cc: linux-ker...@vger.kernel.org
  Cc: de...@linuxdriverproject.org
  Cc: kvm@vger.kernel.org
  Cc: xen-de...@lists.xensource.com
  Cc: virtualizat...@lists.linux-foundation.org
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
 
 Ping, any comments and acks for this series?

Could you provide me with a git branch so I can test it overnight please?

 
 Thanks
   arch/x86/include/asm/hypervisor.h |2 +-
   arch/x86/kernel/cpu/hypervisor.c  |   15 +++
   arch/x86/kernel/cpu/mshyperv.c|   13 -
   arch/x86/kernel/cpu/vmware.c  |8 
   arch/x86/kernel/kvm.c |6 ++
   arch/x86/xen/enlighten.c  |9 +++--
   6 files changed, 25 insertions(+), 28 deletions(-)
 
  diff --git a/arch/x86/include/asm/hypervisor.h 
  b/arch/x86/include/asm/hypervisor.h
  index 2d4b5e6..e42f758 100644
  --- a/arch/x86/include/asm/hypervisor.h
  +++ b/arch/x86/include/asm/hypervisor.h
  @@ -33,7 +33,7 @@ struct hypervisor_x86 {
  const char  *name;
   
  /* Detection routine */
  -   bool(*detect)(void);
  +   uint32_t(*detect)(void);
   
  /* Adjust CPU feature bits (run once per CPU) */
  void(*set_cpu_features)(struct cpuinfo_x86 *);
  diff --git a/arch/x86/kernel/cpu/hypervisor.c 
  b/arch/x86/kernel/cpu/hypervisor.c
  index 8727921..36ce402 100644
  --- a/arch/x86/kernel/cpu/hypervisor.c
  +++ b/arch/x86/kernel/cpu/hypervisor.c
  @@ -25,11 +25,6 @@
   #include asm/processor.h
   #include asm/hypervisor.h
   
  -/*
  - * Hypervisor detect order.  This is specified explicitly here because
  - * some hypervisors might implement compatibility modes for other
  - * hypervisors and therefore need to be detected in specific sequence.
  - */
   static const __initconst struct hypervisor_x86 * const hypervisors[] =
   {
   #ifdef CONFIG_XEN_PVHVM
  @@ -49,15 +44,19 @@ static inline void __init
   detect_hypervisor_vendor(void)
   {
  const struct hypervisor_x86 *h, * const *p;
  +   uint32_t pri, max_pri = 0;
   
  for (p = hypervisors; p  hypervisors + ARRAY_SIZE(hypervisors); p++) {
  h = *p;
  -   if (h-detect()) {
  +   pri = h-detect();
  +   if (pri != 0  pri  max_pri) {
  +   max_pri = pri;
  x86_hyper = h;
  -   printk(KERN_INFO Hypervisor detected: %s\n, h-name);
  -   break;
  }
  }
  +
  +   if (max_pri)
  +   printk(KERN_INFO Hypervisor detected: %s\n, x86_hyper-name);
   }
   
   void init_hypervisor(struct cpuinfo_x86 *c)
  diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
  index 8f4be53..71a39f3 100644
  --- a/arch/x86/kernel/cpu/mshyperv.c
  +++ b/arch/x86/kernel/cpu/mshyperv.c
  @@ -27,20 +27,23 @@
   struct ms_hyperv_info ms_hyperv;
   EXPORT_SYMBOL_GPL(ms_hyperv);
   
  -static bool __init ms_hyperv_platform(void)
  +static uint32_t  __init ms_hyperv_platform(void)
   {
  u32 eax;
  u32 hyp_signature[3];
   
  if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
  -   return false;
  +   return 0;
   
  cpuid(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS,
eax, hyp_signature[0], hyp_signature[1], hyp_signature[2]);
   
  -   return eax = HYPERV_CPUID_MIN 
  -   eax = HYPERV_CPUID_MAX 
  -   !memcmp(Microsoft Hv, hyp_signature, 12);
  +   if (eax = HYPERV_CPUID_MIN 
  +   eax = HYPERV_CPUID_MAX 
  +   !memcmp(Microsoft Hv, hyp_signature, 12))
  +   return

Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-05 Thread Konrad Rzeszutek Wilk
On Mon, Aug 05, 2013 at 11:46:03AM +0200, Ingo Molnar wrote:
 
 * Gleb Natapov g...@redhat.com wrote:
 
  On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote:
Ingo,

Do you have any concerns reg this series? please let me know if this 
looks good now to you.
   
   I'm inclined to NAK it for excessive quotation - who knows how many 
   people left the discussion in disgust? Was it done to drive away as 
   many reviewers as possible?
   
   Anyway, see my other reply, the measurement results seem hard to 
   interpret and inconclusive at the moment.
 
  That result was only for patch 18 of the series, not pvspinlock in 
  general.
 
 Okay - I've re-read the performance numbers and they are impressive, so no 
 objections from me.
 
 The x86 impact seems to be a straightforward API change, with most of the 
 changes on the virtualization side. So:
 
 Acked-by: Ingo Molnar mi...@kernel.org
 
 I guess you'd want to carry this in the KVM tree or so - maybe in a 
 separate branch because it changes Xen as well?

May I suggest an alternate way - perhaps you can put them in a tip/spinlock
tree for v3.12 - since both KVM and Xen maintainers have acked and carefully
reviewed them?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/4] x86: correctly detect hypervisor

2013-08-05 Thread Konrad Rzeszutek Wilk
On Mon, Aug 05, 2013 at 08:20:53AM -0700, H. Peter Anvin wrote:
 On 08/05/2013 07:34 AM, Konrad Rzeszutek Wilk wrote:
  
  Could you provide me with a git branch so I can test it overnight please?
  
 
 Pull tip:x86/paravirt.

It works for me. Thanks.
 
   -hpa
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] xen: switch to use hypervisor_cpuid_base()

2013-07-23 Thread Konrad Rzeszutek Wilk
On Tue, Jul 23, 2013 at 05:41:03PM +0800, Jason Wang wrote:
 Switch to use hypervisor_cpuid_base() to detect Xen.
 
 Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com

Acked-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Cc: Jeremy Fitzhardinge jer...@goop.org
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org
 Cc: Paolo Bonzini pbonz...@redhat.com
 Cc: xen-de...@lists.xensource.com
 Cc: virtualizat...@lists.linux-foundation.org
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  arch/x86/include/asm/xen/hypervisor.h |   16 +---
  1 files changed, 1 insertions(+), 15 deletions(-)
 
 diff --git a/arch/x86/include/asm/xen/hypervisor.h 
 b/arch/x86/include/asm/xen/hypervisor.h
 index 125f344..d866959 100644
 --- a/arch/x86/include/asm/xen/hypervisor.h
 +++ b/arch/x86/include/asm/xen/hypervisor.h
 @@ -40,21 +40,7 @@ extern struct start_info *xen_start_info;
  
  static inline uint32_t xen_cpuid_base(void)
  {
 - uint32_t base, eax, ebx, ecx, edx;
 - char signature[13];
 -
 - for (base = 0x4000; base  0x4001; base += 0x100) {
 - cpuid(base, eax, ebx, ecx, edx);
 - *(uint32_t *)(signature + 0) = ebx;
 - *(uint32_t *)(signature + 4) = ecx;
 - *(uint32_t *)(signature + 8) = edx;
 - signature[12] = 0;
 -
 - if (!strcmp(XenVMMXenVMM, signature)  ((eax - base) = 2))
 - return base;
 - }
 -
 - return 0;
 + return hypervisor_cpuid_base(XenVMMXenVMM, 2);
  }
  
  #ifdef CONFIG_XEN
 -- 
 1.7.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 0/18] Paravirtualized ticket spinlocks

2013-07-22 Thread Konrad Rzeszutek Wilk
 
 github link: https://github.com/ktraghavendra/linux/tree/pvspinlock_v11

And chance you have a backup git tree? I get:

This repository is temporarily unavailable.

 
 Please note that we set SPIN_THRESHOLD = 32k with this series,
 that would eatup little bit of overcommit performance of PLE machines
 and overall performance of non-PLE machines.
 
 The older series[3] was tested by Attilio for Xen implementation.
 
 Note that Konrad needs to revert below two patches to enable xen on hvm 
   70dd4998, f10cd522c

We could add that to the series. But let me first test it out - and that
gets back to the repo :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Konrad Rzeszutek Wilk
On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
 On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
  On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
  
  Here's an idea, trim the damn email ;-) -- not only directed at gleb.
  
 Good idea.
 
Ingo, Gleb,

From the results perspective, Andrew Theurer, Vinod's test results are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.

   I need to spend more time reviewing it :) The problem with PV interfaces
   is that they are easy to add but hard to get rid of if better solution
   (HW or otherwise) appears.
  
  How so? Just make sure the registration for the PV interface is optional; 
  that
  is, allow it to fail. A guest that fails the PV setup will either have to 
  try
  another PV interface or fall back to 'native'.
  
 We have to carry PV around for live migration purposes. PV interface
 cannot disappear under a running guest.

Why can't it? This is the same as handling say XSAVE operations. Some hosts
might have it - some might not. It is the job of the toolstack to make sure
to not migrate to the hosts which don't have it. Or bound the guest to the
lowest interface (so don't enable the PV interface if the other hosts in the
cluster can't support this flag)?

 
I agree that Jiannan's Preemptable Lock idea is promising and we could
evaluate that  approach, and make the best one get into kernel and also
will carry on discussion with Jiannan to improve that patch.
   That would be great. The work is stalled from what I can tell.
  
  I absolutely hated that stuff because it wrecked the native code.
 Yes, the idea was to hide it from native code behind PV hooks.
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Konrad Rzeszutek Wilk
Gleb Natapov g...@redhat.com wrote:
On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote:
 On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
  On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
   On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
   
   Here's an idea, trim the damn email ;-) -- not only directed at
gleb.
   
  Good idea.
  
 Ingo, Gleb,
 
 From the results perspective, Andrew Theurer, Vinod's test
results are
 pro-pvspinlock.
 Could you please help me to know what will make it a
mergeable
 candidate?.
 
I need to spend more time reviewing it :) The problem with PV
interfaces
is that they are easy to add but hard to get rid of if better
solution
(HW or otherwise) appears.
   
   How so? Just make sure the registration for the PV interface is
optional; that
   is, allow it to fail. A guest that fails the PV setup will either
have to try
   another PV interface or fall back to 'native'.
   
  We have to carry PV around for live migration purposes. PV
interface
  cannot disappear under a running guest.
 
 Why can't it? This is the same as handling say XSAVE operations. Some
hosts
 might have it - some might not. It is the job of the toolstack to
make sure
 to not migrate to the hosts which don't have it. Or bound the guest
to the
 lowest interface (so don't enable the PV interface if the other hosts
in the
 cluster can't support this flag)?
XSAVE is HW feature and it is not going disappear under you after
software
upgrade. Upgrading kernel on part of your hosts and no longer been
able to migrate to them is not something people who use live migration
expect. In practise it means that updating all hosts in a datacenter to
newer kernel is no longer possible without rebooting VMs.

--
   Gleb.

I see. Perhaps then if the hardware becomes much better at this then another PV 
interface can be provided which will use the static_key to turn off the PV spin 
lock and use the bare metal version (or perhaps some forms of super ellision 
locks). That does mean the host has to do something when this PV interface is 
invoked for the older guests. 

Anyhow that said I think the benefits are pretty neat right now and benefit 
much and worrying about whether the hardware vendors will provide something new 
is not benefiting users. What perhaps then needs to be addressed is how to have 
an obsolete mechanism in this if the hardware becomes superb? 
-- 
Sent from my Android phone. Please excuse my brevity.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Konrad Rzeszutek Wilk
On Wed, Jun 26, 2013 at 03:52:40PM +0300, Gleb Natapov wrote:
 On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
  On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
   On 06/25/2013 08:20 PM, Andrew Theurer wrote:
   On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
   This series replaces the existing paravirtualized spinlock mechanism
   with a paravirtualized ticketlock mechanism. The series provides
   implementation for both Xen and KVM.
   
   Changes in V9:
   - Changed spin_threshold to 32k to avoid excess halt exits that are
   causing undercommit degradation (after PLE handler improvement).
   - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
   - Optimized halt exit path to use PLE handler
   
   V8 of PVspinlock was posted last year. After Avi's suggestions to look
   at PLE handler's improvements, various optimizations in PLE handling
   have been tried.
   
   Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
   patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
   tested these patches with and without PLE, as PLE is still not scalable
   with large VMs.
   
   
   Hi Andrew,
   
   Thanks for testing.
   
   System: x3850X5, 40 cores, 80 threads
   
   
   1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
   --
Total
   ConfigurationThroughput(MB/s)Notes
   
   3.10-default-ple_on  22945   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-default-ple_off 23184   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-pvticket-ple_on 22895   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-pvticket-ple_off23051   5% CPU 
   in host kernel, 2% spin_lock in guests
   [all 1x results look good here]
   
   Yes. The 1x results look too close
   
   
   
   2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
   ---
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on   6287   55% CPU 
host kernel, 17% spin_lock in guests
   3.10-default-ple_off  1849   2% CPU 
   in host kernel, 95% spin_lock in guests
   3.10-pvticket-ple_on  6691   50% CPU 
   in host kernel, 15% spin_lock in guests
   3.10-pvticket-ple_off16464   8% CPU 
   in host kernel, 33% spin_lock in guests
   
   I see 6.426% improvement with ple_on
   and 161.87% improvement with ple_off. I think this is a very good sign
for the patches
   
   [PLE hinders pv-ticket improvements, but even with PLE off,
 we still off from ideal throughput (somewhere 2)]
   
   
   Okay, The ideal throughput you are referring is getting around atleast
   80% of 1x throughput for over-commit. Yes we are still far away from
   there.
   
   
   1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
   --
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on  22736   6% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-default-ple_off 23377   5% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-pvticket-ple_on 22471   6% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-pvticket-ple_off23445   5% CPU 
   in host kernel, 3% spin_lock in guests
   [1x looking fine here]
   
   
   I see ple_off is little better here.
   
   
   2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
   --
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on   1965   70% CPU 
   in host kernel, 34% spin_lock in guests 
   3.10-default-ple_off   226   2% CPU 
   in host kernel, 94% spin_lock in guests
   3.10-pvticket-ple_on  1942   70% CPU 
   in host kernel, 35% spin_lock in guests
   3.10-pvticket-ple_off 8003   11% CPU 
   in host kernel, 70% spin_lock in guests
   [quite bad all around, but pv-tickets with PLE off the best so far.
 Still quite a bit off from ideal throughput]
   
   This is again a remarkable improvement (307%).
   This motivates me to 

Re: [PATCH RFC V9 12/19] xen: Enable PV ticketlocks on HVM Xen

2013-06-04 Thread Konrad Rzeszutek Wilk
On Tue, Jun 04, 2013 at 12:46:53PM +0530, Raghavendra K T wrote:
 On 06/03/2013 09:27 PM, Konrad Rzeszutek Wilk wrote:
 On Sun, Jun 02, 2013 at 12:55:03AM +0530, Raghavendra K T wrote:
 xen: Enable PV ticketlocks on HVM Xen
 
 There is more to it. You should also revert 
 70dd4998cb85f0ecd6ac892cc7232abefa432efb
 
 
 Yes, true. Do you expect the revert to be folded into this patch itself?
 

I can do them. I would drop this patch and just mention in
the cover letter that Konrad would have to revert two git commits
to re-enable it on PVHVM.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 2/19] x86/ticketlock: Don't inline _spin_unlock when using paravirt spinlocks

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:52:09AM +0530, Raghavendra K T wrote:
 x86/ticketlock: Don't inline _spin_unlock when using paravirt spinlocks
 
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 The code size expands somewhat, and its better to just call
 a function rather than inline it.
 
 Thanks Jeremy for original version of ARCH_NOINLINE_SPIN_UNLOCK config patch,
 which is simplified.
 
 Suggested-by: Linus Torvalds torva...@linux-foundation.org
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  arch/x86/Kconfig |1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 685692c..80fcc4b 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -621,6 +621,7 @@ config PARAVIRT_DEBUG
  config PARAVIRT_SPINLOCKS
   bool Paravirtualization layer for spinlocks
   depends on PARAVIRT  SMP
 + select UNINLINE_SPIN_UNLOCK
   ---help---
 Paravirtualized spinlocks allow a pvops backend to replace the
 spinlock implementation with something virtualization-friendly
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 3/19] x86/ticketlock: Collapse a layer of functions

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:52:29AM +0530, Raghavendra K T wrote:
 x86/ticketlock: Collapse a layer of functions
 
 From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 
 Now that the paravirtualization layer doesn't exist at the spinlock
 level any more, we can collapse the __ticket_ functions into the arch_
 functions.
 
 Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 Tested-by: Attilio Rao attilio@citrix.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  arch/x86/include/asm/spinlock.h |   35 +--
  1 file changed, 5 insertions(+), 30 deletions(-)
 
 diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
 index 4d54244..7442410 100644
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -76,7 +76,7 @@ static __always_inline void __ticket_unlock_kick(struct 
 arch_spinlock *lock,
   * in the high part, because a wide xadd increment of the low part would 
 carry
   * up and contaminate the high part.
   */
 -static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
 +static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
  {
   register struct __raw_tickets inc = { .tail = 1 };
  
 @@ -96,7 +96,7 @@ static __always_inline void __ticket_spin_lock(struct 
 arch_spinlock *lock)
  out: barrier();  /* make sure nothing creeps before the lock is taken */
  }
  
 -static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
 +static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
  {
   arch_spinlock_t old, new;
  
 @@ -110,7 +110,7 @@ static __always_inline int 
 __ticket_spin_trylock(arch_spinlock_t *lock)
   return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
 old.head_tail;
  }
  
 -static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 +static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
  {
   __ticket_t next = lock-tickets.head + 1;
  
 @@ -118,46 +118,21 @@ static __always_inline void 
 __ticket_spin_unlock(arch_spinlock_t *lock)
   __ticket_unlock_kick(lock, next);
  }
  
 -static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 +static inline int arch_spin_is_locked(arch_spinlock_t *lock)
  {
   struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
  
   return tmp.tail != tmp.head;
  }
  
 -static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
 +static inline int arch_spin_is_contended(arch_spinlock_t *lock)
  {
   struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
  
   return (__ticket_t)(tmp.tail - tmp.head)  1;
  }
 -
 -static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 -{
 - return __ticket_spin_is_locked(lock);
 -}
 -
 -static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 -{
 - return __ticket_spin_is_contended(lock);
 -}
  #define arch_spin_is_contended   arch_spin_is_contended
  
 -static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 -{
 - __ticket_spin_lock(lock);
 -}
 -
 -static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 -{
 - return __ticket_spin_trylock(lock);
 -}
 -
 -static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 -{
 - __ticket_spin_unlock(lock);
 -}
 -
  static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 unsigned long flags)
  {
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 8/19] x86/pvticketlock: When paravirtualizing ticket locks, increment by 2

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:54:02AM +0530, Raghavendra K T wrote:
 x86/pvticketlock: When paravirtualizing ticket locks, increment by 2
 
 From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 
 Increment ticket head/tails by 2 rather than 1 to leave the LSB free
 to store a is in slowpath state bit.  This halves the number
 of possible CPUs for a given ticket size, but this shouldn't matter
 in practice - kernels built for 32k+ CPU systems are probably
 specially built for the hardware rather than a generic distro
 kernel.
 
 Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 Tested-by: Attilio Rao attilio@citrix.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  arch/x86/include/asm/spinlock.h   |   10 +-
  arch/x86/include/asm/spinlock_types.h |   10 +-
  2 files changed, 14 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
 index 7442410..04a5cd5 100644
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -78,7 +78,7 @@ static __always_inline void __ticket_unlock_kick(struct 
 arch_spinlock *lock,
   */
  static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
  {
 - register struct __raw_tickets inc = { .tail = 1 };
 + register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
  
   inc = xadd(lock-tickets, inc);
  
 @@ -104,7 +104,7 @@ static __always_inline int 
 arch_spin_trylock(arch_spinlock_t *lock)
   if (old.tickets.head != old.tickets.tail)
   return 0;
  
 - new.head_tail = old.head_tail + (1  TICKET_SHIFT);
 + new.head_tail = old.head_tail + (TICKET_LOCK_INC  TICKET_SHIFT);
  
   /* cmpxchg is a full barrier, so nothing can move before it */
   return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
 old.head_tail;
 @@ -112,9 +112,9 @@ static __always_inline int 
 arch_spin_trylock(arch_spinlock_t *lock)
  
  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
  {
 - __ticket_t next = lock-tickets.head + 1;
 + __ticket_t next = lock-tickets.head + TICKET_LOCK_INC;
  
 - __add(lock-tickets.head, 1, UNLOCK_LOCK_PREFIX);
 + __add(lock-tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
   __ticket_unlock_kick(lock, next);
  }
  
 @@ -129,7 +129,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t 
 *lock)
  {
   struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
  
 - return (__ticket_t)(tmp.tail - tmp.head)  1;
 + return (__ticket_t)(tmp.tail - tmp.head)  TICKET_LOCK_INC;
  }
  #define arch_spin_is_contended   arch_spin_is_contended
  
 diff --git a/arch/x86/include/asm/spinlock_types.h 
 b/arch/x86/include/asm/spinlock_types.h
 index 83fd3c7..e96fcbd 100644
 --- a/arch/x86/include/asm/spinlock_types.h
 +++ b/arch/x86/include/asm/spinlock_types.h
 @@ -3,7 +3,13 @@
  
  #include linux/types.h
  
 -#if (CONFIG_NR_CPUS  256)
 +#ifdef CONFIG_PARAVIRT_SPINLOCKS
 +#define __TICKET_LOCK_INC2
 +#else
 +#define __TICKET_LOCK_INC1
 +#endif
 +
 +#if (CONFIG_NR_CPUS  (256 / __TICKET_LOCK_INC))
  typedef u8  __ticket_t;
  typedef u16 __ticketpair_t;
  #else
 @@ -11,6 +17,8 @@ typedef u16 __ticket_t;
  typedef u32 __ticketpair_t;
  #endif
  
 +#define TICKET_LOCK_INC  ((__ticket_t)__TICKET_LOCK_INC)
 +
  #define TICKET_SHIFT (sizeof(__ticket_t) * 8)
  
  typedef struct arch_spinlock {
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 9/19] Split out rate limiting from jump_label.h

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:54:22AM +0530, Raghavendra K T wrote:
 Split jumplabel ratelimit

I would change the title a bit, perhaps prefix it with: jump_label: 
 
 From: Andrew Jones drjo...@redhat.com
 
 Commit b202952075f62603bea9bfb6ebc6b0420db11949 introduced rate limiting

Also please add right after the git id this:

(perf, core: Rate limit perf_sched_events jump_label patching)

 for jump label disabling. The changes were made in the jump label code
 in order to be more widely available and to keep things tidier. This is
 all fine, except now jump_label.h includes linux/workqueue.h, which
 makes it impossible to include jump_label.h from anything that
 workqueue.h needs. For example, it's now impossible to include
 jump_label.h from asm/spinlock.h, which is done in proposed
 pv-ticketlock patches. This patch splits out the rate limiting related
 changes from jump_label.h into a new file, jump_label_ratelimit.h, to
 resolve the issue.
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

Otherwise looks fine to me:

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  include/linux/jump_label.h   |   26 +-
  include/linux/jump_label_ratelimit.h |   34 
 ++
  include/linux/perf_event.h   |1 +
  kernel/jump_label.c  |1 +
  4 files changed, 37 insertions(+), 25 deletions(-)
  create mode 100644 include/linux/jump_label_ratelimit.h
 
 diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
 index 0976fc4..53cdf89 100644
 --- a/include/linux/jump_label.h
 +++ b/include/linux/jump_label.h
 @@ -48,7 +48,6 @@
  
  #include linux/types.h
  #include linux/compiler.h
 -#include linux/workqueue.h
  
  #if defined(CC_HAVE_ASM_GOTO)  defined(CONFIG_JUMP_LABEL)
  
 @@ -61,12 +60,6 @@ struct static_key {
  #endif
  };
  
 -struct static_key_deferred {
 - struct static_key key;
 - unsigned long timeout;
 - struct delayed_work work;
 -};
 -
  # include asm/jump_label.h
  # define HAVE_JUMP_LABEL
  #endif   /* CC_HAVE_ASM_GOTO  CONFIG_JUMP_LABEL */
 @@ -119,10 +112,7 @@ extern void arch_jump_label_transform_static(struct 
 jump_entry *entry,
  extern int jump_label_text_reserved(void *start, void *end);
  extern void static_key_slow_inc(struct static_key *key);
  extern void static_key_slow_dec(struct static_key *key);
 -extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
  extern void jump_label_apply_nops(struct module *mod);
 -extern void
 -jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
  
  #define STATIC_KEY_INIT_TRUE ((struct static_key) \
   { .enabled = ATOMIC_INIT(1), .entries = (void *)1 })
 @@ -141,10 +131,6 @@ static __always_inline void jump_label_init(void)
  {
  }
  
 -struct static_key_deferred {
 - struct static_key  key;
 -};
 -
  static __always_inline bool static_key_false(struct static_key *key)
  {
   if (unlikely(atomic_read(key-enabled))  0)
 @@ -169,11 +155,6 @@ static inline void static_key_slow_dec(struct static_key 
 *key)
   atomic_dec(key-enabled);
  }
  
 -static inline void static_key_slow_dec_deferred(struct static_key_deferred 
 *key)
 -{
 - static_key_slow_dec(key-key);
 -}
 -
  static inline int jump_label_text_reserved(void *start, void *end)
  {
   return 0;
 @@ -187,12 +168,6 @@ static inline int jump_label_apply_nops(struct module 
 *mod)
   return 0;
  }
  
 -static inline void
 -jump_label_rate_limit(struct static_key_deferred *key,
 - unsigned long rl)
 -{
 -}
 -
  #define STATIC_KEY_INIT_TRUE ((struct static_key) \
   { .enabled = ATOMIC_INIT(1) })
  #define STATIC_KEY_INIT_FALSE ((struct static_key) \
 @@ -203,6 +178,7 @@ jump_label_rate_limit(struct static_key_deferred *key,
  #define STATIC_KEY_INIT STATIC_KEY_INIT_FALSE
  #define jump_label_enabled static_key_enabled
  
 +static inline int atomic_read(const atomic_t *v);
  static inline bool static_key_enabled(struct static_key *key)
  {
   return (atomic_read(key-enabled)  0);
 diff --git a/include/linux/jump_label_ratelimit.h 
 b/include/linux/jump_label_ratelimit.h
 new file mode 100644
 index 000..1137883
 --- /dev/null
 +++ b/include/linux/jump_label_ratelimit.h
 @@ -0,0 +1,34 @@
 +#ifndef _LINUX_JUMP_LABEL_RATELIMIT_H
 +#define _LINUX_JUMP_LABEL_RATELIMIT_H
 +
 +#include linux/jump_label.h
 +#include linux/workqueue.h
 +
 +#if defined(CC_HAVE_ASM_GOTO)  defined(CONFIG_JUMP_LABEL)
 +struct static_key_deferred {
 + struct static_key key;
 + unsigned long timeout;
 + struct delayed_work work;
 +};
 +#endif
 +
 +#ifdef HAVE_JUMP_LABEL
 +extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
 +extern void
 +jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
 +
 +#else/* !HAVE_JUMP_LABEL */
 +struct static_key_deferred {
 + struct static_key  key

Re: [PATCH RFC V9 12/19] xen: Enable PV ticketlocks on HVM Xen

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:55:03AM +0530, Raghavendra K T wrote:
 xen: Enable PV ticketlocks on HVM Xen

There is more to it. You should also revert 
70dd4998cb85f0ecd6ac892cc7232abefa432efb

 
 From: Stefano Stabellini stefano.stabell...@eu.citrix.com
 
 Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/xen/smp.c |1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
 index dcdc91c..8d2abf7 100644
 --- a/arch/x86/xen/smp.c
 +++ b/arch/x86/xen/smp.c
 @@ -682,4 +682,5 @@ void __init xen_hvm_smp_init(void)
   smp_ops.cpu_die = xen_hvm_cpu_die;
   smp_ops.send_call_func_ipi = xen_smp_send_call_function_ipi;
   smp_ops.send_call_func_single_ipi = 
 xen_smp_send_call_function_single_ipi;
 + xen_init_spinlocks();
  }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 16/19] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:55:57AM +0530, Raghavendra K T wrote:
 kvm : Paravirtual ticketlocks support for linux guests running on KVM 
 hypervisor
 
 From: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
 
 During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
 required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
  support for pv-ticketlocks is registered via pv_lock_ops.
 
 Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.
 
 Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
 Signed-off-by: Suzuki Poulose suz...@in.ibm.com
 [Raghu: check_zero race fix, enum for kvm_contention_stat
 jumplabel related changes ]
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/kvm_para.h |   14 ++
  arch/x86/kernel/kvm.c   |  256 
 +++
  2 files changed, 268 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
 index 695399f..427afcb 100644
 --- a/arch/x86/include/asm/kvm_para.h
 +++ b/arch/x86/include/asm/kvm_para.h
 @@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
  void kvm_async_pf_task_wake(u32 token);
  u32 kvm_read_and_reset_pf_reason(void);
  extern void kvm_disable_steal_time(void);
 -#else
 -#define kvm_guest_init() do { } while (0)
 +
 +#ifdef CONFIG_PARAVIRT_SPINLOCKS
 +void __init kvm_spinlock_init(void);
 +#else /* !CONFIG_PARAVIRT_SPINLOCKS */
 +static inline void kvm_spinlock_init(void)
 +{
 +}
 +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
 +
 +#else /* CONFIG_KVM_GUEST */
 +#define kvm_guest_init() do {} while (0)
  #define kvm_async_pf_task_wait(T) do {} while(0)
  #define kvm_async_pf_task_wake(T) do {} while(0)
 +
  static inline u32 kvm_read_and_reset_pf_reason(void)
  {
   return 0;
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index cd6d9a5..2715b92 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -34,6 +34,7 @@
  #include linux/sched.h
  #include linux/slab.h
  #include linux/kprobes.h
 +#include linux/debugfs.h
  #include asm/timer.h
  #include asm/cpu.h
  #include asm/traps.h
 @@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
   WARN_ON(kvm_register_clock(primary cpu clock));
   kvm_guest_cpu_init();
   native_smp_prepare_boot_cpu();
 + kvm_spinlock_init();
  }
  
  static void __cpuinit kvm_guest_cpu_online(void *dummy)
 @@ -523,3 +525,257 @@ static __init int activate_jump_labels(void)
   return 0;
  }
  arch_initcall(activate_jump_labels);
 +
 +/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
 +void kvm_kick_cpu(int cpu)
 +{
 + int apicid;
 +
 + apicid = per_cpu(x86_cpu_to_apicid, cpu);
 + kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
 +}
 +
 +#ifdef CONFIG_PARAVIRT_SPINLOCKS
 +
 +enum kvm_contention_stat {
 + TAKEN_SLOW,
 + TAKEN_SLOW_PICKUP,
 + RELEASED_SLOW,
 + RELEASED_SLOW_KICKED,
 + NR_CONTENTION_STATS
 +};
 +
 +#ifdef CONFIG_KVM_DEBUG_FS
 +#define HISTO_BUCKETS30
 +
 +static struct kvm_spinlock_stats
 +{
 + u32 contention_stats[NR_CONTENTION_STATS];
 + u32 histo_spin_blocked[HISTO_BUCKETS+1];
 + u64 time_blocked;
 +} spinlock_stats;
 +
 +static u8 zero_stats;
 +
 +static inline void check_zero(void)
 +{
 + u8 ret;
 + u8 old;
 +
 + old = ACCESS_ONCE(zero_stats);
 + if (unlikely(old)) {
 + ret = cmpxchg(zero_stats, old, 0);
 + /* This ensures only one fellow resets the stat */
 + if (ret == old)
 + memset(spinlock_stats, 0, sizeof(spinlock_stats));
 + }
 +}
 +
 +static inline void add_stats(enum kvm_contention_stat var, u32 val)
 +{
 + check_zero();
 + spinlock_stats.contention_stats[var] += val;
 +}
 +
 +
 +static inline u64 spin_time_start(void)
 +{
 + return sched_clock();
 +}
 +
 +static void __spin_time_accum(u64 delta, u32 *array)
 +{
 + unsigned index;
 +
 + index = ilog2(delta);
 + check_zero();
 +
 + if (index  HISTO_BUCKETS)
 + array[index]++;
 + else
 + array[HISTO_BUCKETS]++;
 +}
 +
 +static inline void spin_time_accum_blocked(u64 start)
 +{
 + u32 delta;
 +
 + delta = sched_clock() - start;
 + __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
 + spinlock_stats.time_blocked += delta;
 +}
 +
 +static struct dentry *d_spin_debug;
 +static struct dentry *d_kvm_debug;
 +
 +struct dentry *kvm_init_debugfs(void)
 +{
 + d_kvm_debug = debugfs_create_dir(kvm, NULL);
 + if (!d_kvm_debug)
 + printk(KERN_WARNING Could not create 'kvm' debugfs 
 directory\n);
 +
 + return d_kvm_debug;
 +}
 +
 +static int __init kvm_spinlock_debugfs(void)
 +{
 + struct dentry *d_kvm;
 +
 + d_kvm = kvm_init_debugfs();
 + if (d_kvm == NULL)
 + return -ENOMEM;
 +
 + d_spin_debug = debugfs_create_dir(spinlocks, d_kvm);
 +
 + 

Re: [PATCH RFC V9 5/19] xen/pvticketlock: Xen implementation for PV ticket locks

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sat, Jun 01, 2013 at 12:23:14PM -0700, Raghavendra K T wrote:
 xen/pvticketlock: Xen implementation for PV ticket locks
 
 From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 
 Replace the old Xen implementation of PV spinlocks with and implementation
 of xen_lock_spinning and xen_unlock_kick.
 
 xen_lock_spinning simply registers the cpu in its entry in lock_waiting,
 adds itself to the waiting_cpus set, and blocks on an event channel
 until the channel becomes pending.
 
 xen_unlock_kick searches the cpus in waiting_cpus looking for the one
 which next wants this lock with the next ticket, if any.  If found,
 it kicks it by making its event channel pending, which wakes it up.
 
 We need to make sure interrupts are disabled while we're relying on the
 contents of the per-cpu lock_waiting values, otherwise an interrupt
 handler could come in, try to take some other lock, block, and overwrite
 our values.
 
 Raghu: use function + enum instead of macro, cmpxchg for zero status reset
 
 Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/xen/spinlock.c |  347 
 +++
  1 file changed, 78 insertions(+), 269 deletions(-)
 
 diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
 index d6481a9..860e190 100644
 --- a/arch/x86/xen/spinlock.c
 +++ b/arch/x86/xen/spinlock.c
 @@ -16,45 +16,44 @@
  #include xen-ops.h
  #include debugfs.h
  
 -#ifdef CONFIG_XEN_DEBUG_FS
 -static struct xen_spinlock_stats
 -{
 - u64 taken;
 - u32 taken_slow;
 - u32 taken_slow_nested;
 - u32 taken_slow_pickup;
 - u32 taken_slow_spurious;
 - u32 taken_slow_irqenable;
 +enum xen_contention_stat {
 + TAKEN_SLOW,
 + TAKEN_SLOW_PICKUP,
 + TAKEN_SLOW_SPURIOUS,
 + RELEASED_SLOW,
 + RELEASED_SLOW_KICKED,
 + NR_CONTENTION_STATS
 +};
  
 - u64 released;
 - u32 released_slow;
 - u32 released_slow_kicked;
  
 +#ifdef CONFIG_XEN_DEBUG_FS
  #define HISTO_BUCKETS30
 - u32 histo_spin_total[HISTO_BUCKETS+1];
 - u32 histo_spin_spinning[HISTO_BUCKETS+1];
 +static struct xen_spinlock_stats
 +{
 + u32 contention_stats[NR_CONTENTION_STATS];
   u32 histo_spin_blocked[HISTO_BUCKETS+1];
 -
 - u64 time_total;
 - u64 time_spinning;
   u64 time_blocked;
  } spinlock_stats;
  
  static u8 zero_stats;
  
 -static unsigned lock_timeout = 1  10;
 -#define TIMEOUT lock_timeout
 -
  static inline void check_zero(void)
  {
 - if (unlikely(zero_stats)) {
 - memset(spinlock_stats, 0, sizeof(spinlock_stats));
 - zero_stats = 0;
 + u8 ret;
 + u8 old = ACCESS_ONCE(zero_stats);
 + if (unlikely(old)) {
 + ret = cmpxchg(zero_stats, old, 0);
 + /* This ensures only one fellow resets the stat */
 + if (ret == old)
 + memset(spinlock_stats, 0, sizeof(spinlock_stats));
   }
  }
  
 -#define ADD_STATS(elem, val) \
 - do { check_zero(); spinlock_stats.elem += (val); } while(0)
 +static inline void add_stats(enum xen_contention_stat var, u32 val)
 +{
 + check_zero();
 + spinlock_stats.contention_stats[var] += val;
 +}
  
  static inline u64 spin_time_start(void)
  {
 @@ -73,22 +72,6 @@ static void __spin_time_accum(u64 delta, u32 *array)
   array[HISTO_BUCKETS]++;
  }
  
 -static inline void spin_time_accum_spinning(u64 start)
 -{
 - u32 delta = xen_clocksource_read() - start;
 -
 - __spin_time_accum(delta, spinlock_stats.histo_spin_spinning);
 - spinlock_stats.time_spinning += delta;
 -}
 -
 -static inline void spin_time_accum_total(u64 start)
 -{
 - u32 delta = xen_clocksource_read() - start;
 -
 - __spin_time_accum(delta, spinlock_stats.histo_spin_total);
 - spinlock_stats.time_total += delta;
 -}
 -
  static inline void spin_time_accum_blocked(u64 start)
  {
   u32 delta = xen_clocksource_read() - start;
 @@ -98,19 +81,15 @@ static inline void spin_time_accum_blocked(u64 start)
  }
  #else  /* !CONFIG_XEN_DEBUG_FS */
  #define TIMEOUT  (1  10)
 -#define ADD_STATS(elem, val) do { (void)(val); } while(0)
 +static inline void add_stats(enum xen_contention_stat var, u32 val)
 +{
 +}
  
  static inline u64 spin_time_start(void)
  {
   return 0;
  }
  
 -static inline void spin_time_accum_total(u64 start)
 -{
 -}
 -static inline void spin_time_accum_spinning(u64 start)
 -{
 -}
  static inline void spin_time_accum_blocked(u64 start)
  {
  }
 @@ -133,229 +112,82 @@ typedef u16 xen_spinners_t;
   asm(LOCK_PREFIX  decw %0 : +m ((xl)-spinners) : : memory);
  #endif
  
 -struct xen_spinlock {
 - unsigned char lock; /* 0 - free; 1 - locked */
 - xen_spinners_t spinners;/* count of waiting cpus */
 +struct xen_lock_waiting {
 + struct arch_spinlock *lock

Re: [PATCH RFC V9 19/19] kvm hypervisor: Add directed yield in vcpu block path

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:56:45AM +0530, Raghavendra K T wrote:
 kvm hypervisor: Add directed yield in vcpu block path
 
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 We use the improved PLE handler logic in vcpu block patch for
 scheduling rather than plain schedule, so that we can make
 intelligent decisions

You are missing '.' there, and

 
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/ia64/include/asm/kvm_host.h|5 +
  arch/powerpc/include/asm/kvm_host.h |5 +
  arch/s390/include/asm/kvm_host.h|5 +
  arch/x86/include/asm/kvm_host.h |2 +-
  arch/x86/kvm/x86.c  |8 
  include/linux/kvm_host.h|2 +-
  virt/kvm/kvm_main.c |6 --
  7 files changed, 29 insertions(+), 4 deletions(-)
 
 diff --git a/arch/ia64/include/asm/kvm_host.h 
 b/arch/ia64/include/asm/kvm_host.h
 index 989dd3f..999ab15 100644
 --- a/arch/ia64/include/asm/kvm_host.h
 +++ b/arch/ia64/include/asm/kvm_host.h
 @@ -595,6 +595,11 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu);
  int kvm_pal_emul(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
  void kvm_sal_emul(struct kvm_vcpu *vcpu);
  
 +static inline void kvm_do_schedule(struct kvm_vcpu *vcpu)
 +{
 + schedule();
 +}
 +
  #define __KVM_HAVE_ARCH_VM_ALLOC 1
  struct kvm *kvm_arch_alloc_vm(void);
  void kvm_arch_free_vm(struct kvm *kvm);
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index af326cd..1aeecc0 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -628,4 +628,9 @@ struct kvm_vcpu_arch {
  #define __KVM_HAVE_ARCH_WQP
  #define __KVM_HAVE_CREATE_DEVICE
  
 +static inline void kvm_do_schedule(struct kvm_vcpu *vcpu)
 +{
 + schedule();
 +}
 +
  #endif /* __POWERPC_KVM_HOST_H__ */
 diff --git a/arch/s390/include/asm/kvm_host.h 
 b/arch/s390/include/asm/kvm_host.h
 index 16bd5d1..db09a56 100644
 --- a/arch/s390/include/asm/kvm_host.h
 +++ b/arch/s390/include/asm/kvm_host.h
 @@ -266,4 +266,9 @@ struct kvm_arch{
  };
  
  extern int sie64a(struct kvm_s390_sie_block *, u64 *);
 +static inline void kvm_do_schedule(struct kvm_vcpu *vcpu)
 +{
 + schedule();
 +}
 +
  #endif
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 95702de..72ff791 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -1042,5 +1042,5 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
 msr_data *msr_info);
  int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
  void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 -
 +void kvm_do_schedule(struct kvm_vcpu *vcpu);
  #endif /* _ASM_X86_KVM_HOST_H */
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index b963c86..d26c4be 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -7281,6 +7281,14 @@ bool kvm_arch_can_inject_async_page_present(struct 
 kvm_vcpu *vcpu)
   kvm_x86_ops-interrupt_allowed(vcpu);
  }
  
 +void kvm_do_schedule(struct kvm_vcpu *vcpu)
 +{
 + /* We try to yield to a kikced vcpu else do a schedule */

s/kikced/kicked/

 + if (kvm_vcpu_on_spin(vcpu) = 0)
 + schedule();
 +}
 +EXPORT_SYMBOL_GPL(kvm_do_schedule);
 +
  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index f0eea07..39efc18 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -565,7 +565,7 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct 
 kvm_memory_slot *memslot,
  void kvm_vcpu_block(struct kvm_vcpu *vcpu);
  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
  bool kvm_vcpu_yield_to(struct kvm_vcpu *target);
 -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
 +bool kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
  void kvm_resched(struct kvm_vcpu *vcpu);
  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 302681c..8387247 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1685,7 +1685,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
   if (signal_pending(current))
   break;
  
 - schedule();
 + kvm_do_schedule(vcpu);
   }
  
   finish_wait(vcpu-wq, wait);
 @@ -1786,7 +1786,7 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
 kvm_vcpu *vcpu)
  }
  #endif
  
 -void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 +bool kvm_vcpu_on_spin(struct kvm_vcpu *me)
  {
   struct kvm *kvm = me-kvm;
   struct kvm_vcpu *vcpu;
 @@ -1835,6 +1835,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  
   /* Ensure vcpu is not eligible during next spinloop */
   kvm_vcpu_set_dy_eligible(me, false);
 +
 + return yielded;
  }
  

Re: [PATCH RFC V9 18/19] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2013-06-03 Thread Konrad Rzeszutek Wilk
On Sun, Jun 02, 2013 at 12:56:24AM +0530, Raghavendra K T wrote:
 Documentation/kvm : Add documentation on Hypercalls and features used for PV 
 spinlock
 
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 KVM_HC_KICK_CPU  hypercall added to wakeup halted vcpu in paravirtual spinlock
 enabled guest.
 
 KVM_FEATURE_PV_UNHALT enables guest to check whether pv spinlock can be 
 enabled
 in guest.
 
 Thanks Vatsa for rewriting KVM_HC_KICK_CPU
 
 Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  Documentation/virtual/kvm/cpuid.txt  |4 
  Documentation/virtual/kvm/hypercalls.txt |   13 +
  2 files changed, 17 insertions(+)
 
 diff --git a/Documentation/virtual/kvm/cpuid.txt 
 b/Documentation/virtual/kvm/cpuid.txt
 index 83afe65..654f43c 100644
 --- a/Documentation/virtual/kvm/cpuid.txt
 +++ b/Documentation/virtual/kvm/cpuid.txt
 @@ -43,6 +43,10 @@ KVM_FEATURE_CLOCKSOURCE2   || 3 || kvmclock 
 available at msrs
  KVM_FEATURE_ASYNC_PF   || 4 || async pf can be enabled by
 ||   || writing to msr 0x4b564d02
  
 --
 +KVM_FEATURE_PV_UNHALT  || 6 || guest checks this feature bit
 +   ||   || before enabling 
 paravirtualized
 +   ||   || spinlock support.
 +--
  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
 guest-side
 ||   || per-cpu warps are expected in
 ||   || kvmclock.
 diff --git a/Documentation/virtual/kvm/hypercalls.txt 
 b/Documentation/virtual/kvm/hypercalls.txt
 index ea113b5..2a4da11 100644
 --- a/Documentation/virtual/kvm/hypercalls.txt
 +++ b/Documentation/virtual/kvm/hypercalls.txt
 @@ -64,3 +64,16 @@ Purpose: To enable communication between the hypervisor 
 and guest there is a
  shared page that contains parts of supervisor visible register state.
  The guest can map this shared page to access its supervisor register through
  memory using this hypercall.
 +
 +5. KVM_HC_KICK_CPU
 +
 +Architecture: x86
 +Status: active
 +Purpose: Hypercall used to wakeup a vcpu from HLT state
 +Usage example : A vcpu of a paravirtualized guest that is busywaiting in 
 guest
 +kernel mode for an event to occur (ex: a spinlock to become available) can
 +execute HLT instruction once it has busy-waited for more than a threshold
 +time-interval. Execution of HLT instruction would cause the hypervisor to put
 +the vcpu to sleep until occurence of an appropriate event. Another vcpu of 
 the
 +same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
 +specifying APIC ID of the vcpu to be wokenup.

woken up.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/2] vfio: Provide module option to disable vfio_iommu_type1 hugepage support

2013-05-28 Thread Konrad Rzeszutek Wilk
On Tue, May 28, 2013 at 10:27:52AM -0600, Alex Williamson wrote:
 Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
 support.  This causes iommu_map to only be called with single page
 mappings, disabling the IOMMU driver's ability to use hugepages.
 This option can be enabled by loading vfio_iommu_type1 with
 disable_hugepages=1 or dynamically through sysfs.  If enabled
 dynamically, only new mappings are restricted.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---

 
 As suggested by Konrad.  This is cleaner to add as a follow-on
 
  drivers/vfio/vfio_iommu_type1.c |   11 +++
  1 file changed, 11 insertions(+)
 
 diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
 index 6654a7e..8a2be4e 100644
 --- a/drivers/vfio/vfio_iommu_type1.c
 +++ b/drivers/vfio/vfio_iommu_type1.c
 @@ -48,6 +48,12 @@ module_param_named(allow_unsafe_interrupts,
  MODULE_PARM_DESC(allow_unsafe_interrupts,
Enable VFIO IOMMU support for on platforms without interrupt 
 remapping support.);
  
 +static bool disable_hugepages;
 +module_param_named(disable_hugepages,
 +disable_hugepages, bool, S_IRUGO | S_IWUSR);
 +MODULE_PARM_DESC(disable_hugepages,
 +  Disable VFIO IOMMU support for IOMMU hugepages.);
 +
  struct vfio_iommu {
   struct iommu_domain *domain;
   struct mutexlock;
 @@ -270,6 +276,11 @@ static long vfio_pin_pages(unsigned long vaddr, long 
 npage,
   return -ENOMEM;
   }
  
 + if (unlikely(disable_hugepages)) {
 + vfio_lock_acct(1);
 + return 1;
 + }
 +
   /* Lock all the consecutive pages from pfn_base */
   for (i = 1, vaddr += PAGE_SIZE; i  npage; i++, vaddr += PAGE_SIZE) {
   unsigned long pfn = 0;
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vfio: hugepage support for vfio_iommu_type1

2013-05-25 Thread Konrad Rzeszutek Wilk
 + * Turns out AMD IOMMU has a page table bug where it won't map large pages
 + * to a region that previously mapped smaller pages.  This should be fixed
 + * soon, so this is just a temporary workaround to break mappings down into
 + * PAGE_SIZE.  Better to map smaller pages than nothing.
 + */
 +static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
 +   unsigned long pfn, long npage, int prot)
 +{
 + long i;
 + int ret;
 +
 + for (i = 0; i  npage; i++, pfn++, iova += PAGE_SIZE) {
 + ret = iommu_map(iommu-domain, iova,
 + (phys_addr_t)pfn  PAGE_SHIFT,
 + PAGE_SIZE, prot);
 + if (ret)
 + break;
 + }
 +
 + for (; i  npage  i  0; i--, iova -= PAGE_SIZE)
 + iommu_unmap(iommu-domain, iova, PAGE_SIZE);
 +
   return ret;
  }

This looks to belong to a vfio-quirk file (a something else) that deals with
various IOMMU's quirks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] vfio: type1 iommu hugepage support

2013-05-25 Thread Konrad Rzeszutek Wilk
On Fri, May 24, 2013 at 11:24:26AM -0600, Alex Williamson wrote:
 This series let's the vfio type1 iommu backend take advantage of iommu
 large page support.  See patch 2/2 for the details.  This has been
 tested on both amd_iommu and intel_iommu, but only my AMD system has
 large page support.  I'd appreciate any testing and feedback on other
 systems, particularly vt-d systems supporting large pages.  Mapping
 efficiency should be improved a bit without iommu hugepages, but I
 hope that it's much more noticeable with huge pages, especially for
 very large QEMU guests.

I took a very very quick look - and I am wondering if there should also
be a flag to turn it on/off in ther kernel in such case? Especially in the 
field if a user finds out that their particular IOMMU chipset might
be doing something funky with large-pages ?

 
 This change includes a clarification to the mapping expectations for
 users of the type1 iommu, but is compatible with known users and works
 with existing QEMU userspace supporting vfio.  Thanks,
 
 Alex
 
 ---
 
 Alex Williamson (2):
   vfio: Convert type1 iommu to use rbtree
   vfio: hugepage support for vfio_iommu_type1
 
 
  drivers/vfio/vfio_iommu_type1.c |  607 
 ---
  include/uapi/linux/vfio.h   |8 -
  2 files changed, 387 insertions(+), 228 deletions(-)
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-27 Thread Konrad Rzeszutek Wilk
On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:
 Add a consigned field.  This field will hold the time lost due to capping or 
 overcommit.
 The rest of the time will still show up in the steal-time field.
 
 Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/paravirt.h   |4 ++--
  arch/x86/include/asm/paravirt_types.h |2 +-
  arch/x86/kernel/kvm.c |7 ++-
  kernel/sched/core.c   |   10 +-
  kernel/sched/cputime.c|2 +-
  5 files changed, 15 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
 index a0facf3..a5f9f30 100644
 --- a/arch/x86/include/asm/paravirt.h
 +++ b/arch/x86/include/asm/paravirt.h
 @@ -196,9 +196,9 @@ struct static_key;
  extern struct static_key paravirt_steal_enabled;
  extern struct static_key paravirt_steal_rq_enabled;
  
 -static inline u64 paravirt_steal_clock(int cpu)
 +static inline u64 paravirt_steal_clock(int cpu, u64 *steal)

So its u64 here.
  {
 - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
 + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
  }
  
  static inline unsigned long long paravirt_read_pmc(int counter)
 diff --git a/arch/x86/include/asm/paravirt_types.h 
 b/arch/x86/include/asm/paravirt_types.h
 index 142236e..5d4fc8b 100644
 --- a/arch/x86/include/asm/paravirt_types.h
 +++ b/arch/x86/include/asm/paravirt_types.h
 @@ -95,7 +95,7 @@ struct pv_lazy_ops {
  
  struct pv_time_ops {
   unsigned long long (*sched_clock)(void);
 - unsigned long long (*steal_clock)(int cpu);
 + void (*steal_clock)(int cpu, unsigned long long *steal);

But not u64 here? Any particular reason?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-28 Thread Konrad Rzeszutek Wilk
  PLE:
  - works for unmodified / non-Linux guests
  - works for all types of spins (e.g. smp_call_function*())
  - utilizes an existing hardware interface (PAUSE instruction) so likely
  more robust compared to a software interface
 
  PV:
  - has more information, so it can perform better
  
  Should we also consider that we always have an edge here for non-PLE
  machine?
 
 True.  The deployment share for these is decreasing rapidly though.  I
 hate optimizing for obsolete hardware.

Keep in mind that the patchset that Jeremy provided also cleans (remove)
parts of the pv spinlock code. It removes the various spin_lock,
spin_unlock, etc that touch paravirt code. Instead the pv code is only
in the slowpath. And if you don't compile with CONFIG_PARAVIRT_SPINLOCK
the end code is the same as it is now.

On a different subject-  I am curious whether the Haswell new locking
instructions (the transactional ones?) can be put in usage for the slow
case?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Konrad Rzeszutek Wilk
On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:
 On 09/24/2012 02:02 PM, Raghavendra K T wrote:
 On 09/24/2012 02:12 PM, Dor Laor wrote:
 In order to help PLE and pvticketlock converge I thought that a small
 test code should be developed to test this in a predictable,
 deterministic way.
 
 The idea is to have a guest kernel module that spawn a new thread each
 time you write to a /sys/ entry.
 
 Each such a thread spins over a spin lock. The specific spin lock is
 also chosen by the /sys/ interface. Let's say we have an array of spin
 locks *10 times the amount of vcpus.
 
 All the threads are running a
 while (1) {
 
 spin_lock(my_lock);
 sum += execute_dummy_cpu_computation(time);
 spin_unlock(my_lock);
 
 if (sys_tells_thread_to_die()) break;
 }
 
 print_result(sum);
 
 Instead of calling the kernel's spin_lock functions, clone them and make
 the ticket lock order deterministic and known (like a linear walk of all
 the threads trying to catch that lock).
 
 By Cloning you mean hierarchy of the locks?
 
 No, I meant to clone the implementation of the current spin lock
 code in order to set any order you may like for the ticket
 selection.
 (even for a non pvticket lock version)

Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.
 
 For instance, let's say you have N threads trying to grab the lock,
 you can always make the ticket go linearly from 1-2...-N.
 Not sure it's a good idea, just a recommendation.

So round-robin. Could you make NCPUS threads, pin them to CPUs, and set
them to be SCHED_RR? Or NCPUS*2 to overcommit.

 
 Also I believe time should be passed via sysfs / hardcoded for each
 type of lock we are mimicking
 
 Yap
 
 
 
 This way you can easy calculate:
 1. the score of a single vcpu running a single thread
 2. the score of sum of all thread scores when #thread==#vcpu all
 taking the same spin lock. The overall sum should be close as
 possible to #1.
 3. Like #2 but #threads  #vcpus and other versions of #total vcpus
 (belonging to all VMs)  #pcpus.
 4. Create #thread == #vcpus but let each thread have it's own spin
 lock
 5. Like 4 + 2
 
 Hopefully this way will allows you to judge and evaluate the exact
 overhead of scheduling VMs and threads since you have the ideal result
 in hand and you know what the threads are doing.
 
 My 2 cents, Dor
 
 
 Thank you,
 I think this is an excellent idea. ( Though I am trying to put all the
 pieces together you mentioned). So overall we should be able to measure
 the performance of pvspinlock/PLE improvements with a deterministic
 load in guest.
 
 Only thing I am missing is,
 How to generate different combinations of the lock.
 
 Okay, let me see if I can come with a solid model for this.
 
 
 Do you mean the various options for PLE/pvticket/other? I haven't
 thought of it and assumed its static but it can also be controlled
 through the temporary /sys interface.
 
 Thanks for following up!
 Dor
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-14 Thread Konrad Rzeszutek Wilk
 The concern I have is that even though we have gone through changes to
 help reduce the candidate vcpus we yield to, we still have a very poor
 idea of which vcpu really needs to run.  The result is high cpu usage in
 the get_pid_task and still some contention in the double runqueue lock.
 To make this scalable, we either need to significantly reduce the
 occurrence of the lock-holder preemption, or do a much better job of
 knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
 which do not need to run).

The patches that Raghavendra  has been posting do accomplish that.

 On reducing the occurrence:  The worst case for lock-holder preemption
 is having vcpus of same VM on the same runqueue.  This guarantees the
 situation of 1 vcpu running while another [of the same VM] is not.  To
 prove the point, I ran the same test, but with vcpus restricted to a
 range of host cpus, such that any single VM's vcpus can never be on the
 same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
 vcpu-1's are on host cpus 5-9, and so on.  Here is the result:

 kvm_cpu_spin, and all
 yield_to changes, plus
 restricted vcpu placement:  8823 +/- 3.20%   much, much better

 On picking a better vcpu to yield to:  I really hesitate to rely on
 paravirt hint [telling us which vcpu is holding a lock], but I am not
 sure how else to reduce the candidate vcpus to yield to.  I suspect we
 are yielding to way more vcpus than are prempted lock-holders, and that
 IMO is just work accomplishing nothing.  Trying to think of way to
 further reduce candidate vcpus

... the patches are posted -  you could try them out?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 07/10] KVM: add KVM TMEM host side interface

2012-06-28 Thread Konrad Rzeszutek Wilk
On Wed, Jun 06, 2012 at 01:00:15PM +0200, Sasha Levin wrote:
 This is the host side interface that the guests which support KVM TMEM
 talk to.
 
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---
  arch/x86/kvm/tmem/Kconfig|6 +++
  arch/x86/kvm/tmem/Makefile   |2 +
  arch/x86/kvm/tmem/host.c |   78 
 ++
  arch/x86/kvm/tmem/host.h |   20 +
  arch/x86/kvm/x86.c   |8 +---
  drivers/staging/zcache/zcache-main.c |   35 +++-
  6 files changed, 141 insertions(+), 8 deletions(-)
  create mode 100644 arch/x86/kvm/tmem/host.c
  create mode 100644 arch/x86/kvm/tmem/host.h
 
 diff --git a/arch/x86/kvm/tmem/Kconfig b/arch/x86/kvm/tmem/Kconfig
 index 15d8301..1a59e4f 100644
 --- a/arch/x86/kvm/tmem/Kconfig
 +++ b/arch/x86/kvm/tmem/Kconfig
 @@ -13,4 +13,10 @@ menuconfig KVM_TMEM
  
  if KVM_TMEM
  
 +config KVM_TMEM_HOST
 + bool Host-side KVM TMEM
 + ---help---
 + With this option on, the KVM host will be able to process KVM TMEM 
 requests
 + coming from guests.
 +
  endif # KVM_TMEM
 diff --git a/arch/x86/kvm/tmem/Makefile b/arch/x86/kvm/tmem/Makefile
 index 6812d46..706cd36 100644
 --- a/arch/x86/kvm/tmem/Makefile
 +++ b/arch/x86/kvm/tmem/Makefile
 @@ -1 +1,3 @@
  ccflags-y += -Idrivers/staging/zcache/
 +
 +obj-$(CONFIG_KVM_TMEM_HOST)  += host.o
 diff --git a/arch/x86/kvm/tmem/host.c b/arch/x86/kvm/tmem/host.c
 new file mode 100644
 index 000..9e73395
 --- /dev/null
 +++ b/arch/x86/kvm/tmem/host.c
 @@ -0,0 +1,78 @@
 +/*
 + * KVM TMEM host side interface
 + *
 + * Copyright (c) 2012 Sasha Levin
 + *
 + */
 +
 +#include linux/kvm_types.h
 +#include linux/kvm_host.h
 +
 +#include tmem.h
 +#include zcache.h
 +
 +int use_kvm_tmem_host = 1;

__mostly_read and bool

 +
 +static int no_kvmtmemhost(char *s)
 +{
 + use_kvm_tmem_host = 0;
 + return 1;
 +}
 +
 +__setup(nokvmtmemhost, no_kvmtmemhost);
 +
 +int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long *ret)
 +{
 + struct tmem_kvm_op op;
 + struct page *page;
 + int r;
 + unsigned long flags;
 +
 + if (!use_kvm_tmem_host || !zcache_enabled) {
 + *ret = -ENXIO;
 + return 0;
 + }
 +
 + r = kvm_read_guest(vcpu-kvm, addr, op, sizeof(op));
 + if (r  0) {
 + *ret = r;
 + return 0;

Shouldn't this return r?

 + }
 +
 + switch (op.cmd) {
 + case TMEM_NEW_POOL:
 + *ret = zcache_new_pool(op.cli_id, op.u.new.flags);
 + break;
 + case TMEM_DESTROY_POOL:
 + *ret = zcache_destroy_pool(op.cli_id, op.pool_id);
 + break;
 + case TMEM_NEW_PAGE:
 + break;
 + case TMEM_PUT_PAGE:
 + page = gfn_to_page(vcpu-kvm, op.u.gen.gfn);
 + local_irq_save(flags);
 + *ret = zcache_put_page(op.cli_id, op.pool_id,
 + op.u.gen.oid, op.u.gen.index, page);
 + local_irq_restore(flags);
 + break;
 + case TMEM_GET_PAGE:
 + page = gfn_to_page(vcpu-kvm, op.u.gen.gfn);
 + local_irq_save(flags);
 + *ret = zcache_get_page(op.cli_id, op.pool_id,
 + op.u.gen.oid, op.u.gen.index, page);
 + local_irq_restore(flags);
 + break;
 + case TMEM_FLUSH_PAGE:
 + local_irq_save(flags);
 + *ret = zcache_flush_page(op.cli_id, op.pool_id,
 + op.u.gen.oid, op.u.gen.index);
 + local_irq_restore(flags);
 + break;
 + case TMEM_FLUSH_OBJECT:
 + local_irq_save(flags);
 + *ret = zcache_flush_object(op.cli_id, op.pool_id, 
 op.u.gen.oid);
 + local_irq_restore(flags);
 + break;
 + }
 + return 0;
 +}
 diff --git a/arch/x86/kvm/tmem/host.h b/arch/x86/kvm/tmem/host.h
 new file mode 100644
 index 000..17ba0c4
 --- /dev/null
 +++ b/arch/x86/kvm/tmem/host.h
 @@ -0,0 +1,20 @@
 +#ifndef _KVM_TMEM_HOST_H_
 +#define _KVM_TMEM_HOST_H_
 +
 +#ifdef CONFIG_KVM_TMEM_HOST
 +
 +extern int use_kvm_tmem_host;
 +
 +extern int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long 
 *ret);
 +
 +#else
 +
 +static inline int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned 
 long *ret)
 +{
 + *ret = -ENOSUPP;
 + return 0;
 +}
 +
 +#endif
 +
 +#endif
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 4c5b6ab..c92d4c8 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -27,6 +27,7 @@
  #include kvm_cache_regs.h
  #include x86.h
  #include cpuid.h
 +#include tmem/host.h
  
  #include linux/clocksource.h
  #include linux/interrupt.h
 @@ -4993,13 +4994,6 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
   return 1;
  }
  
 -static int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long 
 *ret)
 -{
 - *ret = -ENOTSUPP;
 -
 - return 0;
 -}
 -
  

Re: [PATCH 11/13] pci: Create common pcibios_err_to_errno

2012-05-21 Thread Konrad Rzeszutek Wilk
On Fri, May 11, 2012 at 04:56:44PM -0600, Alex Williamson wrote:
 For returning errors out to non-PCI code.  Re-name xen's version.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com

Acked-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
 
  drivers/xen/xen-pciback/conf_space.c |6 +++---
  include/linux/pci.h  |   26 ++
  2 files changed, 29 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/xen/xen-pciback/conf_space.c 
 b/drivers/xen/xen-pciback/conf_space.c
 index 30d7be0..46ae0f9 100644
 --- a/drivers/xen/xen-pciback/conf_space.c
 +++ b/drivers/xen/xen-pciback/conf_space.c
 @@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 
 new_val_mask,
   return val;
  }
  
 -static int pcibios_err_to_errno(int err)
 +static int xen_pcibios_err_to_errno(int err)
  {
   switch (err) {
   case PCIBIOS_SUCCESSFUL:
 @@ -202,7 +202,7 @@ out:
  pci_name(dev), size, offset, value);
  
   *ret_val = value;
 - return pcibios_err_to_errno(err);
 + return xen_pcibios_err_to_errno(err);
  }
  
  int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 
 value)
 @@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int 
 offset, int size, u32 value)
   }
   }
  
 - return pcibios_err_to_errno(err);
 + return xen_pcibios_err_to_errno(err);
  }
  
  void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
 diff --git a/include/linux/pci.h b/include/linux/pci.h
 index b437225..20a8f2e 100644
 --- a/include/linux/pci.h
 +++ b/include/linux/pci.h
 @@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev 
 *pci_dev) { return false;
  #define PCIBIOS_SET_FAILED   0x88
  #define PCIBIOS_BUFFER_TOO_SMALL 0x89
  
 +/*
 + * Translate above to generic errno for passing back through non-pci.
 + */
 +static inline int pcibios_err_to_errno(int err)
 +{
 + if (err = PCIBIOS_SUCCESSFUL)
 + return err; /* Assume already errno */
 +
 + switch (err) {
 + case PCIBIOS_FUNC_NOT_SUPPORTED:
 + return -ENOENT;
 + case PCIBIOS_BAD_VENDOR_ID:
 + return -EINVAL;
 + case PCIBIOS_DEVICE_NOT_FOUND:
 + return -ENODEV;
 + case PCIBIOS_BAD_REGISTER_NUMBER:
 + return -EFAULT;
 + case PCIBIOS_SET_FAILED:
 + return -EIO;
 + case PCIBIOS_BUFFER_TOO_SMALL:
 + return -ENOSPC;
 + }
 +
 + return -ENOTTY;
 +}
 +
  /* Low-level architecture-dependent routines */
  
  struct pci_ops {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-16 Thread Konrad Rzeszutek Wilk
On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
 * Thomas Gleixner t...@linutronix.de [2012-03-31 00:07:58]:
 
  I know that Peter is going to go berserk on me, but if we are running
  a paravirt guest then it's simple to provide a mechanism which allows
  the host (aka hypervisor) to check that in the guest just by looking
  at some global state.
  
  So if a guest exits due to an external event it's easy to inspect the
  state of that guest and avoid to schedule away when it was interrupted
  in a spinlock held section. That guest/host shared state needs to be
  modified to indicate the guest to invoke an exit when the last nested
  lock has been released.
 
 I had attempted something like that long back:
 
 http://lkml.org/lkml/2010/6/3/4
 
 The issue is with ticketlocks though. VCPUs could go into a spin w/o
 a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
 that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
 and releases the lock. VCPU1 is next eligible to take the lock. If 
 that is not scheduled early enough by host, then remaining vcpus would keep 
 spinning (even though lock is technically not held by anybody) w/o making 
 forward progress.
 
 In that situation, what we really need is for the guest to hint to host
 scheduler to schedule VCPU1 early (via yield_to or something similar). 
 
 The current pv-spinlock patches however does not track which vcpu is
 spinning at what head of the ticketlock. I suppose we can consider 
 that optimization in future and see how much benefit it provides (over
 plain yield/sleep the way its done now).

Right. I think Jeremy played around with this some time?
 
 Do you see any issues if we take in what we have today and address the
 finer-grained optimization as next step?

I think that is the proper course - these patches show
that on baremetal we don't incur performance regressions and in
virtualization case we benefit greatly. Since these are the basic
building blocks of a kernel - taking it slow and just adding
this set of patches for v3.5 is a good idea - and then building on top
of that for further refinement.

 
 - vatsa 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH RFC V5 1/6] debugfs: Add support to print u32 array in debugfs

2012-03-30 Thread Konrad Rzeszutek Wilk
On Fri, Mar 23, 2012 at 01:36:28PM +0530, Raghavendra K T wrote:
 From: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
 
 Move the code from Xen to debugfs to make the code common
 for other users as well.
 
 Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
 Signed-off-by: Suzuki Poulose suz...@in.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com

Greg,

I was thinking to stick this patch in my queue, but I need your
OK since it touches fs/debugfs/file.c.

 ---
 diff --git a/arch/x86/xen/debugfs.c b/arch/x86/xen/debugfs.c
 index ef1db19..c8377fb 100644
 --- a/arch/x86/xen/debugfs.c
 +++ b/arch/x86/xen/debugfs.c
 @@ -19,107 +19,3 @@ struct dentry * __init xen_init_debugfs(void)
   return d_xen_debug;
  }
  
 -struct array_data
 -{
 - void *array;
 - unsigned elements;
 -};
 -
 -static int u32_array_open(struct inode *inode, struct file *file)
 -{
 - file-private_data = NULL;
 - return nonseekable_open(inode, file);
 -}
 -
 -static size_t format_array(char *buf, size_t bufsize, const char *fmt,
 -u32 *array, unsigned array_size)
 -{
 - size_t ret = 0;
 - unsigned i;
 -
 - for(i = 0; i  array_size; i++) {
 - size_t len;
 -
 - len = snprintf(buf, bufsize, fmt, array[i]);
 - len++;  /* ' ' or '\n' */
 - ret += len;
 -
 - if (buf) {
 - buf += len;
 - bufsize -= len;
 - buf[-1] = (i == array_size-1) ? '\n' : ' ';
 - }
 - }
 -
 - ret++;  /* \0 */
 - if (buf)
 - *buf = '\0';
 -
 - return ret;
 -}
 -
 -static char *format_array_alloc(const char *fmt, u32 *array, unsigned 
 array_size)
 -{
 - size_t len = format_array(NULL, 0, fmt, array, array_size);
 - char *ret;
 -
 - ret = kmalloc(len, GFP_KERNEL);
 - if (ret == NULL)
 - return NULL;
 -
 - format_array(ret, len, fmt, array, array_size);
 - return ret;
 -}
 -
 -static ssize_t u32_array_read(struct file *file, char __user *buf, size_t 
 len,
 -   loff_t *ppos)
 -{
 - struct inode *inode = file-f_path.dentry-d_inode;
 - struct array_data *data = inode-i_private;
 - size_t size;
 -
 - if (*ppos == 0) {
 - if (file-private_data) {
 - kfree(file-private_data);
 - file-private_data = NULL;
 - }
 -
 - file-private_data = format_array_alloc(%u, data-array, 
 data-elements);
 - }
 -
 - size = 0;
 - if (file-private_data)
 - size = strlen(file-private_data);
 -
 - return simple_read_from_buffer(buf, len, ppos, file-private_data, 
 size);
 -}
 -
 -static int xen_array_release(struct inode *inode, struct file *file)
 -{
 - kfree(file-private_data);
 -
 - return 0;
 -}
 -
 -static const struct file_operations u32_array_fops = {
 - .owner  = THIS_MODULE,
 - .open   = u32_array_open,
 - .release= xen_array_release,
 - .read   = u32_array_read,
 - .llseek = no_llseek,
 -};
 -
 -struct dentry *xen_debugfs_create_u32_array(const char *name, umode_t mode,
 - struct dentry *parent,
 - u32 *array, unsigned elements)
 -{
 - struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
 -
 - if (data == NULL)
 - return NULL;
 -
 - data-array = array;
 - data-elements = elements;
 -
 - return debugfs_create_file(name, mode, parent, data, u32_array_fops);
 -}
 diff --git a/arch/x86/xen/debugfs.h b/arch/x86/xen/debugfs.h
 index 78d2549..12ebf33 100644
 --- a/arch/x86/xen/debugfs.h
 +++ b/arch/x86/xen/debugfs.h
 @@ -3,8 +3,4 @@
  
  struct dentry * __init xen_init_debugfs(void);
  
 -struct dentry *xen_debugfs_create_u32_array(const char *name, umode_t mode,
 - struct dentry *parent,
 - u32 *array, unsigned elements);
 -
  #endif /* _XEN_DEBUGFS_H */
 diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
 index 4926974..b74cebb 100644
 --- a/arch/x86/xen/spinlock.c
 +++ b/arch/x86/xen/spinlock.c
 @@ -314,7 +314,7 @@ static int __init xen_spinlock_debugfs(void)
   debugfs_create_u64(time_blocked, 0444, d_spin_debug,
  spinlock_stats.time_blocked);
  
 - xen_debugfs_create_u32_array(histo_blocked, 0444, d_spin_debug,
 + debugfs_create_u32_array(histo_blocked, 0444, d_spin_debug,
spinlock_stats.histo_spin_blocked, 
 HISTO_BUCKETS + 1);
  
   return 0;
 diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
 index ef023ee..cb6cff3 100644
 --- a/fs/debugfs/file.c
 +++ b/fs/debugfs/file.c
 @@ -20,6 +20,7 @@
  #include linux/namei.h
  #include linux/debugfs.h
  #include linux/io.h
 +#include linux/slab.h

Re: [RFC 2/2] kvm: guest-side changes for tmem on KVM

2012-03-19 Thread Konrad Rzeszutek Wilk
On Fri, Mar 16, 2012 at 10:30:35AM +0530, Akshay Karle wrote:
  +/* kvm tmem foundation ops/hypercalls */
  +
  +static inline int kvm_tmem_op(u32 tmem_cmd, u32 tmem_pool, struct 
  tmem_oid oid,
  +  u32 index, u32 tmem_offset, u32 pfn_offset, unsigned long pfn, u32 len, 
  uint16_t cli_id)
 
  That is rather long list of arguments. Could you pass in a structure 
  instead?
 
  Are you actually using all of the arguments in every call?
 
 For different functions different parameters are used. If we want to reduce 
 the number of arguments,
 the tmem_ops structure can be created in the functions calling kvm_tmem_op 
 instead of creating it here
 and that can be passed, will make these changes in the next patch.
 
  +{
  +  struct tmem_ops op;
  +  int rc = 0;
  +  op.cmd = tmem_cmd;
  +  op.pool_id = tmem_pool;
  +  op.u.gen.oid[0] = oid.oid[0];
  +  op.u.gen.oid[1] = oid.oid[1];
  +  op.u.gen.oid[2] = oid.oid[2];
  +  op.u.gen.index = index;
  +  op.u.gen.tmem_offset = tmem_offset;
  +  op.u.gen.pfn_offset = pfn_offset;
  +  op.u.gen.pfn = pfn;
  +  op.u.gen.len = len;
  +  op.u.gen.cli_id = cli_id;
  +  rc = kvm_hypercall1(KVM_HC_TMEM, virt_to_phys(op));
  +  rc = rc + 1000;
 
  Why the addition?
 
 If you notice the host patch I had subtracted 1000 while passing the return 
 value
 in the kvm_emulate_hypercall function. This was to avoid the guest kernel 
 panic due to
 the return of a non-negative value by the kvm_hypercall. In order to get the 
 original value
 back I added 1000.

Avi, is there a right way of doing this?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] kvm: Transcendent Memory (tmem) on KVM

2012-03-15 Thread Konrad Rzeszutek Wilk
On Thu, Mar 08, 2012 at 09:59:41PM +0530, Akshay Karle wrote:
 Hi,
 
 We are undergraduate engineering students of Maharashtra Academy of
 Engineering, Pune, India and we are working on a project entitled
 'Transcendent Memory on KVM' as a part of our academics.
 The project members are:
 1. Ashutosh Tripathi
 2. Shreyas Mahure
 3. Nishant Gulhane
 4. Akshay Karle
 
 ---
 Project Description:
 What is Transcendent Memory(tmem in short)?
 Transcendent Memory is a memory optimization technique for the
 virtualized environment. It collects the underutilized memory of the
 guests and the unassigned(fallow) memory of the host and places it into
 a central tmem pool. Indirect access to this pool is then provided to the 
 guests.
 For further information on tmem, please refer the article on lwn by Dr.
 Dan Magenheimer:
 http://lwn.net/Articles/454795/
 
 Since kvm is one of the most popular hypervisors available,
 we decided to implement this technique for kvm.
 
 ---
 kvm-tmem Patch details:
 This patch adds appropriate shims at the guest that invokes the kvm
 hypercalls, and the host uses zcache pools to implement the required
 functions.

Great!

 
 To enable tmem on the 'kvm host' add the boot parameter:
 kvmtmem
 And to enable tmem in the 'kvm guests' add the boot parameter:
 tmem
 
 The diffstat details for this patch are given below:
  arch/x86/include/asm/kvm_host.h  |1 
  arch/x86/kvm/x86.c   |4 
  drivers/staging/zcache/Makefile  |2 
  drivers/staging/zcache/kvm-tmem.c|  356 
 +++
  drivers/staging/zcache/kvm-tmem.h|   55 +
  drivers/staging/zcache/zcache-main.c |   98 -
  include/linux/kvm_para.h |1 
  7 files changed, 508 insertions(+), 9 deletions(-)
   
 We have already uploaded our work alongwith the 'Frontswap' submitted by Dan,
 on the following link:
 https://github.com/akshaykarle/kvm-tmem
 
 Any comments/feedback would be appreciated and will help us a lot with our 
 work.

Great. Will do.
 
 Regards,
 Akshay
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] kvm: Transcendent Memory (tmem) on KVM

2012-03-15 Thread Konrad Rzeszutek Wilk
 ---
 kvm-tmem Patch details:
 This patch adds appropriate shims at the guest that invokes the kvm
 hypercalls, and the host uses zcache pools to implement the required
 functions.
 
 To enable tmem on the 'kvm host' add the boot parameter:
 kvmtmem
 And to enable tmem in the 'kvm guests' add the boot parameter:
 tmem
 
 The diffstat details for this patch are given below:
  arch/x86/include/asm/kvm_host.h  |1 
  arch/x86/kvm/x86.c   |4 
  drivers/staging/zcache/Makefile  |2 
  drivers/staging/zcache/kvm-tmem.c|  356 
 +++
  drivers/staging/zcache/kvm-tmem.h|   55 +
  drivers/staging/zcache/zcache-main.c |   98 -
  include/linux/kvm_para.h |1 
  7 files changed, 508 insertions(+), 9 deletions(-)
   
 We have already uploaded our work alongwith the 'Frontswap' submitted by Dan,
 on the following link:
 https://github.com/akshaykarle/kvm-tmem

Is there a way for these patches to be posted on LKML? It is rather difficult
to copy-n-paste patches in emails and sending them. Or if you want to, you can
email them directly to me. To do that use 'git send-email' and 'git 
format-patch'
to prep the git commits into patches.


Also, the title says 'RFC 0/2' but I am not seing 1 or 2?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] kvm: host-side changes for tmem on KVM

2012-03-15 Thread Konrad Rzeszutek Wilk
On Thu, Mar 08, 2012 at 10:24:08PM +0530, Akshay Karle wrote:
 From: Akshay Karle akshay.a.ka...@gmail.com
 Subject: [RFC 1/2] kvm: host-side changes for tmem on KVM
 
 Working at host:
 Once the guest exits to the kvm host, the host determines that the guest 
 exited
 to perform some tmem operation(done at kvm_emulate_hypercall)and then
 we use zcache to implement this required operations(performed by 
 kvm_pv_tmem_op).

Do you need any modifications to the Kconfig file to reflect the KVM dependency?

 
 ---
 Diffstat for host patch:
  arch/x86/include/asm/kvm_host.h  |1 
  arch/x86/kvm/x86.c   |4 +
  drivers/staging/zcache/zcache-main.c |   98 
 ---
  3 files changed, 95 insertions(+), 8 deletions(-)
 
 diff -Napur vanilla/linux-3.1.5/arch/x86/include/asm/kvm_host.h 
 linux-3.1.5//arch/x86/include/asm/kvm_host.h
 --- vanilla/linux-3.1.5/arch/x86/include/asm/kvm_host.h   2011-12-09 
 22:27:05.0 +0530
 +++ linux-3.1.5//arch/x86/include/asm/kvm_host.h  2012-03-05 
 14:09:41.648006153 +0530
 @@ -668,6 +668,7 @@ int emulator_write_phys(struct kvm_vcpu
 const void *val, int bytes);
  int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes,
 gpa_t addr, unsigned long *ret);
 +int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long *ret);
  u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
  
  extern bool tdp_enabled;
 diff -Napur vanilla/linux-3.1.5/arch/x86/kvm/x86.c 
 linux-3.1.5//arch/x86/kvm/x86.c
 --- vanilla/linux-3.1.5/arch/x86/kvm/x86.c2011-12-09 22:27:05.0 
 +0530
 +++ linux-3.1.5//arch/x86/kvm/x86.c   2012-03-05 14:09:41.652006083 +0530
 @@ -5267,6 +5267,10 @@ int kvm_emulate_hypercall(struct kvm_vcp
   case KVM_HC_MMU_OP:
   r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), ret);
   break;
 + case KVM_HC_TMEM:
 + r = kvm_pv_tmem_op(vcpu, a0, ret);
 + ret = ret - 1000;

That is rather odd. Why the subtraction of 1000?

 + break;
   default:
   ret = -KVM_ENOSYS;
   break;
 diff -Napur vanilla/linux-3.1.5/drivers/staging/zcache/zcache-main.c 
 linux-3.1.5//drivers/staging/zcache/zcache-main.c
 --- vanilla/linux-3.1.5/drivers/staging/zcache/zcache-main.c  2011-12-09 
 22:27:05.0 +0530
 +++ linux-3.1.5//drivers/staging/zcache/zcache-main.c 2012-03-05 
 14:10:31.264006031 +0530
 @@ -30,6 +30,7 @@
  #include linux/atomic.h
  #include linux/math64.h
  #include tmem.h
 +#include kvm-tmem.h
  
  #include ../zram/xvmalloc.h /* if built in drivers/staging */
  
 @@ -669,7 +670,6 @@ static struct zv_hdr *zv_create(struct x
   int chunks = (alloc_size + (CHUNK_SIZE - 1))  CHUNK_SHIFT;
   int ret;
  
 - BUG_ON(!irqs_disabled());

Can you explain why?

   BUG_ON(chunks = NCHUNKS);
   ret = xv_malloc(xvpool, alloc_size,
   page, offset, ZCACHE_GFP_MASK);
 @@ -1313,7 +1313,6 @@ static int zcache_compress(struct page *
   unsigned char *wmem = __get_cpu_var(zcache_workmem);
   char *from_va;
  
 - BUG_ON(!irqs_disabled());
   if (unlikely(dmem == NULL || wmem == NULL))
   goto out;  /* no buffer, so can't compress */
   from_va = kmap_atomic(from, KM_USER0);
 @@ -1533,7 +1532,6 @@ static int zcache_put_page(int cli_id, i
   struct tmem_pool *pool;
   int ret = -1;
  
 - BUG_ON(!irqs_disabled());
   pool = zcache_get_pool_by_id(cli_id, pool_id);
   if (unlikely(pool == NULL))
   goto out;
 @@ -1898,6 +1896,67 @@ struct frontswap_ops zcache_frontswap_re
  #endif
  
  /*
 + * tmem op to support tmem in kvm guests
 + */
 +
 +int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long *ret)
 +{
 + struct tmem_ops op;
 + struct tmem_oid oid;
 + uint64_t pfn;
 + struct page *page;
 + int r;
 +
 + r = kvm_read_guest(vcpu-kvm, addr, op, sizeof(op));
 + if (r  0)
 + return r;
 +
 + switch (op.cmd) {
 + case TMEM_NEW_POOL:
 + *ret = zcache_new_pool(op.u.new.cli_id, op.u.new.flags);
 + break;
 + case TMEM_DESTROY_POOL:
 + *ret = zcache_destroy_pool(op.u.gen.cli_id, op.pool_id);
 + break;
 + case TMEM_NEW_PAGE:
 + break;
 + case TMEM_PUT_PAGE:
 + pfn = gfn_to_pfn(vcpu-kvm, op.u.gen.pfn);
 + page = pfn_to_page(pfn);
 + oid.oid[0] = op.u.gen.oid[0];
 + oid.oid[1] = op.u.gen.oid[1];
 + oid.oid[2] = op.u.gen.oid[2];
 + VM_BUG_ON(!PageLocked(page));
 + *ret = zcache_put_page(op.u.gen.cli_id, op.pool_id,
 + oid, op.u.gen.index, page);
 + break;
 + case TMEM_GET_PAGE:
 + pfn = gfn_to_pfn(vcpu-kvm, op.u.gen.pfn);
 + page = pfn_to_page(pfn);
 + oid.oid[0] = op.u.gen.oid[0];
 + 

Re: [RFC 2/2] kvm: guest-side changes for tmem on KVM

2012-03-15 Thread Konrad Rzeszutek Wilk
On Thu, Mar 08, 2012 at 10:32:37PM +0530, Akshay Karle wrote:
 From: Akshay Karle akshay.a.ka...@gmail.com
 Subject: [RFC 2/2] kvm: guest-side changes for tmem on KVM
 
 Working in the guest:
 At the kvm guest, we add the appropriate tmem shims to intercept the
 tmem operations and then invoke the kvm hypercalls to exit to the host
 and perform these operations.
 
 Signed-off-by: Akshay Karle akshay.a.ka...@gmail.com
 
 ---
 Diffstat for guest side changes:
  drivers/staging/zcache/Makefile   |2 
  drivers/staging/zcache/kvm-tmem.c |  356 
 ++
  drivers/staging/zcache/kvm-tmem.h |   55 +
  include/linux/kvm_para.h  |1 
  4 files changed, 413 insertions(+), 1 deletion(-)
 
 diff -Napur vanilla/linux-3.1.5/drivers/staging/zcache/kvm-tmem.c 
 linux-3.1.5//drivers/staging/zcache/kvm-tmem.c
 --- vanilla/linux-3.1.5/drivers/staging/zcache/kvm-tmem.c 1970-01-01 
 05:30:00.0 +0530
 +++ linux-3.1.5//drivers/staging/zcache/kvm-tmem.c2012-03-05 
 14:16:00.892007167 +0530
 @@ -0,0 +1,356 @@
 +/*
 + * kvm implementation for transcendent memory (tmem)
 + *
 + * Copyright (C) 2009-2011 Oracle Corp.  All rights reserved.
 + * Author: Dan Magenheimer
 + *  Akshay Karle
 + *  Ashutosh Tripathi
 + *  Nishant Gulhane
 + *  Shreyas Mahure
 + */
 +
 +#include linux/kernel.h
 +#include linux/types.h
 +#include linux/init.h
 +#include linux/pagemap.h
 +#include linux/module.h
 +#include linux/cleancache.h
 +
 +/* temporary ifdef until include/linux/frontswap.h is upstream */
 +#ifdef CONFIG_FRONTSWAP
 +#include linux/frontswap.h
 +#endif
 +
 +#include kvm-tmem.h
 +
 +/* kvm tmem foundation ops/hypercalls */
 +
 +static inline int kvm_tmem_op(u32 tmem_cmd, u32 tmem_pool, struct tmem_oid 
 oid,
 + u32 index, u32 tmem_offset, u32 pfn_offset, unsigned long pfn, u32 len, 
 uint16_t cli_id)

That is rather long list of arguments. Could you pass in a structure instead?

Are you actually using all of the arguments in every call?
 +{
 + struct tmem_ops op;
 + int rc = 0;
 + op.cmd = tmem_cmd;
 + op.pool_id = tmem_pool;
 + op.u.gen.oid[0] = oid.oid[0];
 + op.u.gen.oid[1] = oid.oid[1];
 + op.u.gen.oid[2] = oid.oid[2];
 + op.u.gen.index = index;
 + op.u.gen.tmem_offset = tmem_offset;
 + op.u.gen.pfn_offset = pfn_offset;
 + op.u.gen.pfn = pfn;
 + op.u.gen.len = len;
 + op.u.gen.cli_id = cli_id;
 + rc = kvm_hypercall1(KVM_HC_TMEM, virt_to_phys(op));
 + rc = rc + 1000;

Why the addition?

 + return rc;
 +}
 +
 +static int kvm_tmem_new_pool(uint16_t cli_id,
 + u32 flags, unsigned long pagesize)
 +{
 + struct tmem_ops op;
 + int rc, pageshift;
 + for (pageshift = 0; pagesize != 1; pageshift++)
 + pagesize = 1;
 + flags |= (pageshift - 12)  TMEM_POOL_PAGESIZE_SHIFT;

Instead of 12, just use PAGE_SHIFT

 + flags |= TMEM_SPEC_VERSION  TMEM_VERSION_SHIFT;
 + op.cmd = TMEM_NEW_POOL;
 + op.u.new.cli_id = cli_id;
 + op.u.new.flags = flags;
 + rc = kvm_hypercall1(KVM_HC_TMEM, virt_to_phys(op));
 + rc = rc + 1000;
 + return rc;
 +}
 +
 +/* kvm generic tmem ops */
 +
 +static int kvm_tmem_put_page(u32 pool_id, struct tmem_oid oid,
 +  u32 index, unsigned long pfn)
 +{
 +
 + return kvm_tmem_op(TMEM_PUT_PAGE, pool_id, oid, index,
 + 0, 0, pfn, 0, TMEM_CLI);
 +}
 +
 +static int kvm_tmem_get_page(u32 pool_id, struct tmem_oid oid,
 +  u32 index, unsigned long pfn)
 +{
 +
 + return kvm_tmem_op(TMEM_GET_PAGE, pool_id, oid, index,
 + 0, 0, pfn, 0, TMEM_CLI);
 +}
 +
 +static int kvm_tmem_flush_page(u32 pool_id, struct tmem_oid oid, u32 index)
 +{
 + return kvm_tmem_op(TMEM_FLUSH_PAGE, pool_id, oid, index,
 + 0, 0, 0, 0, TMEM_CLI);
 +}
 +
 +static int kvm_tmem_flush_object(u32 pool_id, struct tmem_oid oid)
 +{
 + return kvm_tmem_op(TMEM_FLUSH_OBJECT, pool_id, oid, 0, 0, 0, 0, 0, 
 TMEM_CLI);
 +}
 +
 +static int kvm_tmem_destroy_pool(u32 pool_id)
 +{
 + struct tmem_oid oid = { { 0 } };
 +
 + return kvm_tmem_op(TMEM_DESTROY_POOL, pool_id, oid, 0, 0, 0, 0, 0, 
 TMEM_CLI);
 +}
 +
 +static int kvm_tmem_enabled;
 +
 +static int __init enable_tmem_kvm(char *s)
 +{
 + kvm_tmem_enabled = 1;
 + return 1;
 +}
 +__setup(tmem, enable_tmem_kvm);

I would say do it the other way around. Provide an argument
to disable it.

 +
 +/* cleancache ops */
 +
 +#ifdef CONFIG_CLEANCACHE
 +static void tmem_cleancache_put_page(int pool, struct cleancache_filekey key,
 +  pgoff_t index, struct page *page)
 +{
 + u32 ind = (u32) index;
 + struct tmem_oid oid = *(struct tmem_oid *)key;
 + unsigned long pfn = page_to_pfn(page);
 +
 + if (pool  0)
 + return;
 + if (ind != index)
 + return;
 + mb(); /* ensure page is quiescent; tmem may address it with an alias */

Can 

Re: [RFC 0/2] kvm: Transcendent Memory (tmem) on KVM

2012-03-15 Thread Konrad Rzeszutek Wilk
On Thu, Mar 15, 2012 at 08:01:52PM +0200, Avi Kivity wrote:
 On 03/15/2012 07:49 PM, Dan Magenheimer wrote:
   One of the potential problems with tmem is reduction in performance when
   the cache hit rate is low, for example when streaming.
   
   Can you test this by creating a large file, for example with
   
 dd  /dev/urandom  file bs=1M count=10
   
   and then measuring the time to stream it, using
   
 time dd  file  /dev/null
   
   with and without the patch?
   
   Should be done on a cleancache enabled guest filesystem backed by a
   virtio disk with cache=none.
   
   It would be interesting to compare kvm_stat during the streaming, with
   and without the patch.
 
  Hi Avi --
 
  The WasActive patch (https://lkml.org/lkml/2012/1/25/300) 
  is intended to avoid the streaming situation you are creating here.
  It increases the quality of cached pages placed into zcache
  and should probably also be used on the guest-side stubs (and/or maybe
  the host-side zcache... I don't know KVM well enough to determine
  if that would work).
 
  As Dave Hansen pointed out, the WasActive patch is not yet correct
  and, as akpm points out, pageflag bits are scarce on 32-bit systems,
  so it remains to be seen if the WasActive patch can be upstreamed.
  Or maybe there is a different way to achieve the same goal.
  But I wanted to let you know that the streaming issue is understood
  and needs to be resolved for some cleancache backends just as it was
  resolved in the core mm code.
 
 Nice.  This takes care of the tail-end of the streaming (the more
 important one - since it always involves a cold copy).  What about the
 other side?  Won't the read code invoke cleancache_get_page() for every
 page? (this one is just a null hypercall, so it's cheaper, but still
 expensive).

That is something we should fix - I think it was mentioned in the frontswap
email thread the need for batching and it certainly seems required as those
hypercalls aren't that cheap.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] kvm: Transcendent Memory (tmem) on KVM

2012-03-15 Thread Konrad Rzeszutek Wilk
On Thu, Mar 15, 2012 at 12:36:48PM -0700, Dan Magenheimer wrote:
  From: Avi Kivity [mailto:a...@redhat.com]
  Sent: Thursday, March 15, 2012 12:11 PM
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer; Akshay Karle; linux-ker...@vger.kernel.org; 
  kvm@vger.kernel.org; ashu tripathi;
  nishant gulhane; amarmore2006; Shreyas Mahure; mahesh mohan
  Subject: Re: [RFC 0/2] kvm: Transcendent Memory (tmem) on KVM
  
  On 03/15/2012 08:02 PM, Konrad Rzeszutek Wilk wrote:
   
Nice.  This takes care of the tail-end of the streaming (the more
important one - since it always involves a cold copy).  What about the
other side?  Won't the read code invoke cleancache_get_page() for every
page? (this one is just a null hypercall, so it's cheaper, but still
expensive).
  
   That is something we should fix - I think it was mentioned in the 
   frontswap
   email thread the need for batching and it certainly seems required as 
   those
   hypercalls aren't that cheap.
  
  In fact when tmem was first proposed I asked for two changes - make it
  batchable, and make it asynchronous (so we can offload copies to a dma
  engine, etc).  Of course that would have made tmem significantly more
  complicated.
 
 (Sorry, I'm not typing fast enough to keep up with the thread...)
 
 Hi Avi --
 
 In case it wasn't clear from my last reply, RAMster shows
 that tmem CAN be used asynchronously... by making it more
 complicated, but without making the core kernel changes more
 complicated.
 
 In RAMster, pages are locally cached (compressed using zcache)
 and then, depending on policy, a separate thread sends the pages
 to a remote machine.  So the first part (compress and store locally)
 still must be synchronous, but the second part (transmit to
 another -- remote or possibly host? -- system) can be done
 asynchronously.  The RAMster code has to handle all the race
 conditions, which is a pain but seems to work.
 
 This is all working today in RAMster (which is in linux-next).
 Batching is still not implemented by any tmem backend, but RAMster
 demonstrates how the backend implementation COULD do batching without
 any additional core kernel changes.  I.e. no changes necessary
 to frontswap or cleancache.
 
 So, you see, I *was* listening. I just wasn't willing to fight
 the uphill battle of much more complexity in the core kernel
 for a capability that could be implemented differently.

Dan, please stop this.

The frontswap work is going through me and my goal is to provide
the batching and asynchronous option. It might take longer than
anticipated b/c it might require redoing some of the code - that
is OK. We can do this in steps too - first do the synchronous
(as is right now in implementation) and then add on the batching
and asynchrnous work. This means breaking the ABI/API, and I believe
Avi would like the ABI be as much baked as possible so that he does
not have to provide a v2 (or v3) of the tmem support in KVM.

I appreciate you having done that in RAMster but the transmit
option is what we need to batch. Think of Scatter Gather DMA.

 
 That said, I still think it remains to be proven that
 reducing the number of hypercalls by 2x or 3x (or whatever
 the batching factor you choose) will make a noticeable

I was thinking 32 - about the same number that we do in
Xen with PV MMU upcalls. We also batch it there with multicalls.

 performance difference.  But if it does, batching can
 be done... and completely hidden in the backend.
 
 (I hope Andrea is listening ;-)
 
 Dan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] BUG in pv_clock when overflow condition is detected

2012-02-20 Thread Konrad Rzeszutek Wilk
On Fri, Feb 17, 2012 at 04:25:04PM +0100, Igor Mammedov wrote:
 On 02/16/2012 03:03 PM, Avi Kivity wrote:
 On 02/15/2012 07:18 PM, Igor Mammedov wrote:
 On 02/15/2012 01:23 PM, Igor Mammedov wrote:
static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time
 *shadow)
{
 -u64 delta = native_read_tsc() - shadow-tsc_timestamp;
 +u64 delta;
 +u64 tsc = native_read_tsc();
 +BUG_ON(tsc   shadow-tsc_timestamp);
 +delta = tsc - shadow-tsc_timestamp;
return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
   shadow-tsc_shift);
 
 Maybe a WARN_ON_ONCE()?  Otherwise a relatively minor hypervisor
 bug can
 kill the guest.
 
 
 An attempt to print from this place is not perfect since it often
 leads
 to recursive calling to this very function and it hang there
 anyway.
 But if you insist I'll re-post it with WARN_ON_ONCE,
 It won't make much difference because guest will hang/stall due
 overflow
 anyway.
 
 Won't a BUG_ON() also result in a printk?
 Yes, it will. But stack will still keep failure point and poking
 with crash/gdb at core will always show where it's BUGged.
 
 In case it manages to print dump somehow (saw it couple times from ~
 30 test cycles), logs from console or from kernel message buffer
 (again poking with gdb) will show where it was called from.
 
 If WARN* is used, it will still totaly screwup clock and
 last value and system will become unusable, requiring looking with
 gdb/crash at the core any way.
 
 So I've just used more stable failure point that will leave trace
 everywhere it manages (maybe in console log, but for sure in stack)
 in case of WARN it might leave trace on console or not and probably
 won't reflect failure point in stack either leaving only kernel
 message buffer for clue.
 
 
 Makes sense.  But do get an ack from the Xen people to ensure this
 doesn't break for them.
 
 Konrad, Ian
 
 Could you please review patch form point of view of xen?
 Whole thread could be found here https://lkml.org/lkml/2012/2/13/286

What are the conditions under which this happens? You should probably
include that in the git description as well? Is this something that happens
often? If there is an overflow can you synthesize a value instead of
crashing the guest?

Hm, so are you asking for review for this patch or for
http://www.spinics.net/lists/kvm/msg68440.html ?

(which would also entail a early_percpu_clock_init implementation
in the Xen code naturally).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] VFIO core framework

2012-01-12 Thread Konrad Rzeszutek Wilk
On Tue, Jan 10, 2012 at 11:35:54AM -0700, Alex Williamson wrote:
 On Tue, 2012-01-10 at 11:26 -0500, Konrad Rzeszutek Wilk wrote:
  On Wed, Dec 21, 2011 at 02:42:02PM -0700, Alex Williamson wrote:
   This series includes the core framework for the VFIO driver.
   VFIO is a userspace driver interface meant to replace both the
   KVM device assignment code as well as interfaces like UIO.  Please
   see patch 1/5 for a complete description of VFIO, what it can do,
   and how it's designed.
   
   This version and the VFIO PCI bus driver, for exposing PCI devices
   through VFIO, can be found here:
   
   git://github.com/awilliam/linux-vfio.git vfio-next-20111221
   
   A development version of qemu which includes a full working
   vfio-pci driver, indepdendent of KVM support, can be found here:
   
   git://github.com/awilliam/qemu-vfio.git vfio-ng
   
   Thanks,
  
  Alex,
  
  So I took a look at the patchset with two different things in mind this 
  time:
   - What if you do not need to do any IRQ ack/de-ack etc in the host all of 
  that
 is done in the guest (say you have an actual IOAPIC in the guest that is
 _not_ managed by QEMU).
   - What would be required to make this work with a different hypervisor - 
  say Xen.
  
  And the conclusions I came to that it would require some surgery - 
  especially
  as some of the IRQ, irqfs, etc code support is not required per say.
  
  To me it seems to get this working with Xen (or perhaps with the Power 
  machines
  as well, as their hypervisor is similar to Xen in architecture?) we would 
  need at
  least two extra pieces of Linux kernel code: 
  - Xen IOMMU, which really is just doing a whole bunch of 
  xc_domain_memory_mapping
the user-space iova calls. For the normal PCI devices operations it would 
  just
offload them to the existing DMA API.
  - Xen VFIO PCI. Or at least make the VFIO PCI (in your vfio-next-20111221 
  branch)
driver allow some abstraction. There are certain things we might done via 
  alternate
operations. Such as the interrupt handling - where we bind the IRQ to 
  an event
channel or make a hypercall to program the guest' MSI vectors. Perhaps 
  there can
be an platform-specific part of it.
 
 Sure, I've envisioned that we'll have multiple iommu interfaces.  We'll
 need build-time and run-time selection.  I haven't implemented that yet
 since the iommu requirements are still developing.  Likewise, a
 vfio-xen-pci module is possible or we can look at whether we make the
 vfio-pci code too ugly by incorporating a dual-mode into that.

Yuck. Well, I am all up for making it pretty.

 
  In the userland:
   - In QEMU VFIO, make the interrupt part optional for certain parts (like 
  we don't
 expect an IRQ to happen in the host).
 
 Or can it be handled by vfio-xen-pci, which enables event channels
 through to xen?  It's possible the GET_IRQ_INFO ioctls could report a

Sure.
 flag indicating the type of notification available (eventfds being the
 initial option) and SET_IRQ_EVENTFDS could be generalized to take an
 array of structs other than eventfds.  For the non-Xen case, eventfds
 seem to provide us with the most flexibility since we can either connect
 them to userspace or just have userspace be the agent that connects the
 eventfd to an irqfd in another module.  See the (outdated) version of
 qemu-kvm vfio in this tree for an example (look for QEMU_KVM_BUILD):
 https://github.com/awilliam/qemu-kvm-vfio/blob/vfio/hw/vfio.c

Ah I see.
 
  I am curious to see how the Power folks have to deal with this? Perhaps the 
  requirement
  to write an PV IOMMU is not something they need to write?
  
  In terms of this patchset, the big thing for me is that it moves the 
  usual mechanism
  of unbind/bind of using the SysFS to be done via ioctls. I get the 
  reasoning for it
  - cannot guarantee any locking, but doing it all in ioctls instead of 
  configfs or sysfs
  seems odd. But perhaps that is just me having gotten use to doing it in 
  sysfs/configfs.
  Certainly it makes it easier to program in QEMU/libvirt. And ultimately 
  that is going
  to be user for 99% of this.
 
 Can you be more specific about which ioctl part you're referring to?  We
 bind/unbind each device to vfio-pci via the normal sysfs driver

Let me look again at the QEMU changes. I was thinking you did a bunch
of ioctls to assign a device, but I am probably getting it confused
with the vfio-group ioctls.

 interfaces.  Userspace binds itself to a group via ioctls, but that's
 because neither configfs or sysfs allow ioctl and I don't think it's
 possible to implement an ioctl-free vfio.  Trying to implement vfio
 across both configfs and chardev presents issues with ownership.

Right, one of them works. No need to do it across different subsystem.
 
  The requirement of the VFIO PCI driver to deal with all of the nasty 
  work-arounds for
  devices is nice. I do like the seperation - where this driver (VFIO core) 
  deal

  1   2   >