Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On 26/09/2016 09:28, Alex Bennée wrote: > > cpu->running is only read under the mutex, but can be written _by the > > owner thread only_ outside the mutex. So writes outside the mutex must > > be atomic, but writes under the mutex don't because: > > > > - no other thread ever writes to cpu->running > > > > - no other thread can be reading cpu->running > > Should we add some comments to cpu.h's definitions to make the rules clear? I don't know... It's awfully easy for such documentation to get out of date. Currently it says: * @running: #true if CPU is currently running (lockless). * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end; * valid under cpu_list_lock. I think it's a good middle ground; it's kind of obvious that only the CPU itself writes cpu->running. I'm changing anyway the cpu->running assignment to atomic_set as suggested by Richard, and that makes it valid to read cpu->running outside the lock. Paolo
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
Paolo Bonzini writes: > On 24/09/2016 22:43, Richard Henderson wrote: I don't see that the cpu_list_lock protects the last two lines in any way. >>> >>> It does: >>> >>> qemu_mutex_lock(&qemu_cpu_list_lock); >> >> What I meant is that I don't see that the mutex avoids the need for >> atomic_set. > > Oh, I see. > > cpu->running is only read under the mutex, but can be written _by the > owner thread only_ outside the mutex. So writes outside the mutex must > be atomic, but writes under the mutex don't because: > > - no other thread ever writes to cpu->running > > - no other thread can be reading cpu->running Should we add some comments to cpu.h's definitions to make the rules clear? -- Alex Bennée
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On 24/09/2016 22:43, Richard Henderson wrote: >>> I don't see that the cpu_list_lock protects the >>> last two lines in any way. >> >> It does: >> >> qemu_mutex_lock(&qemu_cpu_list_lock); > > What I meant is that I don't see that the mutex avoids the need for > atomic_set. Oh, I see. cpu->running is only read under the mutex, but can be written _by the owner thread only_ outside the mutex. So writes outside the mutex must be atomic, but writes under the mutex don't because: - no other thread ever writes to cpu->running - no other thread can be reading cpu->running Paolo
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On 09/24/2016 04:52 AM, Paolo Bonzini wrote: - Original Message - From: "Richard Henderson" To: "Paolo Bonzini" , qemu-devel@nongnu.org Cc: "serge fdrv" , c...@braap.org, "alex bennee" , "sergey fedorov" Sent: Friday, September 23, 2016 8:23:46 PM Subject: Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end On 09/23/2016 12:31 AM, Paolo Bonzini wrote: +if (atomic_read(&other_cpu->running)) { ... +atomic_set(&cpu->running, true); ... +cpu->running = false; ... +cpu->running = true; Inconsistent use of atomics. I don't see that the cpu_list_lock protects the last two lines in any way. It does: qemu_mutex_lock(&qemu_cpu_list_lock); What I meant is that I don't see that the mutex avoids the need for atomic_set. but I can change it anyway to atomic_set. Thanks, r~
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
- Original Message - > From: "Richard Henderson" > To: "Paolo Bonzini" , qemu-devel@nongnu.org > Cc: "serge fdrv" , c...@braap.org, "alex bennee" > , "sergey fedorov" > > Sent: Friday, September 23, 2016 8:23:46 PM > Subject: Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for > cpu_exec_start/end > > On 09/23/2016 12:31 AM, Paolo Bonzini wrote: > > +if (atomic_read(&other_cpu->running)) { > ... > > +atomic_set(&cpu->running, true); > ... > > +cpu->running = false; > ... > > +cpu->running = true; > > Inconsistent use of atomics. I don't see that the cpu_list_lock protects the > last two lines in any way. It does: qemu_mutex_lock(&qemu_cpu_list_lock); if (!cpu->has_waiter) { /* Not counted in pending_cpus, let the exclusive item * run. Since we have the lock, just set cpu->running to true * while holding it; no need to check pending_cpus again. */ cpu->running = false; exclusive_idle(); /* Now pending_cpus is zero. */ cpu->running = true; } else { /* Counted in pending_cpus, go ahead and release the * waiter at cpu_exec_end. */ } qemu_mutex_unlock(&qemu_cpu_list_lock); but I can change it anyway to atomic_set. Paolo
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On 09/23/2016 12:31 AM, Paolo Bonzini wrote: +if (atomic_read(&other_cpu->running)) { ... +atomic_set(&cpu->running, true); ... +cpu->running = false; ... +cpu->running = true; Inconsistent use of atomics. I don't see that the cpu_list_lock protects the last two lines in any way. r~
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On 22/09/2016 00:27, Emilio G. Cota wrote: > Another suggestion is to consistently access pending_cpus atomically, > since now we're accessing it with and without the CPU list mutex held. Yeah, that's a bit of a pain in the ass, but it's a good idea. Paolo
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote: (snip) > @@ -212,24 +216,75 @@ void end_exclusive(void) > /* Wait for exclusive ops to finish, and begin cpu execution. */ > void cpu_exec_start(CPUState *cpu) > { > -qemu_mutex_lock(&qemu_cpu_list_mutex); > -exclusive_idle(); > cpu->running = true; > -qemu_mutex_unlock(&qemu_cpu_list_mutex); > + > +/* Write cpu->running before reading pending_cpus. */ > +smp_mb(); > + > +/* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1. > + * After taking the lock we'll see cpu->has_waiter == true and run---not > + * for long because start_exclusive kicked us. cpu_exec_end will > + * decrement pending_cpus and signal the waiter. > + * > + * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1. > + * This includes the case when an exclusive item is running now. > + * Then we'll see cpu->has_waiter == false and wait for the item to > + * complete. > + * > + * 3. pending_cpus == 0. Then start_exclusive is definitely going to > + * see cpu->running == true, and it will kick the CPU. > + */ > +if (pending_cpus) { I'd consider doing if (unlikely(pending_cpus)) { since the exclusive is a slow path and will be more so in the near future. > +qemu_mutex_lock(&qemu_cpu_list_mutex); > +if (!cpu->has_waiter) { > +/* Not counted in pending_cpus, let the exclusive item > + * run. Since we have the lock, set cpu->running to true > + * while holding it instead of retrying. > + */ > +cpu->running = false; > +exclusive_idle(); > +/* Now pending_cpus is zero. */ > +cpu->running = true; > +} else { > +/* Counted in pending_cpus, go ahead. */ > +} > +qemu_mutex_unlock(&qemu_cpu_list_mutex); > +} > } > > /* Mark cpu as not executing, and release pending exclusive ops. */ > void cpu_exec_end(CPUState *cpu) > { > -qemu_mutex_lock(&qemu_cpu_list_mutex); > cpu->running = false; > -if (pending_cpus > 1) { > -pending_cpus--; > -if (pending_cpus == 1) { > -qemu_cond_signal(&exclusive_cond); > + > +/* Write cpu->running before reading pending_cpus. */ > +smp_mb(); > + > +/* 1. start_exclusive saw cpu->running == true. Then it will increment > + * pending_cpus and wait for exclusive_cond. After taking the lock > + * we'll see cpu->has_waiter == true. > + * > + * 2. start_exclusive saw cpu->running == false but here pending_cpus >= > 1. > + * This includes the case when an exclusive item started after setting > + * cpu->running to false and before we read pending_cpus. Then we'll see > + * cpu->has_waiter == false and not touch pending_cpus. The next call to > + * cpu_exec_start will run exclusive_idle if still necessary, thus > waiting > + * for the item to complete. > + * > + * 3. pending_cpus == 0. Then start_exclusive is definitely going to > + * see cpu->running == false, and it can ignore this CPU until the > + * next cpu_exec_start. > + */ > +if (pending_cpus) { ditto > +qemu_mutex_lock(&qemu_cpu_list_mutex); > +if (cpu->has_waiter) { > +cpu->has_waiter = false; > +if (--pending_cpus == 1) { > +qemu_cond_signal(&exclusive_cond); > +} (snip) Another suggestion is to consistently access pending_cpus atomically, since now we're accessing it with and without the CPU list mutex held. Thanks, Emilio
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On Wed, Sep 21, 2016 at 20:19:18 +0200, Paolo Bonzini wrote: (snip) > No, this is not true. Barriers order stores and loads within a thread > _and_ establish synchronizes-with edges. > > In the example above you are violating causality: > > - cpu0 stores cpu->running before loading pending_cpus > > - because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0 > loads it > > - cpu1 loads cpu->running after it stores pending_cpus OK. So I simplified the example to understand this better: cpu0cpu1 { A = B = 0, r0 and r1 are private variables } x = 1 y = 1 smp_mb()smp_mb() r0 = y r1 = x Turns out this is scenario 10 here: https://lwn.net/Articles/573436/ The source of my confusion was not paying due attention to smp_mb, which is necessary for maintaining transitivity. > > Is there a performance (scalability) reason behind this patch? > > Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive. > > With this patch, as long as there are no start/end_exclusive (which are > supposed to be rare) there is no contention on multiple CPUs doing > cpu_exec_start/end. > > Without it, as CPUs increase, the global cpu_list_mutex is going to > become a bottleneck. I see. Scalability-wise I wouldn't expect much improvement with MTTCG full-system, given that the iothread lock is still acquired on every CPU loop exit (just like in KVM). However, for user-mode this should yield measurable improvements =D Thanks, E.
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On 21/09/2016 19:24, Emilio G. Cota wrote: > On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote: >> Set cpu->running without taking the cpu_list lock, only look at it if >> there is a concurrent exclusive section. This requires adding a new >> field to CPUState, which records whether a running CPU is being counted >> in pending_cpus. When an exclusive section is started concurrently with >> cpu_exec_start, cpu_exec_start can use the new field to wait for the end >> of the exclusive section. >> >> This a separate patch for easier bisection of issues. >> >> Signed-off-by: Paolo Bonzini >> --- >> cpus-common.c | 73 >> -- >> docs/tcg-exclusive.promela | 53 +++-- >> include/qom/cpu.h | 5 ++-- >> 3 files changed, 117 insertions(+), 14 deletions(-) >> >> diff --git a/cpus-common.c b/cpus-common.c >> index f7ad534..46cf8ef 100644 >> --- a/cpus-common.c >> +++ b/cpus-common.c >> @@ -184,8 +184,12 @@ void start_exclusive(void) >> >> /* Make all other cpus stop executing. */ >> pending_cpus = 1; >> + >> +/* Write pending_cpus before reading other_cpu->running. */ >> +smp_mb(); >> CPU_FOREACH(other_cpu) { >> if (other_cpu->running) { >> +other_cpu->has_waiter = true; >> pending_cpus++; >> qemu_cpu_kick(other_cpu); >> } >> @@ -212,24 +216,75 @@ void end_exclusive(void) >> /* Wait for exclusive ops to finish, and begin cpu execution. */ >> void cpu_exec_start(CPUState *cpu) >> { >> -qemu_mutex_lock(&qemu_cpu_list_mutex); >> -exclusive_idle(); >> cpu->running = true; >> -qemu_mutex_unlock(&qemu_cpu_list_mutex); >> + >> +/* Write cpu->running before reading pending_cpus. */ >> +smp_mb(); >> + >> +/* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1. >> + * After taking the lock we'll see cpu->has_waiter == true and run---not >> + * for long because start_exclusive kicked us. cpu_exec_end will >> + * decrement pending_cpus and signal the waiter. >> + * >> + * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1. >> + * This includes the case when an exclusive item is running now. >> + * Then we'll see cpu->has_waiter == false and wait for the item to >> + * complete. >> + * >> + * 3. pending_cpus == 0. Then start_exclusive is definitely going to >> + * see cpu->running == true, and it will kick the CPU. >> + */ >> +if (pending_cpus) { >> +qemu_mutex_lock(&qemu_cpu_list_mutex); >> +if (!cpu->has_waiter) { >> +/* Not counted in pending_cpus, let the exclusive item >> + * run. Since we have the lock, set cpu->running to true >> + * while holding it instead of retrying. >> + */ >> +cpu->running = false; >> +exclusive_idle(); >> +/* Now pending_cpus is zero. */ >> +cpu->running = true; >> +} else { >> +/* Counted in pending_cpus, go ahead. */ >> +} >> +qemu_mutex_unlock(&qemu_cpu_list_mutex); >> +} > > wrt scenario (3): I don't think other threads will always see cpu->running == > true. > Consider the following: > > cpu0 cpu1 > > > cpu->running = true; pending_cpus = 1; > smp_mb(); smp_mb(); > if (pending_cpus) { /* false */ } CPU_FOREACH(other_cpu) { if > (other_cpu->running) { /* false */ } } > > The barriers here don't guarantee that changes are immediately visible to > others > (for that we need strong ops, i.e. atomics). No, this is not true. Barriers order stores and loads within a thread _and_ establish synchronizes-with edges. In the example above you are violating causality: - cpu0 stores cpu->running before loading pending_cpus - because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0 loads it - cpu1 loads cpu->running after it stores pending_cpus hence the only valid ordering is cpu->running = true if (pending_cpus) pending_cpus if (other_cpu->running) > Is there a performance (scalability) reason behind this patch? Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive. With this patch, as long as there are no start/end_exclusive (which are supposed to be rare) there is no contention on multiple CPUs doing cpu_exec_start/end. Without it, as CPUs increase, the global cpu_list_mutex is going to become a bottleneck. Paolo
Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote: > Set cpu->running without taking the cpu_list lock, only look at it if > there is a concurrent exclusive section. This requires adding a new > field to CPUState, which records whether a running CPU is being counted > in pending_cpus. When an exclusive section is started concurrently with > cpu_exec_start, cpu_exec_start can use the new field to wait for the end > of the exclusive section. > > This a separate patch for easier bisection of issues. > > Signed-off-by: Paolo Bonzini > --- > cpus-common.c | 73 > -- > docs/tcg-exclusive.promela | 53 +++-- > include/qom/cpu.h | 5 ++-- > 3 files changed, 117 insertions(+), 14 deletions(-) > > diff --git a/cpus-common.c b/cpus-common.c > index f7ad534..46cf8ef 100644 > --- a/cpus-common.c > +++ b/cpus-common.c > @@ -184,8 +184,12 @@ void start_exclusive(void) > > /* Make all other cpus stop executing. */ > pending_cpus = 1; > + > +/* Write pending_cpus before reading other_cpu->running. */ > +smp_mb(); > CPU_FOREACH(other_cpu) { > if (other_cpu->running) { > +other_cpu->has_waiter = true; > pending_cpus++; > qemu_cpu_kick(other_cpu); > } > @@ -212,24 +216,75 @@ void end_exclusive(void) > /* Wait for exclusive ops to finish, and begin cpu execution. */ > void cpu_exec_start(CPUState *cpu) > { > -qemu_mutex_lock(&qemu_cpu_list_mutex); > -exclusive_idle(); > cpu->running = true; > -qemu_mutex_unlock(&qemu_cpu_list_mutex); > + > +/* Write cpu->running before reading pending_cpus. */ > +smp_mb(); > + > +/* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1. > + * After taking the lock we'll see cpu->has_waiter == true and run---not > + * for long because start_exclusive kicked us. cpu_exec_end will > + * decrement pending_cpus and signal the waiter. > + * > + * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1. > + * This includes the case when an exclusive item is running now. > + * Then we'll see cpu->has_waiter == false and wait for the item to > + * complete. > + * > + * 3. pending_cpus == 0. Then start_exclusive is definitely going to > + * see cpu->running == true, and it will kick the CPU. > + */ > +if (pending_cpus) { > +qemu_mutex_lock(&qemu_cpu_list_mutex); > +if (!cpu->has_waiter) { > +/* Not counted in pending_cpus, let the exclusive item > + * run. Since we have the lock, set cpu->running to true > + * while holding it instead of retrying. > + */ > +cpu->running = false; > +exclusive_idle(); > +/* Now pending_cpus is zero. */ > +cpu->running = true; > +} else { > +/* Counted in pending_cpus, go ahead. */ > +} > +qemu_mutex_unlock(&qemu_cpu_list_mutex); > +} wrt scenario (3): I don't think other threads will always see cpu->running == true. Consider the following: cpu0cpu1 cpu->running = true;pending_cpus = 1; smp_mb(); smp_mb(); if (pending_cpus) { /* false */ } CPU_FOREACH(other_cpu) { if (other_cpu->running) { /* false */ } } The barriers here don't guarantee that changes are immediately visible to others (for that we need strong ops, i.e. atomics). So in the example above, pending_cpus has been set to 1, but it might not yet be visible by cpu0. The same thing applies to cpu0->running; despite the barrier, cpu1 might not yet perceive it, and could therefore miss kicking cpu0 (and proceed while cpu0 executes). Is there a performance (scalability) reason behind this patch? I can only think of a guest with many frequent atomics, which would be very slow. However, once the cmpxchg patchset goes in, those atomics will be emulated without leaving the CPU loop. If we want this to scale better without complicating things too much, I'd focus on converting the exclusive_resume broadcast into a signal, so that we avoid the thundering herd problem. Not clear to me what workloads would contend on start/end_exclusive though. Thanks, E.