Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-26 Thread Paolo Bonzini


On 26/09/2016 09:28, Alex Bennée wrote:
> > cpu->running is only read under the mutex, but can be written _by the
> > owner thread only_ outside the mutex.  So writes outside the mutex must
> > be atomic, but writes under the mutex don't because:
> >
> > - no other thread ever writes to cpu->running
> >
> > - no other thread can be reading cpu->running
>
> Should we add some comments to cpu.h's definitions to make the rules clear?

I don't know... It's awfully easy for such documentation to get out of 
date.  Currently it says:

 * @running: #true if CPU is currently running (lockless).
 * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end;
 * valid under cpu_list_lock.

I think it's a good middle ground; it's kind of obvious that only the
CPU itself writes cpu->running.  I'm changing anyway the cpu->running
assignment to atomic_set as suggested by Richard, and that makes it valid
to read cpu->running outside the lock.

Paolo



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-26 Thread Alex Bennée

Paolo Bonzini  writes:

> On 24/09/2016 22:43, Richard Henderson wrote:
 I don't see that the cpu_list_lock protects the
 last two lines in any way.
>>>
>>> It does:
>>>
>>> qemu_mutex_lock(&qemu_cpu_list_lock);
>>
>> What I meant is that I don't see that the mutex avoids the need for
>> atomic_set.
>
> Oh, I see.
>
> cpu->running is only read under the mutex, but can be written _by the
> owner thread only_ outside the mutex.  So writes outside the mutex must
> be atomic, but writes under the mutex don't because:
>
> - no other thread ever writes to cpu->running
>
> - no other thread can be reading cpu->running

Should we add some comments to cpu.h's definitions to make the rules clear?

--
Alex Bennée



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-26 Thread Paolo Bonzini


On 24/09/2016 22:43, Richard Henderson wrote:
>>> I don't see that the cpu_list_lock protects the
>>> last two lines in any way.
>>
>> It does:
>>
>> qemu_mutex_lock(&qemu_cpu_list_lock);
> 
> What I meant is that I don't see that the mutex avoids the need for
> atomic_set.

Oh, I see.

cpu->running is only read under the mutex, but can be written _by the
owner thread only_ outside the mutex.  So writes outside the mutex must
be atomic, but writes under the mutex don't because:

- no other thread ever writes to cpu->running

- no other thread can be reading cpu->running

Paolo



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-24 Thread Richard Henderson

On 09/24/2016 04:52 AM, Paolo Bonzini wrote:



- Original Message -

From: "Richard Henderson" 
To: "Paolo Bonzini" , qemu-devel@nongnu.org
Cc: "serge fdrv" , c...@braap.org, "alex bennee" 
, "sergey fedorov"

Sent: Friday, September 23, 2016 8:23:46 PM
Subject: Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for 
cpu_exec_start/end

On 09/23/2016 12:31 AM, Paolo Bonzini wrote:

+if (atomic_read(&other_cpu->running)) {

...

+atomic_set(&cpu->running, true);

...

+cpu->running = false;

...

+cpu->running = true;


Inconsistent use of atomics.  I don't see that the cpu_list_lock protects the
last two lines in any way.


It does:

qemu_mutex_lock(&qemu_cpu_list_lock);


What I meant is that I don't see that the mutex avoids the need for atomic_set.


but I can change it anyway to atomic_set.


Thanks,


r~



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-24 Thread Paolo Bonzini


- Original Message -
> From: "Richard Henderson" 
> To: "Paolo Bonzini" , qemu-devel@nongnu.org
> Cc: "serge fdrv" , c...@braap.org, "alex bennee" 
> , "sergey fedorov"
> 
> Sent: Friday, September 23, 2016 8:23:46 PM
> Subject: Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for 
> cpu_exec_start/end
> 
> On 09/23/2016 12:31 AM, Paolo Bonzini wrote:
> > +if (atomic_read(&other_cpu->running)) {
> ...
> > +atomic_set(&cpu->running, true);
> ...
> > +cpu->running = false;
> ...
> > +cpu->running = true;
> 
> Inconsistent use of atomics.  I don't see that the cpu_list_lock protects the
> last two lines in any way.

It does:

qemu_mutex_lock(&qemu_cpu_list_lock);
if (!cpu->has_waiter) {
/* Not counted in pending_cpus, let the exclusive item
 * run.  Since we have the lock, just set cpu->running to true
 * while holding it; no need to check pending_cpus again.
 */
cpu->running = false;
exclusive_idle();
/* Now pending_cpus is zero.  */
cpu->running = true;
} else {
/* Counted in pending_cpus, go ahead and release the
 * waiter at cpu_exec_end.
 */
}
qemu_mutex_unlock(&qemu_cpu_list_lock);

but I can change it anyway to atomic_set.

Paolo



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-23 Thread Richard Henderson

On 09/23/2016 12:31 AM, Paolo Bonzini wrote:

+if (atomic_read(&other_cpu->running)) {

...

+atomic_set(&cpu->running, true);

...

+cpu->running = false;

...

+cpu->running = true;


Inconsistent use of atomics.  I don't see that the cpu_list_lock protects the 
last two lines in any way.



r~



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-22 Thread Paolo Bonzini


On 22/09/2016 00:27, Emilio G. Cota wrote:
> Another suggestion is to consistently access pending_cpus atomically,
> since now we're accessing it with and without the CPU list mutex held.

Yeah, that's a bit of a pain in the ass, but it's a good idea.

Paolo



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-21 Thread Emilio G. Cota
On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote:
(snip)
> @@ -212,24 +216,75 @@ void end_exclusive(void)
>  /* Wait for exclusive ops to finish, and begin cpu execution.  */
>  void cpu_exec_start(CPUState *cpu)
>  {
> -qemu_mutex_lock(&qemu_cpu_list_mutex);
> -exclusive_idle();
>  cpu->running = true;
> -qemu_mutex_unlock(&qemu_cpu_list_mutex);
> +
> +/* Write cpu->running before reading pending_cpus.  */
> +smp_mb();
> +
> +/* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1.
> + * After taking the lock we'll see cpu->has_waiter == true and run---not
> + * for long because start_exclusive kicked us.  cpu_exec_end will
> + * decrement pending_cpus and signal the waiter.
> + *
> + * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1.
> + * This includes the case when an exclusive item is running now.
> + * Then we'll see cpu->has_waiter == false and wait for the item to
> + * complete.
> + *
> + * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
> + * see cpu->running == true, and it will kick the CPU.
> + */
> +if (pending_cpus) {

I'd consider doing
if (unlikely(pending_cpus)) {
since the exclusive is a slow path and will be more so in the near future.

> +qemu_mutex_lock(&qemu_cpu_list_mutex);
> +if (!cpu->has_waiter) {
> +/* Not counted in pending_cpus, let the exclusive item
> + * run.  Since we have the lock, set cpu->running to true
> + * while holding it instead of retrying.
> + */
> +cpu->running = false;
> +exclusive_idle();
> +/* Now pending_cpus is zero.  */
> +cpu->running = true;
> +} else {
> +/* Counted in pending_cpus, go ahead.  */
> +}
> +qemu_mutex_unlock(&qemu_cpu_list_mutex);
> +}
>  }
>  
>  /* Mark cpu as not executing, and release pending exclusive ops.  */
>  void cpu_exec_end(CPUState *cpu)
>  {
> -qemu_mutex_lock(&qemu_cpu_list_mutex);
>  cpu->running = false;
> -if (pending_cpus > 1) {
> -pending_cpus--;
> -if (pending_cpus == 1) {
> -qemu_cond_signal(&exclusive_cond);
> +
> +/* Write cpu->running before reading pending_cpus.  */
> +smp_mb();
> +
> +/* 1. start_exclusive saw cpu->running == true.  Then it will increment
> + * pending_cpus and wait for exclusive_cond.  After taking the lock
> + * we'll see cpu->has_waiter == true.
> + *
> + * 2. start_exclusive saw cpu->running == false but here pending_cpus >= 
> 1.
> + * This includes the case when an exclusive item started after setting
> + * cpu->running to false and before we read pending_cpus.  Then we'll see
> + * cpu->has_waiter == false and not touch pending_cpus.  The next call to
> + * cpu_exec_start will run exclusive_idle if still necessary, thus 
> waiting
> + * for the item to complete.
> + *
> + * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
> + * see cpu->running == false, and it can ignore this CPU until the
> + * next cpu_exec_start.
> + */
> +if (pending_cpus) {

ditto

> +qemu_mutex_lock(&qemu_cpu_list_mutex);
> +if (cpu->has_waiter) {
> +cpu->has_waiter = false;
> +if (--pending_cpus == 1) {
> +qemu_cond_signal(&exclusive_cond);
> +}
(snip)

Another suggestion is to consistently access pending_cpus atomically,
since now we're accessing it with and without the CPU list mutex held.

Thanks,

Emilio



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-21 Thread Emilio G. Cota
On Wed, Sep 21, 2016 at 20:19:18 +0200, Paolo Bonzini wrote:
(snip)
> No, this is not true.  Barriers order stores and loads within a thread
> _and_ establish synchronizes-with edges.
> 
> In the example above you are violating causality:
> 
> - cpu0 stores cpu->running before loading pending_cpus
> 
> - because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0
> loads it
> 
> - cpu1 loads cpu->running after it stores pending_cpus

OK. So I simplified the example to understand this better:

cpu0cpu1

   { A = B = 0, r0 and r1 are private variables }
x = 1   y = 1
smp_mb()smp_mb()
r0 = y  r1 = x

Turns out this is scenario 10 here: https://lwn.net/Articles/573436/

The source of my confusion was not paying due attention to smp_mb,
which is necessary for maintaining transitivity.

> > Is there a performance (scalability) reason behind this patch?
> 
> Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive.
> 
> With this patch, as long as there are no start/end_exclusive (which are
> supposed to be rare) there is no contention on multiple CPUs doing
> cpu_exec_start/end.
> 
> Without it, as CPUs increase, the global cpu_list_mutex is going to
> become a bottleneck.

I see. Scalability-wise I wouldn't expect much improvement with MTTCG
full-system, given that the iothread lock is still acquired on every
CPU loop exit (just like in KVM). However, for user-mode this should
yield measurable improvements =D

Thanks,

E.



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-21 Thread Paolo Bonzini


On 21/09/2016 19:24, Emilio G. Cota wrote:
> On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote:
>> Set cpu->running without taking the cpu_list lock, only look at it if
>> there is a concurrent exclusive section.  This requires adding a new
>> field to CPUState, which records whether a running CPU is being counted
>> in pending_cpus.  When an exclusive section is started concurrently with
>> cpu_exec_start, cpu_exec_start can use the new field to wait for the end
>> of the exclusive section.
>>
>> This a separate patch for easier bisection of issues.
>>
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  cpus-common.c  | 73 
>> --
>>  docs/tcg-exclusive.promela | 53 +++--
>>  include/qom/cpu.h  |  5 ++--
>>  3 files changed, 117 insertions(+), 14 deletions(-)
>>
>> diff --git a/cpus-common.c b/cpus-common.c
>> index f7ad534..46cf8ef 100644
>> --- a/cpus-common.c
>> +++ b/cpus-common.c
>> @@ -184,8 +184,12 @@ void start_exclusive(void)
>>  
>>  /* Make all other cpus stop executing.  */
>>  pending_cpus = 1;
>> +
>> +/* Write pending_cpus before reading other_cpu->running.  */
>> +smp_mb();
>>  CPU_FOREACH(other_cpu) {
>>  if (other_cpu->running) {
>> +other_cpu->has_waiter = true;
>>  pending_cpus++;
>>  qemu_cpu_kick(other_cpu);
>>  }
>> @@ -212,24 +216,75 @@ void end_exclusive(void)
>>  /* Wait for exclusive ops to finish, and begin cpu execution.  */
>>  void cpu_exec_start(CPUState *cpu)
>>  {
>> -qemu_mutex_lock(&qemu_cpu_list_mutex);
>> -exclusive_idle();
>>  cpu->running = true;
>> -qemu_mutex_unlock(&qemu_cpu_list_mutex);
>> +
>> +/* Write cpu->running before reading pending_cpus.  */
>> +smp_mb();
>> +
>> +/* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1.
>> + * After taking the lock we'll see cpu->has_waiter == true and run---not
>> + * for long because start_exclusive kicked us.  cpu_exec_end will
>> + * decrement pending_cpus and signal the waiter.
>> + *
>> + * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1.
>> + * This includes the case when an exclusive item is running now.
>> + * Then we'll see cpu->has_waiter == false and wait for the item to
>> + * complete.
>> + *
>> + * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
>> + * see cpu->running == true, and it will kick the CPU.
>> + */
>> +if (pending_cpus) {
>> +qemu_mutex_lock(&qemu_cpu_list_mutex);
>> +if (!cpu->has_waiter) {
>> +/* Not counted in pending_cpus, let the exclusive item
>> + * run.  Since we have the lock, set cpu->running to true
>> + * while holding it instead of retrying.
>> + */
>> +cpu->running = false;
>> +exclusive_idle();
>> +/* Now pending_cpus is zero.  */
>> +cpu->running = true;
>> +} else {
>> +/* Counted in pending_cpus, go ahead.  */
>> +}
>> +qemu_mutex_unlock(&qemu_cpu_list_mutex);
>> +}
> 
> wrt scenario (3): I don't think other threads will always see cpu->running == 
> true.
> Consider the following:
> 
> cpu0  cpu1
>   
> 
> cpu->running = true;  pending_cpus = 1;
> smp_mb(); smp_mb();
> if (pending_cpus) { /* false */ } CPU_FOREACH(other_cpu) { if 
> (other_cpu->running) { /* false */ } }
> 
> The barriers here don't guarantee that changes are immediately visible to 
> others
> (for that we need strong ops, i.e. atomics).

No, this is not true.  Barriers order stores and loads within a thread
_and_ establish synchronizes-with edges.

In the example above you are violating causality:

- cpu0 stores cpu->running before loading pending_cpus

- because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0
loads it

- cpu1 loads cpu->running after it stores pending_cpus

hence the only valid ordering is

  cpu->running = true
  if (pending_cpus)
pending_cpus
if (other_cpu->running)

> Is there a performance (scalability) reason behind this patch?

Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive.

With this patch, as long as there are no start/end_exclusive (which are
supposed to be rare) there is no contention on multiple CPUs doing
cpu_exec_start/end.

Without it, as CPUs increase, the global cpu_list_mutex is going to
become a bottleneck.

Paolo



Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end

2016-09-21 Thread Emilio G. Cota
On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote:
> Set cpu->running without taking the cpu_list lock, only look at it if
> there is a concurrent exclusive section.  This requires adding a new
> field to CPUState, which records whether a running CPU is being counted
> in pending_cpus.  When an exclusive section is started concurrently with
> cpu_exec_start, cpu_exec_start can use the new field to wait for the end
> of the exclusive section.
> 
> This a separate patch for easier bisection of issues.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  cpus-common.c  | 73 
> --
>  docs/tcg-exclusive.promela | 53 +++--
>  include/qom/cpu.h  |  5 ++--
>  3 files changed, 117 insertions(+), 14 deletions(-)
> 
> diff --git a/cpus-common.c b/cpus-common.c
> index f7ad534..46cf8ef 100644
> --- a/cpus-common.c
> +++ b/cpus-common.c
> @@ -184,8 +184,12 @@ void start_exclusive(void)
>  
>  /* Make all other cpus stop executing.  */
>  pending_cpus = 1;
> +
> +/* Write pending_cpus before reading other_cpu->running.  */
> +smp_mb();
>  CPU_FOREACH(other_cpu) {
>  if (other_cpu->running) {
> +other_cpu->has_waiter = true;
>  pending_cpus++;
>  qemu_cpu_kick(other_cpu);
>  }
> @@ -212,24 +216,75 @@ void end_exclusive(void)
>  /* Wait for exclusive ops to finish, and begin cpu execution.  */
>  void cpu_exec_start(CPUState *cpu)
>  {
> -qemu_mutex_lock(&qemu_cpu_list_mutex);
> -exclusive_idle();
>  cpu->running = true;
> -qemu_mutex_unlock(&qemu_cpu_list_mutex);
> +
> +/* Write cpu->running before reading pending_cpus.  */
> +smp_mb();
> +
> +/* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1.
> + * After taking the lock we'll see cpu->has_waiter == true and run---not
> + * for long because start_exclusive kicked us.  cpu_exec_end will
> + * decrement pending_cpus and signal the waiter.
> + *
> + * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1.
> + * This includes the case when an exclusive item is running now.
> + * Then we'll see cpu->has_waiter == false and wait for the item to
> + * complete.
> + *
> + * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
> + * see cpu->running == true, and it will kick the CPU.
> + */
> +if (pending_cpus) {
> +qemu_mutex_lock(&qemu_cpu_list_mutex);
> +if (!cpu->has_waiter) {
> +/* Not counted in pending_cpus, let the exclusive item
> + * run.  Since we have the lock, set cpu->running to true
> + * while holding it instead of retrying.
> + */
> +cpu->running = false;
> +exclusive_idle();
> +/* Now pending_cpus is zero.  */
> +cpu->running = true;
> +} else {
> +/* Counted in pending_cpus, go ahead.  */
> +}
> +qemu_mutex_unlock(&qemu_cpu_list_mutex);
> +}

wrt scenario (3): I don't think other threads will always see cpu->running == 
true.
Consider the following:

cpu0cpu1


cpu->running = true;pending_cpus = 1;
smp_mb();   smp_mb();
if (pending_cpus) { /* false */ }   CPU_FOREACH(other_cpu) { if 
(other_cpu->running) { /* false */ } }

The barriers here don't guarantee that changes are immediately visible to others
(for that we need strong ops, i.e. atomics).
So in the example above, pending_cpus has been set to 1, but it might not
yet be visible by cpu0. The same thing applies to cpu0->running; despite
the barrier, cpu1 might not yet perceive it, and could therefore miss kicking
cpu0 (and proceed while cpu0 executes).

Is there a performance (scalability) reason behind this patch? I can only
think of a guest with many frequent atomics, which would be very slow. However,
once the cmpxchg patchset goes in, those atomics will be emulated without
leaving the CPU loop.

If we want this to scale better without complicating things too much,
I'd focus on converting the exclusive_resume broadcast into a signal,
so that we avoid the thundering herd problem. Not clear to me what workloads
would contend on start/end_exclusive though.

Thanks,

E.