Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread 焦晓冬
On Mon, Mar 12, 2018 at 9:24 PM, Andrea Parri  wrote:
> Hi Trol,
>
> [...]
>
>
>> But this is just one special case that acquire-release chains promise us.
>>
>> A=B=0 as initial
>>
>>   CPU0CPU1CPU2CPU3
>>  write A=1
>>read A=1
>>write B=1
>>release X
>>  acquire X
>>  read A=?
>>  release Y
>>
>> acquire Y
>>
>> read B=?
>>
>> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
>> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
>>
>> The second assurance is both in theory and implemented by real hardware.
>>
>> As for theory, the C++11 memory model, which is a potential formal model
>> for kernel memory model as
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
>> descripes, states that:
>>
>> If a value computation A of an atomic object M happens before a value
>> computation B of M, and A takes its value from a side effect X on M, then
>> the value computed by B shall either be the value stored by X or the value
>> stored by a side effect Y on M, where Y follows X in the modification
>> order of M.
>
> A formal memory consistency model for the Linux kernel is now available at:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git lkmm
>
> Commit
>
>   1c27b644c0fdbc61e113b8faee14baeb8df32486
>   ("Automate memory-barriers.txt; provide Linux-kernel memory model")
>
> provides some information (and references) on the development of this work.
>
> ---
>
> You can check the above observation against this model: unless I mis-typed
> your snippet,
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ cat trol0.litmus
> C trol0
>
> {}
>
> P0(int *a)
> {
> WRITE_ONCE(*a, 1);
> }
>
> P1(int *a, int *b, int *x)
> {
> int r0;
>
> r0 = READ_ONCE(*a);
> WRITE_ONCE(*b, 1);
> smp_store_release(x, 1);
> }
>
> P2(int *a, int *x, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(x);
> r1 = READ_ONCE(*a);
> smp_store_release(y, 1);
> }
>
> P3(int *b, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(y);
> r1 = READ_ONCE(*b);
> }
>
> exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ herd7 -conf linux-kernel.cfg 
> trol0.litmus
> Test trol0 Allowed
> States 25
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> No
> Witnesses
> Positive: 0 Negative: 25
> Condition exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
> Observation trol0 Never 0 25
> Time trol0 0.03
> Hash=21369772c98e442dd382bd84b43067ee
>
> Please see "tools/memory-model/README" or "tools/memory-model/Documentation/"
> for further information about these tools/model.
>
> Best,
>   Andrea
>

This work is amazingly great, Andrea.
I'd like to study on it.

>
>>
>> at
>> $1.10 rule 18, on page 14
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
>>
>> As for real hardware, Luc provided detailed test and explanation on
>> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
>> in this paper:
>>
>> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
>> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> So, I think we may remove RCsc from smp_mb__after_spinlock which is
>> really confusing.
>>
>> Best Regards,
>> Trol
>>
>> >
>> >> And for stopped tasks,
>> >>
>> >>  CPU0 CPU1CPU2
>> >>
>> >> 
>> >>
>> >> lock(rq0)
>> >> schedule out A
>> >> remove A from rq0
>> >> store-release(A->on_cpu)
>> >> unock(rq0)
>> >>
>> >>   load_acquire(A->on_cpu)
>> >>   set_task_cpu(A, 2)
>> >>
>> >>   lock(rq2)
>> >>   add 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread 焦晓冬
On Mon, Mar 12, 2018 at 9:24 PM, Andrea Parri  wrote:
> Hi Trol,
>
> [...]
>
>
>> But this is just one special case that acquire-release chains promise us.
>>
>> A=B=0 as initial
>>
>>   CPU0CPU1CPU2CPU3
>>  write A=1
>>read A=1
>>write B=1
>>release X
>>  acquire X
>>  read A=?
>>  release Y
>>
>> acquire Y
>>
>> read B=?
>>
>> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
>> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
>>
>> The second assurance is both in theory and implemented by real hardware.
>>
>> As for theory, the C++11 memory model, which is a potential formal model
>> for kernel memory model as
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
>> descripes, states that:
>>
>> If a value computation A of an atomic object M happens before a value
>> computation B of M, and A takes its value from a side effect X on M, then
>> the value computed by B shall either be the value stored by X or the value
>> stored by a side effect Y on M, where Y follows X in the modification
>> order of M.
>
> A formal memory consistency model for the Linux kernel is now available at:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git lkmm
>
> Commit
>
>   1c27b644c0fdbc61e113b8faee14baeb8df32486
>   ("Automate memory-barriers.txt; provide Linux-kernel memory model")
>
> provides some information (and references) on the development of this work.
>
> ---
>
> You can check the above observation against this model: unless I mis-typed
> your snippet,
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ cat trol0.litmus
> C trol0
>
> {}
>
> P0(int *a)
> {
> WRITE_ONCE(*a, 1);
> }
>
> P1(int *a, int *b, int *x)
> {
> int r0;
>
> r0 = READ_ONCE(*a);
> WRITE_ONCE(*b, 1);
> smp_store_release(x, 1);
> }
>
> P2(int *a, int *x, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(x);
> r1 = READ_ONCE(*a);
> smp_store_release(y, 1);
> }
>
> P3(int *b, int *y)
> {
> int r0;
> int r1;
>
> r0 = smp_load_acquire(y);
> r1 = READ_ONCE(*b);
> }
>
> exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
>
> andrea@andrea:~/linux-rcu/tools/memory-model$ herd7 -conf linux-kernel.cfg 
> trol0.litmus
> Test trol0 Allowed
> States 25
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=0; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
> 1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
> 1:r0=1; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
> No
> Witnesses
> Positive: 0 Negative: 25
> Condition exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
> Observation trol0 Never 0 25
> Time trol0 0.03
> Hash=21369772c98e442dd382bd84b43067ee
>
> Please see "tools/memory-model/README" or "tools/memory-model/Documentation/"
> for further information about these tools/model.
>
> Best,
>   Andrea
>

This work is amazingly great, Andrea.
I'd like to study on it.

>
>>
>> at
>> $1.10 rule 18, on page 14
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
>>
>> As for real hardware, Luc provided detailed test and explanation on
>> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
>> in this paper:
>>
>> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
>> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> So, I think we may remove RCsc from smp_mb__after_spinlock which is
>> really confusing.
>>
>> Best Regards,
>> Trol
>>
>> >
>> >> And for stopped tasks,
>> >>
>> >>  CPU0 CPU1CPU2
>> >>
>> >> 
>> >>
>> >> lock(rq0)
>> >> schedule out A
>> >> remove A from rq0
>> >> store-release(A->on_cpu)
>> >> unock(rq0)
>> >>
>> >>   load_acquire(A->on_cpu)
>> >>   set_task_cpu(A, 2)
>> >>
>> >>   lock(rq2)
>> >>   add A into rq2
>> >> 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Peter Zijlstra
On Mon, Mar 12, 2018 at 05:13:03PM +0800, 焦晓冬 wrote:

> If the fixed comment could point out where this RCsc is used, it will be 
> great.

The one that comes to mind first is RCU-sched, that relies on a context
switch implying a full smp_mb(). But I think there's more..


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Peter Zijlstra
On Mon, Mar 12, 2018 at 05:13:03PM +0800, 焦晓冬 wrote:

> If the fixed comment could point out where this RCsc is used, it will be 
> great.

The one that comes to mind first is RCU-sched, that relies on a context
switch implying a full smp_mb(). But I think there's more..


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Andrea Parri
Hi Trol,

[...]


> But this is just one special case that acquire-release chains promise us.
> 
> A=B=0 as initial
> 
>   CPU0CPU1CPU2CPU3
>  write A=1
>read A=1
>write B=1
>release X
>  acquire X
>  read A=?
>  release Y
> 
> acquire Y
> 
> read B=?
> 
> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
> 
> The second assurance is both in theory and implemented by real hardware.
> 
> As for theory, the C++11 memory model, which is a potential formal model
> for kernel memory model as
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
> descripes, states that:
> 
> If a value computation A of an atomic object M happens before a value
> computation B of M, and A takes its value from a side effect X on M, then
> the value computed by B shall either be the value stored by X or the value
> stored by a side effect Y on M, where Y follows X in the modification
> order of M.

A formal memory consistency model for the Linux kernel is now available at:

 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git lkmm

Commit

  1c27b644c0fdbc61e113b8faee14baeb8df32486
  ("Automate memory-barriers.txt; provide Linux-kernel memory model")

provides some information (and references) on the development of this work.

---

You can check the above observation against this model: unless I mis-typed
your snippet,

andrea@andrea:~/linux-rcu/tools/memory-model$ cat trol0.litmus
C trol0

{}

P0(int *a)
{
WRITE_ONCE(*a, 1);
}

P1(int *a, int *b, int *x)
{
int r0;

r0 = READ_ONCE(*a);
WRITE_ONCE(*b, 1);
smp_store_release(x, 1);
}

P2(int *a, int *x, int *y)
{
int r0;
int r1;

r0 = smp_load_acquire(x);
r1 = READ_ONCE(*a);
smp_store_release(y, 1);
}

P3(int *b, int *y)
{
int r0;
int r1;

r0 = smp_load_acquire(y);
r1 = READ_ONCE(*b);
}

exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))

andrea@andrea:~/linux-rcu/tools/memory-model$ herd7 -conf linux-kernel.cfg 
trol0.litmus
Test trol0 Allowed
States 25
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=0; 3:r0=1; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=1; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
No
Witnesses
Positive: 0 Negative: 25
Condition exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
Observation trol0 Never 0 25
Time trol0 0.03
Hash=21369772c98e442dd382bd84b43067ee

Please see "tools/memory-model/README" or "tools/memory-model/Documentation/"
for further information about these tools/model.

Best,
  Andrea


> 
> at
> $1.10 rule 18, on page 14
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
> 
> As for real hardware, Luc provided detailed test and explanation on
> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
> in this paper:
> 
> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> So, I think we may remove RCsc from smp_mb__after_spinlock which is
> really confusing.
> 
> Best Regards,
> Trol
> 
> >
> >> And for stopped tasks,
> >>
> >>  CPU0 CPU1CPU2
> >>
> >> 
> >>
> >> lock(rq0)
> >> schedule out A
> >> remove A from rq0
> >> store-release(A->on_cpu)
> >> unock(rq0)
> >>
> >>   load_acquire(A->on_cpu)
> >>   set_task_cpu(A, 2)
> >>
> >>   lock(rq2)
> >>   add A into rq2
> >>   unlock(rq2)
> >>
> >> lock(rq2)
> >> schedule in A
> >> unlock(rq2)
> >>
> >> 
> >>
> >>  happens-before
> >> store-release(A->on_cpu)  happens-before
> >> load_acquire(A->on_cpu)  

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Andrea Parri
Hi Trol,

[...]


> But this is just one special case that acquire-release chains promise us.
> 
> A=B=0 as initial
> 
>   CPU0CPU1CPU2CPU3
>  write A=1
>read A=1
>write B=1
>release X
>  acquire X
>  read A=?
>  release Y
> 
> acquire Y
> 
> read B=?
> 
> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
> 
> The second assurance is both in theory and implemented by real hardware.
> 
> As for theory, the C++11 memory model, which is a potential formal model
> for kernel memory model as
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
> descripes, states that:
> 
> If a value computation A of an atomic object M happens before a value
> computation B of M, and A takes its value from a side effect X on M, then
> the value computed by B shall either be the value stored by X or the value
> stored by a side effect Y on M, where Y follows X in the modification
> order of M.

A formal memory consistency model for the Linux kernel is now available at:

 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git lkmm

Commit

  1c27b644c0fdbc61e113b8faee14baeb8df32486
  ("Automate memory-barriers.txt; provide Linux-kernel memory model")

provides some information (and references) on the development of this work.

---

You can check the above observation against this model: unless I mis-typed
your snippet,

andrea@andrea:~/linux-rcu/tools/memory-model$ cat trol0.litmus
C trol0

{}

P0(int *a)
{
WRITE_ONCE(*a, 1);
}

P1(int *a, int *b, int *x)
{
int r0;

r0 = READ_ONCE(*a);
WRITE_ONCE(*b, 1);
smp_store_release(x, 1);
}

P2(int *a, int *x, int *y)
{
int r0;
int r1;

r0 = smp_load_acquire(x);
r1 = READ_ONCE(*a);
smp_store_release(y, 1);
}

P3(int *b, int *y)
{
int r0;
int r1;

r0 = smp_load_acquire(y);
r1 = READ_ONCE(*b);
}

exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))

andrea@andrea:~/linux-rcu/tools/memory-model$ herd7 -conf linux-kernel.cfg 
trol0.litmus
Test trol0 Allowed
States 25
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
1:r0=0; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=1; 2:r1=0; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=0; 3:r0=1; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=0; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=0; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=0; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=0; 3:r0=1; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=0;
1:r0=1; 2:r0=0; 2:r1=1; 3:r0=1; 3:r1=1;
1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=0;
1:r0=1; 2:r0=1; 2:r1=1; 3:r0=0; 3:r1=1;
1:r0=1; 2:r0=1; 2:r1=1; 3:r0=1; 3:r1=1;
No
Witnesses
Positive: 0 Negative: 25
Condition exists (1:r0=1 /\ 2:r0=1 /\ 3:r0=1 /\ (2:r1=0 \/ 3:r1=0))
Observation trol0 Never 0 25
Time trol0 0.03
Hash=21369772c98e442dd382bd84b43067ee

Please see "tools/memory-model/README" or "tools/memory-model/Documentation/"
for further information about these tools/model.

Best,
  Andrea


> 
> at
> $1.10 rule 18, on page 14
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
> 
> As for real hardware, Luc provided detailed test and explanation on
> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
> in this paper:
> 
> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> So, I think we may remove RCsc from smp_mb__after_spinlock which is
> really confusing.
> 
> Best Regards,
> Trol
> 
> >
> >> And for stopped tasks,
> >>
> >>  CPU0 CPU1CPU2
> >>
> >> 
> >>
> >> lock(rq0)
> >> schedule out A
> >> remove A from rq0
> >> store-release(A->on_cpu)
> >> unock(rq0)
> >>
> >>   load_acquire(A->on_cpu)
> >>   set_task_cpu(A, 2)
> >>
> >>   lock(rq2)
> >>   add A into rq2
> >>   unlock(rq2)
> >>
> >> lock(rq2)
> >> schedule in A
> >> unlock(rq2)
> >>
> >> 
> >>
> >>  happens-before
> >> store-release(A->on_cpu)  happens-before
> >> load_acquire(A->on_cpu)  

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread 焦晓冬
On Mon, Mar 12, 2018 at 4:56 PM, Peter Zijlstra  wrote:
> On Mon, Mar 12, 2018 at 04:56:00PM +0800, Boqun Feng wrote:
>> So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
>> locks, it's just the comments before that may be misleading. We want
>> RCsc locks in schedule code because we want writes in different critical
>> section are ordered even outside the critical sections, for case like:
>>
>>   CPU 0   CPU 1   CPU 2
>>
>>   {A =0 , B = 0}
>>   lock(rq0);
>>   write A=1;
>>   unlock(rq0);
>>
>>   lock(rq0);
>>   read A=1;
>>   write B=2;
>>   unlock(rq0);
>>
>>   read B=2;
>>   smp_rmb();
>>   read A=1;
>>
>> I think we need to fix the comments rather than loose the requirement.
>> Peter?
>
> Yes, ISTR people relying on schedule() being RCsc, and I just picked a
> bad exmaple.

Hi, Peter,
If the fixed comment could point out where this RCsc is used, it will be great.


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread 焦晓冬
On Mon, Mar 12, 2018 at 4:56 PM, Peter Zijlstra  wrote:
> On Mon, Mar 12, 2018 at 04:56:00PM +0800, Boqun Feng wrote:
>> So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
>> locks, it's just the comments before that may be misleading. We want
>> RCsc locks in schedule code because we want writes in different critical
>> section are ordered even outside the critical sections, for case like:
>>
>>   CPU 0   CPU 1   CPU 2
>>
>>   {A =0 , B = 0}
>>   lock(rq0);
>>   write A=1;
>>   unlock(rq0);
>>
>>   lock(rq0);
>>   read A=1;
>>   write B=2;
>>   unlock(rq0);
>>
>>   read B=2;
>>   smp_rmb();
>>   read A=1;
>>
>> I think we need to fix the comments rather than loose the requirement.
>> Peter?
>
> Yes, ISTR people relying on schedule() being RCsc, and I just picked a
> bad exmaple.

Hi, Peter,
If the fixed comment could point out where this RCsc is used, it will be great.


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Peter Zijlstra
On Mon, Mar 12, 2018 at 04:56:00PM +0800, Boqun Feng wrote:
> So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
> locks, it's just the comments before that may be misleading. We want
> RCsc locks in schedule code because we want writes in different critical
> section are ordered even outside the critical sections, for case like:
> 
>   CPU 0   CPU 1   CPU 2
> 
>   {A =0 , B = 0}
>   lock(rq0);
>   write A=1;
>   unlock(rq0);
> 
>   lock(rq0);
>   read A=1;
>   write B=2;
>   unlock(rq0);
> 
>   read B=2;
>   smp_rmb();
>   read A=1;
> 
> I think we need to fix the comments rather than loose the requirement.
> Peter?

Yes, ISTR people relying on schedule() being RCsc, and I just picked a
bad exmaple.


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Peter Zijlstra
On Mon, Mar 12, 2018 at 04:56:00PM +0800, Boqun Feng wrote:
> So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
> locks, it's just the comments before that may be misleading. We want
> RCsc locks in schedule code because we want writes in different critical
> section are ordered even outside the critical sections, for case like:
> 
>   CPU 0   CPU 1   CPU 2
> 
>   {A =0 , B = 0}
>   lock(rq0);
>   write A=1;
>   unlock(rq0);
> 
>   lock(rq0);
>   read A=1;
>   write B=2;
>   unlock(rq0);
> 
>   read B=2;
>   smp_rmb();
>   read A=1;
> 
> I think we need to fix the comments rather than loose the requirement.
> Peter?

Yes, ISTR people relying on schedule() being RCsc, and I just picked a
bad exmaple.


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Boqun Feng
On Mon, Mar 12, 2018 at 04:18:00PM +0800, 焦晓冬 wrote:
> >> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
> >> that the spinning-lock used at __schedule() should be RCsc to ensure
> >> visibility of writes prior to __schedule when the task is to be migrated to
> >> another CPU.
> >>
> >> And this is emphasized at the comment of the newly introduced
> >> smp_mb__after_spinlock(),
> >>
> >>  * This barrier must provide two things:
> >>  *
> >>  *   - it must guarantee a STORE before the spin_lock() is ordered against 
> >> a
> >>  * LOAD after it, see the comments at its two usage sites.
> >>  *
> >>  *   - it must ensure the critical section is RCsc.
> >>  *
> >>  * The latter is important for cases where we observe values written by 
> >> other
> >>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
> >>  *
> >>  * CPU0 CPU1CPU2
> >>  *
> >>  *  for (;;) {
> >>  *if (READ_ONCE(X))
> >>  *  break;
> >>  *  }
> >>  * X=1
> >>  *  
> >>  *  
> >>  *  r = X;
> >>  *
> >>  * without transitivity it could be that CPU1 observes X!=0 breaks the 
> >> loop,
> >>  * we get migrated and CPU2 sees X==0.
> >>
> >> which is used at,
> >>
> >> __schedule(bool preempt) {
> >> ...
> >> rq_lock(rq, );
> >> smp_mb__after_spinlock();
> >> ...
> >> }
> >> .
> >>
> >> If I didn't miss something, I found this kind of visibility is __not__
> >> necessarily
> >> depends on the spinning-lock at __schedule being RCsc.
> >>
> >> In fact, as for runnable task A, the migration would be,
> >>
> >>  CPU0 CPU1CPU2
> >>
> >> 
> >>
> >> lock(rq0)
> >> schedule out A
> >> unock(rq0)
> >>
> >>   lock(rq0)
> >>   remove A from rq0
> >>   unlock(rq0)
> >>
> >>   lock(rq2)
> >>   add A into rq2
> >>   unlock(rq2)
> >> lock(rq2)
> >> schedule in A
> >> unlock(rq2)
> >>
> >> 
> >>
> >>  happens-before
> >> unlock(rq0) happends-before
> >> lock(rq0) happends-before
> >> unlock(rq2) happens-before
> >> lock(rq2) happens-before
> >> 
> >>
> >
> > But without RCsc lock, you cannot guarantee that a write propagates to
> > CPU 0 and CPU 2 at the same time, so the same write may propagate to
> > CPU0 before  but propagate to CPU 2 after
> > . So..
> >
> > Regards,
> > Boqun
> 
> Thank you for pointing out this case, Boqun.
> But this is just one special case that acquire-release chains promise us.
> 

Ah.. right, because of A-Cumulative.

> A=B=0 as initial
> 
>   CPU0CPU1CPU2CPU3
>  write A=1
>read A=1
>write B=1
>release X
>  acquire X
>  read A=?
>  release Y
> 
> acquire Y
> 
> read B=?
> 
> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
> 
> The second assurance is both in theory and implemented by real hardware.
> 
> As for theory, the C++11 memory model, which is a potential formal model
> for kernel memory model as
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
> descripes, states that:
> 
> If a value computation A of an atomic object M happens before a value
> computation B of M, and A takes its value from a side effect X on M, then
> the value computed by B shall either be the value stored by X or the value
> stored by a side effect Y on M, where Y follows X in the modification
> order of M.
> 
> at
> $1.10 rule 18, on page 14
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
> 
> As for real hardware, Luc provided detailed test and explanation on
> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
> in this paper:
> 
> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> So, I think we may remove RCsc from smp_mb__after_spinlock which is
> really confusing.
> 

So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
locks, it's just the comments before that may be misleading. We want
RCsc locks in schedule code because we want writes in different critical
section are ordered even outside the critical sections, for case like:

CPU 0   CPU 1   CPU 2

{A =0 , B = 0}
lock(rq0);
write A=1;
unlock(rq0);

lock(rq0);
read A=1;
write B=2;
unlock(rq0);


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread Boqun Feng
On Mon, Mar 12, 2018 at 04:18:00PM +0800, 焦晓冬 wrote:
> >> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
> >> that the spinning-lock used at __schedule() should be RCsc to ensure
> >> visibility of writes prior to __schedule when the task is to be migrated to
> >> another CPU.
> >>
> >> And this is emphasized at the comment of the newly introduced
> >> smp_mb__after_spinlock(),
> >>
> >>  * This barrier must provide two things:
> >>  *
> >>  *   - it must guarantee a STORE before the spin_lock() is ordered against 
> >> a
> >>  * LOAD after it, see the comments at its two usage sites.
> >>  *
> >>  *   - it must ensure the critical section is RCsc.
> >>  *
> >>  * The latter is important for cases where we observe values written by 
> >> other
> >>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
> >>  *
> >>  * CPU0 CPU1CPU2
> >>  *
> >>  *  for (;;) {
> >>  *if (READ_ONCE(X))
> >>  *  break;
> >>  *  }
> >>  * X=1
> >>  *  
> >>  *  
> >>  *  r = X;
> >>  *
> >>  * without transitivity it could be that CPU1 observes X!=0 breaks the 
> >> loop,
> >>  * we get migrated and CPU2 sees X==0.
> >>
> >> which is used at,
> >>
> >> __schedule(bool preempt) {
> >> ...
> >> rq_lock(rq, );
> >> smp_mb__after_spinlock();
> >> ...
> >> }
> >> .
> >>
> >> If I didn't miss something, I found this kind of visibility is __not__
> >> necessarily
> >> depends on the spinning-lock at __schedule being RCsc.
> >>
> >> In fact, as for runnable task A, the migration would be,
> >>
> >>  CPU0 CPU1CPU2
> >>
> >> 
> >>
> >> lock(rq0)
> >> schedule out A
> >> unock(rq0)
> >>
> >>   lock(rq0)
> >>   remove A from rq0
> >>   unlock(rq0)
> >>
> >>   lock(rq2)
> >>   add A into rq2
> >>   unlock(rq2)
> >> lock(rq2)
> >> schedule in A
> >> unlock(rq2)
> >>
> >> 
> >>
> >>  happens-before
> >> unlock(rq0) happends-before
> >> lock(rq0) happends-before
> >> unlock(rq2) happens-before
> >> lock(rq2) happens-before
> >> 
> >>
> >
> > But without RCsc lock, you cannot guarantee that a write propagates to
> > CPU 0 and CPU 2 at the same time, so the same write may propagate to
> > CPU0 before  but propagate to CPU 2 after
> > . So..
> >
> > Regards,
> > Boqun
> 
> Thank you for pointing out this case, Boqun.
> But this is just one special case that acquire-release chains promise us.
> 

Ah.. right, because of A-Cumulative.

> A=B=0 as initial
> 
>   CPU0CPU1CPU2CPU3
>  write A=1
>read A=1
>write B=1
>release X
>  acquire X
>  read A=?
>  release Y
> 
> acquire Y
> 
> read B=?
> 
> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
> 
> The second assurance is both in theory and implemented by real hardware.
> 
> As for theory, the C++11 memory model, which is a potential formal model
> for kernel memory model as
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
> descripes, states that:
> 
> If a value computation A of an atomic object M happens before a value
> computation B of M, and A takes its value from a side effect X on M, then
> the value computed by B shall either be the value stored by X or the value
> stored by a side effect Y on M, where Y follows X in the modification
> order of M.
> 
> at
> $1.10 rule 18, on page 14
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
> 
> As for real hardware, Luc provided detailed test and explanation on
> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
> in this paper:
> 
> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> So, I think we may remove RCsc from smp_mb__after_spinlock which is
> really confusing.
> 

So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
locks, it's just the comments before that may be misleading. We want
RCsc locks in schedule code because we want writes in different critical
section are ordered even outside the critical sections, for case like:

CPU 0   CPU 1   CPU 2

{A =0 , B = 0}
lock(rq0);
write A=1;
unlock(rq0);

lock(rq0);
read A=1;
write B=2;
unlock(rq0);


Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread 焦晓冬
>> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
>> that the spinning-lock used at __schedule() should be RCsc to ensure
>> visibility of writes prior to __schedule when the task is to be migrated to
>> another CPU.
>>
>> And this is emphasized at the comment of the newly introduced
>> smp_mb__after_spinlock(),
>>
>>  * This barrier must provide two things:
>>  *
>>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
>>  * LOAD after it, see the comments at its two usage sites.
>>  *
>>  *   - it must ensure the critical section is RCsc.
>>  *
>>  * The latter is important for cases where we observe values written by other
>>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
>>  *
>>  * CPU0 CPU1CPU2
>>  *
>>  *  for (;;) {
>>  *if (READ_ONCE(X))
>>  *  break;
>>  *  }
>>  * X=1
>>  *  
>>  *  
>>  *  r = X;
>>  *
>>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
>>  * we get migrated and CPU2 sees X==0.
>>
>> which is used at,
>>
>> __schedule(bool preempt) {
>> ...
>> rq_lock(rq, );
>> smp_mb__after_spinlock();
>> ...
>> }
>> .
>>
>> If I didn't miss something, I found this kind of visibility is __not__
>> necessarily
>> depends on the spinning-lock at __schedule being RCsc.
>>
>> In fact, as for runnable task A, the migration would be,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> unock(rq0)
>>
>>   lock(rq0)
>>   remove A from rq0
>>   unlock(rq0)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> unlock(rq0) happends-before
>> lock(rq0) happends-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>
> But without RCsc lock, you cannot guarantee that a write propagates to
> CPU 0 and CPU 2 at the same time, so the same write may propagate to
> CPU0 before  but propagate to CPU 2 after
> . So..
>
> Regards,
> Boqun

Thank you for pointing out this case, Boqun.
But this is just one special case that acquire-release chains promise us.

A=B=0 as initial

  CPU0CPU1CPU2CPU3
 write A=1
   read A=1
   write B=1
   release X
 acquire X
 read A=?
 release Y

acquire Y

read B=?

assurance 1: CPU3 will surely see B=1 writing by CPU1, and
assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case

The second assurance is both in theory and implemented by real hardware.

As for theory, the C++11 memory model, which is a potential formal model
for kernel memory model as
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
descripes, states that:

If a value computation A of an atomic object M happens before a value
computation B of M, and A takes its value from a side effect X on M, then
the value computed by B shall either be the value stored by X or the value
stored by a side effect Y on M, where Y follows X in the modification
order of M.

at
$1.10 rule 18, on page 14
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf

As for real hardware, Luc provided detailed test and explanation on
ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
in this paper:

A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

So, I think we may remove RCsc from smp_mb__after_spinlock which is
really confusing.

Best Regards,
Trol

>
>> And for stopped tasks,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> remove A from rq0
>> store-release(A->on_cpu)
>> unock(rq0)
>>
>>   load_acquire(A->on_cpu)
>>   set_task_cpu(A, 2)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>>
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> store-release(A->on_cpu)  happens-before
>> load_acquire(A->on_cpu)  happens-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>> So, I think the only requirement to smp_mb__after_spinlock is
>> to guarantee a STORE before the spin_lock() is ordered
>> against a LOAD after it. So we could remove the RCsc requirement
>> to 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-12 Thread 焦晓冬
>> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
>> that the spinning-lock used at __schedule() should be RCsc to ensure
>> visibility of writes prior to __schedule when the task is to be migrated to
>> another CPU.
>>
>> And this is emphasized at the comment of the newly introduced
>> smp_mb__after_spinlock(),
>>
>>  * This barrier must provide two things:
>>  *
>>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
>>  * LOAD after it, see the comments at its two usage sites.
>>  *
>>  *   - it must ensure the critical section is RCsc.
>>  *
>>  * The latter is important for cases where we observe values written by other
>>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
>>  *
>>  * CPU0 CPU1CPU2
>>  *
>>  *  for (;;) {
>>  *if (READ_ONCE(X))
>>  *  break;
>>  *  }
>>  * X=1
>>  *  
>>  *  
>>  *  r = X;
>>  *
>>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
>>  * we get migrated and CPU2 sees X==0.
>>
>> which is used at,
>>
>> __schedule(bool preempt) {
>> ...
>> rq_lock(rq, );
>> smp_mb__after_spinlock();
>> ...
>> }
>> .
>>
>> If I didn't miss something, I found this kind of visibility is __not__
>> necessarily
>> depends on the spinning-lock at __schedule being RCsc.
>>
>> In fact, as for runnable task A, the migration would be,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> unock(rq0)
>>
>>   lock(rq0)
>>   remove A from rq0
>>   unlock(rq0)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> unlock(rq0) happends-before
>> lock(rq0) happends-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>
> But without RCsc lock, you cannot guarantee that a write propagates to
> CPU 0 and CPU 2 at the same time, so the same write may propagate to
> CPU0 before  but propagate to CPU 2 after
> . So..
>
> Regards,
> Boqun

Thank you for pointing out this case, Boqun.
But this is just one special case that acquire-release chains promise us.

A=B=0 as initial

  CPU0CPU1CPU2CPU3
 write A=1
   read A=1
   write B=1
   release X
 acquire X
 read A=?
 release Y

acquire Y

read B=?

assurance 1: CPU3 will surely see B=1 writing by CPU1, and
assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case

The second assurance is both in theory and implemented by real hardware.

As for theory, the C++11 memory model, which is a potential formal model
for kernel memory model as
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
descripes, states that:

If a value computation A of an atomic object M happens before a value
computation B of M, and A takes its value from a side effect X on M, then
the value computed by B shall either be the value stored by X or the value
stored by a side effect Y on M, where Y follows X in the modification
order of M.

at
$1.10 rule 18, on page 14
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf

As for real hardware, Luc provided detailed test and explanation on
ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
in this paper:

A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

So, I think we may remove RCsc from smp_mb__after_spinlock which is
really confusing.

Best Regards,
Trol

>
>> And for stopped tasks,
>>
>>  CPU0 CPU1CPU2
>>
>> 
>>
>> lock(rq0)
>> schedule out A
>> remove A from rq0
>> store-release(A->on_cpu)
>> unock(rq0)
>>
>>   load_acquire(A->on_cpu)
>>   set_task_cpu(A, 2)
>>
>>   lock(rq2)
>>   add A into rq2
>>   unlock(rq2)
>>
>> lock(rq2)
>> schedule in A
>> unlock(rq2)
>>
>> 
>>
>>  happens-before
>> store-release(A->on_cpu)  happens-before
>> load_acquire(A->on_cpu)  happens-before
>> unlock(rq2) happens-before
>> lock(rq2) happens-before
>> 
>>
>> So, I think the only requirement to smp_mb__after_spinlock is
>> to guarantee a STORE before the spin_lock() is ordered
>> against a LOAD after it. So we could remove the RCsc requirement
>> to 

Re: smp_mb__after_spinlock requirement too strong?

2018-03-11 Thread Boqun Feng
On Sun, Mar 11, 2018 at 03:55:41PM +0800, 焦晓冬 wrote:
> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
> that the spinning-lock used at __schedule() should be RCsc to ensure
> visibility of writes prior to __schedule when the task is to be migrated to
> another CPU.
> 
> And this is emphasized at the comment of the newly introduced
> smp_mb__after_spinlock(),
> 
>  * This barrier must provide two things:
>  *
>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
>  * LOAD after it, see the comments at its two usage sites.
>  *
>  *   - it must ensure the critical section is RCsc.
>  *
>  * The latter is important for cases where we observe values written by other
>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
>  *
>  * CPU0 CPU1CPU2
>  *
>  *  for (;;) {
>  *if (READ_ONCE(X))
>  *  break;
>  *  }
>  * X=1
>  *  
>  *  
>  *  r = X;
>  *
>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
>  * we get migrated and CPU2 sees X==0.
> 
> which is used at,
> 
> __schedule(bool preempt) {
> ...
> rq_lock(rq, );
> smp_mb__after_spinlock();
> ...
> }
> .
> 
> If I didn't miss something, I found this kind of visibility is __not__
> necessarily
> depends on the spinning-lock at __schedule being RCsc.
> 
> In fact, as for runnable task A, the migration would be,
> 
>  CPU0 CPU1CPU2
> 
> 
> 
> lock(rq0)
> schedule out A
> unock(rq0)
> 
>   lock(rq0)
>   remove A from rq0
>   unlock(rq0)
> 
>   lock(rq2)
>   add A into rq2
>   unlock(rq2)
> lock(rq2)
> schedule in A
> unlock(rq2)
> 
> 
> 
>  happens-before
> unlock(rq0) happends-before
> lock(rq0) happends-before
> unlock(rq2) happens-before
> lock(rq2) happens-before
> 
> 

But without RCsc lock, you cannot guarantee that a write propagates to
CPU 0 and CPU 2 at the same time, so the same write may propagate to
CPU0 before  but propagate to CPU 2 after
. So..

Regards,
Boqun

> And for stopped tasks,
> 
>  CPU0 CPU1CPU2
> 
> 
> 
> lock(rq0)
> schedule out A
> remove A from rq0
> store-release(A->on_cpu)
> unock(rq0)
> 
>   load_acquire(A->on_cpu)
>   set_task_cpu(A, 2)
> 
>   lock(rq2)
>   add A into rq2
>   unlock(rq2)
> 
> lock(rq2)
> schedule in A
> unlock(rq2)
> 
> 
> 
>  happens-before
> store-release(A->on_cpu)  happens-before
> load_acquire(A->on_cpu)  happens-before
> unlock(rq2) happens-before
> lock(rq2) happens-before
> 
> 
> So, I think the only requirement to smp_mb__after_spinlock is
> to guarantee a STORE before the spin_lock() is ordered
> against a LOAD after it. So we could remove the RCsc requirement
> to allow more efficient implementation.
> 
> Did I miss something or this RCsc requirement does not really matter?


signature.asc
Description: PGP signature


Re: smp_mb__after_spinlock requirement too strong?

2018-03-11 Thread Boqun Feng
On Sun, Mar 11, 2018 at 03:55:41PM +0800, 焦晓冬 wrote:
> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
> that the spinning-lock used at __schedule() should be RCsc to ensure
> visibility of writes prior to __schedule when the task is to be migrated to
> another CPU.
> 
> And this is emphasized at the comment of the newly introduced
> smp_mb__after_spinlock(),
> 
>  * This barrier must provide two things:
>  *
>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
>  * LOAD after it, see the comments at its two usage sites.
>  *
>  *   - it must ensure the critical section is RCsc.
>  *
>  * The latter is important for cases where we observe values written by other
>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
>  *
>  * CPU0 CPU1CPU2
>  *
>  *  for (;;) {
>  *if (READ_ONCE(X))
>  *  break;
>  *  }
>  * X=1
>  *  
>  *  
>  *  r = X;
>  *
>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
>  * we get migrated and CPU2 sees X==0.
> 
> which is used at,
> 
> __schedule(bool preempt) {
> ...
> rq_lock(rq, );
> smp_mb__after_spinlock();
> ...
> }
> .
> 
> If I didn't miss something, I found this kind of visibility is __not__
> necessarily
> depends on the spinning-lock at __schedule being RCsc.
> 
> In fact, as for runnable task A, the migration would be,
> 
>  CPU0 CPU1CPU2
> 
> 
> 
> lock(rq0)
> schedule out A
> unock(rq0)
> 
>   lock(rq0)
>   remove A from rq0
>   unlock(rq0)
> 
>   lock(rq2)
>   add A into rq2
>   unlock(rq2)
> lock(rq2)
> schedule in A
> unlock(rq2)
> 
> 
> 
>  happens-before
> unlock(rq0) happends-before
> lock(rq0) happends-before
> unlock(rq2) happens-before
> lock(rq2) happens-before
> 
> 

But without RCsc lock, you cannot guarantee that a write propagates to
CPU 0 and CPU 2 at the same time, so the same write may propagate to
CPU0 before  but propagate to CPU 2 after
. So..

Regards,
Boqun

> And for stopped tasks,
> 
>  CPU0 CPU1CPU2
> 
> 
> 
> lock(rq0)
> schedule out A
> remove A from rq0
> store-release(A->on_cpu)
> unock(rq0)
> 
>   load_acquire(A->on_cpu)
>   set_task_cpu(A, 2)
> 
>   lock(rq2)
>   add A into rq2
>   unlock(rq2)
> 
> lock(rq2)
> schedule in A
> unlock(rq2)
> 
> 
> 
>  happens-before
> store-release(A->on_cpu)  happens-before
> load_acquire(A->on_cpu)  happens-before
> unlock(rq2) happens-before
> lock(rq2) happens-before
> 
> 
> So, I think the only requirement to smp_mb__after_spinlock is
> to guarantee a STORE before the spin_lock() is ordered
> against a LOAD after it. So we could remove the RCsc requirement
> to allow more efficient implementation.
> 
> Did I miss something or this RCsc requirement does not really matter?


signature.asc
Description: PGP signature


smp_mb__after_spinlock requirement too strong?

2018-03-10 Thread 焦晓冬
Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
that the spinning-lock used at __schedule() should be RCsc to ensure
visibility of writes prior to __schedule when the task is to be migrated to
another CPU.

And this is emphasized at the comment of the newly introduced
smp_mb__after_spinlock(),

 * This barrier must provide two things:
 *
 *   - it must guarantee a STORE before the spin_lock() is ordered against a
 * LOAD after it, see the comments at its two usage sites.
 *
 *   - it must ensure the critical section is RCsc.
 *
 * The latter is important for cases where we observe values written by other
 * CPUs in spin-loops, without barriers, while being subject to scheduling.
 *
 * CPU0 CPU1CPU2
 *
 *  for (;;) {
 *if (READ_ONCE(X))
 *  break;
 *  }
 * X=1
 *  
 *  
 *  r = X;
 *
 * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
 * we get migrated and CPU2 sees X==0.

which is used at,

__schedule(bool preempt) {
...
rq_lock(rq, );
smp_mb__after_spinlock();
...
}
.

If I didn't miss something, I found this kind of visibility is __not__
necessarily
depends on the spinning-lock at __schedule being RCsc.

In fact, as for runnable task A, the migration would be,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
unock(rq0)

  lock(rq0)
  remove A from rq0
  unlock(rq0)

  lock(rq2)
  add A into rq2
  unlock(rq2)
lock(rq2)
schedule in A
unlock(rq2)



 happens-before
unlock(rq0) happends-before
lock(rq0) happends-before
unlock(rq2) happens-before
lock(rq2) happens-before


And for stopped tasks,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
remove A from rq0
store-release(A->on_cpu)
unock(rq0)

  load_acquire(A->on_cpu)
  set_task_cpu(A, 2)

  lock(rq2)
  add A into rq2
  unlock(rq2)

lock(rq2)
schedule in A
unlock(rq2)



 happens-before
store-release(A->on_cpu)  happens-before
load_acquire(A->on_cpu)  happens-before
unlock(rq2) happens-before
lock(rq2) happens-before


So, I think the only requirement to smp_mb__after_spinlock is
to guarantee a STORE before the spin_lock() is ordered
against a LOAD after it. So we could remove the RCsc requirement
to allow more efficient implementation.

Did I miss something or this RCsc requirement does not really matter?


smp_mb__after_spinlock requirement too strong?

2018-03-10 Thread 焦晓冬
Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
that the spinning-lock used at __schedule() should be RCsc to ensure
visibility of writes prior to __schedule when the task is to be migrated to
another CPU.

And this is emphasized at the comment of the newly introduced
smp_mb__after_spinlock(),

 * This barrier must provide two things:
 *
 *   - it must guarantee a STORE before the spin_lock() is ordered against a
 * LOAD after it, see the comments at its two usage sites.
 *
 *   - it must ensure the critical section is RCsc.
 *
 * The latter is important for cases where we observe values written by other
 * CPUs in spin-loops, without barriers, while being subject to scheduling.
 *
 * CPU0 CPU1CPU2
 *
 *  for (;;) {
 *if (READ_ONCE(X))
 *  break;
 *  }
 * X=1
 *  
 *  
 *  r = X;
 *
 * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
 * we get migrated and CPU2 sees X==0.

which is used at,

__schedule(bool preempt) {
...
rq_lock(rq, );
smp_mb__after_spinlock();
...
}
.

If I didn't miss something, I found this kind of visibility is __not__
necessarily
depends on the spinning-lock at __schedule being RCsc.

In fact, as for runnable task A, the migration would be,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
unock(rq0)

  lock(rq0)
  remove A from rq0
  unlock(rq0)

  lock(rq2)
  add A into rq2
  unlock(rq2)
lock(rq2)
schedule in A
unlock(rq2)



 happens-before
unlock(rq0) happends-before
lock(rq0) happends-before
unlock(rq2) happens-before
lock(rq2) happens-before


And for stopped tasks,

 CPU0 CPU1CPU2



lock(rq0)
schedule out A
remove A from rq0
store-release(A->on_cpu)
unock(rq0)

  load_acquire(A->on_cpu)
  set_task_cpu(A, 2)

  lock(rq2)
  add A into rq2
  unlock(rq2)

lock(rq2)
schedule in A
unlock(rq2)



 happens-before
store-release(A->on_cpu)  happens-before
load_acquire(A->on_cpu)  happens-before
unlock(rq2) happens-before
lock(rq2) happens-before


So, I think the only requirement to smp_mb__after_spinlock is
to guarantee a STORE before the spin_lock() is ordered
against a LOAD after it. So we could remove the RCsc requirement
to allow more efficient implementation.

Did I miss something or this RCsc requirement does not really matter?