Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-13 Thread bdegraaf

On 2016-10-13 07:02, Will Deacon wrote:

Brent,

On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegr...@codeaurora.org 
wrote:


Everything from this point down needs clarification.


All arm64 lockref accesses that occur without taking the spinlock must
behave like true atomics, ensuring successive operations are all done
sequentially.


What is a "true atomic"? What do you mean by "successive"? What do you
mean by "done sequentially"?



Despite the use case in dentry, lockref itself is a generic locking API, 
and
any problems I describe here are with the generic API itself, not 
necessarily

the dentry use case.  I'm not patching dentry--I'm fixing lockref.

By necessity, the API must do its update atomically, as keeping things 
correct
involves potential spinlock access by other agents which may opt to use 
the
spinlock API or the lockref API at their discretion.  With the current 
arm64
spinlock implementation, it is possible for the lockref to observe the 
changed
contents of the protected count without observing the spinlock being 
locked,

which could lead to missed changes to the lock_count itself, because any
calculations made on it could be overwritten or completed in a different
sequence.

(Spinlock locked access is obtained with a simple store under certain
scenarios. My attempt to fix this in the spinlock code was met with 
resistance
saying it should be addressed in lockref, since that is the API that 
would
encounter the issue.  These changes were initiated in response to that 
request.
Additional ordering problems were uncovered when I looked into lockref 
itself.)


The example below involves only a single agent exactly as you explain 
the
problem in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.  Even for a 
single
execution agent, this means that the code below could access out of 
order.
As lockref is a generic API, it doesn't matter whether dentry does this 
or not.

By "done sequentially," I mean "accessed in program order."

As far as "true atomic" goes, I am referring to an atomic in the same 
sense you

did in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

The guarantee provided by lockref is that, if you hold the spinlock, 
then

you don't need to use atomics to inspect the reference count, as it is
guaranteed to be stable. You can't just go around replacing spin_lock
calls with lockref_get -- that's not what this is about.



I am not sure where you got the idea that I was referring to replacing 
spinlocks

with lockref calls.  That is not the foundation for this fix.


Currently
the lockref accesses, when decompiled, look like the following 
sequence:




// Lockref "unlocked" (B)
1:  ldxr   x0, [B] // Exclusive load
 
stxr   w1, x0, [B]
cbnz   w1, 1b

 

Even though access to the lock_count is protected by exclusives, this 
is not

enough
to guarantee order: The lock_count must change atomically, in order, 
so the

only
permitted ordering would be:
  A -> B -> C


Says who? Please point me at a piece of code that relies on this. I'm
willing to believe that are bugs in this area, but waving your hands 
around

and saying certain properties "must" hold is not helpful unless you can
say *why* they must hold and *where* that is required.



The lockref code must access in order, because other agents can observe 
it via
spinlock OR lockref APIs.  Again, this is a generic API, not an explicit 
part of
dentry. Other code will use it, and the manner in which it is used in 
dentry is not
relevant.  What lock_count is changed to is not proscribed by the 
lockref
API.  There is no guarantee whether it be an add, subtract, multiply, 
divide, set
to some explicit value, etc.  But the changes must be done in program 
order and
observable in that same order by other agents:  Therefore, the spinlock 
and lock_count
must be accessed atomically, and observed to change atomically at the 
system level.


I am not off base saying lockref is an atomic access.  Here are some 
references:


Under Documentation/filesystems/path-lookup.md, the dentry->d_lockref 
mechanism

is described as an atomic access.

At the time lockref was introduced, The Linux Foundation gave a 
presentation at

LinuxCon 2014 that can be found at the following link:

https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf

On page 46, it outlines the lockref API. The first lines of the slide 
give the

relevant details.

Lockref
• *Generic* mechanism to *atomically* update a reference count that is
  protected by a spinlock without actually acquiring the spinlock 
itself.


While dentry's use is mentioned, this API is not restricted to the use 
case of dentry.


Unfortunately, this is not the case by the letter of the architecture 
and,

in fact,
the accesses to A and C are not protected by any sort of barrier, and 

Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-13 Thread bdegraaf

On 2016-10-13 07:02, Will Deacon wrote:

Brent,

On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegr...@codeaurora.org 
wrote:


Everything from this point down needs clarification.


All arm64 lockref accesses that occur without taking the spinlock must
behave like true atomics, ensuring successive operations are all done
sequentially.


What is a "true atomic"? What do you mean by "successive"? What do you
mean by "done sequentially"?



Despite the use case in dentry, lockref itself is a generic locking API, 
and
any problems I describe here are with the generic API itself, not 
necessarily

the dentry use case.  I'm not patching dentry--I'm fixing lockref.

By necessity, the API must do its update atomically, as keeping things 
correct
involves potential spinlock access by other agents which may opt to use 
the
spinlock API or the lockref API at their discretion.  With the current 
arm64
spinlock implementation, it is possible for the lockref to observe the 
changed
contents of the protected count without observing the spinlock being 
locked,

which could lead to missed changes to the lock_count itself, because any
calculations made on it could be overwritten or completed in a different
sequence.

(Spinlock locked access is obtained with a simple store under certain
scenarios. My attempt to fix this in the spinlock code was met with 
resistance
saying it should be addressed in lockref, since that is the API that 
would
encounter the issue.  These changes were initiated in response to that 
request.
Additional ordering problems were uncovered when I looked into lockref 
itself.)


The example below involves only a single agent exactly as you explain 
the
problem in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.  Even for a 
single
execution agent, this means that the code below could access out of 
order.
As lockref is a generic API, it doesn't matter whether dentry does this 
or not.

By "done sequentially," I mean "accessed in program order."

As far as "true atomic" goes, I am referring to an atomic in the same 
sense you

did in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

The guarantee provided by lockref is that, if you hold the spinlock, 
then

you don't need to use atomics to inspect the reference count, as it is
guaranteed to be stable. You can't just go around replacing spin_lock
calls with lockref_get -- that's not what this is about.



I am not sure where you got the idea that I was referring to replacing 
spinlocks

with lockref calls.  That is not the foundation for this fix.


Currently
the lockref accesses, when decompiled, look like the following 
sequence:




// Lockref "unlocked" (B)
1:  ldxr   x0, [B] // Exclusive load
 
stxr   w1, x0, [B]
cbnz   w1, 1b

 

Even though access to the lock_count is protected by exclusives, this 
is not

enough
to guarantee order: The lock_count must change atomically, in order, 
so the

only
permitted ordering would be:
  A -> B -> C


Says who? Please point me at a piece of code that relies on this. I'm
willing to believe that are bugs in this area, but waving your hands 
around

and saying certain properties "must" hold is not helpful unless you can
say *why* they must hold and *where* that is required.



The lockref code must access in order, because other agents can observe 
it via
spinlock OR lockref APIs.  Again, this is a generic API, not an explicit 
part of
dentry. Other code will use it, and the manner in which it is used in 
dentry is not
relevant.  What lock_count is changed to is not proscribed by the 
lockref
API.  There is no guarantee whether it be an add, subtract, multiply, 
divide, set
to some explicit value, etc.  But the changes must be done in program 
order and
observable in that same order by other agents:  Therefore, the spinlock 
and lock_count
must be accessed atomically, and observed to change atomically at the 
system level.


I am not off base saying lockref is an atomic access.  Here are some 
references:


Under Documentation/filesystems/path-lookup.md, the dentry->d_lockref 
mechanism

is described as an atomic access.

At the time lockref was introduced, The Linux Foundation gave a 
presentation at

LinuxCon 2014 that can be found at the following link:

https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf

On page 46, it outlines the lockref API. The first lines of the slide 
give the

relevant details.

Lockref
• *Generic* mechanism to *atomically* update a reference count that is
  protected by a spinlock without actually acquiring the spinlock 
itself.


While dentry's use is mentioned, this API is not restricted to the use 
case of dentry.


Unfortunately, this is not the case by the letter of the architecture 
and,

in fact,
the accesses to A and C are not protected by any sort of barrier, and 

Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-12 Thread bdegraaf

On 2016-10-05 11:30, bdegr...@codeaurora.org wrote:

On 2016-10-05 11:10, Peter Zijlstra wrote:
On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-04 15:12, Mark Rutland wrote:
>Hi Brent,
>
>Could you *please* clarify if you are trying to solve:
>
>(a) a correctness issue (e.g. data corruption) seen in practice.
>(b) a correctness issue (e.g. data corruption) found by inspection.
>(c) A performance issue, seen in practice.
>(d) A performance issue, found by inspection.
>
>Any one of these is fine; we just need to know in order to be able to
>help effectively, and so far it hasn't been clear.


Brent, you forgot to state which: 'a-d' is the case here.


I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
at

that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, 
and

this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the 
problem with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.


This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.


Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that

it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the 
access to

lockref should be similar.

Brent


I am still working through some additional analyses for mixed accesses, 
but I
thought I'd send along some sample commit text for the fix as it 
currently stands.
Please feel free to comment if you see something that needs 
clarification.


Brent



Text:

All arm64 lockref accesses that occur without taking the spinlock must 
behave like
true atomics, ensuring successive operations are all done sequentially.  
Currently

the lockref accesses, when decompiled, look like the following sequence:



// Lockref "unlocked" (B)
1:  ldxr   x0, [B] // Exclusive load
 
stxr   w1, x0, [B]
cbnz   w1, 1b

 

Even though access to the lock_count is protected by exclusives, this is 
not enough
to guarantee order: The lock_count must change atomically, in order, so 
the only

permitted ordering would be:
  A -> B -> C

Unfortunately, this is not the case by the letter of the architecture 
and, in fact,
the accesses to A and C are not protected by any sort of barrier, and 
hence are

permitted to reorder freely, resulting in orderings such as

   Bl -> A -> C -> Bs

In this specific scenario, since "change lock_count" could be an 
increment, a decrement
or even a set to a specific value, there could be trouble.  With more 
agents accessing
the lockref without taking the lock, even scenarios where the cmpxchg 
passes falsely
can be encountered, as there is no guarantee that the the "old" value 
will not match
exactly a newer value due to out-of-order access by a combination of 
agents that

increment and decrement the lock_count by the same amount.

Since multiple agents are accessing this without locking the spinlock, 
this access
must have the same protections in place as atomics do in the arch's 
atomic.h.
Fortunately, the fix is not complicated: merely removing the errant 
_relaxed option
on the cmpxchg64 is enough to introduce exactly the same code sequence 
justified

in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67 to fix arm64 atomics.

   1:  

Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-12 Thread bdegraaf

On 2016-10-05 11:30, bdegr...@codeaurora.org wrote:

On 2016-10-05 11:10, Peter Zijlstra wrote:
On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-04 15:12, Mark Rutland wrote:
>Hi Brent,
>
>Could you *please* clarify if you are trying to solve:
>
>(a) a correctness issue (e.g. data corruption) seen in practice.
>(b) a correctness issue (e.g. data corruption) found by inspection.
>(c) A performance issue, seen in practice.
>(d) A performance issue, found by inspection.
>
>Any one of these is fine; we just need to know in order to be able to
>help effectively, and so far it hasn't been clear.


Brent, you forgot to state which: 'a-d' is the case here.


I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
at

that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, 
and

this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the 
problem with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.


This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.


Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that

it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the 
access to

lockref should be similar.

Brent


I am still working through some additional analyses for mixed accesses, 
but I
thought I'd send along some sample commit text for the fix as it 
currently stands.
Please feel free to comment if you see something that needs 
clarification.


Brent



Text:

All arm64 lockref accesses that occur without taking the spinlock must 
behave like
true atomics, ensuring successive operations are all done sequentially.  
Currently

the lockref accesses, when decompiled, look like the following sequence:



// Lockref "unlocked" (B)
1:  ldxr   x0, [B] // Exclusive load
 
stxr   w1, x0, [B]
cbnz   w1, 1b

 

Even though access to the lock_count is protected by exclusives, this is 
not enough
to guarantee order: The lock_count must change atomically, in order, so 
the only

permitted ordering would be:
  A -> B -> C

Unfortunately, this is not the case by the letter of the architecture 
and, in fact,
the accesses to A and C are not protected by any sort of barrier, and 
hence are

permitted to reorder freely, resulting in orderings such as

   Bl -> A -> C -> Bs

In this specific scenario, since "change lock_count" could be an 
increment, a decrement
or even a set to a specific value, there could be trouble.  With more 
agents accessing
the lockref without taking the lock, even scenarios where the cmpxchg 
passes falsely
can be encountered, as there is no guarantee that the the "old" value 
will not match
exactly a newer value due to out-of-order access by a combination of 
agents that

increment and decrement the lock_count by the same amount.

Since multiple agents are accessing this without locking the spinlock, 
this access
must have the same protections in place as atomics do in the arch's 
atomic.h.
Fortunately, the fix is not complicated: merely removing the errant 
_relaxed option
on the cmpxchg64 is enough to introduce exactly the same code sequence 
justified

in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67 to fix arm64 atomics.

   1:  

Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-05 Thread bdegraaf

On 2016-10-05 11:10, Peter Zijlstra wrote:
On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-04 15:12, Mark Rutland wrote:
>Hi Brent,
>
>Could you *please* clarify if you are trying to solve:
>
>(a) a correctness issue (e.g. data corruption) seen in practice.
>(b) a correctness issue (e.g. data corruption) found by inspection.
>(c) A performance issue, seen in practice.
>(d) A performance issue, found by inspection.
>
>Any one of these is fine; we just need to know in order to be able to
>help effectively, and so far it hasn't been clear.


Brent, you forgot to state which: 'a-d' is the case here.


I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
at

that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, 
and

this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the 
problem with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.


This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.


Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that

it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access 
to

lockref should be similar.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-05 Thread bdegraaf

On 2016-10-05 11:10, Peter Zijlstra wrote:
On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-04 15:12, Mark Rutland wrote:
>Hi Brent,
>
>Could you *please* clarify if you are trying to solve:
>
>(a) a correctness issue (e.g. data corruption) seen in practice.
>(b) a correctness issue (e.g. data corruption) found by inspection.
>(c) A performance issue, seen in practice.
>(d) A performance issue, found by inspection.
>
>Any one of these is fine; we just need to know in order to be able to
>help effectively, and so far it hasn't been clear.


Brent, you forgot to state which: 'a-d' is the case here.


I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
at

that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, 
and

this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the 
problem with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.


This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.


Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that

it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access 
to

lockref should be similar.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-05 Thread bdegraaf

On 2016-10-05 10:55, bdegr...@codeaurora.org wrote:

On 2016-10-04 15:12, Mark Rutland wrote:

Hi Brent,

Could you *please* clarify if you are trying to solve:

(a) a correctness issue (e.g. data corruption) seen in practice.
(b) a correctness issue (e.g. data corruption) found by inspection.
(c) A performance issue, seen in practice.
(d) A performance issue, found by inspection.

Any one of these is fine; we just need to know in order to be able to
help effectively, and so far it hasn't been clear.

On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegr...@codeaurora.org 
wrote:

After looking at this, the problem is not with the lockref code per
se: it is a problem with arch_spin_value_unlocked(). In the
out-of-order case, arch_spin_value_unlocked() can return TRUE for a
spinlock that is in fact locked but the lock is not observable yet 
via

an ordinary load.


Given arch_spin_value_unlocked() doesn't perform any load itself, I
assume the ordinary load that you are referring to is the READ_ONCE()
early in CMPXCHG_LOOP().

It's worth noting that even if we ignore ordering and assume a
sequentially-consistent machine, READ_ONCE() can give us a stale 
value.
We could perform the read, then another agent can acquire the lock, 
then

we can move onto the cmpxchg(), i.e.

CPU0  CPU1
old = READ_ONCE(x.lock_val)
  spin_lock(x.lock)
cmpxchg(x.lock_val, old, new)
  spin_unlock(x.lock)

If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
should return an up-to-date value which we will then retry with.


Other than ensuring order on the locking side (as the prior patch
did), there is a way to make arch_spin_value_unlock's TRUE return
value deterministic,


In general, this cannot be made deterministic. As above, there is a 
race

that cannot be avoided.


but it requires that it does a write-back to the lock to ensure we
didn't observe the unlocked value while another agent was in process
of writing back a locked value.


The cmpxchg gives us this guarantee. If it successfully stores, then 
the

value it observed was the same as READ_ONCE() saw, and the update was
atomic.

There *could* have been an intervening sequence between the READ_ONCE
and cmpxchg (e.g. put(); get()) but that's not problematic for 
lockref.

Until you've taken your reference it was possible that things changed
underneath you.

Thanks,
Mark.


Mark,

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
at that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, 
and

this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem 
with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.

Best regards,
Brent


FYI, this is a "b" type fix (correctness fix based on code inspection).



Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-05 Thread bdegraaf

On 2016-10-05 10:55, bdegr...@codeaurora.org wrote:

On 2016-10-04 15:12, Mark Rutland wrote:

Hi Brent,

Could you *please* clarify if you are trying to solve:

(a) a correctness issue (e.g. data corruption) seen in practice.
(b) a correctness issue (e.g. data corruption) found by inspection.
(c) A performance issue, seen in practice.
(d) A performance issue, found by inspection.

Any one of these is fine; we just need to know in order to be able to
help effectively, and so far it hasn't been clear.

On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegr...@codeaurora.org 
wrote:

After looking at this, the problem is not with the lockref code per
se: it is a problem with arch_spin_value_unlocked(). In the
out-of-order case, arch_spin_value_unlocked() can return TRUE for a
spinlock that is in fact locked but the lock is not observable yet 
via

an ordinary load.


Given arch_spin_value_unlocked() doesn't perform any load itself, I
assume the ordinary load that you are referring to is the READ_ONCE()
early in CMPXCHG_LOOP().

It's worth noting that even if we ignore ordering and assume a
sequentially-consistent machine, READ_ONCE() can give us a stale 
value.
We could perform the read, then another agent can acquire the lock, 
then

we can move onto the cmpxchg(), i.e.

CPU0  CPU1
old = READ_ONCE(x.lock_val)
  spin_lock(x.lock)
cmpxchg(x.lock_val, old, new)
  spin_unlock(x.lock)

If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
should return an up-to-date value which we will then retry with.


Other than ensuring order on the locking side (as the prior patch
did), there is a way to make arch_spin_value_unlock's TRUE return
value deterministic,


In general, this cannot be made deterministic. As above, there is a 
race

that cannot be avoided.


but it requires that it does a write-back to the lock to ensure we
didn't observe the unlocked value while another agent was in process
of writing back a locked value.


The cmpxchg gives us this guarantee. If it successfully stores, then 
the

value it observed was the same as READ_ONCE() saw, and the update was
atomic.

There *could* have been an intervening sequence between the READ_ONCE
and cmpxchg (e.g. put(); get()) but that's not problematic for 
lockref.

Until you've taken your reference it was possible that things changed
underneath you.

Thanks,
Mark.


Mark,

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
at that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, 
and

this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem 
with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.

Best regards,
Brent


FYI, this is a "b" type fix (correctness fix based on code inspection).



Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-05 Thread bdegraaf

On 2016-10-04 15:12, Mark Rutland wrote:

Hi Brent,

Could you *please* clarify if you are trying to solve:

(a) a correctness issue (e.g. data corruption) seen in practice.
(b) a correctness issue (e.g. data corruption) found by inspection.
(c) A performance issue, seen in practice.
(d) A performance issue, found by inspection.

Any one of these is fine; we just need to know in order to be able to
help effectively, and so far it hasn't been clear.

On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegr...@codeaurora.org 
wrote:

After looking at this, the problem is not with the lockref code per
se: it is a problem with arch_spin_value_unlocked(). In the
out-of-order case, arch_spin_value_unlocked() can return TRUE for a
spinlock that is in fact locked but the lock is not observable yet via
an ordinary load.


Given arch_spin_value_unlocked() doesn't perform any load itself, I
assume the ordinary load that you are referring to is the READ_ONCE()
early in CMPXCHG_LOOP().

It's worth noting that even if we ignore ordering and assume a
sequentially-consistent machine, READ_ONCE() can give us a stale value.
We could perform the read, then another agent can acquire the lock, 
then

we can move onto the cmpxchg(), i.e.

CPU0  CPU1
old = READ_ONCE(x.lock_val)
  spin_lock(x.lock)
cmpxchg(x.lock_val, old, new)
  spin_unlock(x.lock)

If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
should return an up-to-date value which we will then retry with.


Other than ensuring order on the locking side (as the prior patch
did), there is a way to make arch_spin_value_unlock's TRUE return
value deterministic,


In general, this cannot be made deterministic. As above, there is a 
race

that cannot be avoided.


but it requires that it does a write-back to the lock to ensure we
didn't observe the unlocked value while another agent was in process
of writing back a locked value.


The cmpxchg gives us this guarantee. If it successfully stores, then 
the

value it observed was the same as READ_ONCE() saw, and the update was
atomic.

There *could* have been an intervening sequence between the READ_ONCE
and cmpxchg (e.g. put(); get()) but that's not problematic for lockref.
Until you've taken your reference it was possible that things changed
underneath you.

Thanks,
Mark.


Mark,

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at 
that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, and
this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem 
with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.

Best regards,
Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-05 Thread bdegraaf

On 2016-10-04 15:12, Mark Rutland wrote:

Hi Brent,

Could you *please* clarify if you are trying to solve:

(a) a correctness issue (e.g. data corruption) seen in practice.
(b) a correctness issue (e.g. data corruption) found by inspection.
(c) A performance issue, seen in practice.
(d) A performance issue, found by inspection.

Any one of these is fine; we just need to know in order to be able to
help effectively, and so far it hasn't been clear.

On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegr...@codeaurora.org 
wrote:

After looking at this, the problem is not with the lockref code per
se: it is a problem with arch_spin_value_unlocked(). In the
out-of-order case, arch_spin_value_unlocked() can return TRUE for a
spinlock that is in fact locked but the lock is not observable yet via
an ordinary load.


Given arch_spin_value_unlocked() doesn't perform any load itself, I
assume the ordinary load that you are referring to is the READ_ONCE()
early in CMPXCHG_LOOP().

It's worth noting that even if we ignore ordering and assume a
sequentially-consistent machine, READ_ONCE() can give us a stale value.
We could perform the read, then another agent can acquire the lock, 
then

we can move onto the cmpxchg(), i.e.

CPU0  CPU1
old = READ_ONCE(x.lock_val)
  spin_lock(x.lock)
cmpxchg(x.lock_val, old, new)
  spin_unlock(x.lock)

If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
should return an up-to-date value which we will then retry with.


Other than ensuring order on the locking side (as the prior patch
did), there is a way to make arch_spin_value_unlock's TRUE return
value deterministic,


In general, this cannot be made deterministic. As above, there is a 
race

that cannot be avoided.


but it requires that it does a write-back to the lock to ensure we
didn't observe the unlocked value while another agent was in process
of writing back a locked value.


The cmpxchg gives us this guarantee. If it successfully stores, then 
the

value it observed was the same as READ_ONCE() saw, and the update was
atomic.

There *could* have been an intervening sequence between the READ_ONCE
and cmpxchg (e.g. put(); get()) but that's not problematic for lockref.
Until you've taken your reference it was possible that things changed
underneath you.

Thanks,
Mark.


Mark,

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers

in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at 
that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.


d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, and
this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem 
with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same

code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar

to Will's atomic barrier fix.

Best regards,
Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-04 Thread bdegraaf

On 2016-10-04 13:53, bdegr...@codeaurora.org wrote:

On 2016-10-04 06:12, Mark Rutland wrote:
On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-01 14:11, Mark Rutland wrote:
>Hi Brent,
>
>Evidently my questions weren't sufficiently clear; even with your
>answers it's not clear to me what precise issue you're attempting to
>solve.  I've tried to be more specific this time.
>
>At a high-level, can you clarify whether you're attempting to solve is:
>
>(a) a functional correctness issue (e.g. data corruption)
>(b) a performance issue
>
>And whether this was seen in practice, or found through code
>inspection?



Thinking about this, as the reader/writer code has no known "abuse"
case, I'll remove it from the patchset, then provide a v2 patchset
with a detailed explanation for the lockref problem using the commits
you provided as an example, as well as performance consideration.


If there's a functional problem, let's consider that in isolation 
first.

Once we understand that, then we can consider doing what is optimal.

As should be obvious from the above, I'm confused because this patch
conflates functional details with performance optimisations which (to
me) sound architecturally dubious.

I completely agree with Peter that if the problem lies with lockref, 
it

should be solved in the lockref code.

Thanks,
Mark.


After looking at this, the problem is not with the lockref code per se: 
it is

a problem with arch_spin_value_unlocked().  In the out-of-order case,
arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
fact
locked but the lock is not observable yet via an ordinary load.  Other 
than
ensuring order on the locking side (as the prior patch did), there is a 
way
to make arch_spin_value_unlock's TRUE return value deterministic, but 
it
requires that it does a write-back to the lock to ensure we didn't 
observe

the unlocked value while another agent was in process of writing back a
locked value.

Brent


Scratch that--things get complicated as the lock itself gets "cloned," 
which
could happen during the out-of-order window.  I'll post back later after 
I've

analyzed it fully.


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-04 Thread bdegraaf

On 2016-10-04 13:53, bdegr...@codeaurora.org wrote:

On 2016-10-04 06:12, Mark Rutland wrote:
On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-01 14:11, Mark Rutland wrote:
>Hi Brent,
>
>Evidently my questions weren't sufficiently clear; even with your
>answers it's not clear to me what precise issue you're attempting to
>solve.  I've tried to be more specific this time.
>
>At a high-level, can you clarify whether you're attempting to solve is:
>
>(a) a functional correctness issue (e.g. data corruption)
>(b) a performance issue
>
>And whether this was seen in practice, or found through code
>inspection?



Thinking about this, as the reader/writer code has no known "abuse"
case, I'll remove it from the patchset, then provide a v2 patchset
with a detailed explanation for the lockref problem using the commits
you provided as an example, as well as performance consideration.


If there's a functional problem, let's consider that in isolation 
first.

Once we understand that, then we can consider doing what is optimal.

As should be obvious from the above, I'm confused because this patch
conflates functional details with performance optimisations which (to
me) sound architecturally dubious.

I completely agree with Peter that if the problem lies with lockref, 
it

should be solved in the lockref code.

Thanks,
Mark.


After looking at this, the problem is not with the lockref code per se: 
it is

a problem with arch_spin_value_unlocked().  In the out-of-order case,
arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
fact
locked but the lock is not observable yet via an ordinary load.  Other 
than
ensuring order on the locking side (as the prior patch did), there is a 
way
to make arch_spin_value_unlock's TRUE return value deterministic, but 
it
requires that it does a write-back to the lock to ensure we didn't 
observe

the unlocked value while another agent was in process of writing back a
locked value.

Brent


Scratch that--things get complicated as the lock itself gets "cloned," 
which
could happen during the out-of-order window.  I'll post back later after 
I've

analyzed it fully.


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-04 Thread bdegraaf

On 2016-10-04 06:12, Mark Rutland wrote:
On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-01 14:11, Mark Rutland wrote:
>Hi Brent,
>
>Evidently my questions weren't sufficiently clear; even with your
>answers it's not clear to me what precise issue you're attempting to
>solve.  I've tried to be more specific this time.
>
>At a high-level, can you clarify whether you're attempting to solve is:
>
>(a) a functional correctness issue (e.g. data corruption)
>(b) a performance issue
>
>And whether this was seen in practice, or found through code
>inspection?



Thinking about this, as the reader/writer code has no known "abuse"
case, I'll remove it from the patchset, then provide a v2 patchset
with a detailed explanation for the lockref problem using the commits
you provided as an example, as well as performance consideration.


If there's a functional problem, let's consider that in isolation 
first.

Once we understand that, then we can consider doing what is optimal.

As should be obvious from the above, I'm confused because this patch
conflates functional details with performance optimisations which (to
me) sound architecturally dubious.

I completely agree with Peter that if the problem lies with lockref, it
should be solved in the lockref code.

Thanks,
Mark.


After looking at this, the problem is not with the lockref code per se: 
it is

a problem with arch_spin_value_unlocked().  In the out-of-order case,
arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
fact
locked but the lock is not observable yet via an ordinary load.  Other 
than
ensuring order on the locking side (as the prior patch did), there is a 
way

to make arch_spin_value_unlock's TRUE return value deterministic, but it
requires that it does a write-back to the lock to ensure we didn't 
observe

the unlocked value while another agent was in process of writing back a
locked value.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-04 Thread bdegraaf

On 2016-10-04 06:12, Mark Rutland wrote:
On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-10-01 14:11, Mark Rutland wrote:
>Hi Brent,
>
>Evidently my questions weren't sufficiently clear; even with your
>answers it's not clear to me what precise issue you're attempting to
>solve.  I've tried to be more specific this time.
>
>At a high-level, can you clarify whether you're attempting to solve is:
>
>(a) a functional correctness issue (e.g. data corruption)
>(b) a performance issue
>
>And whether this was seen in practice, or found through code
>inspection?



Thinking about this, as the reader/writer code has no known "abuse"
case, I'll remove it from the patchset, then provide a v2 patchset
with a detailed explanation for the lockref problem using the commits
you provided as an example, as well as performance consideration.


If there's a functional problem, let's consider that in isolation 
first.

Once we understand that, then we can consider doing what is optimal.

As should be obvious from the above, I'm confused because this patch
conflates functional details with performance optimisations which (to
me) sound architecturally dubious.

I completely agree with Peter that if the problem lies with lockref, it
should be solved in the lockref code.

Thanks,
Mark.


After looking at this, the problem is not with the lockref code per se: 
it is

a problem with arch_spin_value_unlocked().  In the out-of-order case,
arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
fact
locked but the lock is not observable yet via an ordinary load.  Other 
than
ensuring order on the locking side (as the prior patch did), there is a 
way

to make arch_spin_value_unlock's TRUE return value deterministic, but it
requires that it does a write-back to the lock to ensure we didn't 
observe

the unlocked value while another agent was in process of writing back a
locked value.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-03 Thread bdegraaf

On 2016-10-01 14:11, Mark Rutland wrote:

Hi Brent,

Evidently my questions weren't sufficiently clear; even with your 
answers it's
not clear to me what precise issue you're attempting to solve. I've 
tried to be

more specific this time.

At a high-level, can you clarify whether you're attempting to solve is:

(a) a functional correctness issue (e.g. data corruption)
(b) a performance issue

And whether this was seen in practice, or found through code 
inspection?


On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-09-30 15:32, Mark Rutland wrote:
>On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>>+* so LSE needs an explicit barrier here as well.  Without this, the
>>+* changed contents of the area protected by the spinlock could be
>>+* observed prior to the lock.
>>+*/
>
>By whom? We generally expect that if data is protected by a lock, you take
>the lock before accessing it. If you expect concurrent lockless readers,
>then there's a requirement on the writer side to explicitly provide the
>ordering it requires -- spinlocks are not expected to provide that.

More details are in my response to Robin, but there is an API arm64 
supports
in spinlock.h which is used by lockref to determine whether a lock is 
free or
not.  For that code to work properly without adding these barriers, 
that API

needs to take the lock.


Can you please provide a concrete example of a case where things go 
wrong,
citing the code (or access pattern) in question? e.g. as in the commit 
messages

for:

  8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for
full barrier semantics)
  859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels")

(for the latter, I messed up the register -> var mapping in one 
paragraph, but

the style and reasoning is otherwise sound).

In the absence of a concrete example as above, it's very difficult to 
reason
about what the underlying issue is, and what a valid fix would be for 
said

issue.


>What pattern of accesses are made by readers and writers such that there is
>a problem?


Please note here I was asking specifically w.r.t. the lockref code, 
e.g. which
loads could see stale data, and what does the code do based on this 
value such

that there is a problem.

I added the barriers to the readers/writers because I do not know 
these are
not similarly abused.  There is a lot of driver code out there, and 
ensuring
order is the safest way to be sure we don't get burned by something 
similar

to the lockref access.


Making the architecture-provided primitives overly strong harms 
performance and
efficiency (in general), makes the code harder to maintain and optimise 
in
future, and only masks the issue (which could crop up on other 
architectures,

for instance).

Thanks,
Mark.



Thinking about this, as the reader/writer code has no known "abuse" 
case, I'll

remove it from the patchset, then provide a v2 patchset with a detailed
explanation for the lockref problem using the commits you provided as an 
example,

as well as performance consideration.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-03 Thread bdegraaf

On 2016-10-01 14:11, Mark Rutland wrote:

Hi Brent,

Evidently my questions weren't sufficiently clear; even with your 
answers it's
not clear to me what precise issue you're attempting to solve. I've 
tried to be

more specific this time.

At a high-level, can you clarify whether you're attempting to solve is:

(a) a functional correctness issue (e.g. data corruption)
(b) a performance issue

And whether this was seen in practice, or found through code 
inspection?


On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-09-30 15:32, Mark Rutland wrote:
>On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>>+* so LSE needs an explicit barrier here as well.  Without this, the
>>+* changed contents of the area protected by the spinlock could be
>>+* observed prior to the lock.
>>+*/
>
>By whom? We generally expect that if data is protected by a lock, you take
>the lock before accessing it. If you expect concurrent lockless readers,
>then there's a requirement on the writer side to explicitly provide the
>ordering it requires -- spinlocks are not expected to provide that.

More details are in my response to Robin, but there is an API arm64 
supports
in spinlock.h which is used by lockref to determine whether a lock is 
free or
not.  For that code to work properly without adding these barriers, 
that API

needs to take the lock.


Can you please provide a concrete example of a case where things go 
wrong,
citing the code (or access pattern) in question? e.g. as in the commit 
messages

for:

  8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for
full barrier semantics)
  859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels")

(for the latter, I messed up the register -> var mapping in one 
paragraph, but

the style and reasoning is otherwise sound).

In the absence of a concrete example as above, it's very difficult to 
reason
about what the underlying issue is, and what a valid fix would be for 
said

issue.


>What pattern of accesses are made by readers and writers such that there is
>a problem?


Please note here I was asking specifically w.r.t. the lockref code, 
e.g. which
loads could see stale data, and what does the code do based on this 
value such

that there is a problem.

I added the barriers to the readers/writers because I do not know 
these are
not similarly abused.  There is a lot of driver code out there, and 
ensuring
order is the safest way to be sure we don't get burned by something 
similar

to the lockref access.


Making the architecture-provided primitives overly strong harms 
performance and
efficiency (in general), makes the code harder to maintain and optimise 
in
future, and only masks the issue (which could crop up on other 
architectures,

for instance).

Thanks,
Mark.



Thinking about this, as the reader/writer code has no known "abuse" 
case, I'll

remove it from the patchset, then provide a v2 patchset with a detailed
explanation for the lockref problem using the commits you provided as an 
example,

as well as performance consideration.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-01 Thread bdegraaf

On 2016-09-30 15:32, Mark Rutland wrote:

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:

Prior spinlock code solely used load-acquire and store-release
semantics to ensure ordering of the spinlock lock and the area it
protects. However, store-release semantics and ordinary stores do
not protect against accesses to the protected area being observed
prior to the access that locks the lock itself.

While the load-acquire and store-release ordering is sufficient
when the spinlock routines themselves are strictly used, other
kernel code that references the lock values directly (e.g. lockrefs)
could observe changes to the area protected by the spinlock prior
to observance of the lock itself being in a locked state, despite
the fact that the spinlock logic itself is correct.


If the spinlock logic is correct, why are we changing that, and not the 
lockref

code that you say has a problem?

What exactly goes wrong in the lockref code? Can you give a concrete 
example?


Why does the lockref code accesses lock-protected fields without taking 
the

lock first? Wouldn't concurrent modification be a problem regardless?


+   /*
+* Yes: The store done on this cpu was the one that locked the lock.
+* Store-release one-way barrier on LL/SC means that accesses coming
+* after this could be reordered into the critical section of the


I assume you meant s/store-release/load-acquire/ here. This does not 
make sense

to me otherwise.

+	 * load-acquire/store-release, where we did not own the lock. On 
LSE,
+	 * even the one-way barrier of the store-release semantics is 
missing,


Likewise (for the LSE case description).


+* so LSE needs an explicit barrier here as well.  Without this, the
+* changed contents of the area protected by the spinlock could be
+* observed prior to the lock.
+*/


By whom? We generally expect that if data is protected by a lock, you 
take the
lock before accessing it. If you expect concurrent lockless readers, 
then
there's a requirement on the writer side to explicitly provide the 
ordering it

requires -- spinlocks are not expected to provide that.
More details are in my response to Robin, but there is an API arm64 
supports
in spinlock.h which is used by lockref to determine whether a lock is 
free or not.
For that code to work properly without adding these barriers, that API 
needs to
take the lock.  I tested that configuration, and it cost us heavily in 
terms of
lockref performance in the form of a 30 to 50 percent performance loss.  
On the
other hand, I have not seen any performance degradation due to the 
introduction

of these barriers.



So, why aren't those observers taking the lock?


lockref doesn't take the lock specifically because it is slower.



What pattern of accesses are made by readers and writers such that 
there is a

problem?


I added the barriers to the readers/writers because I do not know these 
are not
similarly abused.  There is a lot of driver code out there, and ensuring 
order is
the safest way to be sure we don't get burned by something similar to 
the lockref

access.



What does this result in?

No measureable negative performance impact.  However, the lockref 
performance actually
improved slightly (between 1 and 2 percent on my 24-core test system) 
due to the change.



+" dmb ish\n"
+" b   3f\n"
+"4:\n"
/*
 * No: spin on the owner. Send a local event to avoid missing an
 * unlock before the exclusive load.
@@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t 
*lock)

 " ldaxrh  %w2, %4\n"
 " eor %w1, %w2, %w0, lsr #16\n"
 " cbnz%w1, 2b\n"
-   /* We got the lock. Critical section starts here. */
+   /*
+	 * We got the lock and have observed the prior owner's 
store-release.

+* In this case, the one-way barrier of the prior owner that we
+* observed combined with the one-way barrier of our load-acquire is
+* enough to ensure accesses to the protected area coming after this
+* are not accessed until we own the lock.  In this case, other
+* observers will not see our changes prior to observing the lock
+* itself.  Critical locked section starts here.
+*/


Each of these comments ends up covers, and their repeated presence 
makes the
code harder to read. If there's a common problem, note it once at the 
top of

the file.


I added these comments to make it crystal clear that the absence of a 
barrier at this

point was deliberate, and that I did consider each code path.



Thanks,
Mark.


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-01 Thread bdegraaf

On 2016-09-30 15:32, Mark Rutland wrote:

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:

Prior spinlock code solely used load-acquire and store-release
semantics to ensure ordering of the spinlock lock and the area it
protects. However, store-release semantics and ordinary stores do
not protect against accesses to the protected area being observed
prior to the access that locks the lock itself.

While the load-acquire and store-release ordering is sufficient
when the spinlock routines themselves are strictly used, other
kernel code that references the lock values directly (e.g. lockrefs)
could observe changes to the area protected by the spinlock prior
to observance of the lock itself being in a locked state, despite
the fact that the spinlock logic itself is correct.


If the spinlock logic is correct, why are we changing that, and not the 
lockref

code that you say has a problem?

What exactly goes wrong in the lockref code? Can you give a concrete 
example?


Why does the lockref code accesses lock-protected fields without taking 
the

lock first? Wouldn't concurrent modification be a problem regardless?


+   /*
+* Yes: The store done on this cpu was the one that locked the lock.
+* Store-release one-way barrier on LL/SC means that accesses coming
+* after this could be reordered into the critical section of the


I assume you meant s/store-release/load-acquire/ here. This does not 
make sense

to me otherwise.

+	 * load-acquire/store-release, where we did not own the lock. On 
LSE,
+	 * even the one-way barrier of the store-release semantics is 
missing,


Likewise (for the LSE case description).


+* so LSE needs an explicit barrier here as well.  Without this, the
+* changed contents of the area protected by the spinlock could be
+* observed prior to the lock.
+*/


By whom? We generally expect that if data is protected by a lock, you 
take the
lock before accessing it. If you expect concurrent lockless readers, 
then
there's a requirement on the writer side to explicitly provide the 
ordering it

requires -- spinlocks are not expected to provide that.
More details are in my response to Robin, but there is an API arm64 
supports
in spinlock.h which is used by lockref to determine whether a lock is 
free or not.
For that code to work properly without adding these barriers, that API 
needs to
take the lock.  I tested that configuration, and it cost us heavily in 
terms of
lockref performance in the form of a 30 to 50 percent performance loss.  
On the
other hand, I have not seen any performance degradation due to the 
introduction

of these barriers.



So, why aren't those observers taking the lock?


lockref doesn't take the lock specifically because it is slower.



What pattern of accesses are made by readers and writers such that 
there is a

problem?


I added the barriers to the readers/writers because I do not know these 
are not
similarly abused.  There is a lot of driver code out there, and ensuring 
order is
the safest way to be sure we don't get burned by something similar to 
the lockref

access.



What does this result in?

No measureable negative performance impact.  However, the lockref 
performance actually
improved slightly (between 1 and 2 percent on my 24-core test system) 
due to the change.



+" dmb ish\n"
+" b   3f\n"
+"4:\n"
/*
 * No: spin on the owner. Send a local event to avoid missing an
 * unlock before the exclusive load.
@@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t 
*lock)

 " ldaxrh  %w2, %4\n"
 " eor %w1, %w2, %w0, lsr #16\n"
 " cbnz%w1, 2b\n"
-   /* We got the lock. Critical section starts here. */
+   /*
+	 * We got the lock and have observed the prior owner's 
store-release.

+* In this case, the one-way barrier of the prior owner that we
+* observed combined with the one-way barrier of our load-acquire is
+* enough to ensure accesses to the protected area coming after this
+* are not accessed until we own the lock.  In this case, other
+* observers will not see our changes prior to observing the lock
+* itself.  Critical locked section starts here.
+*/


Each of these comments ends up covers, and their repeated presence 
makes the
code harder to read. If there's a common problem, note it once at the 
top of

the file.


I added these comments to make it crystal clear that the absence of a 
barrier at this

point was deliberate, and that I did consider each code path.



Thanks,
Mark.


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-01 Thread bdegraaf

On 2016-09-30 15:05, Peter Zijlstra wrote:

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:

Prior spinlock code solely used load-acquire and store-release
semantics to ensure ordering of the spinlock lock and the area it
protects. However, store-release semantics and ordinary stores do
not protect against accesses to the protected area being observed
prior to the access that locks the lock itself.

While the load-acquire and store-release ordering is sufficient
when the spinlock routines themselves are strictly used, other
kernel code that references the lock values directly (e.g. lockrefs)


Isn't the problem with lockref the fact that arch_spin_value_unlocked()
isn't a load-acquire, and therefore the CPU in question doesn't need to
observe the contents of the critical section etc..?

That is, wouldn't fixing arch_spin_value_unlocked() by making that an
smp_load_acquire() fix things much better?


could observe changes to the area protected by the spinlock prior
to observance of the lock itself being in a locked state, despite
the fact that the spinlock logic itself is correct.


Thanks for your comments.

The load-acquire would not be enough for lockref, as it can still 
observe

the changed data out of order. To ensure order, lockref has to take the
lock, which comes at a high performance cost.  Turning off the config
option CONFIG_ARCH_USE_CMPXCHG_LOCKREF, which forces arch_spin_lock 
calls

reduced my multicore performance between 30 and 50 percent using Linus'
"stat" test that was part of the grounds for introducing lockref.

On the other hand, I did not see any negative impact to performance by
the new barriers, in large part probably because they only tend to come
into play when locks are not heavily contended in the case of ticket
locks.

I have not yet found any other spinlock "abuses" in the kernel besides
lockref, but locks are referenced in a large number of places that
includes drivers, which are dynamic.  It is arguable that I could remove
the barriers to the read/write locks, as lockref doesn't use those, but
it seemed to me to be safer and more "normal" to ensure that the locked
write to the lock itself is visible prior to the changed contents of the
protected area.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-01 Thread bdegraaf

On 2016-09-30 15:05, Peter Zijlstra wrote:

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:

Prior spinlock code solely used load-acquire and store-release
semantics to ensure ordering of the spinlock lock and the area it
protects. However, store-release semantics and ordinary stores do
not protect against accesses to the protected area being observed
prior to the access that locks the lock itself.

While the load-acquire and store-release ordering is sufficient
when the spinlock routines themselves are strictly used, other
kernel code that references the lock values directly (e.g. lockrefs)


Isn't the problem with lockref the fact that arch_spin_value_unlocked()
isn't a load-acquire, and therefore the CPU in question doesn't need to
observe the contents of the critical section etc..?

That is, wouldn't fixing arch_spin_value_unlocked() by making that an
smp_load_acquire() fix things much better?


could observe changes to the area protected by the spinlock prior
to observance of the lock itself being in a locked state, despite
the fact that the spinlock logic itself is correct.


Thanks for your comments.

The load-acquire would not be enough for lockref, as it can still 
observe

the changed data out of order. To ensure order, lockref has to take the
lock, which comes at a high performance cost.  Turning off the config
option CONFIG_ARCH_USE_CMPXCHG_LOCKREF, which forces arch_spin_lock 
calls

reduced my multicore performance between 30 and 50 percent using Linus'
"stat" test that was part of the grounds for introducing lockref.

On the other hand, I did not see any negative impact to performance by
the new barriers, in large part probably because they only tend to come
into play when locks are not heavily contended in the case of ticket
locks.

I have not yet found any other spinlock "abuses" in the kernel besides
lockref, but locks are referenced in a large number of places that
includes drivers, which are dynamic.  It is arguable that I could remove
the barriers to the read/write locks, as lockref doesn't use those, but
it seemed to me to be safer and more "normal" to ensure that the locked
write to the lock itself is visible prior to the changed contents of the
protected area.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-01 Thread bdegraaf

On 2016-09-30 14:43, Robin Murphy wrote:

+* so LSE needs an explicit barrier here as well.  Without this, the
+* changed contents of the area protected by the spinlock could be
+* observed prior to the lock.


What is that last sentence supposed to mean? If the lock is free, then
surely any previous writes to the data it's protecting would have
already been observed by the release semantics of the previous unlock?
If the lock is currently held, what do we care about the state of the
data while we're still spinning on the lock itself? And if someone's
touching the data without having acquired *or* released the lock, why 
is

there a lock in the first place?

This seems like a very expensive way of papering over broken callers :/

Robin.



Thanks for your question.

First off, I saw no negative impact to performance as a result of 
introducing

these barriers running a wide variety of use cases, both for mobile and
server-class devices ranging from 4 to 24 cpus.

Yes, it does protect lockref, which observes the spinlocks in a 
non-conventional
way.  In fact, with this code in place, the performance of Linus' test 
which runs
stat like crazy actually improved in the range of 1-2% (I suspect this 
is due to

fewer failures due to contention on the protected count lockref uses).

The lockref code can be made compliant, but not by a single 
load-acquire--it has
to take the lock.  Turning off CONFIG_ARCH_USE_CMPXCHG_LOCKREF is the 
most
obvious solution as it forces lockref.c to take the lock.  That, 
however, comes
at a very high performance cost: 30-50% on Linus' stat test on a 24-core 
system.

For larger systems, this performance gap will get even worse.

With the above in mind, it seems that supporting lockref's unorthodox 
method of
dealing with the lock is the better alternative, as it helps, rather 
than hurts,

arm64 performance.

Brent


Re: [RFC] arm64: Enforce observed order for spinlock and data

2016-10-01 Thread bdegraaf

On 2016-09-30 14:43, Robin Murphy wrote:

+* so LSE needs an explicit barrier here as well.  Without this, the
+* changed contents of the area protected by the spinlock could be
+* observed prior to the lock.


What is that last sentence supposed to mean? If the lock is free, then
surely any previous writes to the data it's protecting would have
already been observed by the release semantics of the previous unlock?
If the lock is currently held, what do we care about the state of the
data while we're still spinning on the lock itself? And if someone's
touching the data without having acquired *or* released the lock, why 
is

there a lock in the first place?

This seems like a very expensive way of papering over broken callers :/

Robin.



Thanks for your question.

First off, I saw no negative impact to performance as a result of 
introducing

these barriers running a wide variety of use cases, both for mobile and
server-class devices ranging from 4 to 24 cpus.

Yes, it does protect lockref, which observes the spinlocks in a 
non-conventional
way.  In fact, with this code in place, the performance of Linus' test 
which runs
stat like crazy actually improved in the range of 1-2% (I suspect this 
is due to

fewer failures due to contention on the protected count lockref uses).

The lockref code can be made compliant, but not by a single 
load-acquire--it has
to take the lock.  Turning off CONFIG_ARCH_USE_CMPXCHG_LOCKREF is the 
most
obvious solution as it forces lockref.c to take the lock.  That, 
however, comes
at a very high performance cost: 30-50% on Linus' stat test on a 24-core 
system.

For larger systems, this performance gap will get even worse.

With the above in mind, it seems that supporting lockref's unorthodox 
method of
dealing with the lock is the better alternative, as it helps, rather 
than hurts,

arm64 performance.

Brent


Re: [RFC] arm64: Ensure proper addressing for ldnp/stnp

2016-09-20 Thread bdegraaf

On 2016-09-20 07:00, Robin Murphy wrote:

On 19/09/16 19:25, bdegr...@codeaurora.org wrote:

On 2016-09-19 14:01, Robin Murphy wrote:

On 19/09/16 18:36, Brent DeGraaf wrote:
According to section 6.3.8 of the ARM Programmer's Guide, 
non-temporal
loads and stores do not verify that address dependency is met 
between a
load of an address to a register and a subsequent non-temporal load 
or
store using that address on the executing PE. Therefore, context 
switch
code and subroutine calls that use non-temporally accessed addresses 
as
parameters that might depend on a load of an address into an 
argument
register must ensure that ordering requirements are met by 
introducing
a barrier prior to the successive non-temporal access.  Add 
appropriate

barriers whereever this specific situation comes into play.

Signed-off-by: Brent DeGraaf 
---
 arch/arm64/kernel/entry.S  | 1 +
 arch/arm64/lib/copy_page.S | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 441420c..982c4d3 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -679,6 +679,7 @@ ENTRY(cpu_switch_to)
 ldpx27, x28, [x8], #16
 ldpx29, x9, [x8], #16
 ldrlr, [x8]
+dmbnshld// Existence of instructions with loose 
load-use

dependencies (e.g. ldnp/stnp) make this barrier necessary
 movsp, x9
 andx9, x9, #~(THREAD_SIZE - 1)
 msrsp_el0, x9
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 4c1e700..21c6892 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -47,6 +47,8 @@ alternative_endif
 ldpx14, x15, [x1, #96]
 ldpx16, x17, [x1, #112]

+dmbnshld // In case x0 (for stnp) is dependent on a load


The ARMv8 ARM (B2.7.2 in issue j) says that when an address 
dependency

exists between a load and a subsequent LDNP, *other* observers may
observe those accesses in any order. How's that related to an STNP on
the same CPU?

Robin.


+
 movx18, #(PAGE_SIZE - 128)
 addx1, x1, #128
 1:



Yes, I have seen the section in the ARM ARM about this. But the
Programmer's Guide goes further, even providing a concrete example:

"Non-temporal loads and stores relax the memory ordering
requirements...the LDNP instruction might
be observed before the preceding LDR instruction, which can result in
reading from an unpredictable
address in X0.

For example:
LDR X0, [X3]
LDNP X2, X1, [X0]
To correct the above, you need an explicit load barrier:
LDR X0, [X3]
DMB NSHLD
LDNP X2, X1, [X0]"

Did the ARM ARM leave this out?  Or is the Programmer's Guide section
incorrect?


If the ARM ARM and the Programmer's Guide don't agree, then the
Programmer's Guide is wrong (I'll raise a bug against it).

All the ARM ARM says is that in this situation:

 P1P2
 STP x0, x1, [x2]  1: LDR x0, 
 DMB ISH  CBZ x0, 1b
 STR x2, LDNP x1, x2, [x0]

P2's address dependency still very much exists from the point of view 
of
P2's execution, it just may not guarantee order13.2.4 against the DMB 
on P1,

so P2's LDNP isn't guaranteed to see the data from P1's STP (as opposed
to how a regular LDP *is*), and may still load older stale data 
instead.


Robin.



Thanks for your comments,
Brent



Thank you Robin.  This was concerning to me because the ARM ARM 
description
does not explicitly disagree with the Programmer's Guide, it just 
doesn't
touch on the PEe ordering.  Meanwhile, as you can see from the quote 
above,
the Programmer's Guide doesn't talk about PEy, and even includes sample 
code
that would only affect PEe ordering (the "nsh" option), leaving PEy 
ordering
impacts completely out of the picture, meaning some degree of thought 
was

applied toward it.

I went back through older versions of both the ARM ARM last week and the
Guide and the ARM ARM has never mentioned an issue with PEe ordering of
non-temporal accesses, so I am unsure where the Guide could have gotten
its information.

Also, please be aware that this description is in the Programmer's Guide
*twice*, both in section 6.3.8 which I mentioned in the commit text and 
is

duplicated in section 13.2.4.

Please keep me posted as to when the Guide will be corrected (or if it 
is

discovered to be correct).

Thanks again,
Brent


Re: [RFC] arm64: Ensure proper addressing for ldnp/stnp

2016-09-20 Thread bdegraaf

On 2016-09-20 07:00, Robin Murphy wrote:

On 19/09/16 19:25, bdegr...@codeaurora.org wrote:

On 2016-09-19 14:01, Robin Murphy wrote:

On 19/09/16 18:36, Brent DeGraaf wrote:
According to section 6.3.8 of the ARM Programmer's Guide, 
non-temporal
loads and stores do not verify that address dependency is met 
between a
load of an address to a register and a subsequent non-temporal load 
or
store using that address on the executing PE. Therefore, context 
switch
code and subroutine calls that use non-temporally accessed addresses 
as
parameters that might depend on a load of an address into an 
argument
register must ensure that ordering requirements are met by 
introducing
a barrier prior to the successive non-temporal access.  Add 
appropriate

barriers whereever this specific situation comes into play.

Signed-off-by: Brent DeGraaf 
---
 arch/arm64/kernel/entry.S  | 1 +
 arch/arm64/lib/copy_page.S | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 441420c..982c4d3 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -679,6 +679,7 @@ ENTRY(cpu_switch_to)
 ldpx27, x28, [x8], #16
 ldpx29, x9, [x8], #16
 ldrlr, [x8]
+dmbnshld// Existence of instructions with loose 
load-use

dependencies (e.g. ldnp/stnp) make this barrier necessary
 movsp, x9
 andx9, x9, #~(THREAD_SIZE - 1)
 msrsp_el0, x9
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 4c1e700..21c6892 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -47,6 +47,8 @@ alternative_endif
 ldpx14, x15, [x1, #96]
 ldpx16, x17, [x1, #112]

+dmbnshld // In case x0 (for stnp) is dependent on a load


The ARMv8 ARM (B2.7.2 in issue j) says that when an address 
dependency

exists between a load and a subsequent LDNP, *other* observers may
observe those accesses in any order. How's that related to an STNP on
the same CPU?

Robin.


+
 movx18, #(PAGE_SIZE - 128)
 addx1, x1, #128
 1:



Yes, I have seen the section in the ARM ARM about this. But the
Programmer's Guide goes further, even providing a concrete example:

"Non-temporal loads and stores relax the memory ordering
requirements...the LDNP instruction might
be observed before the preceding LDR instruction, which can result in
reading from an unpredictable
address in X0.

For example:
LDR X0, [X3]
LDNP X2, X1, [X0]
To correct the above, you need an explicit load barrier:
LDR X0, [X3]
DMB NSHLD
LDNP X2, X1, [X0]"

Did the ARM ARM leave this out?  Or is the Programmer's Guide section
incorrect?


If the ARM ARM and the Programmer's Guide don't agree, then the
Programmer's Guide is wrong (I'll raise a bug against it).

All the ARM ARM says is that in this situation:

 P1P2
 STP x0, x1, [x2]  1: LDR x0, 
 DMB ISH  CBZ x0, 1b
 STR x2, LDNP x1, x2, [x0]

P2's address dependency still very much exists from the point of view 
of
P2's execution, it just may not guarantee order13.2.4 against the DMB 
on P1,

so P2's LDNP isn't guaranteed to see the data from P1's STP (as opposed
to how a regular LDP *is*), and may still load older stale data 
instead.


Robin.



Thanks for your comments,
Brent



Thank you Robin.  This was concerning to me because the ARM ARM 
description
does not explicitly disagree with the Programmer's Guide, it just 
doesn't
touch on the PEe ordering.  Meanwhile, as you can see from the quote 
above,
the Programmer's Guide doesn't talk about PEy, and even includes sample 
code
that would only affect PEe ordering (the "nsh" option), leaving PEy 
ordering
impacts completely out of the picture, meaning some degree of thought 
was

applied toward it.

I went back through older versions of both the ARM ARM last week and the
Guide and the ARM ARM has never mentioned an issue with PEe ordering of
non-temporal accesses, so I am unsure where the Guide could have gotten
its information.

Also, please be aware that this description is in the Programmer's Guide
*twice*, both in section 6.3.8 which I mentioned in the commit text and 
is

duplicated in section 13.2.4.

Please keep me posted as to when the Guide will be corrected (or if it 
is

discovered to be correct).

Thanks again,
Brent


Re: [RFC] arm64: Ensure proper addressing for ldnp/stnp

2016-09-19 Thread bdegraaf

On 2016-09-19 14:28, Laura Abbott wrote:

On 09/19/2016 10:36 AM, Brent DeGraaf wrote:

According to section 6.3.8 of the ARM Programmer's Guide, non-temporal
loads and stores do not verify that address dependency is met between 
a

load of an address to a register and a subsequent non-temporal load or
store using that address on the executing PE. Therefore, context 
switch
code and subroutine calls that use non-temporally accessed addresses 
as

parameters that might depend on a load of an address into an argument
register must ensure that ordering requirements are met by introducing
a barrier prior to the successive non-temporal access.  Add 
appropriate

barriers whereever this specific situation comes into play.



Was this found by code inspection or is there a (public) exciting test
case to observe this behavior?

Thanks,
Laura



Code inspection only.

Brent


Re: [RFC] arm64: Ensure proper addressing for ldnp/stnp

2016-09-19 Thread bdegraaf

On 2016-09-19 14:28, Laura Abbott wrote:

On 09/19/2016 10:36 AM, Brent DeGraaf wrote:

According to section 6.3.8 of the ARM Programmer's Guide, non-temporal
loads and stores do not verify that address dependency is met between 
a

load of an address to a register and a subsequent non-temporal load or
store using that address on the executing PE. Therefore, context 
switch
code and subroutine calls that use non-temporally accessed addresses 
as

parameters that might depend on a load of an address into an argument
register must ensure that ordering requirements are met by introducing
a barrier prior to the successive non-temporal access.  Add 
appropriate

barriers whereever this specific situation comes into play.



Was this found by code inspection or is there a (public) exciting test
case to observe this behavior?

Thanks,
Laura



Code inspection only.

Brent


Re: [RFC] arm64: Ensure proper addressing for ldnp/stnp

2016-09-19 Thread bdegraaf

On 2016-09-19 14:01, Robin Murphy wrote:

On 19/09/16 18:36, Brent DeGraaf wrote:

According to section 6.3.8 of the ARM Programmer's Guide, non-temporal
loads and stores do not verify that address dependency is met between 
a

load of an address to a register and a subsequent non-temporal load or
store using that address on the executing PE. Therefore, context 
switch
code and subroutine calls that use non-temporally accessed addresses 
as

parameters that might depend on a load of an address into an argument
register must ensure that ordering requirements are met by introducing
a barrier prior to the successive non-temporal access.  Add 
appropriate

barriers whereever this specific situation comes into play.

Signed-off-by: Brent DeGraaf 
---
 arch/arm64/kernel/entry.S  | 1 +
 arch/arm64/lib/copy_page.S | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 441420c..982c4d3 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -679,6 +679,7 @@ ENTRY(cpu_switch_to)
ldp x27, x28, [x8], #16
ldp x29, x9, [x8], #16
ldr lr, [x8]
+	dmb	nshld	// Existence of instructions with loose load-use 
dependencies (e.g. ldnp/stnp) make this barrier necessary

mov sp, x9
and x9, x9, #~(THREAD_SIZE - 1)
msr sp_el0, x9
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 4c1e700..21c6892 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -47,6 +47,8 @@ alternative_endif
ldp x14, x15, [x1, #96]
ldp x16, x17, [x1, #112]

+   dmb nshld // In case x0 (for stnp) is dependent on a load


The ARMv8 ARM (B2.7.2 in issue j) says that when an address dependency
exists between a load and a subsequent LDNP, *other* observers may
observe those accesses in any order. How's that related to an STNP on
the same CPU?

Robin.


+
mov x18, #(PAGE_SIZE - 128)
add x1, x1, #128
 1:



Yes, I have seen the section in the ARM ARM about this. But the 
Programmer's Guide goes further, even providing a concrete example:


"Non-temporal loads and stores relax the memory ordering 
requirements...the LDNP instruction might
be observed before the preceding LDR instruction, which can result in 
reading from an unpredictable

address in X0.

For example:
LDR X0, [X3]
LDNP X2, X1, [X0]
To correct the above, you need an explicit load barrier:
LDR X0, [X3]
DMB NSHLD
LDNP X2, X1, [X0]"

Did the ARM ARM leave this out?  Or is the Programmer's Guide section 
incorrect?


Thanks for your comments,
Brent


Re: [RFC] arm64: Ensure proper addressing for ldnp/stnp

2016-09-19 Thread bdegraaf

On 2016-09-19 14:01, Robin Murphy wrote:

On 19/09/16 18:36, Brent DeGraaf wrote:

According to section 6.3.8 of the ARM Programmer's Guide, non-temporal
loads and stores do not verify that address dependency is met between 
a

load of an address to a register and a subsequent non-temporal load or
store using that address on the executing PE. Therefore, context 
switch
code and subroutine calls that use non-temporally accessed addresses 
as

parameters that might depend on a load of an address into an argument
register must ensure that ordering requirements are met by introducing
a barrier prior to the successive non-temporal access.  Add 
appropriate

barriers whereever this specific situation comes into play.

Signed-off-by: Brent DeGraaf 
---
 arch/arm64/kernel/entry.S  | 1 +
 arch/arm64/lib/copy_page.S | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 441420c..982c4d3 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -679,6 +679,7 @@ ENTRY(cpu_switch_to)
ldp x27, x28, [x8], #16
ldp x29, x9, [x8], #16
ldr lr, [x8]
+	dmb	nshld	// Existence of instructions with loose load-use 
dependencies (e.g. ldnp/stnp) make this barrier necessary

mov sp, x9
and x9, x9, #~(THREAD_SIZE - 1)
msr sp_el0, x9
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 4c1e700..21c6892 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -47,6 +47,8 @@ alternative_endif
ldp x14, x15, [x1, #96]
ldp x16, x17, [x1, #112]

+   dmb nshld // In case x0 (for stnp) is dependent on a load


The ARMv8 ARM (B2.7.2 in issue j) says that when an address dependency
exists between a load and a subsequent LDNP, *other* observers may
observe those accesses in any order. How's that related to an STNP on
the same CPU?

Robin.


+
mov x18, #(PAGE_SIZE - 128)
add x1, x1, #128
 1:



Yes, I have seen the section in the ARM ARM about this. But the 
Programmer's Guide goes further, even providing a concrete example:


"Non-temporal loads and stores relax the memory ordering 
requirements...the LDNP instruction might
be observed before the preceding LDR instruction, which can result in 
reading from an unpredictable

address in X0.

For example:
LDR X0, [X3]
LDNP X2, X1, [X0]
To correct the above, you need an explicit load barrier:
LDR X0, [X3]
DMB NSHLD
LDNP X2, X1, [X0]"

Did the ARM ARM leave this out?  Or is the Programmer's Guide section 
incorrect?


Thanks for your comments,
Brent


Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-22 Thread bdegraaf

On 2016-08-22 14:56, Mark Rutland wrote:

Hi Brent,

Thanks for the thorough reply. Comments inline below.

On Mon, Aug 22, 2016 at 01:32:47PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-08-22 07:37, Mark Rutland wrote:
>* What problem does this patch address?

Initially, I set out to fix a control-flow problem, as I originally
wrote this against the code prior to the refactoring of commit
b33f491f5a9aaf171b7de0f905362eb0314af478, back when there was still a
do_get_tspec subroutine.  That one used a dmb to order the
data accesses prior to the isb/mrs sequence that read the virtual
counter.  Our Senior Director, who has done extensive work with the 
ARM
cpu and is intimately familiar with the instruction set, indicated 
that

the dmb which was used in that code was not enough to ensure ordering
between the loads from the structure and the read of the virtual
counter.
Since that code had no control-flow (e.g., some conditional governing
the code's progress) prior to the isb, he suggested that some form of
dsb would be required to ensure proper order of access between the
loads from the vdso structure and the mrs read of the the virtual
counter.


Ok. So if I've parsed the above correctly, the fear was that an ISB was
insufficient to guarantee the ordering of prior loads w.r.t. the
subsequent MRS, and a control dependency between the two was necessary,
in addition to the ISB.


Exactly.

I went to the latest armv8 ARM at that time and found a concrete 
example

of how the code should be structured to ensure an ordered read of the
virtual counter.  In the most recent copy to which I have access
(ARM DDI 0487A.j), that code is given on page D6-1871, under section
D6.2.2.  I moved the sequence count check immediately prior to the
isb to satisfy the particular ordering requirements in that code.


My reading of that example is that the control dependency alone was
insufficient (given speculation), and the ISB provided the necessary
ordering between the signal variable being updated and the MRS. To me,
the example does not imply that both are required in all cases, only
that a control dependency alone is insufficient.

Per the description of ISB on page B2-87 of ARM DDI 0487A.j, my
understanding (which may be flawed), is that the instructions prior to
the ISB must be completed before the subsequent instructions are
fetched+issued, and hence the MRS should be (locally) ordered w.r.t. 
the

loads.


Saying the instructions are completed reportedly isn't exactly the same
as saying the loads have been accessed (I brought up this same point to
him when we discussed it). If a control dependency is not present prior
to the ISB, he said we would have to add a DSB to ensure ordering.
Not wanting to potentially slow things down across multiple cores with
a DSB, the control dependency is the lighter-weight solution.  I would
not have known that the control dependency were available in this situ-
ation had the ARM not mentioned it.


When the refactored code went in recently, however, the seqcnt_read
prior to the isb changed to a seqcnt_check, which addressed that
ordering requirement.


Have you seen an issue in practice prior to this? If so, we may need to
go digging further into this, and consider stable kernels.


I had not seen a problem. We were looking at the speed of the code to
see if anything could be helped when the Sr. Director noticed the
correctness problem. After I applied the fix, results given by
gettimeofday got tighter: averages that varied by up to 2 nsec before
now vary by only one or two hundredths of a nanosecond.


>* Is this a bug fix? If so, what problem can be seen currently?

The most obvious problem with the existing code is where the timezone
data gets loaded: after sequence counts have been checked and
rechecked, completely outside the area of the code protected by the
sequence counter.  While I realize that timezone code does not change
frequently, this is still a problem as the routine explicitly reads
data that could be in the process of being updated.


I take that you specifically mean the following line in
__kernel_gettimeofday, which occurs after the usual seqcnt_acquire +
seqcnt_check sequence:

ldp w4, w5, [vdso_data, #VDSO_TZ_MINWEST]

I see that the writer side for the timezone data is also not protected,
since commit bdba0051ebcb3c63 ("arm64: vdso: remove broken, redundant
sequence counting for timezones"). Following comments in commits I 
found

x86 commit 6c260d586343f7f7 ("x86: vdso: Remove bogus locking in
update_vsyscall_tz()").

Per the x86 commit, this is not atomic in the usual syscall path, and
thus it is a cross-architecture property that timezone updates are not
atomic, and that reordering of accesses may occur around a change of
timezone. If we want to tighten up the VDSO, we'll also need to tighten
up the syscall.

[...]



Ugh. I had not looked at the writer. That would explain why the comments
refer to it as "whacky tz stuff."


The second problem is 

Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-22 Thread bdegraaf

On 2016-08-22 14:56, Mark Rutland wrote:

Hi Brent,

Thanks for the thorough reply. Comments inline below.

On Mon, Aug 22, 2016 at 01:32:47PM -0400, bdegr...@codeaurora.org 
wrote:

On 2016-08-22 07:37, Mark Rutland wrote:
>* What problem does this patch address?

Initially, I set out to fix a control-flow problem, as I originally
wrote this against the code prior to the refactoring of commit
b33f491f5a9aaf171b7de0f905362eb0314af478, back when there was still a
do_get_tspec subroutine.  That one used a dmb to order the
data accesses prior to the isb/mrs sequence that read the virtual
counter.  Our Senior Director, who has done extensive work with the 
ARM
cpu and is intimately familiar with the instruction set, indicated 
that

the dmb which was used in that code was not enough to ensure ordering
between the loads from the structure and the read of the virtual
counter.
Since that code had no control-flow (e.g., some conditional governing
the code's progress) prior to the isb, he suggested that some form of
dsb would be required to ensure proper order of access between the
loads from the vdso structure and the mrs read of the the virtual
counter.


Ok. So if I've parsed the above correctly, the fear was that an ISB was
insufficient to guarantee the ordering of prior loads w.r.t. the
subsequent MRS, and a control dependency between the two was necessary,
in addition to the ISB.


Exactly.

I went to the latest armv8 ARM at that time and found a concrete 
example

of how the code should be structured to ensure an ordered read of the
virtual counter.  In the most recent copy to which I have access
(ARM DDI 0487A.j), that code is given on page D6-1871, under section
D6.2.2.  I moved the sequence count check immediately prior to the
isb to satisfy the particular ordering requirements in that code.


My reading of that example is that the control dependency alone was
insufficient (given speculation), and the ISB provided the necessary
ordering between the signal variable being updated and the MRS. To me,
the example does not imply that both are required in all cases, only
that a control dependency alone is insufficient.

Per the description of ISB on page B2-87 of ARM DDI 0487A.j, my
understanding (which may be flawed), is that the instructions prior to
the ISB must be completed before the subsequent instructions are
fetched+issued, and hence the MRS should be (locally) ordered w.r.t. 
the

loads.


Saying the instructions are completed reportedly isn't exactly the same
as saying the loads have been accessed (I brought up this same point to
him when we discussed it). If a control dependency is not present prior
to the ISB, he said we would have to add a DSB to ensure ordering.
Not wanting to potentially slow things down across multiple cores with
a DSB, the control dependency is the lighter-weight solution.  I would
not have known that the control dependency were available in this situ-
ation had the ARM not mentioned it.


When the refactored code went in recently, however, the seqcnt_read
prior to the isb changed to a seqcnt_check, which addressed that
ordering requirement.


Have you seen an issue in practice prior to this? If so, we may need to
go digging further into this, and consider stable kernels.


I had not seen a problem. We were looking at the speed of the code to
see if anything could be helped when the Sr. Director noticed the
correctness problem. After I applied the fix, results given by
gettimeofday got tighter: averages that varied by up to 2 nsec before
now vary by only one or two hundredths of a nanosecond.


>* Is this a bug fix? If so, what problem can be seen currently?

The most obvious problem with the existing code is where the timezone
data gets loaded: after sequence counts have been checked and
rechecked, completely outside the area of the code protected by the
sequence counter.  While I realize that timezone code does not change
frequently, this is still a problem as the routine explicitly reads
data that could be in the process of being updated.


I take that you specifically mean the following line in
__kernel_gettimeofday, which occurs after the usual seqcnt_acquire +
seqcnt_check sequence:

ldp w4, w5, [vdso_data, #VDSO_TZ_MINWEST]

I see that the writer side for the timezone data is also not protected,
since commit bdba0051ebcb3c63 ("arm64: vdso: remove broken, redundant
sequence counting for timezones"). Following comments in commits I 
found

x86 commit 6c260d586343f7f7 ("x86: vdso: Remove bogus locking in
update_vsyscall_tz()").

Per the x86 commit, this is not atomic in the usual syscall path, and
thus it is a cross-architecture property that timezone updates are not
atomic, and that reordering of accesses may occur around a change of
timezone. If we want to tighten up the VDSO, we'll also need to tighten
up the syscall.

[...]



Ugh. I had not looked at the writer. That would explain why the comments
refer to it as "whacky tz stuff."


The second problem is 

Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-22 Thread bdegraaf

On 2016-08-22 07:37, Mark Rutland wrote:

Hi,

On Fri, Aug 19, 2016 at 04:02:08PM -0400, Brent DeGraaf wrote:

Introduce explicit control-flow logic immediately prior to virtual
counter register read in all cases so that the mrs read will
always be accessed after all vdso data elements are read and
sequence count is verified. Ensure data elements under the
protection provided by the sequence counter are read only within
the protection logic loop.  Read the virtual counter as soon as
possible after the data elements are confirmed correctly read,
rather than after several other operations which can affect time.
Reduce full barriers required in register read code in favor of
the lighter-weight one-way barrier supplied by a load-acquire
wherever possible.


As I commented previously, can you please explain *why* these chagnes
should be made?

* What problem does this patch address?


Initially, I set out to fix a control-flow problem, as I originally
wrote this against the code prior to the refactoring of commit
b33f491f5a9aaf171b7de0f905362eb0314af478, back when there was still a
do_get_tspec subroutine.  That one used a dmb to order the
data accesses prior to the isb/mrs sequence that read the virtual
counter.  Our Senior Director, who has done extensive work with the ARM
cpu and is intimately familiar with the instruction set, indicated that
the dmb which was used in that code was not enough to ensure ordering
between the loads from the structure and the read of the virtual 
counter.
Since that code had no control-flow (e.g., some conditional governing 
the

code's progress) prior to the isb, he suggested that some form of dsb
would be required to ensure proper order of access between the loads
from the vdso structure and the mrs read of the the virtual counter.

I went to the latest armv8 ARM at that time and found a concrete example
of how the code should be structured to ensure an ordered read of the
virtual counter.  In the most recent copy to which I have access
(ARM DDI 0487A.j), that code is given on page D6-1871, under section
D6.2.2.  I moved the sequence count check immediately prior to the
isb to satisfy the particular ordering requirements in that code.

When the refactored code went in recently, however, the seqcnt_read
prior to the isb changed to a seqcnt_check, which addressed that 
ordering

requirement.

In the process of merging the prior code, however, I had noticed
several other things that were wrong or not ideal and I describe the
details of those problems below.



* Is this a bug fix? If so, what problem can be seen currently?


The most obvious problem with the existing code is where the timezone 
data

gets loaded: after sequence counts have been checked and rechecked,
completely outside the area of the code protected by the sequence 
counter.

While I realize that timezone code does not change frequently, this is
still a problem as the routine explicitly reads data that could be in 
the

process of being updated.

To make the demarkation clear with the refactored version, and keep code
that reads the vdso_data cleanly contained within the protection of the
sequence count, I elected to combine the two macros that check and 
recheck

the sequence counter.  This allowed the recheck to avoid looping on the
barrier requirements of a two reads of the sequence count, as branching
from one macro to a label created inside another macro (e.g. label 
in this case) is ugly at best.  This, I figured, would also help prevent
future mistakes that involved access vdso_data in an unprotected 
fashion:
Only the block of code passed to the routine would be allowed to 
directly

access the vdso data.

The second problem is that the timing between the reading of the vdso 
data

and the virtual counter is treated as secondary in the code, as a few
register manipulations using the structure data are performed prior
to reading the virtual counter.  While the cpu itself is free to reorder
these shifts and loads somewhat, depending on the implementation, the
read of the virtual counter should be performed as close as possible to
the time the vdso data itself is read to minimize variability.  As these
manipulations and loads are not dependent on the result of the mrs read,
putting them after the virtual counter isb/mrs sequence allows these
independent register manipulations to issue while the mrs is still in
process.

Finally, the nomenclature ("acquire_seqcnt") used by the existing code
pointed toward use of the load-acquire semantic, and using those allowed
me to reduce the number of explicit barriers needed in the reader code.



* Is this an optimisation? If so, how much of an improvement can be
  seen?


Optimization was not the main goal of this patch, yet performance did
improve on my target, with the average time improving marginally
(350 nsec after vs 360 nsec before the change), compared to the
refactored code.  In fact, performance is now slightly better (around
a miniscule 2 nsec) than 

Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-22 Thread bdegraaf

On 2016-08-22 07:37, Mark Rutland wrote:

Hi,

On Fri, Aug 19, 2016 at 04:02:08PM -0400, Brent DeGraaf wrote:

Introduce explicit control-flow logic immediately prior to virtual
counter register read in all cases so that the mrs read will
always be accessed after all vdso data elements are read and
sequence count is verified. Ensure data elements under the
protection provided by the sequence counter are read only within
the protection logic loop.  Read the virtual counter as soon as
possible after the data elements are confirmed correctly read,
rather than after several other operations which can affect time.
Reduce full barriers required in register read code in favor of
the lighter-weight one-way barrier supplied by a load-acquire
wherever possible.


As I commented previously, can you please explain *why* these chagnes
should be made?

* What problem does this patch address?


Initially, I set out to fix a control-flow problem, as I originally
wrote this against the code prior to the refactoring of commit
b33f491f5a9aaf171b7de0f905362eb0314af478, back when there was still a
do_get_tspec subroutine.  That one used a dmb to order the
data accesses prior to the isb/mrs sequence that read the virtual
counter.  Our Senior Director, who has done extensive work with the ARM
cpu and is intimately familiar with the instruction set, indicated that
the dmb which was used in that code was not enough to ensure ordering
between the loads from the structure and the read of the virtual 
counter.
Since that code had no control-flow (e.g., some conditional governing 
the

code's progress) prior to the isb, he suggested that some form of dsb
would be required to ensure proper order of access between the loads
from the vdso structure and the mrs read of the the virtual counter.

I went to the latest armv8 ARM at that time and found a concrete example
of how the code should be structured to ensure an ordered read of the
virtual counter.  In the most recent copy to which I have access
(ARM DDI 0487A.j), that code is given on page D6-1871, under section
D6.2.2.  I moved the sequence count check immediately prior to the
isb to satisfy the particular ordering requirements in that code.

When the refactored code went in recently, however, the seqcnt_read
prior to the isb changed to a seqcnt_check, which addressed that 
ordering

requirement.

In the process of merging the prior code, however, I had noticed
several other things that were wrong or not ideal and I describe the
details of those problems below.



* Is this a bug fix? If so, what problem can be seen currently?


The most obvious problem with the existing code is where the timezone 
data

gets loaded: after sequence counts have been checked and rechecked,
completely outside the area of the code protected by the sequence 
counter.

While I realize that timezone code does not change frequently, this is
still a problem as the routine explicitly reads data that could be in 
the

process of being updated.

To make the demarkation clear with the refactored version, and keep code
that reads the vdso_data cleanly contained within the protection of the
sequence count, I elected to combine the two macros that check and 
recheck

the sequence counter.  This allowed the recheck to avoid looping on the
barrier requirements of a two reads of the sequence count, as branching
from one macro to a label created inside another macro (e.g. label 
in this case) is ugly at best.  This, I figured, would also help prevent
future mistakes that involved access vdso_data in an unprotected 
fashion:
Only the block of code passed to the routine would be allowed to 
directly

access the vdso data.

The second problem is that the timing between the reading of the vdso 
data

and the virtual counter is treated as secondary in the code, as a few
register manipulations using the structure data are performed prior
to reading the virtual counter.  While the cpu itself is free to reorder
these shifts and loads somewhat, depending on the implementation, the
read of the virtual counter should be performed as close as possible to
the time the vdso data itself is read to minimize variability.  As these
manipulations and loads are not dependent on the result of the mrs read,
putting them after the virtual counter isb/mrs sequence allows these
independent register manipulations to issue while the mrs is still in
process.

Finally, the nomenclature ("acquire_seqcnt") used by the existing code
pointed toward use of the load-acquire semantic, and using those allowed
me to reduce the number of explicit barriers needed in the reader code.



* Is this an optimisation? If so, how much of an improvement can be
  seen?


Optimization was not the main goal of this patch, yet performance did
improve on my target, with the average time improving marginally
(350 nsec after vs 360 nsec before the change), compared to the
refactored code.  In fact, performance is now slightly better (around
a miniscule 2 nsec) than 

Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-11 Thread bdegraaf
Sorry, this is my first go at proper email etiquette. I think I've got 
it now.


--
"Can you explain the problem that you're fixing here, please?"
--
Since your question included only the vdso_data structure itself and the
vdso.c file changes, I'm assuming that's what you meant by "here."
I moved tb_seq_count to the start of the vdso_data structure because of 
the

nature of the ldar/strl instructions, which do not support offsets.
This avoided the extra math to locate the structure offset, while at the 
same

time using the same base pointer for the structure.  ia64 does something
similar and also puts this as the first field.  I moved the use_syscall
field up next to it just to keep most elements aligned according to 
their

width. That "use_syscall" is the first element used in the logic was an
additional reason I decided to move that one in particular.

--- a/arch/arm64/include/asm/vdso_datapage.h
+++ b/arch/arm64/include/asm/vdso_datapage.h
@@ -21,6 +21,8 @@
 #ifndef __ASSEMBLY__

 struct vdso_data {
+__u32 tb_seq_count;/* Timebase sequence counter */
+__u32 use_syscall;
 __u64 cs_cycle_last;/* Timebase at clocksource init */
 __u64 raw_time_sec;/* Raw time */
 __u64 raw_time_nsec;
@@ -30,14 +32,12 @@ struct vdso_data {
 __u64 xtime_coarse_nsec;
 __u64 wtm_clock_sec;/* Wall to monotonic time */
 __u64 wtm_clock_nsec;
-__u32 tb_seq_count;/* Timebase sequence counter */
 /* cs_* members must be adjacent and in this order (ldp accesses) 
*/

 __u32 cs_mono_mult;/* NTP-adjusted clocksource multiplier */
 __u32 cs_shift;/* Clocksource shift (mono = raw) */
 __u32 cs_raw_mult;/* Raw clocksource multiplier */
 __u32 tz_minuteswest;/* Whacky timezone stuff */
 __u32 tz_dsttime;
-__u32 use_syscall;
 };


As to vdso.c, I changed the instructions used to access tb_seq_count 
(both

in vdso.c and the gettimeofday.S file) to the load-acquire/store-release
instructions because those instructions seem to be designed for this
particular scenario: their "one-way" barriers enforce the order we need,
on top of eliminating the explicit barriers.


--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -201,10 +201,12 @@ up_fail:
  */
 void update_vsyscall(struct timekeeper *tk)
 {
+register u32 tmp;
 u32 use_syscall = strcmp(tk->tkr_mono.clock->name, 
"arch_sys_counter");


-++vdso_data->tb_seq_count;
-smp_wmb();
+tmp = smp_load_acquire(_data->tb_seq_count);
+++tmp;
+smp_store_release(_data->tb_seq_count, tmp);

"This looks busted -- the writes to vdso_data can be reordered before 
the

update of tb_seq_count.

/confused

Will"

--

Hopefully, this more detailed explanation helps make things clearer.

Thank you,
Brent


Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-11 Thread bdegraaf
Sorry, this is my first go at proper email etiquette. I think I've got 
it now.


--
"Can you explain the problem that you're fixing here, please?"
--
Since your question included only the vdso_data structure itself and the
vdso.c file changes, I'm assuming that's what you meant by "here."
I moved tb_seq_count to the start of the vdso_data structure because of 
the

nature of the ldar/strl instructions, which do not support offsets.
This avoided the extra math to locate the structure offset, while at the 
same

time using the same base pointer for the structure.  ia64 does something
similar and also puts this as the first field.  I moved the use_syscall
field up next to it just to keep most elements aligned according to 
their

width. That "use_syscall" is the first element used in the logic was an
additional reason I decided to move that one in particular.

--- a/arch/arm64/include/asm/vdso_datapage.h
+++ b/arch/arm64/include/asm/vdso_datapage.h
@@ -21,6 +21,8 @@
 #ifndef __ASSEMBLY__

 struct vdso_data {
+__u32 tb_seq_count;/* Timebase sequence counter */
+__u32 use_syscall;
 __u64 cs_cycle_last;/* Timebase at clocksource init */
 __u64 raw_time_sec;/* Raw time */
 __u64 raw_time_nsec;
@@ -30,14 +32,12 @@ struct vdso_data {
 __u64 xtime_coarse_nsec;
 __u64 wtm_clock_sec;/* Wall to monotonic time */
 __u64 wtm_clock_nsec;
-__u32 tb_seq_count;/* Timebase sequence counter */
 /* cs_* members must be adjacent and in this order (ldp accesses) 
*/

 __u32 cs_mono_mult;/* NTP-adjusted clocksource multiplier */
 __u32 cs_shift;/* Clocksource shift (mono = raw) */
 __u32 cs_raw_mult;/* Raw clocksource multiplier */
 __u32 tz_minuteswest;/* Whacky timezone stuff */
 __u32 tz_dsttime;
-__u32 use_syscall;
 };


As to vdso.c, I changed the instructions used to access tb_seq_count 
(both

in vdso.c and the gettimeofday.S file) to the load-acquire/store-release
instructions because those instructions seem to be designed for this
particular scenario: their "one-way" barriers enforce the order we need,
on top of eliminating the explicit barriers.


--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -201,10 +201,12 @@ up_fail:
  */
 void update_vsyscall(struct timekeeper *tk)
 {
+register u32 tmp;
 u32 use_syscall = strcmp(tk->tkr_mono.clock->name, 
"arch_sys_counter");


-++vdso_data->tb_seq_count;
-smp_wmb();
+tmp = smp_load_acquire(_data->tb_seq_count);
+++tmp;
+smp_store_release(_data->tb_seq_count, tmp);

"This looks busted -- the writes to vdso_data can be reordered before 
the

update of tb_seq_count.

/confused

Will"

--

Hopefully, this more detailed explanation helps make things clearer.

Thank you,
Brent


Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-11 Thread bdegraaf

On 2016-08-11 11:54, Will Deacon wrote:

On Thu, Aug 11, 2016 at 11:37:44AM -0400, Christopher Covington wrote:

From: Brent DeGraaf 

Prior gettimeofday code register read code is not architecturally
correct as there is no control flow gating logic enforced
immediately prior to the required isb.  Introduce explicit
control-flow logic prior to register read in all cases so that
the mrs read will always be done after all vdso data elements are
read, and read all data elements within the protection logic
provided by the sequence counter.


-ENOPARSE

Can you explain the problem that you're fixing here, please?


Signed-off-by: Brent DeGraaf 
---
 arch/arm64/include/asm/vdso_datapage.h |   4 +-
 arch/arm64/kernel/vdso.c   |  11 ++--
 arch/arm64/kernel/vdso/gettimeofday.S  | 106 
+++--

 3 files changed, 56 insertions(+), 65 deletions(-)

diff --git a/arch/arm64/include/asm/vdso_datapage.h 
b/arch/arm64/include/asm/vdso_datapage.h

index 2b9a637..49a0a51 100644
--- a/arch/arm64/include/asm/vdso_datapage.h
+++ b/arch/arm64/include/asm/vdso_datapage.h
@@ -21,6 +21,8 @@
 #ifndef __ASSEMBLY__

 struct vdso_data {
+   __u32 tb_seq_count; /* Timebase sequence counter */
+   __u32 use_syscall;
__u64 cs_cycle_last;/* Timebase at clocksource init */
__u64 raw_time_sec; /* Raw time */
__u64 raw_time_nsec;
@@ -30,14 +32,12 @@ struct vdso_data {
__u64 xtime_coarse_nsec;
__u64 wtm_clock_sec;/* Wall to monotonic time */
__u64 wtm_clock_nsec;
-   __u32 tb_seq_count; /* Timebase sequence counter */
/* cs_* members must be adjacent and in this order (ldp accesses) */
__u32 cs_mono_mult; /* NTP-adjusted clocksource multiplier */
__u32 cs_shift; /* Clocksource shift (mono = raw) */
__u32 cs_raw_mult;  /* Raw clocksource multiplier */
__u32 tz_minuteswest;   /* Whacky timezone stuff */
__u32 tz_dsttime;
-   __u32 use_syscall;
 };

 #endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 076312b..7751667 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -201,10 +201,12 @@ up_fail:
  */
 void update_vsyscall(struct timekeeper *tk)
 {
+   register u32 tmp;
 	u32 use_syscall = strcmp(tk->tkr_mono.clock->name, 
"arch_sys_counter");


-   ++vdso_data->tb_seq_count;
-   smp_wmb();
+   tmp = smp_load_acquire(_data->tb_seq_count);
+   ++tmp;
+   smp_store_release(_data->tb_seq_count, tmp);


This looks busted -- the writes to vdso_data can be reordered before 
the

update of tb_seq_count.

/confused

Will


The ldar/strl contain implicit barriers that will prevent the reorder, 
similar to the way they allowed removal of the explicit barriers in the 
spinlock code.


Thank you,
Brent


Re: [RFC] arm64: Enforce gettimeofday vdso structure read ordering

2016-08-11 Thread bdegraaf

On 2016-08-11 11:54, Will Deacon wrote:

On Thu, Aug 11, 2016 at 11:37:44AM -0400, Christopher Covington wrote:

From: Brent DeGraaf 

Prior gettimeofday code register read code is not architecturally
correct as there is no control flow gating logic enforced
immediately prior to the required isb.  Introduce explicit
control-flow logic prior to register read in all cases so that
the mrs read will always be done after all vdso data elements are
read, and read all data elements within the protection logic
provided by the sequence counter.


-ENOPARSE

Can you explain the problem that you're fixing here, please?


Signed-off-by: Brent DeGraaf 
---
 arch/arm64/include/asm/vdso_datapage.h |   4 +-
 arch/arm64/kernel/vdso.c   |  11 ++--
 arch/arm64/kernel/vdso/gettimeofday.S  | 106 
+++--

 3 files changed, 56 insertions(+), 65 deletions(-)

diff --git a/arch/arm64/include/asm/vdso_datapage.h 
b/arch/arm64/include/asm/vdso_datapage.h

index 2b9a637..49a0a51 100644
--- a/arch/arm64/include/asm/vdso_datapage.h
+++ b/arch/arm64/include/asm/vdso_datapage.h
@@ -21,6 +21,8 @@
 #ifndef __ASSEMBLY__

 struct vdso_data {
+   __u32 tb_seq_count; /* Timebase sequence counter */
+   __u32 use_syscall;
__u64 cs_cycle_last;/* Timebase at clocksource init */
__u64 raw_time_sec; /* Raw time */
__u64 raw_time_nsec;
@@ -30,14 +32,12 @@ struct vdso_data {
__u64 xtime_coarse_nsec;
__u64 wtm_clock_sec;/* Wall to monotonic time */
__u64 wtm_clock_nsec;
-   __u32 tb_seq_count; /* Timebase sequence counter */
/* cs_* members must be adjacent and in this order (ldp accesses) */
__u32 cs_mono_mult; /* NTP-adjusted clocksource multiplier */
__u32 cs_shift; /* Clocksource shift (mono = raw) */
__u32 cs_raw_mult;  /* Raw clocksource multiplier */
__u32 tz_minuteswest;   /* Whacky timezone stuff */
__u32 tz_dsttime;
-   __u32 use_syscall;
 };

 #endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 076312b..7751667 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -201,10 +201,12 @@ up_fail:
  */
 void update_vsyscall(struct timekeeper *tk)
 {
+   register u32 tmp;
 	u32 use_syscall = strcmp(tk->tkr_mono.clock->name, 
"arch_sys_counter");


-   ++vdso_data->tb_seq_count;
-   smp_wmb();
+   tmp = smp_load_acquire(_data->tb_seq_count);
+   ++tmp;
+   smp_store_release(_data->tb_seq_count, tmp);


This looks busted -- the writes to vdso_data can be reordered before 
the

update of tb_seq_count.

/confused

Will


The ldar/strl contain implicit barriers that will prevent the reorder, 
similar to the way they allowed removal of the explicit barriers in the 
spinlock code.


Thank you,
Brent