Re: rw locks vs memory barriers and adaptive spinning

2017-10-11 Thread Mateusz Guzik
On Wed, Oct 11, 2017 at 03:15:42PM +0200, Martin Pieuchot wrote:
> On 10/10/17(Tue) 20:19, Mateusz Guzik wrote:
> > On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote:
> > > Hello Mateusz,
> > >
> > > On 09/10/17(Mon) 21:43, Mateusz Guzik wrote:
> > > > I was looking at rw lock code out of curiosity and noticed you always do
> > > > membar_enter which on MP-enabled amd64 kernel translates to mfence.
> > > > This makes the entire business a little bit slower.
> > > >
> > > > Interestingly you already have relevant macros for amd64:
> > > > #define membar_enter_after_atomic() __membar("")
> > > > #define membar_exit_before_atomic() __membar("")
> > >
> > > If you look at the history, you'll see that theses macro didn't exist
> > > when membar_* where added to rw locks.
> > >
> > 
> > I know. Addition of such a macro sounds like an accelent opportunity to
> > examine existing users. I figured rw locks were ommitted by accident
> > as the original posting explicitly mentions mutexes.
> > 
> > https://marc.info/?l=openbsd-tech&m=149555411827749&w=2
> > 
> > > > And there is even a default variant for archs which don't provide one.
> > > > I guess the switch should be easy.
> > >
> > > Then please do it :)  At lot of stuff is easy on OpenBSD but we are not
> > > enough.
> > >
> > > Do it, test it, explain it, get oks and commit it 8)
> > >
> > 
> > As noted earlier the patch is trivial. I have no easy means to even
> > compile-test it though as I'm not using OpenBSD. Only had a look out of
> > curiosity.
> 
> Why don't you start using it?  Testing is what makes the difference.
> Many of us have ideas of how to improve the kernel but we're lacking
> the time to test & debug them all.
> 

I'm playing with concurrency as a hobby and smpifying single-threaded
code is above the pain threshold I'm willing to accept in my spare time.

> Sometimes a "correct" change exposes a bug and we try not to break
> -current, so we cannot afford to commit a non-tested diff.
> 

I definitely don't advocate for just committing stuff. I do state though
that there should be no binary change for archs other than i386 and
amd64 (== nothing to test there). Given the nature of the change for
these two it really seemed it just fell through the cracks earlier and I
just wanted to point it out as perhaps someone will find cycles to pick
it up and test on relevant platforms.

> > Nonetheless, here is the diff which can be split in 2. One part deals
> > with barriers, another one cleans up rw_status.
> >
> > [...]
> >
> > Read from the lock only once in rw_status. The lock word is marked as
> > volatile which forces re-read on each access. There is no correctness
> > issue here, but the assembly is worse than it needs to be and in case
> > of contention this adds avoidable cache traffic.
> 
> Here's the diff to cleans rw_status.  The difference on amd64 with
> clang 4.0  is the following:
> 
>   48 8b 0fmov(%rdi),%rcx
>   f6 c1 04test   $0x4,%cl
>   75 0c   jne738 
>   31 c0   xor%eax,%eax
> - 48 83 3f 00 cmpq   $0x0,(%rdi)
> + 48 85 c9test   %rcx,%rcx
> 
> ok?


There should be another change past rrw_status+0x18:
812557e8:   65 48 8b 04 25 08 00mov%gs:0x8,%rax
812557ef:   00 00
812557f1:   48 8b 80 90 02 00 00mov0x290(%rax),%rax
812557f8:   48 33 07xor(%rdi),%rax

> 
> Index: kern/kern_rwlock.c
> ===
> RCS file: /cvs/src/sys/kern/kern_rwlock.c,v
> retrieving revision 1.30
> diff -u -p -r1.30 kern_rwlock.c
> --- kern/kern_rwlock.c12 Aug 2017 23:27:44 -  1.30
> +++ kern/kern_rwlock.c11 Oct 2017 12:59:19 -
> @@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS
>  int
>  rw_status(struct rwlock *rwl)
>  {
> - if (rwl->rwl_owner & RWLOCK_WRLOCK) {
> - if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner))
> + unsigned long owner = rwl->rwl_owner;
> +
> + if (owner & RWLOCK_WRLOCK) {
> + if (RW_PROC(curproc) == RW_PROC(owner))
>   return RW_WRITE;
>   else
>   return RW_WRITE_OTHER;
>   }
> - if (rwl->rwl_owner)
> + if (owner)
>   return RW_READ;
>   return (0);
>  }

-- 
Mateusz Guzik



Re: rw locks vs memory barriers and adaptive spinning

2017-10-11 Thread Martin Pieuchot
On 10/10/17(Tue) 20:19, Mateusz Guzik wrote:
> On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote:
> > Hello Mateusz,
> >
> > On 09/10/17(Mon) 21:43, Mateusz Guzik wrote:
> > > I was looking at rw lock code out of curiosity and noticed you always do
> > > membar_enter which on MP-enabled amd64 kernel translates to mfence.
> > > This makes the entire business a little bit slower.
> > >
> > > Interestingly you already have relevant macros for amd64:
> > > #define membar_enter_after_atomic() __membar("")
> > > #define membar_exit_before_atomic() __membar("")
> >
> > If you look at the history, you'll see that theses macro didn't exist
> > when membar_* where added to rw locks.
> >
> 
> I know. Addition of such a macro sounds like an accelent opportunity to
> examine existing users. I figured rw locks were ommitted by accident
> as the original posting explicitly mentions mutexes.
> 
> https://marc.info/?l=openbsd-tech&m=149555411827749&w=2
> 
> > > And there is even a default variant for archs which don't provide one.
> > > I guess the switch should be easy.
> >
> > Then please do it :)  At lot of stuff is easy on OpenBSD but we are not
> > enough.
> >
> > Do it, test it, explain it, get oks and commit it 8)
> >
> 
> As noted earlier the patch is trivial. I have no easy means to even
> compile-test it though as I'm not using OpenBSD. Only had a look out of
> curiosity.

Why don't you start using it?  Testing is what makes the difference.
Many of us have ideas of how to improve the kernel but we're lacking
the time to test & debug them all.

Sometimes a "correct" change exposes a bug and we try not to break
-current, so we cannot afford to commit a non-tested diff.

> Nonetheless, here is the diff which can be split in 2. One part deals
> with barriers, another one cleans up rw_status.
>
> [...]
>
> Read from the lock only once in rw_status. The lock word is marked as
> volatile which forces re-read on each access. There is no correctness
> issue here, but the assembly is worse than it needs to be and in case
> of contention this adds avoidable cache traffic.

Here's the diff to cleans rw_status.  The difference on amd64 with
clang 4.0  is the following:

48 8b 0fmov(%rdi),%rcx
f6 c1 04test   $0x4,%cl
75 0c   jne738 
31 c0   xor%eax,%eax
-   48 83 3f 00 cmpq   $0x0,(%rdi)
+   48 85 c9test   %rcx,%rcx

ok?

Index: kern/kern_rwlock.c
===
RCS file: /cvs/src/sys/kern/kern_rwlock.c,v
retrieving revision 1.30
diff -u -p -r1.30 kern_rwlock.c
--- kern/kern_rwlock.c  12 Aug 2017 23:27:44 -  1.30
+++ kern/kern_rwlock.c  11 Oct 2017 12:59:19 -
@@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS
 int
 rw_status(struct rwlock *rwl)
 {
-   if (rwl->rwl_owner & RWLOCK_WRLOCK) {
-   if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner))
+   unsigned long owner = rwl->rwl_owner;
+
+   if (owner & RWLOCK_WRLOCK) {
+   if (RW_PROC(curproc) == RW_PROC(owner))
return RW_WRITE;
else
return RW_WRITE_OTHER;
}
-   if (rwl->rwl_owner)
+   if (owner)
return RW_READ;
return (0);
 }



Re: rw locks vs memory barriers and adaptive spinning

2017-10-10 Thread Mateusz Guzik
On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote:
> Hello Mateusz,
>
> On 09/10/17(Mon) 21:43, Mateusz Guzik wrote:
> > I was looking at rw lock code out of curiosity and noticed you always do
> > membar_enter which on MP-enabled amd64 kernel translates to mfence.
> > This makes the entire business a little bit slower.
> >
> > Interestingly you already have relevant macros for amd64:
> > #define membar_enter_after_atomic() __membar("")
> > #define membar_exit_before_atomic() __membar("")
>
> If you look at the history, you'll see that theses macro didn't exist
> when membar_* where added to rw locks.
>

I know. Addition of such a macro sounds like an accelent opportunity to
examine existing users. I figured rw locks were ommitted by accident
as the original posting explicitly mentions mutexes.

https://marc.info/?l=openbsd-tech&m=149555411827749&w=2

> > And there is even a default variant for archs which don't provide one.
> > I guess the switch should be easy.
>
> Then please do it :)  At lot of stuff is easy on OpenBSD but we are not
> enough.
>
> Do it, test it, explain it, get oks and commit it 8)
>

As noted earlier the patch is trivial. I have no easy means to even
compile-test it though as I'm not using OpenBSD. Only had a look out of
curiosity.

Nonetheless, here is the diff which can be split in 2. One part deals
with barriers, another one cleans up rw_status.

This should not result in binary change on any arch apart from i386 and
amd64. On these 2 if there is a breakage, it's a braino of some sort.
The change in principle is definitely fine.

So:

Utilize membar_*{before,after}_atomic in rw locks.

On architectures whose atomic operations provide relevant barriers there
is no need for additional membar. In particular this is the case on
amd64 and i386.

Read from the lock only once in rw_status. The lock word is marked as
volatile which forces re-read on each access. There is no correctness
issue here, but the assembly is worse than it needs to be and in case
of contention this adds avoidable cache traffic.

diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index e2c3cae..5551835 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -96,7 +96,7 @@ _rw_enter_read(struct rwlock *rwl LOCK_FL_VARS)
rw_cas(&rwl->rwl_owner, owner, owner + RWLOCK_READ_INCR)))
_rw_enter(rwl, RW_READ LOCK_FL_ARGS);
else {
-   membar_enter();
+   membar_enter_after_atomic();
WITNESS_CHECKORDER(&rwl->rwl_lock_obj, LOP_NEWORDER, file,
line,
NULL);
WITNESS_LOCK(&rwl->rwl_lock_obj, 0, file, line);
@@ -112,7 +112,7 @@ _rw_enter_write(struct rwlock *rwl LOCK_FL_VARS)
RW_PROC(p) | RWLOCK_WRLOCK)))
_rw_enter(rwl, RW_WRITE LOCK_FL_ARGS);
else {
-   membar_enter();
+   membar_enter_after_atomic();
WITNESS_CHECKORDER(&rwl->rwl_lock_obj,
LOP_EXCLUSIVE | LOP_NEWORDER, file, line, NULL);
WITNESS_LOCK(&rwl->rwl_lock_obj, LOP_EXCLUSIVE, file, line);
@@ -126,7 +126,7 @@ _rw_exit_read(struct rwlock *rwl LOCK_FL_VARS)

rw_assert_rdlock(rwl);

-   membar_exit();
+   membar_exit_before_atomic();
if (__predict_false((owner & RWLOCK_WAIT) ||
rw_cas(&rwl->rwl_owner, owner, owner - RWLOCK_READ_INCR)))
_rw_exit(rwl LOCK_FL_ARGS);
@@ -141,7 +141,7 @@ _rw_exit_write(struct rwlock *rwl LOCK_FL_VARS)

rw_assert_wrlock(rwl);

-   membar_exit();
+   membar_exit_before_atomic();
if (__predict_false((owner & RWLOCK_WAIT) ||
rw_cas(&rwl->rwl_owner, owner, 0)))
_rw_exit(rwl LOCK_FL_ARGS);
@@ -261,7 +261,7 @@ retry:

if (__predict_false(rw_cas(&rwl->rwl_owner, o, o + inc)))
goto retry;
-   membar_enter();
+   membar_enter_after_atomic();

/*
 * If old lock had RWLOCK_WAIT and RWLOCK_WRLOCK set, it means we
@@ -295,7 +295,7 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS)
WITNESS_UNLOCK(&rwl->rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0,
file, line);

-   membar_exit();
+   membar_exit_before_atomic();
do {
owner = rwl->rwl_owner;
if (wrlock)
@@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS)
 int
 rw_status(struct rwlock *rwl)
 {
-   if (rwl->rwl_owner & RWLOCK_WRLOCK) {
-   if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner))
+   unsigned long owner = rwl->rwl_owner;
+
+   if (owner & RWLOCK_WRLOCK) {
+   if (RW_PROC(curproc) == RW_PROC(owner))
return RW_WRITE;
else
return RW_WRITE_OTHER;
}
-   if (rwl->rwl_owner)
+   if (owner)
return RW_READ;
return (0);
 }

-- 
Mateusz Guzik 


Re: rw locks vs memory barriers and adaptive spinning

2017-10-10 Thread Martin Pieuchot
Hello Mateusz,

On 09/10/17(Mon) 21:43, Mateusz Guzik wrote:
> I was looking at rw lock code out of curiosity and noticed you always do
> membar_enter which on MP-enabled amd64 kernel translates to mfence.
> This makes the entire business a little bit slower.
> 
> Interestingly you already have relevant macros for amd64:
> #define membar_enter_after_atomic() __membar("")
> #define membar_exit_before_atomic() __membar("")

If you look at the history, you'll see that theses macro didn't exist
when membar_* where added to rw locks.

> And there is even a default variant for archs which don't provide one.
> I guess the switch should be easy.

Then please do it :)  At lot of stuff is easy on OpenBSD but we are not
enough.

Do it, test it, explain it, get oks and commit it 8)



rw locks vs memory barriers and adaptive spinning

2017-10-09 Thread Mateusz Guzik
Hello,

I was looking at rw lock code out of curiosity and noticed you always do
membar_enter which on MP-enabled amd64 kernel translates to mfence.
This makes the entire business a little bit slower.

Interestingly you already have relevant macros for amd64:
#define membar_enter_after_atomic() __membar("")
#define membar_exit_before_atomic() __membar("")

And there is even a default variant for archs which don't provide one.
I guess the switch should be easy.

Grabbing for reading is rw_cas to a higher value. On failure you
explicitly re-read from the lock  This is slower than necessary in
presence of concurrent read lock/unlock since cas returns the found
value and you can use that instead.

Also the read lock fast path does not have to descent to the slow path
after a single cas failure, but this probably does not matter right now.

The actual question I have here is if you played with adaptive spinning
instead of instantly putting threads to sleep at least for cases when
the kernel lock is not held. This can be as simplistic as just spinning
as long as the lock is owned by a running thread.

For cases where curproc holds the kernel lock you can perhaps drop it
for spinning purposes and reacquire later. Although I have no idea if
this one is going to help anything. Definitely worth testing imo.

A side note is that the global locks I found are not annotated in any
manner with respect to exclusivity of cacheline placement. In particular
netlock in a 6.2 kernel shares its chaceline with if_input_task_locked:


81aca608 D if_input_task_locked
81aca630 D netlock


The standard boiler plate to deal with it is to annotate with aligned()
and place the variable in a dedicated section. FreeBSD and NetBSD
contain the necessary crappery to copy-paste including linker script
support.

Cheers,
-- 
Mateusz Guzik