Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Mikulas Patocka


On Fri, 26 Oct 2012, Oleg Nesterov wrote:

> > The code is different, but it can be changed to use percpu rw semaphores
> > (if we add percpu_down_write_trylock).
> 
> I don't really understand how you can make percpu_down_write_trylock()
> atomic so that it can be called under br_write_lock(vfsmount_lock) in
> sb_prepare_remount_readonly(). So I guess you also need to replace
> vfsmount_lock at least. Or _trylock needs the barriers in _down_read.
> Or I missed something.
> 
> Oleg.

That's true - that code is under spinlock and you can't implement 
non-blocking percpu_down_write_trylock.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Oleg Nesterov
On 10/26, Mikulas Patocka wrote:
>
> On Fri, 26 Oct 2012, Oleg Nesterov wrote:
>
> I didn't know about. The code is not reusable, and it doesn't really do
> locking.

That was my main point.

As for the changing fs/namespace.c to use percpu_rwsem, I am not sure
it is that simple and even worthwhile but I won't argue, I do not
pretend I understand this code.

> > I don't understand why do you both think that __mnt_want_write()
> > and mnt_make_readonly() provides the same functionality. I looked
> > at this code before I started this patch, and unless I completely
> > misread it this does very different things. It is not "lock" at all.
> >
> > Oleg.
>
> mnt_want_write uses percpu array of counters, just like percpu semaphores.

and this is all imo ;)

> The code is different, but it can be changed to use percpu rw semaphores
> (if we add percpu_down_write_trylock).

I don't really understand how you can make percpu_down_write_trylock()
atomic so that it can be called under br_write_lock(vfsmount_lock) in
sb_prepare_remount_readonly(). So I guess you also need to replace
vfsmount_lock at least. Or _trylock needs the barriers in _down_read.
Or I missed something.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Mikulas Patocka


On Fri, 26 Oct 2012, Oleg Nesterov wrote:

> On 10/26, Dave Chinner wrote:
> >
> > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> > >
> > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw
> > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly()
> > > to use percpu rw semaphores and remove the duplicated code.
> >
> > I think you misunderstood my point - that rather than re-inventing
> > the wheel, why didn't you just copy something that is known to
> > work?

I didn't know about. The code is not reusable, and it doesn't really do 
locking. And it has two barriers on the read path, while percpu rw 
semaphores have none.

> I don't understand why do you both think that __mnt_want_write()
> and mnt_make_readonly() provides the same functionality. I looked
> at this code before I started this patch, and unless I completely
> misread it this does very different things. It is not "lock" at all.
> 
> Oleg.

mnt_want_write uses percpu array of counters, just like percpu semaphores.

The code is different, but it can be changed to use percpu rw semaphores 
(if we add percpu_down_write_trylock).

__mnt_want_write could call percpu_down_read and check if it is readonly 
(if it is, drop the lock and return -EROFS)
__mnt_drop_write could call percpu_up_read
mnt_make_readonly and sb_prepare_remount_readonly could call 
percpu_down_write_trylock instead of mnt_get_writers (if they get the 
write lock, set it to readonly and drop the write lock)

... and that's it, then, you can remove MNT_WRITE_HOLD, the barriers, 
spinning and other complexity from fs/namespace.c

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Oleg Nesterov
On 10/26, Dave Chinner wrote:
>
> On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> >
> > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw
> > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly()
> > to use percpu rw semaphores and remove the duplicated code.
>
> I think you misunderstood my point - that rather than re-inventing
> the wheel, why didn't you just copy something that is known to
> work?

I don't understand why do you both think that __mnt_want_write()
and mnt_make_readonly() provides the same functionality. I looked
at this code before I started this patch, and unless I completely
misread it this does very different things. It is not "lock" at all.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Oleg Nesterov
On 10/26, Dave Chinner wrote:

 On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
 
  Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw
  semaphores. I think you can convert mnt_want_write()/mnt_make_readonly()
  to use percpu rw semaphores and remove the duplicated code.

 I think you misunderstood my point - that rather than re-inventing
 the wheel, why didn't you just copy something that is known to
 work?

I don't understand why do you both think that __mnt_want_write()
and mnt_make_readonly() provides the same functionality. I looked
at this code before I started this patch, and unless I completely
misread it this does very different things. It is not lock at all.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Mikulas Patocka


On Fri, 26 Oct 2012, Oleg Nesterov wrote:

 On 10/26, Dave Chinner wrote:
 
  On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
  
   Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw
   semaphores. I think you can convert mnt_want_write()/mnt_make_readonly()
   to use percpu rw semaphores and remove the duplicated code.
 
  I think you misunderstood my point - that rather than re-inventing
  the wheel, why didn't you just copy something that is known to
  work?

I didn't know about. The code is not reusable, and it doesn't really do 
locking. And it has two barriers on the read path, while percpu rw 
semaphores have none.

 I don't understand why do you both think that __mnt_want_write()
 and mnt_make_readonly() provides the same functionality. I looked
 at this code before I started this patch, and unless I completely
 misread it this does very different things. It is not lock at all.
 
 Oleg.

mnt_want_write uses percpu array of counters, just like percpu semaphores.

The code is different, but it can be changed to use percpu rw semaphores 
(if we add percpu_down_write_trylock).

__mnt_want_write could call percpu_down_read and check if it is readonly 
(if it is, drop the lock and return -EROFS)
__mnt_drop_write could call percpu_up_read
mnt_make_readonly and sb_prepare_remount_readonly could call 
percpu_down_write_trylock instead of mnt_get_writers (if they get the 
write lock, set it to readonly and drop the write lock)

... and that's it, then, you can remove MNT_WRITE_HOLD, the barriers, 
spinning and other complexity from fs/namespace.c

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Oleg Nesterov
On 10/26, Mikulas Patocka wrote:

 On Fri, 26 Oct 2012, Oleg Nesterov wrote:

 I didn't know about. The code is not reusable, and it doesn't really do
 locking.

That was my main point.

As for the changing fs/namespace.c to use percpu_rwsem, I am not sure
it is that simple and even worthwhile but I won't argue, I do not
pretend I understand this code.

  I don't understand why do you both think that __mnt_want_write()
  and mnt_make_readonly() provides the same functionality. I looked
  at this code before I started this patch, and unless I completely
  misread it this does very different things. It is not lock at all.
 
  Oleg.

 mnt_want_write uses percpu array of counters, just like percpu semaphores.

and this is all imo ;)

 The code is different, but it can be changed to use percpu rw semaphores
 (if we add percpu_down_write_trylock).

I don't really understand how you can make percpu_down_write_trylock()
atomic so that it can be called under br_write_lock(vfsmount_lock) in
sb_prepare_remount_readonly(). So I guess you also need to replace
vfsmount_lock at least. Or _trylock needs the barriers in _down_read.
Or I missed something.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-26 Thread Mikulas Patocka


On Fri, 26 Oct 2012, Oleg Nesterov wrote:

  The code is different, but it can be changed to use percpu rw semaphores
  (if we add percpu_down_write_trylock).
 
 I don't really understand how you can make percpu_down_write_trylock()
 atomic so that it can be called under br_write_lock(vfsmount_lock) in
 sb_prepare_remount_readonly(). So I guess you also need to replace
 vfsmount_lock at least. Or _trylock needs the barriers in _down_read.
 Or I missed something.
 
 Oleg.

That's true - that code is under spinlock and you can't implement 
non-blocking percpu_down_write_trylock.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-25 Thread Dave Chinner
On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Dave Chinner wrote:
> 
> > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> > > 
> > > > > Yes, I tried this approach - it involves doing LOCK instruction on 
> > > > > read 
> > > > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > > > unlock (which will hopefully be on the same CPU, so no cacheline 
> > > > > bouncing 
> > > > > happens in the common case). It was slower than the approach without 
> > > > > any 
> > > > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without 
> > > > > atomic 
> > > > > instruction; the benchmark involved doing 512-byte direct-io reads 
> > > > > and 
> > > > > writes on a ramdisk with 8 processes on 8-core machine).
> > > > 
> > > > So why is that a problem? Surely that's already tons better then what
> > > > you've currently got.
> > > 
> > > Percpu rw-semaphores do not improve performance at all. I put them there 
> > > to avoid performance regression, not to improve performance.
> > > 
> > > All Linux kernels have a race condition - when you change block size of a 
> > > block device and you read or write the device at the same time, a crash 
> > > may happen. This bug is there since ever. Recently, this bug started to 
> > > cause major trouble - multiple high profile business sites report crashes 
> > > because of this race condition.
> > >
> > > You can fix this race by using a read lock around I/O paths and write 
> > > lock 
> > > around block size changing, but normal rw semaphore cause cache line 
> > > bouncing when taken for read by multiple processors and I/O performance 
> > > degradation because of it is measurable.
> > 
> > This doesn't sound like a new problem.  Hasn't this global access,
> > single modifier exclusion problem been solved before in the VFS?
> > e.g. mnt_want_write()/mnt_make_readonly()
> > 
> > Cheers,
> > 
> > Dave.
> 
> Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
> semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
> to use percpu rw semaphores and remove the duplicated code.

I think you misunderstood my point - that rather than re-inventing
the wheel, why didn't you just copy something that is known to
work?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-25 Thread Mikulas Patocka


On Wed, 24 Oct 2012, Dave Chinner wrote:

> On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> > 
> > > > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > > unlock (which will hopefully be on the same CPU, so no cacheline 
> > > > bouncing 
> > > > happens in the common case). It was slower than the approach without 
> > > > any 
> > > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > > per-cpu LOCKed access, 42.7 seconds for this implementation without 
> > > > atomic 
> > > > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > > > writes on a ramdisk with 8 processes on 8-core machine).
> > > 
> > > So why is that a problem? Surely that's already tons better then what
> > > you've currently got.
> > 
> > Percpu rw-semaphores do not improve performance at all. I put them there 
> > to avoid performance regression, not to improve performance.
> > 
> > All Linux kernels have a race condition - when you change block size of a 
> > block device and you read or write the device at the same time, a crash 
> > may happen. This bug is there since ever. Recently, this bug started to 
> > cause major trouble - multiple high profile business sites report crashes 
> > because of this race condition.
> >
> > You can fix this race by using a read lock around I/O paths and write lock 
> > around block size changing, but normal rw semaphore cause cache line 
> > bouncing when taken for read by multiple processors and I/O performance 
> > degradation because of it is measurable.
> 
> This doesn't sound like a new problem.  Hasn't this global access,
> single modifier exclusion problem been solved before in the VFS?
> e.g. mnt_want_write()/mnt_make_readonly()
> 
> Cheers,
> 
> Dave.

Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
to use percpu rw semaphores and remove the duplicated code.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-25 Thread Mikulas Patocka


On Wed, 24 Oct 2012, Dave Chinner wrote:

 On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
  
  
  On Fri, 19 Oct 2012, Peter Zijlstra wrote:
  
Yes, I tried this approach - it involves doing LOCK instruction on read 
lock, remembering the cpu and doing another LOCK instruction on read 
unlock (which will hopefully be on the same CPU, so no cacheline 
bouncing 
happens in the common case). It was slower than the approach without 
any 
LOCK instructions (43.3 seconds seconds for the implementation with 
per-cpu LOCKed access, 42.7 seconds for this implementation without 
atomic 
instruction; the benchmark involved doing 512-byte direct-io reads and 
writes on a ramdisk with 8 processes on 8-core machine).
   
   So why is that a problem? Surely that's already tons better then what
   you've currently got.
  
  Percpu rw-semaphores do not improve performance at all. I put them there 
  to avoid performance regression, not to improve performance.
  
  All Linux kernels have a race condition - when you change block size of a 
  block device and you read or write the device at the same time, a crash 
  may happen. This bug is there since ever. Recently, this bug started to 
  cause major trouble - multiple high profile business sites report crashes 
  because of this race condition.
 
  You can fix this race by using a read lock around I/O paths and write lock 
  around block size changing, but normal rw semaphore cause cache line 
  bouncing when taken for read by multiple processors and I/O performance 
  degradation because of it is measurable.
 
 This doesn't sound like a new problem.  Hasn't this global access,
 single modifier exclusion problem been solved before in the VFS?
 e.g. mnt_want_write()/mnt_make_readonly()
 
 Cheers,
 
 Dave.

Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
to use percpu rw semaphores and remove the duplicated code.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-25 Thread Dave Chinner
On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
 
 
 On Wed, 24 Oct 2012, Dave Chinner wrote:
 
  On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
   
   
   On Fri, 19 Oct 2012, Peter Zijlstra wrote:
   
 Yes, I tried this approach - it involves doing LOCK instruction on 
 read 
 lock, remembering the cpu and doing another LOCK instruction on read 
 unlock (which will hopefully be on the same CPU, so no cacheline 
 bouncing 
 happens in the common case). It was slower than the approach without 
 any 
 LOCK instructions (43.3 seconds seconds for the implementation with 
 per-cpu LOCKed access, 42.7 seconds for this implementation without 
 atomic 
 instruction; the benchmark involved doing 512-byte direct-io reads 
 and 
 writes on a ramdisk with 8 processes on 8-core machine).

So why is that a problem? Surely that's already tons better then what
you've currently got.
   
   Percpu rw-semaphores do not improve performance at all. I put them there 
   to avoid performance regression, not to improve performance.
   
   All Linux kernels have a race condition - when you change block size of a 
   block device and you read or write the device at the same time, a crash 
   may happen. This bug is there since ever. Recently, this bug started to 
   cause major trouble - multiple high profile business sites report crashes 
   because of this race condition.
  
   You can fix this race by using a read lock around I/O paths and write 
   lock 
   around block size changing, but normal rw semaphore cause cache line 
   bouncing when taken for read by multiple processors and I/O performance 
   degradation because of it is measurable.
  
  This doesn't sound like a new problem.  Hasn't this global access,
  single modifier exclusion problem been solved before in the VFS?
  e.g. mnt_want_write()/mnt_make_readonly()
  
  Cheers,
  
  Dave.
 
 Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
 semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
 to use percpu rw semaphores and remove the duplicated code.

I think you misunderstood my point - that rather than re-inventing
the wheel, why didn't you just copy something that is known to
work?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-23 Thread Dave Chinner
On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> 
> > > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > > happens in the common case). It was slower than the approach without any 
> > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > per-cpu LOCKed access, 42.7 seconds for this implementation without 
> > > atomic 
> > > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > > writes on a ramdisk with 8 processes on 8-core machine).
> > 
> > So why is that a problem? Surely that's already tons better then what
> > you've currently got.
> 
> Percpu rw-semaphores do not improve performance at all. I put them there 
> to avoid performance regression, not to improve performance.
> 
> All Linux kernels have a race condition - when you change block size of a 
> block device and you read or write the device at the same time, a crash 
> may happen. This bug is there since ever. Recently, this bug started to 
> cause major trouble - multiple high profile business sites report crashes 
> because of this race condition.
>
> You can fix this race by using a read lock around I/O paths and write lock 
> around block size changing, but normal rw semaphore cause cache line 
> bouncing when taken for read by multiple processors and I/O performance 
> degradation because of it is measurable.

This doesn't sound like a new problem.  Hasn't this global access,
single modifier exclusion problem been solved before in the VFS?
e.g. mnt_want_write()/mnt_make_readonly()

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-23 Thread Oleg Nesterov
Hi Mikulas,

On 10/22, Mikulas Patocka wrote:
>
> On Fri, 19 Oct 2012, Oleg Nesterov wrote:
>
> > On 10/19, Mikulas Patocka wrote:
> > >
> > > synchronize_rcu() is way slower than msleep(1) -
> >
> > This depends, I guess. but this doesn't mmatter,
> >
> > > so I don't see a reason
> > > why should it be complicated to avoid msleep(1).
> >
> > I don't think this really needs complications. Please look at this
> > patch for example. Or initial (single writer) version below. It is
> > not finished and lacks the barriers too, but I do not think it is
^^^

please note the comment above ;)

> > more complex.
>
> Hi
>
> My implementation has a smaller structure (it doesn't have
> wait_queue_head_t).

Oh, I don't think sizeof() really matters in this case.

> Your implementation is prone to starvation - if the writer has a high
> priority and if it is doing back-to-back write unlocks/locks, it may
> happen that the readers have no chance to run.

Yes, it is write-biased, this was intent. writers should be rare.

> The use of mutex instead of a wait queue in my implementation is unusual,
> but I don't see anything wrong with it

Neither me.

Mikulas, apart from _rcu/_sched change, my only point was msleep() can
(and imho should) be avoided.

> > static inline long brw_read_ctr(struct brw_sem *brw)
> > {
> > long sum = 0;
> > int cpu;
> >
> > for_each_possible_cpu(cpu)
> > sum += per_cpu(*brw->read_ctr, cpu);
>
> Integer overflow on signed types is undefined - you should use unsigned
> long - you can use -fwrapv option to gcc to make signed overflow defined,
> but Linux doesn't use it.

I don't think -fwrapv can make any difference in this case, but I agree
that "unsigned long" makes more sense.

> > void brw_up_write(struct brw_sem *brw)
> > {
> > brw->writer = NULL;
> > synchronize_sched();
>
> That synchronize_sched should be put before brw->writer = NULL.

Yes, I know. I mentioned this at the start, this lacks the necessary
barrier between this writer and the next reader.

> I had this bug in my implementation too.

Yes, exactly. And this is why I cc'ed you initially ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-23 Thread Oleg Nesterov
Hi Mikulas,

On 10/22, Mikulas Patocka wrote:

 On Fri, 19 Oct 2012, Oleg Nesterov wrote:

  On 10/19, Mikulas Patocka wrote:
  
   synchronize_rcu() is way slower than msleep(1) -
 
  This depends, I guess. but this doesn't mmatter,
 
   so I don't see a reason
   why should it be complicated to avoid msleep(1).
 
  I don't think this really needs complications. Please look at this
  patch for example. Or initial (single writer) version below. It is
  not finished and lacks the barriers too, but I do not think it is
^^^

please note the comment above ;)

  more complex.

 Hi

 My implementation has a smaller structure (it doesn't have
 wait_queue_head_t).

Oh, I don't think sizeof() really matters in this case.

 Your implementation is prone to starvation - if the writer has a high
 priority and if it is doing back-to-back write unlocks/locks, it may
 happen that the readers have no chance to run.

Yes, it is write-biased, this was intent. writers should be rare.

 The use of mutex instead of a wait queue in my implementation is unusual,
 but I don't see anything wrong with it

Neither me.

Mikulas, apart from _rcu/_sched change, my only point was msleep() can
(and imho should) be avoided.

  static inline long brw_read_ctr(struct brw_sem *brw)
  {
  long sum = 0;
  int cpu;
 
  for_each_possible_cpu(cpu)
  sum += per_cpu(*brw-read_ctr, cpu);

 Integer overflow on signed types is undefined - you should use unsigned
 long - you can use -fwrapv option to gcc to make signed overflow defined,
 but Linux doesn't use it.

I don't think -fwrapv can make any difference in this case, but I agree
that unsigned long makes more sense.

  void brw_up_write(struct brw_sem *brw)
  {
  brw-writer = NULL;
  synchronize_sched();

 That synchronize_sched should be put before brw-writer = NULL.

Yes, I know. I mentioned this at the start, this lacks the necessary
barrier between this writer and the next reader.

 I had this bug in my implementation too.

Yes, exactly. And this is why I cc'ed you initially ;)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-23 Thread Dave Chinner
On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
 
 
 On Fri, 19 Oct 2012, Peter Zijlstra wrote:
 
   Yes, I tried this approach - it involves doing LOCK instruction on read 
   lock, remembering the cpu and doing another LOCK instruction on read 
   unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
   happens in the common case). It was slower than the approach without any 
   LOCK instructions (43.3 seconds seconds for the implementation with 
   per-cpu LOCKed access, 42.7 seconds for this implementation without 
   atomic 
   instruction; the benchmark involved doing 512-byte direct-io reads and 
   writes on a ramdisk with 8 processes on 8-core machine).
  
  So why is that a problem? Surely that's already tons better then what
  you've currently got.
 
 Percpu rw-semaphores do not improve performance at all. I put them there 
 to avoid performance regression, not to improve performance.
 
 All Linux kernels have a race condition - when you change block size of a 
 block device and you read or write the device at the same time, a crash 
 may happen. This bug is there since ever. Recently, this bug started to 
 cause major trouble - multiple high profile business sites report crashes 
 because of this race condition.

 You can fix this race by using a read lock around I/O paths and write lock 
 around block size changing, but normal rw semaphore cause cache line 
 bouncing when taken for read by multiple processors and I/O performance 
 degradation because of it is measurable.

This doesn't sound like a new problem.  Hasn't this global access,
single modifier exclusion problem been solved before in the VFS?
e.g. mnt_want_write()/mnt_make_readonly()

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-22 Thread Mikulas Patocka


On Fri, 19 Oct 2012, Oleg Nesterov wrote:

> On 10/19, Mikulas Patocka wrote:
> >
> > synchronize_rcu() is way slower than msleep(1) -
> 
> This depends, I guess. but this doesn't mmatter,
> 
> > so I don't see a reason
> > why should it be complicated to avoid msleep(1).
> 
> I don't think this really needs complications. Please look at this
> patch for example. Or initial (single writer) version below. It is
> not finished and lacks the barriers too, but I do not think it is
> more complex.

Hi

My implementation has a smaller structure (it doesn't have 
wait_queue_head_t).

Using preempt_disable()/synchronize_sched() instead of RCU seems like a 
good idea. Here, the locked region is so small that it doesn't make sense 
to play tricks with preemptible RCU.

Your implementation is prone to starvation - if the writer has a high 
priority and if it is doing back-to-back write unlocks/locks, it may 
happen that the readers have no chance to run.

The use of mutex instead of a wait queue in my implementation is unusual, 
but I don't see anything wrong with it - it makes the structure smaller 
and it solves the starvation problem (which would otherwise be complicated 
to solve).

Mikulas

> Oleg.
> 
> struct brw_sem {
>   long __percpu   *read_ctr;
>   wait_queue_head_t   read_waitq;
>   struct mutexwriter_mutex;
>   struct task_struct  *writer;
> };
> 
> int brw_init(struct brw_sem *brw)
> {
>   brw->writer = NULL;
>   mutex_init(>writer_mutex);
>   init_waitqueue_head(>read_waitq);
>   brw->read_ctr = alloc_percpu(long);
>   return brw->read_ctr ? 0 : -ENOMEM;
> }
> 
> void brw_down_read(struct brw_sem *brw)
> {
>   for (;;) {
>   bool done = false;
> 
>   preempt_disable();
>   if (likely(!brw->writer)) {
>   __this_cpu_inc(*brw->read_ctr);
>   done = true;
>   }
>   preempt_enable();
> 
>   if (likely(done))
>   break;
> 
>   __wait_event(brw->read_waitq, !brw->writer);
>   }
> }
> 
> void brw_up_read(struct brw_sem *brw)
> {
>   struct task_struct *writer;
> 
>   preempt_disable();
>   __this_cpu_dec(*brw->read_ctr);
>   writer = ACCESS_ONCE(brw->writer);
>   if (unlikely(writer))
>   wake_up_process(writer);
>   preempt_enable();
> }
> 
> static inline long brw_read_ctr(struct brw_sem *brw)
> {
>   long sum = 0;
>   int cpu;
> 
>   for_each_possible_cpu(cpu)
>   sum += per_cpu(*brw->read_ctr, cpu);

Integer overflow on signed types is undefined - you should use unsigned 
long - you can use -fwrapv option to gcc to make signed overflow defined, 
but Linux doesn't use it.

> 
>   return sum;
> }
> 
> void brw_down_write(struct brw_sem *brw)
> {
>   mutex_lock(>writer_mutex);
>   brw->writer = current;
>   synchronize_sched();
>   /*
>* Thereafter brw_*_read() must see ->writer != NULL,
>* and we should see the result of __this_cpu_inc().
>*/
>   for (;;) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (brw_read_ctr(brw) == 0)
>   break;
>   schedule();
>   }
>   __set_current_state(TASK_RUNNING);
>   /*
>* We can add another synchronize_sched() to avoid the
>* spurious wakeups from brw_up_read() after return.
>*/
> }
> 
> void brw_up_write(struct brw_sem *brw)
> {
>   brw->writer = NULL;
>   synchronize_sched();

That synchronize_sched should be put before brw->writer = NULL. This is 
incorrect, because brw->writer = NULL may be reordered with previous 
writes done by this process and the other CPU may see brw->writer == NULL 
(and think that the lock is unlocked) while it doesn't see previous writes 
done by the writer.

I had this bug in my implementation too.

>   wake_up_all(>read_waitq);
>   mutex_unlock(>writer_mutex);
> }

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-22 Thread Mikulas Patocka


On Fri, 19 Oct 2012, Oleg Nesterov wrote:

 On 10/19, Mikulas Patocka wrote:
 
  synchronize_rcu() is way slower than msleep(1) -
 
 This depends, I guess. but this doesn't mmatter,
 
  so I don't see a reason
  why should it be complicated to avoid msleep(1).
 
 I don't think this really needs complications. Please look at this
 patch for example. Or initial (single writer) version below. It is
 not finished and lacks the barriers too, but I do not think it is
 more complex.

Hi

My implementation has a smaller structure (it doesn't have 
wait_queue_head_t).

Using preempt_disable()/synchronize_sched() instead of RCU seems like a 
good idea. Here, the locked region is so small that it doesn't make sense 
to play tricks with preemptible RCU.

Your implementation is prone to starvation - if the writer has a high 
priority and if it is doing back-to-back write unlocks/locks, it may 
happen that the readers have no chance to run.

The use of mutex instead of a wait queue in my implementation is unusual, 
but I don't see anything wrong with it - it makes the structure smaller 
and it solves the starvation problem (which would otherwise be complicated 
to solve).

Mikulas

 Oleg.
 
 struct brw_sem {
   long __percpu   *read_ctr;
   wait_queue_head_t   read_waitq;
   struct mutexwriter_mutex;
   struct task_struct  *writer;
 };
 
 int brw_init(struct brw_sem *brw)
 {
   brw-writer = NULL;
   mutex_init(brw-writer_mutex);
   init_waitqueue_head(brw-read_waitq);
   brw-read_ctr = alloc_percpu(long);
   return brw-read_ctr ? 0 : -ENOMEM;
 }
 
 void brw_down_read(struct brw_sem *brw)
 {
   for (;;) {
   bool done = false;
 
   preempt_disable();
   if (likely(!brw-writer)) {
   __this_cpu_inc(*brw-read_ctr);
   done = true;
   }
   preempt_enable();
 
   if (likely(done))
   break;
 
   __wait_event(brw-read_waitq, !brw-writer);
   }
 }
 
 void brw_up_read(struct brw_sem *brw)
 {
   struct task_struct *writer;
 
   preempt_disable();
   __this_cpu_dec(*brw-read_ctr);
   writer = ACCESS_ONCE(brw-writer);
   if (unlikely(writer))
   wake_up_process(writer);
   preempt_enable();
 }
 
 static inline long brw_read_ctr(struct brw_sem *brw)
 {
   long sum = 0;
   int cpu;
 
   for_each_possible_cpu(cpu)
   sum += per_cpu(*brw-read_ctr, cpu);

Integer overflow on signed types is undefined - you should use unsigned 
long - you can use -fwrapv option to gcc to make signed overflow defined, 
but Linux doesn't use it.

 
   return sum;
 }
 
 void brw_down_write(struct brw_sem *brw)
 {
   mutex_lock(brw-writer_mutex);
   brw-writer = current;
   synchronize_sched();
   /*
* Thereafter brw_*_read() must see -writer != NULL,
* and we should see the result of __this_cpu_inc().
*/
   for (;;) {
   set_current_state(TASK_UNINTERRUPTIBLE);
   if (brw_read_ctr(brw) == 0)
   break;
   schedule();
   }
   __set_current_state(TASK_RUNNING);
   /*
* We can add another synchronize_sched() to avoid the
* spurious wakeups from brw_up_read() after return.
*/
 }
 
 void brw_up_write(struct brw_sem *brw)
 {
   brw-writer = NULL;
   synchronize_sched();

That synchronize_sched should be put before brw-writer = NULL. This is 
incorrect, because brw-writer = NULL may be reordered with previous 
writes done by this process and the other CPU may see brw-writer == NULL 
(and think that the lock is unlocked) while it doesn't see previous writes 
done by the writer.

I had this bug in my implementation too.

   wake_up_all(brw-read_waitq);
   mutex_unlock(brw-writer_mutex);
 }

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Mikulas Patocka


On Fri, 19 Oct 2012, Peter Zijlstra wrote:

> > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > lock, remembering the cpu and doing another LOCK instruction on read 
> > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > happens in the common case). It was slower than the approach without any 
> > LOCK instructions (43.3 seconds seconds for the implementation with 
> > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > writes on a ramdisk with 8 processes on 8-core machine).
> 
> So why is that a problem? Surely that's already tons better then what
> you've currently got.

Percpu rw-semaphores do not improve performance at all. I put them there 
to avoid performance regression, not to improve performance.

All Linux kernels have a race condition - when you change block size of a 
block device and you read or write the device at the same time, a crash 
may happen. This bug is there since ever. Recently, this bug started to 
cause major trouble - multiple high profile business sites report crashes 
because of this race condition.

You can fix this race by using a read lock around I/O paths and write lock 
around block size changing, but normal rw semaphore cause cache line 
bouncing when taken for read by multiple processors and I/O performance 
degradation because of it is measurable.

So I put this percpu-rw-semaphore there to fix the crashes and minimize 
performance impact - on x86 it doesn't take any interlocked instructions 
in the read path.

I don't quite understand why are people opposing to this and what do they 
want to do instead? If you pull percpu-rw-semaphores out of the kernel, 
you introduce a performance regression (raw device i/o will be slower on 
3.7 than on 3.6, because on 3.6 it doesn't take any lock at all and on 3.7 
it takes a read lock).

So you have options:
1) don't lock i/o just like on 3.6 and previous versions - you get a fast 
kernel that randomly crashes
2) lock i/o with normal rw semaphore - you get a kernel that doesn't 
crash, but that is slower than previous versions
3) lock i/o with percpu rw semaphore - you get kernel that is almost as 
fast as previous kernels and that doesn't crash

For the users, the option 3) is the best. The users don't care whether it 
looks ugly or not, they care about correctness and performance, that's 
all.

Obviously, you can improve rw semaphores by adding lockdep annotations, or 
by other things (turning rcu_read_lock/sychronize_rcu into 
preempt_disable/synchronize_sched, by using barrier()-synchronize_sched() 
instead of smp_mb()...), but I don't see a reason why do you want to hurt 
users' experience by pulling it out and reverting to state 1) or 2) and 
then, two kernel cycles later, come up with percpu-rw-semaphores again.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Paul E. McKenney
On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote:
> On 10/18, Paul E. McKenney wrote:
> >
> > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> > >
> > > I thought that you meant that without mb() brw_start_write() can
> > > race with brw_end_read() and hang forever.
> > >
> > > But probably you meant that we need the barriers to ensure that,
> > > say, if the reader does
> > >
> > >   brw_start_read();
> > >   CONDITION = 1;
> > >   brw_end_read();
> > >
> > > then the writer must see CONDITION != 0 after brw_start_write() ?
> > > (or vice-versa)
> >
> > Yes, this is exactly my concern.
> 
> Oh, thanks at lot Paul (as always).

Glad it helped.  ;-)

> > > In this case we need the barrier, yes. Obviously brw_start_write()
> > > can return right after this_cpu_dec() and before wake_up_all().
> > >
> > > 2/2 doesn't need this guarantee but I agree, this doesn't look
> > > sane in gerenal...
> >
> > Or name it something not containing "lock".  And clearly document
> > the behavior and how it is to be used.  ;-)
> 
> this would be insane, I guess ;)

Well, I suppose you could call it a "phase" : brw_start_phase_1() and
so on.

> So. Ignoring the possible optimization you mentioned before,
> brw_end_read() should do:
> 
>   smp_mb();
>   this_cpu_dec();
> 
>   wake_up_all();
> 
> And yes, we need the full mb(). wmb() is enough to ensure that the
> writer will see the memory modifications done by the reader. But we
> also need to ensure that any LOAD inside start_read/end_read can not
> be moved outside of the critical section.
> 
> But we should also ensure that "read" will see all modifications
> which were done under start_write/end_write. This means that
> brw_end_write() needs another synchronize_sched() before
> atomic_dec_and_test(), or brw_start_read() needs mb() in the
> fast-path.
> 
> Correct?

Good point, I missed the need for synchronize_sched() to avoid
readers sleeping through the next write cycle due to racing with
an exiting writer.  But yes, this sounds correct.

> Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> something similar. Certainly it was not in my tree when I started
> this patch... percpu_down_write() doesn't allow multiple writers,
> but the main problem it uses msleep(1). It should not, I think.
> 
> But. It seems that percpu_up_write() is equally wrong? Doesn't
> it need synchronize_rcu() before "p->locked = false" ?
> 
> (add Mikulas)

Mikulas said something about doing an updated patch, so I figured I
would look at his next version.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Oleg Nesterov
On 10/19, Peter Zijlstra wrote:
>
> But using preempt_{disable,enable} and using synchronize_sched() would
> be better (for PREEMPT_RCU) although it wouldn't fix anything
> fundamental.

BTW, I agree. I didn't even notice percpu-rwsem.h uses _rcu, not _sched.

> Fine goal, although somewhat arch specific. Also note that there's a
> relation between atomics and memory barriers, one isn't necessarily
> worse than the other, they all require synchronization of sorts.

As Paul pointed out, the fast path can avoid mb(). It is only needed
when "up_read" detects the writer.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Oleg Nesterov
On 10/19, Mikulas Patocka wrote:
>
> synchronize_rcu() is way slower than msleep(1) -

This depends, I guess. but this doesn't mmatter,

> so I don't see a reason
> why should it be complicated to avoid msleep(1).

I don't think this really needs complications. Please look at this
patch for example. Or initial (single writer) version below. It is
not finished and lacks the barriers too, but I do not think it is
more complex.

Oleg.

struct brw_sem {
long __percpu   *read_ctr;
wait_queue_head_t   read_waitq;
struct mutexwriter_mutex;
struct task_struct  *writer;
};

int brw_init(struct brw_sem *brw)
{
brw->writer = NULL;
mutex_init(>writer_mutex);
init_waitqueue_head(>read_waitq);
brw->read_ctr = alloc_percpu(long);
return brw->read_ctr ? 0 : -ENOMEM;
}

void brw_down_read(struct brw_sem *brw)
{
for (;;) {
bool done = false;

preempt_disable();
if (likely(!brw->writer)) {
__this_cpu_inc(*brw->read_ctr);
done = true;
}
preempt_enable();

if (likely(done))
break;

__wait_event(brw->read_waitq, !brw->writer);
}
}

void brw_up_read(struct brw_sem *brw)
{
struct task_struct *writer;

preempt_disable();
__this_cpu_dec(*brw->read_ctr);
writer = ACCESS_ONCE(brw->writer);
if (unlikely(writer))
wake_up_process(writer);
preempt_enable();
}

static inline long brw_read_ctr(struct brw_sem *brw)
{
long sum = 0;
int cpu;

for_each_possible_cpu(cpu)
sum += per_cpu(*brw->read_ctr, cpu);

return sum;
}

void brw_down_write(struct brw_sem *brw)
{
mutex_lock(>writer_mutex);
brw->writer = current;
synchronize_sched();
/*
 * Thereafter brw_*_read() must see ->writer != NULL,
 * and we should see the result of __this_cpu_inc().
 */
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (brw_read_ctr(brw) == 0)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
/*
 * We can add another synchronize_sched() to avoid the
 * spurious wakeups from brw_up_read() after return.
 */
}

void brw_up_write(struct brw_sem *brw)
{
brw->writer = NULL;
synchronize_sched();
wake_up_all(>read_waitq);
mutex_unlock(>writer_mutex);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Peter Zijlstra
On Fri, 2012-10-19 at 11:32 -0400, Mikulas Patocka wrote:

> So if you can do an alternative implementation without RCU, show it.

Uhm,,. no that's not how it works. You just don't push through crap like
this and then demand someone else does it better.

But using preempt_{disable,enable} and using synchronize_sched() would
be better (for PREEMPT_RCU) although it wouldn't fix anything
fundamental.

>  The 
> goal is - there should be no LOCK instructions on the read path and as 
> few barriers as possible.

Fine goal, although somewhat arch specific. Also note that there's a
relation between atomics and memory barriers, one isn't necessarily
worse than the other, they all require synchronization of sorts.  

> > So did you consider keeping the inc/dec on the same per-cpu variable?
> > Yes this adds a potential remote access to dec and requires you to use
> > atomics, but I would not be surprised if the inc/dec were mostly on the
> > same cpu most of the times -- which might be plenty fast for what you
> > want.
> 
> Yes, I tried this approach - it involves doing LOCK instruction on read 
> lock, remembering the cpu and doing another LOCK instruction on read 
> unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> happens in the common case). It was slower than the approach without any 
> LOCK instructions (43.3 seconds seconds for the implementation with 
> per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
> instruction; the benchmark involved doing 512-byte direct-io reads and 
> writes on a ramdisk with 8 processes on 8-core machine).

So why is that a problem? Surely that's already tons better then what
you've currently got. Also uncontended LOCK is something all x86 vendors
keep optimizing, they'll have to if they want to keep adding CPUs.

> > If you've got coherent per-cpu counts, you can better do the
> > waitqueue/wake condition for write_down.
> 
> synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
> why should it be complicated to avoid msleep(1).

Its not about slow, a polling write side is just fscking ugly. Also, if
you're already polling that *_sync() is bloody pointless.

> > It might also make sense to do away with the mutex, there's no point in
> > serializing the wakeups in the p->locked case of down_read.
> 
> The advantage of a mutex is that it is already protected against 
> starvation. If I replace the mutex with a wait queue and retry, there is 
> no starvation protection.

Which starvation? writer-writer order? What stops you from adding a list
there yourself? Also, writers had better be rare for this thing, so who
gives a crap?

> > Furthermore,
> > p->locked seems a complete duplicate of the mutex state, so removing the
> > mutex also removes that duplication.
> 
> We could replace if (p->locked) with if (mutex_is_locked(p->mtx))

Quite so.. 

You're also still lacking lockdep annotations...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Mikulas Patocka


On Fri, 19 Oct 2012, Peter Zijlstra wrote:

> On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
> > 
> > On Thu, 18 Oct 2012, Oleg Nesterov wrote:
> > 
> > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > > something similar. Certainly it was not in my tree when I started
> > > this patch... percpu_down_write() doesn't allow multiple writers,
> > > but the main problem it uses msleep(1). It should not, I think.
> > 
> > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
> > a big problem.
> 
> That code is beyond ugly though.. it should really not have been merged.
> 
> There's absolutely no reason for it to use RCU except to make it more

So if you can do an alternative implementation without RCU, show it. The 
goal is - there should be no LOCK instructions on the read path and as 
few barriers as possible.

> complicated. And as Oleg pointed out that msleep() is very ill
> considered.
> 
> The very worst part of it seems to be that nobody who's usually involved
> with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
> tglx, dhowells and me). It doesn't even have lockdep annotations :/
> 
> So the only reason you appear to use RCU is because you don't actually
> have a sane way to wait for count==0. And I'm contesting rcu_sync() is
> sane here -- for the very simple reason you still need while (count)
> loop right after it.
> 
> So it appears you want an totally reader biased, sleepable rw-lock like
> thing?

Yes.

> So did you consider keeping the inc/dec on the same per-cpu variable?
> Yes this adds a potential remote access to dec and requires you to use
> atomics, but I would not be surprised if the inc/dec were mostly on the
> same cpu most of the times -- which might be plenty fast for what you
> want.

Yes, I tried this approach - it involves doing LOCK instruction on read 
lock, remembering the cpu and doing another LOCK instruction on read 
unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
happens in the common case). It was slower than the approach without any 
LOCK instructions (43.3 seconds seconds for the implementation with 
per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
instruction; the benchmark involved doing 512-byte direct-io reads and 
writes on a ramdisk with 8 processes on 8-core machine).

> If you've got coherent per-cpu counts, you can better do the
> waitqueue/wake condition for write_down.

synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
why should it be complicated to avoid msleep(1).

> It might also make sense to do away with the mutex, there's no point in
> serializing the wakeups in the p->locked case of down_read.

The advantage of a mutex is that it is already protected against 
starvation. If I replace the mutex with a wait queue and retry, there is 
no starvation protection.

> Furthermore,
> p->locked seems a complete duplicate of the mutex state, so removing the
> mutex also removes that duplication.

We could replace if (p->locked) with if (mutex_is_locked(p->mtx))

> Also, that CONFIG_x86 thing.. *shudder*...

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Peter Zijlstra
On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
> 
> On Thu, 18 Oct 2012, Oleg Nesterov wrote:
> 
> > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > something similar. Certainly it was not in my tree when I started
> > this patch... percpu_down_write() doesn't allow multiple writers,
> > but the main problem it uses msleep(1). It should not, I think.
> 
> synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
> a big problem.

That code is beyond ugly though.. it should really not have been merged.

There's absolutely no reason for it to use RCU except to make it more
complicated. And as Oleg pointed out that msleep() is very ill
considered.

The very worst part of it seems to be that nobody who's usually involved
with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
tglx, dhowells and me). It doesn't even have lockdep annotations :/

So the only reason you appear to use RCU is because you don't actually
have a sane way to wait for count==0. And I'm contesting rcu_sync() is
sane here -- for the very simple reason you still need while (count)
loop right after it.

So it appears you want an totally reader biased, sleepable rw-lock like
thing?

So did you consider keeping the inc/dec on the same per-cpu variable?
Yes this adds a potential remote access to dec and requires you to use
atomics, but I would not be surprised if the inc/dec were mostly on the
same cpu most of the times -- which might be plenty fast for what you
want.

If you've got coherent per-cpu counts, you can better do the
waitqueue/wake condition for write_down.

It might also make sense to do away with the mutex, there's no point in
serializing the wakeups in the p->locked case of down_read. Furthermore,
p->locked seems a complete duplicate of the mutex state, so removing the
mutex also removes that duplication.

Also, that CONFIG_x86 thing.. *shudder*...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Peter Zijlstra
On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
 
 On Thu, 18 Oct 2012, Oleg Nesterov wrote:
 
  Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
  something similar. Certainly it was not in my tree when I started
  this patch... percpu_down_write() doesn't allow multiple writers,
  but the main problem it uses msleep(1). It should not, I think.
 
 synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
 a big problem.

That code is beyond ugly though.. it should really not have been merged.

There's absolutely no reason for it to use RCU except to make it more
complicated. And as Oleg pointed out that msleep() is very ill
considered.

The very worst part of it seems to be that nobody who's usually involved
with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
tglx, dhowells and me). It doesn't even have lockdep annotations :/

So the only reason you appear to use RCU is because you don't actually
have a sane way to wait for count==0. And I'm contesting rcu_sync() is
sane here -- for the very simple reason you still need while (count)
loop right after it.

So it appears you want an totally reader biased, sleepable rw-lock like
thing?

So did you consider keeping the inc/dec on the same per-cpu variable?
Yes this adds a potential remote access to dec and requires you to use
atomics, but I would not be surprised if the inc/dec were mostly on the
same cpu most of the times -- which might be plenty fast for what you
want.

If you've got coherent per-cpu counts, you can better do the
waitqueue/wake condition for write_down.

It might also make sense to do away with the mutex, there's no point in
serializing the wakeups in the p-locked case of down_read. Furthermore,
p-locked seems a complete duplicate of the mutex state, so removing the
mutex also removes that duplication.

Also, that CONFIG_x86 thing.. *shudder*...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Mikulas Patocka


On Fri, 19 Oct 2012, Peter Zijlstra wrote:

 On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
  
  On Thu, 18 Oct 2012, Oleg Nesterov wrote:
  
   Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
   something similar. Certainly it was not in my tree when I started
   this patch... percpu_down_write() doesn't allow multiple writers,
   but the main problem it uses msleep(1). It should not, I think.
  
  synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
  a big problem.
 
 That code is beyond ugly though.. it should really not have been merged.
 
 There's absolutely no reason for it to use RCU except to make it more

So if you can do an alternative implementation without RCU, show it. The 
goal is - there should be no LOCK instructions on the read path and as 
few barriers as possible.

 complicated. And as Oleg pointed out that msleep() is very ill
 considered.
 
 The very worst part of it seems to be that nobody who's usually involved
 with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
 tglx, dhowells and me). It doesn't even have lockdep annotations :/
 
 So the only reason you appear to use RCU is because you don't actually
 have a sane way to wait for count==0. And I'm contesting rcu_sync() is
 sane here -- for the very simple reason you still need while (count)
 loop right after it.
 
 So it appears you want an totally reader biased, sleepable rw-lock like
 thing?

Yes.

 So did you consider keeping the inc/dec on the same per-cpu variable?
 Yes this adds a potential remote access to dec and requires you to use
 atomics, but I would not be surprised if the inc/dec were mostly on the
 same cpu most of the times -- which might be plenty fast for what you
 want.

Yes, I tried this approach - it involves doing LOCK instruction on read 
lock, remembering the cpu and doing another LOCK instruction on read 
unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
happens in the common case). It was slower than the approach without any 
LOCK instructions (43.3 seconds seconds for the implementation with 
per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
instruction; the benchmark involved doing 512-byte direct-io reads and 
writes on a ramdisk with 8 processes on 8-core machine).

 If you've got coherent per-cpu counts, you can better do the
 waitqueue/wake condition for write_down.

synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
why should it be complicated to avoid msleep(1).

 It might also make sense to do away with the mutex, there's no point in
 serializing the wakeups in the p-locked case of down_read.

The advantage of a mutex is that it is already protected against 
starvation. If I replace the mutex with a wait queue and retry, there is 
no starvation protection.

 Furthermore,
 p-locked seems a complete duplicate of the mutex state, so removing the
 mutex also removes that duplication.

We could replace if (p-locked) with if (mutex_is_locked(p-mtx))

 Also, that CONFIG_x86 thing.. *shudder*...

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Peter Zijlstra
On Fri, 2012-10-19 at 11:32 -0400, Mikulas Patocka wrote:

 So if you can do an alternative implementation without RCU, show it.

Uhm,,. no that's not how it works. You just don't push through crap like
this and then demand someone else does it better.

But using preempt_{disable,enable} and using synchronize_sched() would
be better (for PREEMPT_RCU) although it wouldn't fix anything
fundamental.

  The 
 goal is - there should be no LOCK instructions on the read path and as 
 few barriers as possible.

Fine goal, although somewhat arch specific. Also note that there's a
relation between atomics and memory barriers, one isn't necessarily
worse than the other, they all require synchronization of sorts.  

  So did you consider keeping the inc/dec on the same per-cpu variable?
  Yes this adds a potential remote access to dec and requires you to use
  atomics, but I would not be surprised if the inc/dec were mostly on the
  same cpu most of the times -- which might be plenty fast for what you
  want.
 
 Yes, I tried this approach - it involves doing LOCK instruction on read 
 lock, remembering the cpu and doing another LOCK instruction on read 
 unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
 happens in the common case). It was slower than the approach without any 
 LOCK instructions (43.3 seconds seconds for the implementation with 
 per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
 instruction; the benchmark involved doing 512-byte direct-io reads and 
 writes on a ramdisk with 8 processes on 8-core machine).

So why is that a problem? Surely that's already tons better then what
you've currently got. Also uncontended LOCK is something all x86 vendors
keep optimizing, they'll have to if they want to keep adding CPUs.

  If you've got coherent per-cpu counts, you can better do the
  waitqueue/wake condition for write_down.
 
 synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
 why should it be complicated to avoid msleep(1).

Its not about slow, a polling write side is just fscking ugly. Also, if
you're already polling that *_sync() is bloody pointless.

  It might also make sense to do away with the mutex, there's no point in
  serializing the wakeups in the p-locked case of down_read.
 
 The advantage of a mutex is that it is already protected against 
 starvation. If I replace the mutex with a wait queue and retry, there is 
 no starvation protection.

Which starvation? writer-writer order? What stops you from adding a list
there yourself? Also, writers had better be rare for this thing, so who
gives a crap?

  Furthermore,
  p-locked seems a complete duplicate of the mutex state, so removing the
  mutex also removes that duplication.
 
 We could replace if (p-locked) with if (mutex_is_locked(p-mtx))

Quite so.. 

You're also still lacking lockdep annotations...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Oleg Nesterov
On 10/19, Mikulas Patocka wrote:

 synchronize_rcu() is way slower than msleep(1) -

This depends, I guess. but this doesn't mmatter,

 so I don't see a reason
 why should it be complicated to avoid msleep(1).

I don't think this really needs complications. Please look at this
patch for example. Or initial (single writer) version below. It is
not finished and lacks the barriers too, but I do not think it is
more complex.

Oleg.

struct brw_sem {
long __percpu   *read_ctr;
wait_queue_head_t   read_waitq;
struct mutexwriter_mutex;
struct task_struct  *writer;
};

int brw_init(struct brw_sem *brw)
{
brw-writer = NULL;
mutex_init(brw-writer_mutex);
init_waitqueue_head(brw-read_waitq);
brw-read_ctr = alloc_percpu(long);
return brw-read_ctr ? 0 : -ENOMEM;
}

void brw_down_read(struct brw_sem *brw)
{
for (;;) {
bool done = false;

preempt_disable();
if (likely(!brw-writer)) {
__this_cpu_inc(*brw-read_ctr);
done = true;
}
preempt_enable();

if (likely(done))
break;

__wait_event(brw-read_waitq, !brw-writer);
}
}

void brw_up_read(struct brw_sem *brw)
{
struct task_struct *writer;

preempt_disable();
__this_cpu_dec(*brw-read_ctr);
writer = ACCESS_ONCE(brw-writer);
if (unlikely(writer))
wake_up_process(writer);
preempt_enable();
}

static inline long brw_read_ctr(struct brw_sem *brw)
{
long sum = 0;
int cpu;

for_each_possible_cpu(cpu)
sum += per_cpu(*brw-read_ctr, cpu);

return sum;
}

void brw_down_write(struct brw_sem *brw)
{
mutex_lock(brw-writer_mutex);
brw-writer = current;
synchronize_sched();
/*
 * Thereafter brw_*_read() must see -writer != NULL,
 * and we should see the result of __this_cpu_inc().
 */
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (brw_read_ctr(brw) == 0)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
/*
 * We can add another synchronize_sched() to avoid the
 * spurious wakeups from brw_up_read() after return.
 */
}

void brw_up_write(struct brw_sem *brw)
{
brw-writer = NULL;
synchronize_sched();
wake_up_all(brw-read_waitq);
mutex_unlock(brw-writer_mutex);
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Oleg Nesterov
On 10/19, Peter Zijlstra wrote:

 But using preempt_{disable,enable} and using synchronize_sched() would
 be better (for PREEMPT_RCU) although it wouldn't fix anything
 fundamental.

BTW, I agree. I didn't even notice percpu-rwsem.h uses _rcu, not _sched.

 Fine goal, although somewhat arch specific. Also note that there's a
 relation between atomics and memory barriers, one isn't necessarily
 worse than the other, they all require synchronization of sorts.

As Paul pointed out, the fast path can avoid mb(). It is only needed
when up_read detects the writer.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Paul E. McKenney
On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote:
 On 10/18, Paul E. McKenney wrote:
 
  On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
  
   I thought that you meant that without mb() brw_start_write() can
   race with brw_end_read() and hang forever.
  
   But probably you meant that we need the barriers to ensure that,
   say, if the reader does
  
 brw_start_read();
 CONDITION = 1;
 brw_end_read();
  
   then the writer must see CONDITION != 0 after brw_start_write() ?
   (or vice-versa)
 
  Yes, this is exactly my concern.
 
 Oh, thanks at lot Paul (as always).

Glad it helped.  ;-)

   In this case we need the barrier, yes. Obviously brw_start_write()
   can return right after this_cpu_dec() and before wake_up_all().
  
   2/2 doesn't need this guarantee but I agree, this doesn't look
   sane in gerenal...
 
  Or name it something not containing lock.  And clearly document
  the behavior and how it is to be used.  ;-)
 
 this would be insane, I guess ;)

Well, I suppose you could call it a phase : brw_start_phase_1() and
so on.

 So. Ignoring the possible optimization you mentioned before,
 brw_end_read() should do:
 
   smp_mb();
   this_cpu_dec();
 
   wake_up_all();
 
 And yes, we need the full mb(). wmb() is enough to ensure that the
 writer will see the memory modifications done by the reader. But we
 also need to ensure that any LOAD inside start_read/end_read can not
 be moved outside of the critical section.
 
 But we should also ensure that read will see all modifications
 which were done under start_write/end_write. This means that
 brw_end_write() needs another synchronize_sched() before
 atomic_dec_and_test(), or brw_start_read() needs mb() in the
 fast-path.
 
 Correct?

Good point, I missed the need for synchronize_sched() to avoid
readers sleeping through the next write cycle due to racing with
an exiting writer.  But yes, this sounds correct.

 Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
 something similar. Certainly it was not in my tree when I started
 this patch... percpu_down_write() doesn't allow multiple writers,
 but the main problem it uses msleep(1). It should not, I think.
 
 But. It seems that percpu_up_write() is equally wrong? Doesn't
 it need synchronize_rcu() before p-locked = false ?
 
 (add Mikulas)

Mikulas said something about doing an updated patch, so I figured I
would look at his next version.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Mikulas Patocka


On Fri, 19 Oct 2012, Peter Zijlstra wrote:

  Yes, I tried this approach - it involves doing LOCK instruction on read 
  lock, remembering the cpu and doing another LOCK instruction on read 
  unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
  happens in the common case). It was slower than the approach without any 
  LOCK instructions (43.3 seconds seconds for the implementation with 
  per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
  instruction; the benchmark involved doing 512-byte direct-io reads and 
  writes on a ramdisk with 8 processes on 8-core machine).
 
 So why is that a problem? Surely that's already tons better then what
 you've currently got.

Percpu rw-semaphores do not improve performance at all. I put them there 
to avoid performance regression, not to improve performance.

All Linux kernels have a race condition - when you change block size of a 
block device and you read or write the device at the same time, a crash 
may happen. This bug is there since ever. Recently, this bug started to 
cause major trouble - multiple high profile business sites report crashes 
because of this race condition.

You can fix this race by using a read lock around I/O paths and write lock 
around block size changing, but normal rw semaphore cause cache line 
bouncing when taken for read by multiple processors and I/O performance 
degradation because of it is measurable.

So I put this percpu-rw-semaphore there to fix the crashes and minimize 
performance impact - on x86 it doesn't take any interlocked instructions 
in the read path.

I don't quite understand why are people opposing to this and what do they 
want to do instead? If you pull percpu-rw-semaphores out of the kernel, 
you introduce a performance regression (raw device i/o will be slower on 
3.7 than on 3.6, because on 3.6 it doesn't take any lock at all and on 3.7 
it takes a read lock).

So you have options:
1) don't lock i/o just like on 3.6 and previous versions - you get a fast 
kernel that randomly crashes
2) lock i/o with normal rw semaphore - you get a kernel that doesn't 
crash, but that is slower than previous versions
3) lock i/o with percpu rw semaphore - you get kernel that is almost as 
fast as previous kernels and that doesn't crash

For the users, the option 3) is the best. The users don't care whether it 
looks ugly or not, they care about correctness and performance, that's 
all.

Obviously, you can improve rw semaphores by adding lockdep annotations, or 
by other things (turning rcu_read_lock/sychronize_rcu into 
preempt_disable/synchronize_sched, by using barrier()-synchronize_sched() 
instead of smp_mb()...), but I don't see a reason why do you want to hurt 
users' experience by pulling it out and reverting to state 1) or 2) and 
then, two kernel cycles later, come up with percpu-rw-semaphores again.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Mikulas Patocka


On Thu, 18 Oct 2012, Oleg Nesterov wrote:

> Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> something similar. Certainly it was not in my tree when I started
> this patch... percpu_down_write() doesn't allow multiple writers,
> but the main problem it uses msleep(1). It should not, I think.

synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
a big problem.

> But. It seems that percpu_up_write() is equally wrong? Doesn't
> it need synchronize_rcu() before "p->locked = false" ?

Yes, it does ... and I sent patch for that to Linus.

> (add Mikulas)
> 
> Oleg.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Oleg Nesterov
On 10/18, Paul E. McKenney wrote:
>
> On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> >
> > I thought that you meant that without mb() brw_start_write() can
> > race with brw_end_read() and hang forever.
> >
> > But probably you meant that we need the barriers to ensure that,
> > say, if the reader does
> >
> > brw_start_read();
> > CONDITION = 1;
> > brw_end_read();
> >
> > then the writer must see CONDITION != 0 after brw_start_write() ?
> > (or vice-versa)
>
> Yes, this is exactly my concern.

Oh, thanks at lot Paul (as always).

> > In this case we need the barrier, yes. Obviously brw_start_write()
> > can return right after this_cpu_dec() and before wake_up_all().
> >
> > 2/2 doesn't need this guarantee but I agree, this doesn't look
> > sane in gerenal...
>
> Or name it something not containing "lock".  And clearly document
> the behavior and how it is to be used.  ;-)

this would be insane, I guess ;)

So. Ignoring the possible optimization you mentioned before,
brw_end_read() should do:

smp_mb();
this_cpu_dec();

wake_up_all();

And yes, we need the full mb(). wmb() is enough to ensure that the
writer will see the memory modifications done by the reader. But we
also need to ensure that any LOAD inside start_read/end_read can not
be moved outside of the critical section.

But we should also ensure that "read" will see all modifications
which were done under start_write/end_write. This means that
brw_end_write() needs another synchronize_sched() before
atomic_dec_and_test(), or brw_start_read() needs mb() in the
fast-path.

Correct?



Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
something similar. Certainly it was not in my tree when I started
this patch... percpu_down_write() doesn't allow multiple writers,
but the main problem it uses msleep(1). It should not, I think.

But. It seems that percpu_up_write() is equally wrong? Doesn't
it need synchronize_rcu() before "p->locked = false" ?

(add Mikulas)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Paul E. McKenney
On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
> On 10/17, Paul E. McKenney wrote:
> >
> > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> > > On 10/16, Paul E. McKenney wrote:
> > > >
> > > > Suppose that the writer arrives and sees that the value of the counter
> > > > is zero,
> > >
> > > after synchronize_sched(). So there are no readers (but perhaps there
> > > are brw_end_read's in flight which already decremented read_ctr)
> >
> > But the preempt_disable() region only covers read acquisition.  So
> > synchronize_sched() waits only for all the brw_start_read() calls to
> > reach the preempt_enable()
> 
> Yes.
> 
> > -- it cannot wait for all the resulting
> > readers to reach the corresponding brw_end_read().
> 
> Indeed.
> 
> > > > and thus never sleeps, and so is also not awakened?
> > >
> > > and why do we need wakeup in this case?
> >
> > To get the memory barriers required to keep the critical sections
> > ordered -- to ensure that everyone sees the reader's critical section
> > as ending before the writer's critical section starts.
> 
> And now I am starting to think I misunderstood your concern from
> the very beginning.
> 
> I thought that you meant that without mb() brw_start_write() can
> race with brw_end_read() and hang forever.
> 
> But probably you meant that we need the barriers to ensure that,
> say, if the reader does
> 
>   brw_start_read();
>   CONDITION = 1;
>   brw_end_read();
> 
> then the writer must see CONDITION != 0 after brw_start_write() ?
> (or vice-versa)

Yes, this is exactly my concern.

> In this case we need the barrier, yes. Obviously brw_start_write()
> can return right after this_cpu_dec() and before wake_up_all().
> 
> 2/2 doesn't need this guarantee but I agree, this doesn't look
> sane in gerenal...

Or name it something not containing "lock".  And clearly document
the behavior and how it is to be used.  ;-)

Otherwise, someone will get confused and introduce bugs.

> Or I misunderstood you again?

No, this was indeed my concern.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Oleg Nesterov
On 10/17, Paul E. McKenney wrote:
>
> On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> > On 10/16, Paul E. McKenney wrote:
> > >
> > > Suppose that the writer arrives and sees that the value of the counter
> > > is zero,
> >
> > after synchronize_sched(). So there are no readers (but perhaps there
> > are brw_end_read's in flight which already decremented read_ctr)
>
> But the preempt_disable() region only covers read acquisition.  So
> synchronize_sched() waits only for all the brw_start_read() calls to
> reach the preempt_enable()

Yes.

> -- it cannot wait for all the resulting
> readers to reach the corresponding brw_end_read().

Indeed.

> > > and thus never sleeps, and so is also not awakened?
> >
> > and why do we need wakeup in this case?
>
> To get the memory barriers required to keep the critical sections
> ordered -- to ensure that everyone sees the reader's critical section
> as ending before the writer's critical section starts.

And now I am starting to think I misunderstood your concern from
the very beginning.

I thought that you meant that without mb() brw_start_write() can
race with brw_end_read() and hang forever.

But probably you meant that we need the barriers to ensure that,
say, if the reader does

brw_start_read();
CONDITION = 1;
brw_end_read();

then the writer must see CONDITION != 0 after brw_start_write() ?
(or vice-versa)


In this case we need the barrier, yes. Obviously brw_start_write()
can return right after this_cpu_dec() and before wake_up_all().

2/2 doesn't need this guarantee but I agree, this doesn't look
sane in gerenal...

Or I misunderstood you again?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Oleg Nesterov
On 10/17, Paul E. McKenney wrote:

 On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
  On 10/16, Paul E. McKenney wrote:
  
   Suppose that the writer arrives and sees that the value of the counter
   is zero,
 
  after synchronize_sched(). So there are no readers (but perhaps there
  are brw_end_read's in flight which already decremented read_ctr)

 But the preempt_disable() region only covers read acquisition.  So
 synchronize_sched() waits only for all the brw_start_read() calls to
 reach the preempt_enable()

Yes.

 -- it cannot wait for all the resulting
 readers to reach the corresponding brw_end_read().

Indeed.

   and thus never sleeps, and so is also not awakened?
 
  and why do we need wakeup in this case?

 To get the memory barriers required to keep the critical sections
 ordered -- to ensure that everyone sees the reader's critical section
 as ending before the writer's critical section starts.

And now I am starting to think I misunderstood your concern from
the very beginning.

I thought that you meant that without mb() brw_start_write() can
race with brw_end_read() and hang forever.

But probably you meant that we need the barriers to ensure that,
say, if the reader does

brw_start_read();
CONDITION = 1;
brw_end_read();

then the writer must see CONDITION != 0 after brw_start_write() ?
(or vice-versa)


In this case we need the barrier, yes. Obviously brw_start_write()
can return right after this_cpu_dec() and before wake_up_all().

2/2 doesn't need this guarantee but I agree, this doesn't look
sane in gerenal...

Or I misunderstood you again?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Paul E. McKenney
On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
 On 10/17, Paul E. McKenney wrote:
 
  On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
   On 10/16, Paul E. McKenney wrote:
   
Suppose that the writer arrives and sees that the value of the counter
is zero,
  
   after synchronize_sched(). So there are no readers (but perhaps there
   are brw_end_read's in flight which already decremented read_ctr)
 
  But the preempt_disable() region only covers read acquisition.  So
  synchronize_sched() waits only for all the brw_start_read() calls to
  reach the preempt_enable()
 
 Yes.
 
  -- it cannot wait for all the resulting
  readers to reach the corresponding brw_end_read().
 
 Indeed.
 
and thus never sleeps, and so is also not awakened?
  
   and why do we need wakeup in this case?
 
  To get the memory barriers required to keep the critical sections
  ordered -- to ensure that everyone sees the reader's critical section
  as ending before the writer's critical section starts.
 
 And now I am starting to think I misunderstood your concern from
 the very beginning.
 
 I thought that you meant that without mb() brw_start_write() can
 race with brw_end_read() and hang forever.
 
 But probably you meant that we need the barriers to ensure that,
 say, if the reader does
 
   brw_start_read();
   CONDITION = 1;
   brw_end_read();
 
 then the writer must see CONDITION != 0 after brw_start_write() ?
 (or vice-versa)

Yes, this is exactly my concern.

 In this case we need the barrier, yes. Obviously brw_start_write()
 can return right after this_cpu_dec() and before wake_up_all().
 
 2/2 doesn't need this guarantee but I agree, this doesn't look
 sane in gerenal...

Or name it something not containing lock.  And clearly document
the behavior and how it is to be used.  ;-)

Otherwise, someone will get confused and introduce bugs.

 Or I misunderstood you again?

No, this was indeed my concern.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Oleg Nesterov
On 10/18, Paul E. McKenney wrote:

 On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote:
 
  I thought that you meant that without mb() brw_start_write() can
  race with brw_end_read() and hang forever.
 
  But probably you meant that we need the barriers to ensure that,
  say, if the reader does
 
  brw_start_read();
  CONDITION = 1;
  brw_end_read();
 
  then the writer must see CONDITION != 0 after brw_start_write() ?
  (or vice-versa)

 Yes, this is exactly my concern.

Oh, thanks at lot Paul (as always).

  In this case we need the barrier, yes. Obviously brw_start_write()
  can return right after this_cpu_dec() and before wake_up_all().
 
  2/2 doesn't need this guarantee but I agree, this doesn't look
  sane in gerenal...

 Or name it something not containing lock.  And clearly document
 the behavior and how it is to be used.  ;-)

this would be insane, I guess ;)

So. Ignoring the possible optimization you mentioned before,
brw_end_read() should do:

smp_mb();
this_cpu_dec();

wake_up_all();

And yes, we need the full mb(). wmb() is enough to ensure that the
writer will see the memory modifications done by the reader. But we
also need to ensure that any LOAD inside start_read/end_read can not
be moved outside of the critical section.

But we should also ensure that read will see all modifications
which were done under start_write/end_write. This means that
brw_end_write() needs another synchronize_sched() before
atomic_dec_and_test(), or brw_start_read() needs mb() in the
fast-path.

Correct?



Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
something similar. Certainly it was not in my tree when I started
this patch... percpu_down_write() doesn't allow multiple writers,
but the main problem it uses msleep(1). It should not, I think.

But. It seems that percpu_up_write() is equally wrong? Doesn't
it need synchronize_rcu() before p-locked = false ?

(add Mikulas)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-18 Thread Mikulas Patocka


On Thu, 18 Oct 2012, Oleg Nesterov wrote:

 Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
 something similar. Certainly it was not in my tree when I started
 this patch... percpu_down_write() doesn't allow multiple writers,
 but the main problem it uses msleep(1). It should not, I think.

synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
a big problem.

 But. It seems that percpu_up_write() is equally wrong? Doesn't
 it need synchronize_rcu() before p-locked = false ?

Yes, it does ... and I sent patch for that to Linus.

 (add Mikulas)
 
 Oleg.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Paul E. McKenney
On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote:
> On 10/16, Linus Torvalds wrote:
> >
> > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov  wrote:
> > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> > > it allows multiple writers too, just "read" and "write" are mutually
> > > exclusive.
> >
> > So those semantics just don't sound sane. It's also not what any kind
> > of normal "rw" lock ever does.
> 
> Yes, this is not usual.
> 
> And initially I made brw_sem which allows only 1 writer, but then
> I changed this patch.
> 
> > So can you explain why these particular insane semantics are useful,
> > and what for?
> 
> To allow multiple uprobe_register/unregister at the same time. Mostly
> to not add the "regression", currently this is possible.
> 
> It is not that I think this is terribly important, but still. And
> personally I think that "multiple writers" is not necessarily insane
> in general. Suppose you have the complex object/subsystem, the readers
> can use a single brw_mutex to access it "lockless", start_read() is
> very cheap.
> 
> But start_write() is slow. Multiple writes can use the fine-grained
> inside the start_write/end_write section and do not block each other.

Strangely enough, the old VAXCluster locking primitives allowed this
sort of thing.  The brw_start_read() would be a "protected read", and
brw_start_write() would be a "concurrent write".  Even more interesting,
they gave the same advice you give -- concurrent writes should use
fine-grained locking to protect the actual accesses.

It seems like it should be possible to come up with better names,
but I cannot think of any at the moment.

Thanx, Paul

PS.  For the sufficiently masochistic, here is the exclusion table
 for the six VAXCluster locking modes:

NL  CR  CW  PR  PW  EX
NL
CR  X
CW  X   X   X
PR  X   X   X
PW  X   X   X   X
EX  X   X   X   X   X

"X" means that the pair of modes exclude each other,
otherwise the lock may be held in both of the modes
simultaneously.


Modes:

NL: Null, or "not held".
CR: Concurrent read.
CW: Concurrent write.
PR: Protected read.
PW: Protected write.
EX: Exclusive.


A reader-writer lock could use protected read for readers and either of
protected write or exclusive for writers, the difference between protected
write and exclusive being irrelevant in the absence of concurrent readers.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Paul E. McKenney
On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
> On 10/16, Paul E. McKenney wrote:
> >
> > On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I believe that you need smp_mb() here.
> > >
> > > I don't understand why...
> > >
> > > > The wake_up_all()'s memory barriers
> > > > do not suffice because some other reader might have awakened the writer
> > > > between this_cpu_dec() and wake_up_all().
> > >
> > > But __wake_up(q) takes q->lock? And the same lock is taken by
> > > prepare_to_wait(), so how can the writer miss the result of _dec?
> >
> > Suppose that the writer arrives and sees that the value of the counter
> > is zero,
> 
> after synchronize_sched(). So there are no readers (but perhaps there
> are brw_end_read's in flight which already decremented read_ctr)

But the preempt_disable() region only covers read acquisition.  So
synchronize_sched() waits only for all the brw_start_read() calls to
reach the preempt_enable() -- it cannot wait for all the resulting
readers to reach the corresponding brw_end_read().

> > and thus never sleeps, and so is also not awakened?
> 
> and why do we need wakeup in this case?

To get the memory barriers required to keep the critical sections
ordered -- to ensure that everyone sees the reader's critical section
as ending before the writer's critical section starts.

> > > > void brw_end_read(struct brw_mutex *brw)
> > > > {
> > > > if (unlikely(atomic_read(>write_ctr))) {
> > > > smp_mb();
> > > > this_cpu_dec(*brw->read_ctr);
> > > > wake_up_all(>write_waitq);
> > >
> > > Hmm... still can't understand.
> > >
> > > It seems that this mb() is needed to ensure that brw_end_read() can't
> > > miss write_ctr != 0.
> > >
> > > But we do not care unless the writer already does wait_event(). And
> > > before it does wait_event() it calls synchronize_sched() after it sets
> > > write_ctr != 0. Doesn't this mean that after that any preempt-disabled
> > > section must see write_ctr != 0 ?
> > >
> > > This code actually checks write_ctr after preempt_disable + enable,
> > > but I think this doesn't matter?
> > >
> > > Paul, most probably I misunderstood you. Could you spell please?
> >
> > Let me try outlining the sequence of events that I am worried about...
> >
> > 1.  Task A invokes brw_start_read().  There is no writer, so it
> > takes the fastpath.
> >
> > 2.  Task B invokes brw_start_write(), atomically increments
> > >write_ctr, and executes synchronize_sched().
> >
> > 3.  Task A invokes brw_end_read() and does this_cpu_dec().
> 
> OK. And to simplify this discussion, suppose that A invoked
> brw_start_read() on CPU_0 and thus incremented read_ctr[0], and
> then it migrates to CPU_1 and brw_end_read() uses read_ctr[1].
> 
> My understanding was, brw_start_write() must see read_ctr[0] == 1
> after synchronize_sched().

Yep.  But it makes absolutely no guarantee about ordering of the
decrement of read_ctr[1].

> > 4.  Task B invokes wait_event(), which invokes brw_read_ctr()
> > and sees the result as zero.
> 
> So my understanding is completely wrong? I thought that after
> synchronize_sched() we should see the result of any operation
> which were done inside the preempt-disable section.

We should indeed.  But the decrement of read_ctr[1] is not done within
the preempt_disable() section, and the guarantee therefore does not
apply to it.  This means that there is no guarantee that Task A's
read-side critical section will be ordered before Task B's read-side
critical section.

Now, maybe you don't need that guarantee, but if you don't, I am missing
what exactly these primitives are doing for you.

> No?
> 
> Hmm. Suppose that we have long A = B = STOP = 0, and
> 
>   void func(void)
>   {
>   preempt_disable();
>   if (!STOP) {
>   A = 1;
>   B = 1;
>   }
>   preempt_enable();
>   }
> 
> Now, you are saying that this code
> 
>   STOP = 1;
> 
>   synchronize_sched();
> 
>   BUG_ON(A != B);
> 
> is not correct? (yes, yes, this example is not very good).

Yep.  Assuming no other modifications to A and B, at the point of
the BUG_ON(), we should have A==1 and B==1.

The thing is that the preempt_disable() in your patch only covers
brw_start_read(), but not brw_end_read().  So the decrement (along with
the rest of the read-side critical section) is unordered with respect
to the write-side critical section started by the brw_start_write().

> The comment above synchronize_sched() says:
> 
>   return ... after all currently executing
>   rcu-sched read-side critical sections have completed.
> 
> But if this code is wrong, then what "completed" actually means?
> I thought that it also means "all memory operations have completed",
> but this is not true?

>From what I can see, your interpretation of synchronize_sched() is
correct.

The problem is that 

Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Oleg Nesterov
On 10/16, Linus Torvalds wrote:
>
> On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov  wrote:
> > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> > it allows multiple writers too, just "read" and "write" are mutually
> > exclusive.
>
> So those semantics just don't sound sane. It's also not what any kind
> of normal "rw" lock ever does.

Yes, this is not usual.

And initially I made brw_sem which allows only 1 writer, but then
I changed this patch.

> So can you explain why these particular insane semantics are useful,
> and what for?

To allow multiple uprobe_register/unregister at the same time. Mostly
to not add the "regression", currently this is possible.

It is not that I think this is terribly important, but still. And
personally I think that "multiple writers" is not necessarily insane
in general. Suppose you have the complex object/subsystem, the readers
can use a single brw_mutex to access it "lockless", start_read() is
very cheap.

But start_write() is slow. Multiple writes can use the fine-grained
inside the start_write/end_write section and do not block each other.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Oleg Nesterov
On 10/16, Paul E. McKenney wrote:
>
> On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
> > >
> > > I believe that you need smp_mb() here.
> >
> > I don't understand why...
> >
> > > The wake_up_all()'s memory barriers
> > > do not suffice because some other reader might have awakened the writer
> > > between this_cpu_dec() and wake_up_all().
> >
> > But __wake_up(q) takes q->lock? And the same lock is taken by
> > prepare_to_wait(), so how can the writer miss the result of _dec?
>
> Suppose that the writer arrives and sees that the value of the counter
> is zero,

after synchronize_sched(). So there are no readers (but perhaps there
are brw_end_read's in flight which already decremented read_ctr)

> and thus never sleeps, and so is also not awakened?

and why do we need wakeup in this case?

> > > void brw_end_read(struct brw_mutex *brw)
> > > {
> > >   if (unlikely(atomic_read(>write_ctr))) {
> > >   smp_mb();
> > >   this_cpu_dec(*brw->read_ctr);
> > >   wake_up_all(>write_waitq);
> >
> > Hmm... still can't understand.
> >
> > It seems that this mb() is needed to ensure that brw_end_read() can't
> > miss write_ctr != 0.
> >
> > But we do not care unless the writer already does wait_event(). And
> > before it does wait_event() it calls synchronize_sched() after it sets
> > write_ctr != 0. Doesn't this mean that after that any preempt-disabled
> > section must see write_ctr != 0 ?
> >
> > This code actually checks write_ctr after preempt_disable + enable,
> > but I think this doesn't matter?
> >
> > Paul, most probably I misunderstood you. Could you spell please?
>
> Let me try outlining the sequence of events that I am worried about...
>
> 1.Task A invokes brw_start_read().  There is no writer, so it
>   takes the fastpath.
>
> 2.Task B invokes brw_start_write(), atomically increments
>   >write_ctr, and executes synchronize_sched().
>
> 3.Task A invokes brw_end_read() and does this_cpu_dec().

OK. And to simplify this discussion, suppose that A invoked
brw_start_read() on CPU_0 and thus incremented read_ctr[0], and
then it migrates to CPU_1 and brw_end_read() uses read_ctr[1].

My understanding was, brw_start_write() must see read_ctr[0] == 1
after synchronize_sched().

> 4.Task B invokes wait_event(), which invokes brw_read_ctr()
>   and sees the result as zero.

So my understanding is completely wrong? I thought that after
synchronize_sched() we should see the result of any operation
which were done inside the preempt-disable section.

No?

Hmm. Suppose that we have long A = B = STOP = 0, and

void func(void)
{
preempt_disable();
if (!STOP) {
A = 1;
B = 1;
}
preempt_enable();
}

Now, you are saying that this code

STOP = 1;

synchronize_sched();

BUG_ON(A != B);

is not correct? (yes, yes, this example is not very good).

The comment above synchronize_sched() says:

return ... after all currently executing
rcu-sched read-side critical sections have completed.

But if this code is wrong, then what "completed" actually means?
I thought that it also means "all memory operations have completed",
but this is not true?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Oleg Nesterov
On 10/16, Paul E. McKenney wrote:

 On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
  
   I believe that you need smp_mb() here.
 
  I don't understand why...
 
   The wake_up_all()'s memory barriers
   do not suffice because some other reader might have awakened the writer
   between this_cpu_dec() and wake_up_all().
 
  But __wake_up(q) takes q-lock? And the same lock is taken by
  prepare_to_wait(), so how can the writer miss the result of _dec?

 Suppose that the writer arrives and sees that the value of the counter
 is zero,

after synchronize_sched(). So there are no readers (but perhaps there
are brw_end_read's in flight which already decremented read_ctr)

 and thus never sleeps, and so is also not awakened?

and why do we need wakeup in this case?

   void brw_end_read(struct brw_mutex *brw)
   {
 if (unlikely(atomic_read(brw-write_ctr))) {
 smp_mb();
 this_cpu_dec(*brw-read_ctr);
 wake_up_all(brw-write_waitq);
 
  Hmm... still can't understand.
 
  It seems that this mb() is needed to ensure that brw_end_read() can't
  miss write_ctr != 0.
 
  But we do not care unless the writer already does wait_event(). And
  before it does wait_event() it calls synchronize_sched() after it sets
  write_ctr != 0. Doesn't this mean that after that any preempt-disabled
  section must see write_ctr != 0 ?
 
  This code actually checks write_ctr after preempt_disable + enable,
  but I think this doesn't matter?
 
  Paul, most probably I misunderstood you. Could you spell please?

 Let me try outlining the sequence of events that I am worried about...

 1.Task A invokes brw_start_read().  There is no writer, so it
   takes the fastpath.

 2.Task B invokes brw_start_write(), atomically increments
   brw-write_ctr, and executes synchronize_sched().

 3.Task A invokes brw_end_read() and does this_cpu_dec().

OK. And to simplify this discussion, suppose that A invoked
brw_start_read() on CPU_0 and thus incremented read_ctr[0], and
then it migrates to CPU_1 and brw_end_read() uses read_ctr[1].

My understanding was, brw_start_write() must see read_ctr[0] == 1
after synchronize_sched().

 4.Task B invokes wait_event(), which invokes brw_read_ctr()
   and sees the result as zero.

So my understanding is completely wrong? I thought that after
synchronize_sched() we should see the result of any operation
which were done inside the preempt-disable section.

No?

Hmm. Suppose that we have long A = B = STOP = 0, and

void func(void)
{
preempt_disable();
if (!STOP) {
A = 1;
B = 1;
}
preempt_enable();
}

Now, you are saying that this code

STOP = 1;

synchronize_sched();

BUG_ON(A != B);

is not correct? (yes, yes, this example is not very good).

The comment above synchronize_sched() says:

return ... after all currently executing
rcu-sched read-side critical sections have completed.

But if this code is wrong, then what completed actually means?
I thought that it also means all memory operations have completed,
but this is not true?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Oleg Nesterov
On 10/16, Linus Torvalds wrote:

 On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov o...@redhat.com wrote:
  This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
  it allows multiple writers too, just read and write are mutually
  exclusive.

 So those semantics just don't sound sane. It's also not what any kind
 of normal rw lock ever does.

Yes, this is not usual.

And initially I made brw_sem which allows only 1 writer, but then
I changed this patch.

 So can you explain why these particular insane semantics are useful,
 and what for?

To allow multiple uprobe_register/unregister at the same time. Mostly
to not add the regression, currently this is possible.

It is not that I think this is terribly important, but still. And
personally I think that multiple writers is not necessarily insane
in general. Suppose you have the complex object/subsystem, the readers
can use a single brw_mutex to access it lockless, start_read() is
very cheap.

But start_write() is slow. Multiple writes can use the fine-grained
inside the start_write/end_write section and do not block each other.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Paul E. McKenney
On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote:
 On 10/16, Paul E. McKenney wrote:
 
  On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
   
I believe that you need smp_mb() here.
  
   I don't understand why...
  
The wake_up_all()'s memory barriers
do not suffice because some other reader might have awakened the writer
between this_cpu_dec() and wake_up_all().
  
   But __wake_up(q) takes q-lock? And the same lock is taken by
   prepare_to_wait(), so how can the writer miss the result of _dec?
 
  Suppose that the writer arrives and sees that the value of the counter
  is zero,
 
 after synchronize_sched(). So there are no readers (but perhaps there
 are brw_end_read's in flight which already decremented read_ctr)

But the preempt_disable() region only covers read acquisition.  So
synchronize_sched() waits only for all the brw_start_read() calls to
reach the preempt_enable() -- it cannot wait for all the resulting
readers to reach the corresponding brw_end_read().

  and thus never sleeps, and so is also not awakened?
 
 and why do we need wakeup in this case?

To get the memory barriers required to keep the critical sections
ordered -- to ensure that everyone sees the reader's critical section
as ending before the writer's critical section starts.

void brw_end_read(struct brw_mutex *brw)
{
if (unlikely(atomic_read(brw-write_ctr))) {
smp_mb();
this_cpu_dec(*brw-read_ctr);
wake_up_all(brw-write_waitq);
  
   Hmm... still can't understand.
  
   It seems that this mb() is needed to ensure that brw_end_read() can't
   miss write_ctr != 0.
  
   But we do not care unless the writer already does wait_event(). And
   before it does wait_event() it calls synchronize_sched() after it sets
   write_ctr != 0. Doesn't this mean that after that any preempt-disabled
   section must see write_ctr != 0 ?
  
   This code actually checks write_ctr after preempt_disable + enable,
   but I think this doesn't matter?
  
   Paul, most probably I misunderstood you. Could you spell please?
 
  Let me try outlining the sequence of events that I am worried about...
 
  1.  Task A invokes brw_start_read().  There is no writer, so it
  takes the fastpath.
 
  2.  Task B invokes brw_start_write(), atomically increments
  brw-write_ctr, and executes synchronize_sched().
 
  3.  Task A invokes brw_end_read() and does this_cpu_dec().
 
 OK. And to simplify this discussion, suppose that A invoked
 brw_start_read() on CPU_0 and thus incremented read_ctr[0], and
 then it migrates to CPU_1 and brw_end_read() uses read_ctr[1].
 
 My understanding was, brw_start_write() must see read_ctr[0] == 1
 after synchronize_sched().

Yep.  But it makes absolutely no guarantee about ordering of the
decrement of read_ctr[1].

  4.  Task B invokes wait_event(), which invokes brw_read_ctr()
  and sees the result as zero.
 
 So my understanding is completely wrong? I thought that after
 synchronize_sched() we should see the result of any operation
 which were done inside the preempt-disable section.

We should indeed.  But the decrement of read_ctr[1] is not done within
the preempt_disable() section, and the guarantee therefore does not
apply to it.  This means that there is no guarantee that Task A's
read-side critical section will be ordered before Task B's read-side
critical section.

Now, maybe you don't need that guarantee, but if you don't, I am missing
what exactly these primitives are doing for you.

 No?
 
 Hmm. Suppose that we have long A = B = STOP = 0, and
 
   void func(void)
   {
   preempt_disable();
   if (!STOP) {
   A = 1;
   B = 1;
   }
   preempt_enable();
   }
 
 Now, you are saying that this code
 
   STOP = 1;
 
   synchronize_sched();
 
   BUG_ON(A != B);
 
 is not correct? (yes, yes, this example is not very good).

Yep.  Assuming no other modifications to A and B, at the point of
the BUG_ON(), we should have A==1 and B==1.

The thing is that the preempt_disable() in your patch only covers
brw_start_read(), but not brw_end_read().  So the decrement (along with
the rest of the read-side critical section) is unordered with respect
to the write-side critical section started by the brw_start_write().

 The comment above synchronize_sched() says:
 
   return ... after all currently executing
   rcu-sched read-side critical sections have completed.
 
 But if this code is wrong, then what completed actually means?
 I thought that it also means all memory operations have completed,
 but this is not true?

From what I can see, your interpretation of synchronize_sched() is
correct.

The problem is that brw_end_read() isn't within the relevant
rcu-sched read-side critical section.

Or that I am confused

Thanx, Paul

 Oleg.
 
 --

Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-17 Thread Paul E. McKenney
On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote:
 On 10/16, Linus Torvalds wrote:
 
  On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov o...@redhat.com wrote:
   This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
   it allows multiple writers too, just read and write are mutually
   exclusive.
 
  So those semantics just don't sound sane. It's also not what any kind
  of normal rw lock ever does.
 
 Yes, this is not usual.
 
 And initially I made brw_sem which allows only 1 writer, but then
 I changed this patch.
 
  So can you explain why these particular insane semantics are useful,
  and what for?
 
 To allow multiple uprobe_register/unregister at the same time. Mostly
 to not add the regression, currently this is possible.
 
 It is not that I think this is terribly important, but still. And
 personally I think that multiple writers is not necessarily insane
 in general. Suppose you have the complex object/subsystem, the readers
 can use a single brw_mutex to access it lockless, start_read() is
 very cheap.
 
 But start_write() is slow. Multiple writes can use the fine-grained
 inside the start_write/end_write section and do not block each other.

Strangely enough, the old VAXCluster locking primitives allowed this
sort of thing.  The brw_start_read() would be a protected read, and
brw_start_write() would be a concurrent write.  Even more interesting,
they gave the same advice you give -- concurrent writes should use
fine-grained locking to protect the actual accesses.

It seems like it should be possible to come up with better names,
but I cannot think of any at the moment.

Thanx, Paul

PS.  For the sufficiently masochistic, here is the exclusion table
 for the six VAXCluster locking modes:

NL  CR  CW  PR  PW  EX
NL
CR  X
CW  X   X   X
PR  X   X   X
PW  X   X   X   X
EX  X   X   X   X   X

X means that the pair of modes exclude each other,
otherwise the lock may be held in both of the modes
simultaneously.


Modes:

NL: Null, or not held.
CR: Concurrent read.
CW: Concurrent write.
PR: Protected read.
PW: Protected write.
EX: Exclusive.


A reader-writer lock could use protected read for readers and either of
protected write or exclusive for writers, the difference between protected
write and exclusive being irrelevant in the absence of concurrent readers.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-16 Thread Linus Torvalds
On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov  wrote:
> This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> it allows multiple writers too, just "read" and "write" are mutually
> exclusive.

So those semantics just don't sound sane. It's also not what any kind
of normal "rw" lock ever does.

So can you explain why these particular insane semantics are useful,
and what for?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-16 Thread Paul E. McKenney
On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
> Paul, thanks for looking!
> 
> On 10/15, Paul E. McKenney wrote:
> >
> > > +void brw_start_read(struct brw_mutex *brw)
> > > +{
> > > + for (;;) {
> > > + bool done = false;
> > > +
> > > + preempt_disable();
> > > + if (likely(!atomic_read(>write_ctr))) {
> > > + __this_cpu_inc(*brw->read_ctr);
> > > + done = true;
> > > + }
> >
> > brw_start_read() is not recursive -- attempting to call it recursively
> > can result in deadlock if a writer has shown up in the meantime.
> 
> Yes, yes, it is not recursive. Like rw_semaphore.
> 
> > Which is often OK, but not sure what you intended.
> 
> I forgot to document this in the changelog.

Hey, I had to ask.  ;-)

> > > +void brw_end_read(struct brw_mutex *brw)
> > > +{
> >
> > I believe that you need smp_mb() here.
> 
> I don't understand why...
> 
> > The wake_up_all()'s memory barriers
> > do not suffice because some other reader might have awakened the writer
> > between this_cpu_dec() and wake_up_all().
> 
> But __wake_up(q) takes q->lock? And the same lock is taken by
> prepare_to_wait(), so how can the writer miss the result of _dec?

Suppose that the writer arrives and sees that the value of the counter
is zero, and thus never sleeps, and so is also not awakened?  Unless I
am missing something, there are no memory barriers in that case.

Which means that you also need an smp_mb() after the wait_event()
in the writer, now that I think on it.

> > > + this_cpu_dec(*brw->read_ctr);
> > > +
> > > + if (unlikely(atomic_read(>write_ctr)))
> > > + wake_up_all(>write_waitq);
> > > +}
> >
> > Of course, it would be good to avoid smp_mb on the fast path.  Here is
> > one way to avoid it:
> >
> > void brw_end_read(struct brw_mutex *brw)
> > {
> > if (unlikely(atomic_read(>write_ctr))) {
> > smp_mb();
> > this_cpu_dec(*brw->read_ctr);
> > wake_up_all(>write_waitq);
> 
> Hmm... still can't understand.
> 
> It seems that this mb() is needed to ensure that brw_end_read() can't
> miss write_ctr != 0.
> 
> But we do not care unless the writer already does wait_event(). And
> before it does wait_event() it calls synchronize_sched() after it sets
> write_ctr != 0. Doesn't this mean that after that any preempt-disabled
> section must see write_ctr != 0 ?
> 
> This code actually checks write_ctr after preempt_disable + enable,
> but I think this doesn't matter?
> 
> Paul, most probably I misunderstood you. Could you spell please?

Let me try outlining the sequence of events that I am worried about...

1.  Task A invokes brw_start_read().  There is no writer, so it
takes the fastpath.

2.  Task B invokes brw_start_write(), atomically increments
>write_ctr, and executes synchronize_sched().

3.  Task A invokes brw_end_read() and does this_cpu_dec().

4.  Task B invokes wait_event(), which invokes brw_read_ctr()
and sees the result as zero.  Therefore, Task B does
not sleep, does not acquire locks, and does not execute
any memory barriers.  As a result, ordering is not
guaranteed between Task A's read-side critical section
and Task B's upcoming write-side critical section.

So I believe that you need smp_mb() in both brw_end_read() and
brw_start_write().

Sigh...  It is quite possible that you also need an smp_mb() in
brw_start_read(), but let's start with just the scenario above.

So, does the above scenario show a problem, or am I confused?

> > > +void brw_start_write(struct brw_mutex *brw)
> > > +{
> > > + atomic_inc(>write_ctr);
> > > + synchronize_sched();
> > > + /*
> > > +  * Thereafter brw_*_read() must see write_ctr != 0,
> > > +  * and we should see the result of __this_cpu_inc().
> > > +  */
> > > + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);
> >
> > This looks like it allows multiple writers to proceed concurrently.
> > They both increment, do a synchronize_sched(), do the wait_event(),
> > and then are both awakened by the last reader.
> 
> Yes. From the changelog:
> 
>   Unlike rw_semaphore it allows multiple writers too,
>   just "read" and "write" are mutually exclusive.

OK, color me blind!  ;-)

> > Was that the intent?  (The implementation of brw_end_write() makes
> > it look like it is in fact the intent.)
> 
> Please look at 2/2.
> 
> Multiple uprobe_register() or uprobe_unregister() can run at the
> same time to install/remove the system-wide breakpoint, and
> brw_start_write() is used to block dup_mmap() to avoid the race.
> But they do not block each other.

Ah, makes sense, thank you!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-16 Thread Oleg Nesterov
Paul, thanks for looking!

On 10/15, Paul E. McKenney wrote:
>
> > +void brw_start_read(struct brw_mutex *brw)
> > +{
> > +   for (;;) {
> > +   bool done = false;
> > +
> > +   preempt_disable();
> > +   if (likely(!atomic_read(>write_ctr))) {
> > +   __this_cpu_inc(*brw->read_ctr);
> > +   done = true;
> > +   }
>
> brw_start_read() is not recursive -- attempting to call it recursively
> can result in deadlock if a writer has shown up in the meantime.

Yes, yes, it is not recursive. Like rw_semaphore.

> Which is often OK, but not sure what you intended.

I forgot to document this in the changelog.

> > +void brw_end_read(struct brw_mutex *brw)
> > +{
>
> I believe that you need smp_mb() here.

I don't understand why...

> The wake_up_all()'s memory barriers
> do not suffice because some other reader might have awakened the writer
> between this_cpu_dec() and wake_up_all().

But __wake_up(q) takes q->lock? And the same lock is taken by
prepare_to_wait(), so how can the writer miss the result of _dec?

> > +   this_cpu_dec(*brw->read_ctr);
> > +
> > +   if (unlikely(atomic_read(>write_ctr)))
> > +   wake_up_all(>write_waitq);
> > +}
>
> Of course, it would be good to avoid smp_mb on the fast path.  Here is
> one way to avoid it:
>
> void brw_end_read(struct brw_mutex *brw)
> {
>   if (unlikely(atomic_read(>write_ctr))) {
>   smp_mb();
>   this_cpu_dec(*brw->read_ctr);
>   wake_up_all(>write_waitq);

Hmm... still can't understand.

It seems that this mb() is needed to ensure that brw_end_read() can't
miss write_ctr != 0.

But we do not care unless the writer already does wait_event(). And
before it does wait_event() it calls synchronize_sched() after it sets
write_ctr != 0. Doesn't this mean that after that any preempt-disabled
section must see write_ctr != 0 ?

This code actually checks write_ctr after preempt_disable + enable,
but I think this doesn't matter?

Paul, most probably I misunderstood you. Could you spell please?

> > +void brw_start_write(struct brw_mutex *brw)
> > +{
> > +   atomic_inc(>write_ctr);
> > +   synchronize_sched();
> > +   /*
> > +* Thereafter brw_*_read() must see write_ctr != 0,
> > +* and we should see the result of __this_cpu_inc().
> > +*/
> > +   wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);
>
> This looks like it allows multiple writers to proceed concurrently.
> They both increment, do a synchronize_sched(), do the wait_event(),
> and then are both awakened by the last reader.

Yes. From the changelog:

Unlike rw_semaphore it allows multiple writers too,
just "read" and "write" are mutually exclusive.

> Was that the intent?  (The implementation of brw_end_write() makes
> it look like it is in fact the intent.)

Please look at 2/2.

Multiple uprobe_register() or uprobe_unregister() can run at the
same time to install/remove the system-wide breakpoint, and
brw_start_write() is used to block dup_mmap() to avoid the race.
But they do not block each other.

Thanks!

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-16 Thread Oleg Nesterov
Paul, thanks for looking!

On 10/15, Paul E. McKenney wrote:

  +void brw_start_read(struct brw_mutex *brw)
  +{
  +   for (;;) {
  +   bool done = false;
  +
  +   preempt_disable();
  +   if (likely(!atomic_read(brw-write_ctr))) {
  +   __this_cpu_inc(*brw-read_ctr);
  +   done = true;
  +   }

 brw_start_read() is not recursive -- attempting to call it recursively
 can result in deadlock if a writer has shown up in the meantime.

Yes, yes, it is not recursive. Like rw_semaphore.

 Which is often OK, but not sure what you intended.

I forgot to document this in the changelog.

  +void brw_end_read(struct brw_mutex *brw)
  +{

 I believe that you need smp_mb() here.

I don't understand why...

 The wake_up_all()'s memory barriers
 do not suffice because some other reader might have awakened the writer
 between this_cpu_dec() and wake_up_all().

But __wake_up(q) takes q-lock? And the same lock is taken by
prepare_to_wait(), so how can the writer miss the result of _dec?

  +   this_cpu_dec(*brw-read_ctr);
  +
  +   if (unlikely(atomic_read(brw-write_ctr)))
  +   wake_up_all(brw-write_waitq);
  +}

 Of course, it would be good to avoid smp_mb on the fast path.  Here is
 one way to avoid it:

 void brw_end_read(struct brw_mutex *brw)
 {
   if (unlikely(atomic_read(brw-write_ctr))) {
   smp_mb();
   this_cpu_dec(*brw-read_ctr);
   wake_up_all(brw-write_waitq);

Hmm... still can't understand.

It seems that this mb() is needed to ensure that brw_end_read() can't
miss write_ctr != 0.

But we do not care unless the writer already does wait_event(). And
before it does wait_event() it calls synchronize_sched() after it sets
write_ctr != 0. Doesn't this mean that after that any preempt-disabled
section must see write_ctr != 0 ?

This code actually checks write_ctr after preempt_disable + enable,
but I think this doesn't matter?

Paul, most probably I misunderstood you. Could you spell please?

  +void brw_start_write(struct brw_mutex *brw)
  +{
  +   atomic_inc(brw-write_ctr);
  +   synchronize_sched();
  +   /*
  +* Thereafter brw_*_read() must see write_ctr != 0,
  +* and we should see the result of __this_cpu_inc().
  +*/
  +   wait_event(brw-write_waitq, brw_read_ctr(brw) == 0);

 This looks like it allows multiple writers to proceed concurrently.
 They both increment, do a synchronize_sched(), do the wait_event(),
 and then are both awakened by the last reader.

Yes. From the changelog:

Unlike rw_semaphore it allows multiple writers too,
just read and write are mutually exclusive.

 Was that the intent?  (The implementation of brw_end_write() makes
 it look like it is in fact the intent.)

Please look at 2/2.

Multiple uprobe_register() or uprobe_unregister() can run at the
same time to install/remove the system-wide breakpoint, and
brw_start_write() is used to block dup_mmap() to avoid the race.
But they do not block each other.

Thanks!

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-16 Thread Paul E. McKenney
On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote:
 Paul, thanks for looking!
 
 On 10/15, Paul E. McKenney wrote:
 
   +void brw_start_read(struct brw_mutex *brw)
   +{
   + for (;;) {
   + bool done = false;
   +
   + preempt_disable();
   + if (likely(!atomic_read(brw-write_ctr))) {
   + __this_cpu_inc(*brw-read_ctr);
   + done = true;
   + }
 
  brw_start_read() is not recursive -- attempting to call it recursively
  can result in deadlock if a writer has shown up in the meantime.
 
 Yes, yes, it is not recursive. Like rw_semaphore.
 
  Which is often OK, but not sure what you intended.
 
 I forgot to document this in the changelog.

Hey, I had to ask.  ;-)

   +void brw_end_read(struct brw_mutex *brw)
   +{
 
  I believe that you need smp_mb() here.
 
 I don't understand why...
 
  The wake_up_all()'s memory barriers
  do not suffice because some other reader might have awakened the writer
  between this_cpu_dec() and wake_up_all().
 
 But __wake_up(q) takes q-lock? And the same lock is taken by
 prepare_to_wait(), so how can the writer miss the result of _dec?

Suppose that the writer arrives and sees that the value of the counter
is zero, and thus never sleeps, and so is also not awakened?  Unless I
am missing something, there are no memory barriers in that case.

Which means that you also need an smp_mb() after the wait_event()
in the writer, now that I think on it.

   + this_cpu_dec(*brw-read_ctr);
   +
   + if (unlikely(atomic_read(brw-write_ctr)))
   + wake_up_all(brw-write_waitq);
   +}
 
  Of course, it would be good to avoid smp_mb on the fast path.  Here is
  one way to avoid it:
 
  void brw_end_read(struct brw_mutex *brw)
  {
  if (unlikely(atomic_read(brw-write_ctr))) {
  smp_mb();
  this_cpu_dec(*brw-read_ctr);
  wake_up_all(brw-write_waitq);
 
 Hmm... still can't understand.
 
 It seems that this mb() is needed to ensure that brw_end_read() can't
 miss write_ctr != 0.
 
 But we do not care unless the writer already does wait_event(). And
 before it does wait_event() it calls synchronize_sched() after it sets
 write_ctr != 0. Doesn't this mean that after that any preempt-disabled
 section must see write_ctr != 0 ?
 
 This code actually checks write_ctr after preempt_disable + enable,
 but I think this doesn't matter?
 
 Paul, most probably I misunderstood you. Could you spell please?

Let me try outlining the sequence of events that I am worried about...

1.  Task A invokes brw_start_read().  There is no writer, so it
takes the fastpath.

2.  Task B invokes brw_start_write(), atomically increments
brw-write_ctr, and executes synchronize_sched().

3.  Task A invokes brw_end_read() and does this_cpu_dec().

4.  Task B invokes wait_event(), which invokes brw_read_ctr()
and sees the result as zero.  Therefore, Task B does
not sleep, does not acquire locks, and does not execute
any memory barriers.  As a result, ordering is not
guaranteed between Task A's read-side critical section
and Task B's upcoming write-side critical section.

So I believe that you need smp_mb() in both brw_end_read() and
brw_start_write().

Sigh...  It is quite possible that you also need an smp_mb() in
brw_start_read(), but let's start with just the scenario above.

So, does the above scenario show a problem, or am I confused?

   +void brw_start_write(struct brw_mutex *brw)
   +{
   + atomic_inc(brw-write_ctr);
   + synchronize_sched();
   + /*
   +  * Thereafter brw_*_read() must see write_ctr != 0,
   +  * and we should see the result of __this_cpu_inc().
   +  */
   + wait_event(brw-write_waitq, brw_read_ctr(brw) == 0);
 
  This looks like it allows multiple writers to proceed concurrently.
  They both increment, do a synchronize_sched(), do the wait_event(),
  and then are both awakened by the last reader.
 
 Yes. From the changelog:
 
   Unlike rw_semaphore it allows multiple writers too,
   just read and write are mutually exclusive.

OK, color me blind!  ;-)

  Was that the intent?  (The implementation of brw_end_write() makes
  it look like it is in fact the intent.)
 
 Please look at 2/2.
 
 Multiple uprobe_register() or uprobe_unregister() can run at the
 same time to install/remove the system-wide breakpoint, and
 brw_start_write() is used to block dup_mmap() to avoid the race.
 But they do not block each other.

Ah, makes sense, thank you!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-16 Thread Linus Torvalds
On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov o...@redhat.com wrote:
 This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
 it allows multiple writers too, just read and write are mutually
 exclusive.

So those semantics just don't sound sane. It's also not what any kind
of normal rw lock ever does.

So can you explain why these particular insane semantics are useful,
and what for?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-15 Thread Paul E. McKenney
On Mon, Oct 15, 2012 at 09:10:18PM +0200, Oleg Nesterov wrote:
> This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
> it allows multiple writers too, just "read" and "write" are mutually
> exclusive.
> 
> brw_start_read() and brw_end_read() are extremely cheap, they only do
> this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers.
> 
> OTOH it is write-biased, any brw_start_write() blocks the new readers.
> But "write" is slow, it does synchronize_sched() to serialize with
> preempt_disable() in brw_start_read(), and wait_event(write_waitq) can
> have a lot of extra wakeups before percpu-counter-sum becomes zero.

A few questions and comments below, as always.

Thanx, Paul

> Signed-off-by: Oleg Nesterov 
> ---
>  include/linux/brw_mutex.h |   22 +++
>  lib/Makefile  |2 +-
>  lib/brw_mutex.c   |   67 
> +
>  3 files changed, 90 insertions(+), 1 deletions(-)
>  create mode 100644 include/linux/brw_mutex.h
>  create mode 100644 lib/brw_mutex.c
> 
> diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h
> new file mode 100644
> index 000..16b8d5f
> --- /dev/null
> +++ b/include/linux/brw_mutex.h
> @@ -0,0 +1,22 @@
> +#ifndef _LINUX_BRW_MUTEX_H
> +#define _LINUX_BRW_MUTEX_H
> +
> +#include 
> +#include 
> +
> +struct brw_mutex {
> + long __percpu   *read_ctr;
> + atomic_twrite_ctr;
> + wait_queue_head_t   read_waitq;
> + wait_queue_head_t   write_waitq;
> +};
> +
> +extern int brw_mutex_init(struct brw_mutex *brw);
> +
> +extern void brw_start_read(struct brw_mutex *brw);
> +extern void brw_end_read(struct brw_mutex *brw);
> +
> +extern void brw_start_write(struct brw_mutex *brw);
> +extern void brw_end_write(struct brw_mutex *brw);
> +
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 3128e35..18f2876 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>idr.o int_sqrt.o extable.o \
>sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
>proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> -  is_single_threaded.o plist.o decompress.o
> +  is_single_threaded.o plist.o decompress.o brw_mutex.o
> 
>  lib-$(CONFIG_MMU) += ioremap.o
>  lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c
> new file mode 100644
> index 000..41984a6
> --- /dev/null
> +++ b/lib/brw_mutex.c
> @@ -0,0 +1,67 @@
> +#include 
> +#include 
> +#include 
> +
> +int brw_mutex_init(struct brw_mutex *brw)
> +{
> + atomic_set(>write_ctr, 0);
> + init_waitqueue_head(>read_waitq);
> + init_waitqueue_head(>write_waitq);
> + brw->read_ctr = alloc_percpu(long);
> + return brw->read_ctr ? 0 : -ENOMEM;
> +}
> +
> +void brw_start_read(struct brw_mutex *brw)
> +{
> + for (;;) {
> + bool done = false;
> +
> + preempt_disable();
> + if (likely(!atomic_read(>write_ctr))) {
> + __this_cpu_inc(*brw->read_ctr);
> + done = true;
> + }

brw_start_read() is not recursive -- attempting to call it recursively
can result in deadlock if a writer has shown up in the meantime.

Which is often OK, but not sure what you intended.

> + preempt_enable();
> +
> + if (likely(done))
> + break;
> +
> + __wait_event(brw->read_waitq, !atomic_read(>write_ctr));
> + }
> +}
> +
> +void brw_end_read(struct brw_mutex *brw)
> +{

I believe that you need smp_mb() here.  The wake_up_all()'s memory barriers
do not suffice because some other reader might have awakened the writer
between this_cpu_dec() and wake_up_all().  IIRC, this smp_mb() is also
needed if the timing is such that the writer does not actually block.

> + this_cpu_dec(*brw->read_ctr);
> +
> + if (unlikely(atomic_read(>write_ctr)))
> + wake_up_all(>write_waitq);
> +}

Of course, it would be good to avoid smp_mb on the fast path.  Here is
one way to avoid it:

void brw_end_read(struct brw_mutex *brw)
{
if (unlikely(atomic_read(>write_ctr))) {
smp_mb();
this_cpu_dec(*brw->read_ctr);
wake_up_all(>write_waitq);
} else {
this_cpu_dec(*brw->read_ctr);
}
}

> +static inline long brw_read_ctr(struct brw_mutex *brw)
> +{
> + long sum = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + sum += per_cpu(*brw->read_ctr, cpu);
> +
> + return sum;
> +}
> +
> +void brw_start_write(struct brw_mutex *brw)
> +{
> + atomic_inc(>write_ctr);
> + synchronize_sched();
> + /*
> +  * Thereafter brw_*_read() must see write_ctr != 0,
> +  * and we should see the result of __this_cpu_inc().
> +  */
> + 

[PATCH 1/2] brw_mutex: big read-write mutex

2012-10-15 Thread Oleg Nesterov
This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
it allows multiple writers too, just "read" and "write" are mutually
exclusive.

brw_start_read() and brw_end_read() are extremely cheap, they only do
this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers.

OTOH it is write-biased, any brw_start_write() blocks the new readers.
But "write" is slow, it does synchronize_sched() to serialize with
preempt_disable() in brw_start_read(), and wait_event(write_waitq) can
have a lot of extra wakeups before percpu-counter-sum becomes zero.

Signed-off-by: Oleg Nesterov 
---
 include/linux/brw_mutex.h |   22 +++
 lib/Makefile  |2 +-
 lib/brw_mutex.c   |   67 +
 3 files changed, 90 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/brw_mutex.h
 create mode 100644 lib/brw_mutex.c

diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h
new file mode 100644
index 000..16b8d5f
--- /dev/null
+++ b/include/linux/brw_mutex.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_BRW_MUTEX_H
+#define _LINUX_BRW_MUTEX_H
+
+#include 
+#include 
+
+struct brw_mutex {
+   long __percpu   *read_ctr;
+   atomic_twrite_ctr;
+   wait_queue_head_t   read_waitq;
+   wait_queue_head_t   write_waitq;
+};
+
+extern int brw_mutex_init(struct brw_mutex *brw);
+
+extern void brw_start_read(struct brw_mutex *brw);
+extern void brw_end_read(struct brw_mutex *brw);
+
+extern void brw_start_write(struct brw_mutex *brw);
+extern void brw_end_write(struct brw_mutex *brw);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 3128e35..18f2876 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 idr.o int_sqrt.o extable.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-is_single_threaded.o plist.o decompress.o
+is_single_threaded.o plist.o decompress.o brw_mutex.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c
new file mode 100644
index 000..41984a6
--- /dev/null
+++ b/lib/brw_mutex.c
@@ -0,0 +1,67 @@
+#include 
+#include 
+#include 
+
+int brw_mutex_init(struct brw_mutex *brw)
+{
+   atomic_set(>write_ctr, 0);
+   init_waitqueue_head(>read_waitq);
+   init_waitqueue_head(>write_waitq);
+   brw->read_ctr = alloc_percpu(long);
+   return brw->read_ctr ? 0 : -ENOMEM;
+}
+
+void brw_start_read(struct brw_mutex *brw)
+{
+   for (;;) {
+   bool done = false;
+
+   preempt_disable();
+   if (likely(!atomic_read(>write_ctr))) {
+   __this_cpu_inc(*brw->read_ctr);
+   done = true;
+   }
+   preempt_enable();
+
+   if (likely(done))
+   break;
+
+   __wait_event(brw->read_waitq, !atomic_read(>write_ctr));
+   }
+}
+
+void brw_end_read(struct brw_mutex *brw)
+{
+   this_cpu_dec(*brw->read_ctr);
+
+   if (unlikely(atomic_read(>write_ctr)))
+   wake_up_all(>write_waitq);
+}
+
+static inline long brw_read_ctr(struct brw_mutex *brw)
+{
+   long sum = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   sum += per_cpu(*brw->read_ctr, cpu);
+
+   return sum;
+}
+
+void brw_start_write(struct brw_mutex *brw)
+{
+   atomic_inc(>write_ctr);
+   synchronize_sched();
+   /*
+* Thereafter brw_*_read() must see write_ctr != 0,
+* and we should see the result of __this_cpu_inc().
+*/
+   wait_event(brw->write_waitq, brw_read_ctr(brw) == 0);
+}
+
+void brw_end_write(struct brw_mutex *brw)
+{
+   if (atomic_dec_and_test(>write_ctr))
+   wake_up_all(>read_waitq);
+}
-- 
1.5.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] brw_mutex: big read-write mutex

2012-10-15 Thread Oleg Nesterov
This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
it allows multiple writers too, just read and write are mutually
exclusive.

brw_start_read() and brw_end_read() are extremely cheap, they only do
this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers.

OTOH it is write-biased, any brw_start_write() blocks the new readers.
But write is slow, it does synchronize_sched() to serialize with
preempt_disable() in brw_start_read(), and wait_event(write_waitq) can
have a lot of extra wakeups before percpu-counter-sum becomes zero.

Signed-off-by: Oleg Nesterov o...@redhat.com
---
 include/linux/brw_mutex.h |   22 +++
 lib/Makefile  |2 +-
 lib/brw_mutex.c   |   67 +
 3 files changed, 90 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/brw_mutex.h
 create mode 100644 lib/brw_mutex.c

diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h
new file mode 100644
index 000..16b8d5f
--- /dev/null
+++ b/include/linux/brw_mutex.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_BRW_MUTEX_H
+#define _LINUX_BRW_MUTEX_H
+
+#include linux/percpu.h
+#include linux/wait.h
+
+struct brw_mutex {
+   long __percpu   *read_ctr;
+   atomic_twrite_ctr;
+   wait_queue_head_t   read_waitq;
+   wait_queue_head_t   write_waitq;
+};
+
+extern int brw_mutex_init(struct brw_mutex *brw);
+
+extern void brw_start_read(struct brw_mutex *brw);
+extern void brw_end_read(struct brw_mutex *brw);
+
+extern void brw_start_write(struct brw_mutex *brw);
+extern void brw_end_write(struct brw_mutex *brw);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 3128e35..18f2876 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 idr.o int_sqrt.o extable.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-is_single_threaded.o plist.o decompress.o
+is_single_threaded.o plist.o decompress.o brw_mutex.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c
new file mode 100644
index 000..41984a6
--- /dev/null
+++ b/lib/brw_mutex.c
@@ -0,0 +1,67 @@
+#include linux/brw_mutex.h
+#include linux/rcupdate.h
+#include linux/sched.h
+
+int brw_mutex_init(struct brw_mutex *brw)
+{
+   atomic_set(brw-write_ctr, 0);
+   init_waitqueue_head(brw-read_waitq);
+   init_waitqueue_head(brw-write_waitq);
+   brw-read_ctr = alloc_percpu(long);
+   return brw-read_ctr ? 0 : -ENOMEM;
+}
+
+void brw_start_read(struct brw_mutex *brw)
+{
+   for (;;) {
+   bool done = false;
+
+   preempt_disable();
+   if (likely(!atomic_read(brw-write_ctr))) {
+   __this_cpu_inc(*brw-read_ctr);
+   done = true;
+   }
+   preempt_enable();
+
+   if (likely(done))
+   break;
+
+   __wait_event(brw-read_waitq, !atomic_read(brw-write_ctr));
+   }
+}
+
+void brw_end_read(struct brw_mutex *brw)
+{
+   this_cpu_dec(*brw-read_ctr);
+
+   if (unlikely(atomic_read(brw-write_ctr)))
+   wake_up_all(brw-write_waitq);
+}
+
+static inline long brw_read_ctr(struct brw_mutex *brw)
+{
+   long sum = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   sum += per_cpu(*brw-read_ctr, cpu);
+
+   return sum;
+}
+
+void brw_start_write(struct brw_mutex *brw)
+{
+   atomic_inc(brw-write_ctr);
+   synchronize_sched();
+   /*
+* Thereafter brw_*_read() must see write_ctr != 0,
+* and we should see the result of __this_cpu_inc().
+*/
+   wait_event(brw-write_waitq, brw_read_ctr(brw) == 0);
+}
+
+void brw_end_write(struct brw_mutex *brw)
+{
+   if (atomic_dec_and_test(brw-write_ctr))
+   wake_up_all(brw-read_waitq);
+}
-- 
1.5.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-15 Thread Paul E. McKenney
On Mon, Oct 15, 2012 at 09:10:18PM +0200, Oleg Nesterov wrote:
 This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore
 it allows multiple writers too, just read and write are mutually
 exclusive.
 
 brw_start_read() and brw_end_read() are extremely cheap, they only do
 this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers.
 
 OTOH it is write-biased, any brw_start_write() blocks the new readers.
 But write is slow, it does synchronize_sched() to serialize with
 preempt_disable() in brw_start_read(), and wait_event(write_waitq) can
 have a lot of extra wakeups before percpu-counter-sum becomes zero.

A few questions and comments below, as always.

Thanx, Paul

 Signed-off-by: Oleg Nesterov o...@redhat.com
 ---
  include/linux/brw_mutex.h |   22 +++
  lib/Makefile  |2 +-
  lib/brw_mutex.c   |   67 
 +
  3 files changed, 90 insertions(+), 1 deletions(-)
  create mode 100644 include/linux/brw_mutex.h
  create mode 100644 lib/brw_mutex.c
 
 diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h
 new file mode 100644
 index 000..16b8d5f
 --- /dev/null
 +++ b/include/linux/brw_mutex.h
 @@ -0,0 +1,22 @@
 +#ifndef _LINUX_BRW_MUTEX_H
 +#define _LINUX_BRW_MUTEX_H
 +
 +#include linux/percpu.h
 +#include linux/wait.h
 +
 +struct brw_mutex {
 + long __percpu   *read_ctr;
 + atomic_twrite_ctr;
 + wait_queue_head_t   read_waitq;
 + wait_queue_head_t   write_waitq;
 +};
 +
 +extern int brw_mutex_init(struct brw_mutex *brw);
 +
 +extern void brw_start_read(struct brw_mutex *brw);
 +extern void brw_end_read(struct brw_mutex *brw);
 +
 +extern void brw_start_write(struct brw_mutex *brw);
 +extern void brw_end_write(struct brw_mutex *brw);
 +
 +#endif
 diff --git a/lib/Makefile b/lib/Makefile
 index 3128e35..18f2876 100644
 --- a/lib/Makefile
 +++ b/lib/Makefile
 @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 -  is_single_threaded.o plist.o decompress.o
 +  is_single_threaded.o plist.o decompress.o brw_mutex.o
 
  lib-$(CONFIG_MMU) += ioremap.o
  lib-$(CONFIG_SMP) += cpumask.o
 diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c
 new file mode 100644
 index 000..41984a6
 --- /dev/null
 +++ b/lib/brw_mutex.c
 @@ -0,0 +1,67 @@
 +#include linux/brw_mutex.h
 +#include linux/rcupdate.h
 +#include linux/sched.h
 +
 +int brw_mutex_init(struct brw_mutex *brw)
 +{
 + atomic_set(brw-write_ctr, 0);
 + init_waitqueue_head(brw-read_waitq);
 + init_waitqueue_head(brw-write_waitq);
 + brw-read_ctr = alloc_percpu(long);
 + return brw-read_ctr ? 0 : -ENOMEM;
 +}
 +
 +void brw_start_read(struct brw_mutex *brw)
 +{
 + for (;;) {
 + bool done = false;
 +
 + preempt_disable();
 + if (likely(!atomic_read(brw-write_ctr))) {
 + __this_cpu_inc(*brw-read_ctr);
 + done = true;
 + }

brw_start_read() is not recursive -- attempting to call it recursively
can result in deadlock if a writer has shown up in the meantime.

Which is often OK, but not sure what you intended.

 + preempt_enable();
 +
 + if (likely(done))
 + break;
 +
 + __wait_event(brw-read_waitq, !atomic_read(brw-write_ctr));
 + }
 +}
 +
 +void brw_end_read(struct brw_mutex *brw)
 +{

I believe that you need smp_mb() here.  The wake_up_all()'s memory barriers
do not suffice because some other reader might have awakened the writer
between this_cpu_dec() and wake_up_all().  IIRC, this smp_mb() is also
needed if the timing is such that the writer does not actually block.

 + this_cpu_dec(*brw-read_ctr);
 +
 + if (unlikely(atomic_read(brw-write_ctr)))
 + wake_up_all(brw-write_waitq);
 +}

Of course, it would be good to avoid smp_mb on the fast path.  Here is
one way to avoid it:

void brw_end_read(struct brw_mutex *brw)
{
if (unlikely(atomic_read(brw-write_ctr))) {
smp_mb();
this_cpu_dec(*brw-read_ctr);
wake_up_all(brw-write_waitq);
} else {
this_cpu_dec(*brw-read_ctr);
}
}

 +static inline long brw_read_ctr(struct brw_mutex *brw)
 +{
 + long sum = 0;
 + int cpu;
 +
 + for_each_possible_cpu(cpu)
 + sum += per_cpu(*brw-read_ctr, cpu);
 +
 + return sum;
 +}
 +
 +void brw_start_write(struct brw_mutex *brw)
 +{
 + atomic_inc(brw-write_ctr);
 + synchronize_sched();
 + /*
 +  * Thereafter brw_*_read() must see write_ctr != 0,
 +  * and we should see the result of __this_cpu_inc().
 +  */
 + wait_event(brw-write_waitq, brw_read_ctr(brw) ==