Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread Peter Zijlstra
On Thu, Mar 15, 2018 at 02:45:20PM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > Does the below address things sufficiently clear?
> 
> Yep.

Thanks!

> > +wait_queue_head_t *__var_waitqueue(void *p)
> > +{
> > +   if (BITS_PER_LONG == 64) {
> > +   unsigned long q = (unsigned long)p;
> > +
> > +   return bit_waitqueue((void *)(q & ~1), q & 1);
> > +   }
> > +   return bit_waitqueue(p, 0);
> > +}
> 
> You might be better off not using bit_waitqueue() but rather do the
> calculation directly since you don't actually have a bit number.

Yes, I did that in patch 11. The initial version uses the exact same
stuff wait_on_atomic_t() uses to avoid spurious changes.


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread Peter Zijlstra
On Thu, Mar 15, 2018 at 02:45:20PM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > Does the below address things sufficiently clear?
> 
> Yep.

Thanks!

> > +wait_queue_head_t *__var_waitqueue(void *p)
> > +{
> > +   if (BITS_PER_LONG == 64) {
> > +   unsigned long q = (unsigned long)p;
> > +
> > +   return bit_waitqueue((void *)(q & ~1), q & 1);
> > +   }
> > +   return bit_waitqueue(p, 0);
> > +}
> 
> You might be better off not using bit_waitqueue() but rather do the
> calculation directly since you don't actually have a bit number.

Yes, I did that in patch 11. The initial version uses the exact same
stuff wait_on_atomic_t() uses to avoid spurious changes.


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread David Howells
Peter Zijlstra  wrote:

> Does the below address things sufficiently clear?

Yep.

> +wait_queue_head_t *__var_waitqueue(void *p)
> +{
> + if (BITS_PER_LONG == 64) {
> + unsigned long q = (unsigned long)p;
> +
> + return bit_waitqueue((void *)(q & ~1), q & 1);
> + }
> + return bit_waitqueue(p, 0);
> +}

You might be better off not using bit_waitqueue() but rather do the
calculation directly since you don't actually have a bit number.

David


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread David Howells
Peter Zijlstra  wrote:

> Does the below address things sufficiently clear?

Yep.

> +wait_queue_head_t *__var_waitqueue(void *p)
> +{
> + if (BITS_PER_LONG == 64) {
> + unsigned long q = (unsigned long)p;
> +
> + return bit_waitqueue((void *)(q & ~1), q & 1);
> + }
> + return bit_waitqueue(p, 0);
> +}

You might be better off not using bit_waitqueue() but rather do the
calculation directly since you don't actually have a bit number.

David


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread Peter Zijlstra
On Thu, Mar 15, 2018 at 09:58:42AM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > > trainwreck already and now you're making it worse still.
> 
> Your patch description needs to say why this isn't a trainwreck when you
> consider wait_for_atomic_t() to be one since it does things in a very similar
> way.

Does the below address things sufficiently clear?

---

Subject: sched/wait: Introduce wait_var_event()
From: Peter Zijlstra 
Date: Thu Mar 15 11:40:33 CET 2018

As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.

The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.

It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.

The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.

Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.

The following are (mostly) identical replacements:

wait_on_atomic_t(_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
wake_up_atomic_t(_atomic);

wait_var_event(_atomic, !atomic_read(_atomic));
wake_up_var(_atomic);

The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.

Cc: David Howells 
Tested-by: Dan Williams 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/wait_bit.h |   70 +++
 kernel/sched/wait_bit.c  |   48 
 2 files changed, 118 insertions(+)

--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -262,4 +262,74 @@ int wait_on_atomic_t(atomic_t *val, wait
return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+extern void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void 
*var, int flags);
+extern void wake_up_var(void *var);
+extern wait_queue_head_t *__var_waitqueue(void *p);
+
+#define ___wait_var_event(var, condition, state, exclusive, ret, cmd)  \
+({ \
+   __label__ __out;\
+   struct wait_queue_head *__wq_head = __var_waitqueue(var);   \
+   struct wait_bit_queue_entry __wbq_entry;\
+   long __ret = ret; /* explicit shadow */ \
+   \
+   init_wait_var_entry(&__wbq_entry, var,  \
+   exclusive ? WQ_FLAG_EXCLUSIVE : 0); \
+   for (;;) {  \
+   long __int = prepare_to_wait_event(__wq_head,   \
+  &__wbq_entry.wq_entry, \
+  state);  \
+   if (condition)  \
+   break;  \
+   \
+   if (___wait_is_interruptible(state) && __int) { \
+   __ret = __int;  \
+   goto __out; \
+   }   \
+   \
+   cmd;\
+   }   \
+   finish_wait(__wq_head, &__wbq_entry.wq_entry);  \
+__out: __ret;  \
+})
+
+#define __wait_var_event(var, condition)   \
+   ___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0,   \
+ schedule())
+
+#define wait_var_event(var, condition) \
+do {   \
+   might_sleep();  \
+   if (condition)  \
+   break; 

Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread Peter Zijlstra
On Thu, Mar 15, 2018 at 09:58:42AM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > > trainwreck already and now you're making it worse still.
> 
> Your patch description needs to say why this isn't a trainwreck when you
> consider wait_for_atomic_t() to be one since it does things in a very similar
> way.

Does the below address things sufficiently clear?

---

Subject: sched/wait: Introduce wait_var_event()
From: Peter Zijlstra 
Date: Thu Mar 15 11:40:33 CET 2018

As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.

The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.

It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.

The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.

Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.

The following are (mostly) identical replacements:

wait_on_atomic_t(_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
wake_up_atomic_t(_atomic);

wait_var_event(_atomic, !atomic_read(_atomic));
wake_up_var(_atomic);

The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.

Cc: David Howells 
Tested-by: Dan Williams 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/wait_bit.h |   70 +++
 kernel/sched/wait_bit.c  |   48 
 2 files changed, 118 insertions(+)

--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -262,4 +262,74 @@ int wait_on_atomic_t(atomic_t *val, wait
return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+extern void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void 
*var, int flags);
+extern void wake_up_var(void *var);
+extern wait_queue_head_t *__var_waitqueue(void *p);
+
+#define ___wait_var_event(var, condition, state, exclusive, ret, cmd)  \
+({ \
+   __label__ __out;\
+   struct wait_queue_head *__wq_head = __var_waitqueue(var);   \
+   struct wait_bit_queue_entry __wbq_entry;\
+   long __ret = ret; /* explicit shadow */ \
+   \
+   init_wait_var_entry(&__wbq_entry, var,  \
+   exclusive ? WQ_FLAG_EXCLUSIVE : 0); \
+   for (;;) {  \
+   long __int = prepare_to_wait_event(__wq_head,   \
+  &__wbq_entry.wq_entry, \
+  state);  \
+   if (condition)  \
+   break;  \
+   \
+   if (___wait_is_interruptible(state) && __int) { \
+   __ret = __int;  \
+   goto __out; \
+   }   \
+   \
+   cmd;\
+   }   \
+   finish_wait(__wq_head, &__wbq_entry.wq_entry);  \
+__out: __ret;  \
+})
+
+#define __wait_var_event(var, condition)   \
+   ___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0,   \
+ schedule())
+
+#define wait_var_event(var, condition) \
+do {   \
+   might_sleep();  \
+   if (condition)  \
+   break;  \
+   __wait_var_event(var, condition); 

Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread Peter Zijlstra
On Thu, Mar 15, 2018 at 09:58:42AM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > > trainwreck already and now you're making it worse still.
> 
> Your patch description needs to say why this isn't a trainwreck when you
> consider wait_for_atomic_t() to be one since it does things in a very similar
> way.

Yeah, still writing changelogs..


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread Peter Zijlstra
On Thu, Mar 15, 2018 at 09:58:42AM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > > trainwreck already and now you're making it worse still.
> 
> Your patch description needs to say why this isn't a trainwreck when you
> consider wait_for_atomic_t() to be one since it does things in a very similar
> way.

Yeah, still writing changelogs..


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread David Howells
Peter Zijlstra  wrote:

> > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > trainwreck already and now you're making it worse still.

Your patch description needs to say why this isn't a trainwreck when you
consider wait_for_atomic_t() to be one since it does things in a very similar
way.

David


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-15 Thread David Howells
Peter Zijlstra  wrote:

> > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
> > > trainwreck already and now you're making it worse still.

Your patch description needs to say why this isn't a trainwreck when you
consider wait_for_atomic_t() to be one since it does things in a very similar
way.

David


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-14 Thread Dan Williams
On Tue, Mar 13, 2018 at 3:20 AM, Peter Zijlstra  wrote:
> On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
>> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra  wrote:
>> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>> >>
>> >> Page reference counts typically need to reach 0 to be considered a
>> >> free / inactive page. However, ZONE_DEVICE pages allocated via
>> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> >> done at init time to assign pages to the page allocator is skipped.
>> >>
>> >> These pages will have their reference count elevated > 1 by
>> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> >> these pages vs filesytem operations like hole-punch and truncate the
>> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> >> the 2 to 1 count transition).
>> >>
>> >> For now, this implementation does not have functional behavior change,
>> >> follow-on patches will add waiters for these page-idle events.
>> >
>> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> > trainwreck already and now you're making it worse still.
>> >
>> > Please have a look here:
>> >
>> >   
>> > https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx...@hirez.programming.kicks-ass.net
>>
>> That thread seems to be worried about the object disappearing the
>> moment it's reference count reaches a target. That isn't the case with
>> the memmap / struct page objects for ZONE_DEVICE pages. I understand
>> wait_for_atomic_one() is broken in the general case, but as far as I
>> can see it works fine specifically for ZONE_DEVICE page busy tracking,
>> just not generic object lifetime.
>
> How's this, compile tested (x86_64-allmodconfig) only.
>
> This allows you to write:
>
> wait_var_event(_atomic, atomic_read(_atomic) == 1);

This works for me, you can add

Tested-by: Dan Williams 

...to the upstream version.

Can we add this new api in an immutable commit tip/sched/core tree, so
I can base my fix on it? The wait_for_atomic_t removal can then come
in follow-on patches.


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-14 Thread Dan Williams
On Tue, Mar 13, 2018 at 3:20 AM, Peter Zijlstra  wrote:
> On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
>> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra  wrote:
>> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>> >>
>> >> Page reference counts typically need to reach 0 to be considered a
>> >> free / inactive page. However, ZONE_DEVICE pages allocated via
>> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> >> done at init time to assign pages to the page allocator is skipped.
>> >>
>> >> These pages will have their reference count elevated > 1 by
>> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> >> these pages vs filesytem operations like hole-punch and truncate the
>> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> >> the 2 to 1 count transition).
>> >>
>> >> For now, this implementation does not have functional behavior change,
>> >> follow-on patches will add waiters for these page-idle events.
>> >
>> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> > trainwreck already and now you're making it worse still.
>> >
>> > Please have a look here:
>> >
>> >   
>> > https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx...@hirez.programming.kicks-ass.net
>>
>> That thread seems to be worried about the object disappearing the
>> moment it's reference count reaches a target. That isn't the case with
>> the memmap / struct page objects for ZONE_DEVICE pages. I understand
>> wait_for_atomic_one() is broken in the general case, but as far as I
>> can see it works fine specifically for ZONE_DEVICE page busy tracking,
>> just not generic object lifetime.
>
> How's this, compile tested (x86_64-allmodconfig) only.
>
> This allows you to write:
>
> wait_var_event(_atomic, atomic_read(_atomic) == 1);

This works for me, you can add

Tested-by: Dan Williams 

...to the upstream version.

Can we add this new api in an immutable commit tip/sched/core tree, so
I can base my fix on it? The wait_for_atomic_t removal can then come
in follow-on patches.


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-13 Thread Dan Williams
On Tue, Mar 13, 2018 at 3:20 AM, Peter Zijlstra  wrote:
> On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
>> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra  wrote:
>> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>> >>
>> >> Page reference counts typically need to reach 0 to be considered a
>> >> free / inactive page. However, ZONE_DEVICE pages allocated via
>> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> >> done at init time to assign pages to the page allocator is skipped.
>> >>
>> >> These pages will have their reference count elevated > 1 by
>> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> >> these pages vs filesytem operations like hole-punch and truncate the
>> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> >> the 2 to 1 count transition).
>> >>
>> >> For now, this implementation does not have functional behavior change,
>> >> follow-on patches will add waiters for these page-idle events.
>> >
>> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> > trainwreck already and now you're making it worse still.
>> >
>> > Please have a look here:
>> >
>> >   
>> > https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx...@hirez.programming.kicks-ass.net
>>
>> That thread seems to be worried about the object disappearing the
>> moment it's reference count reaches a target. That isn't the case with
>> the memmap / struct page objects for ZONE_DEVICE pages. I understand
>> wait_for_atomic_one() is broken in the general case, but as far as I
>> can see it works fine specifically for ZONE_DEVICE page busy tracking,
>> just not generic object lifetime.
>
> How's this, compile tested (x86_64-allmodconfig) only.
>
> This allows you to write:
>
> wait_var_event(_atomic, atomic_read(_atomic) == 1);

Nice!

I'll give this a shot. I will need to add
wait_var_event_interruptible(), but other than that it looks workable
to me.


Re: [RFC][PATCH] sched/wait_bit: Introduce wait_var_event()/wake_up_var()

2018-03-13 Thread Dan Williams
On Tue, Mar 13, 2018 at 3:20 AM, Peter Zijlstra  wrote:
> On Sun, Mar 11, 2018 at 10:15:55AM -0700, Dan Williams wrote:
>> On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra  wrote:
>> > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote:
>> >> Add a generic facility for awaiting an atomic_t to reach a value of 1.
>> >>
>> >> Page reference counts typically need to reach 0 to be considered a
>> >> free / inactive page. However, ZONE_DEVICE pages allocated via
>> >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
>> >> done at init time to assign pages to the page allocator is skipped.
>> >>
>> >> These pages will have their reference count elevated > 1 by
>> >> get_user_pages() when they are under DMA. In order to coordinate DMA to
>> >> these pages vs filesytem operations like hole-punch and truncate the
>> >> filesystem-dax implementation needs to capture the DMA-idle event i.e.
>> >> the 2 to 1 count transition).
>> >>
>> >> For now, this implementation does not have functional behavior change,
>> >> follow-on patches will add waiters for these page-idle events.
>> >
>> > Argh, no no no.. That whole wait_for_atomic_t thing is a giant
>> > trainwreck already and now you're making it worse still.
>> >
>> > Please have a look here:
>> >
>> >   
>> > https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx...@hirez.programming.kicks-ass.net
>>
>> That thread seems to be worried about the object disappearing the
>> moment it's reference count reaches a target. That isn't the case with
>> the memmap / struct page objects for ZONE_DEVICE pages. I understand
>> wait_for_atomic_one() is broken in the general case, but as far as I
>> can see it works fine specifically for ZONE_DEVICE page busy tracking,
>> just not generic object lifetime.
>
> How's this, compile tested (x86_64-allmodconfig) only.
>
> This allows you to write:
>
> wait_var_event(_atomic, atomic_read(_atomic) == 1);

Nice!

I'll give this a shot. I will need to add
wait_var_event_interruptible(), but other than that it looks workable
to me.