RE: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread David Laight


From: Zebediah Figura
> Sent: 08 March 2021 16:18
> 
> On 3/3/21 6:42 PM, André Almeida wrote:
> >   ** The wait on multiple problem
> >
> >   The use case lies in the Wine implementation of the Windows NT interface
> >   WaitMultipleObjects. This Windows API function allows a thread to sleep
> >   waiting on the first of a set of event sources (mutexes, timers, signal,
> >   console input, etc) to signal.  Considering this is a primitive
> >   synchronization operation for Windows applications, being able to quickly
> >   signal events on the producer side, and quickly go to sleep on the
> >   consumer side is essential for good performance of those running over 
> > Wine.
> 
> It's probably worth pointing out, for better or for worse, while this is
> *a* use case, it's also limited to an out-of-tree patch set/forked
> versions of Wine. I'm currently working on a different approach that
> should be upstreamable to Wine proper, as detailed in [1].
> 
> [1]
> https://lore.kernel.org/lkml/f4cc1a38-1441-62f8-47e4-0c67f5ad1...@codeweavers.com/

* NtPulseEvent can't work right. We badly emulate it by setting and then
immediately resetting the event, but due to the above gap between poll()
and read(), most threads end up missing the wakeup anyway.

As you stated later PulseEvent() is completely broken anyway.
At least one of the problems is that in order to complete an async io
(and all io is async) to final 'copy_to_user' must be done in the
context of the initiating thread.
So if the thread is in WaitMultipleObjects (it usually is) and an async io
completes (eg receive data on a TCP connection) the thread stops waiting
while the io completion callback is done.
If a pulseEvent happens during that window then it is lost.

Mind you there was (maybe is still) a bug in WMO on 64bit windows
that means the process completely misses io completion callbacks
if (I think) they happen while the process is being scheduled.
There is a loop in WMO - that fails to recover because interrupts
are disabled and a 30 second timer that unblocks things.
I had to add code to write to the ioapic to request the hardware
interrupt to unblock everything :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread Zebediah Figura

On 3/3/21 6:42 PM, André Almeida wrote:

  ** The wait on multiple problem

  The use case lies in the Wine implementation of the Windows NT interface
  WaitMultipleObjects. This Windows API function allows a thread to sleep
  waiting on the first of a set of event sources (mutexes, timers, signal,
  console input, etc) to signal.  Considering this is a primitive
  synchronization operation for Windows applications, being able to quickly
  signal events on the producer side, and quickly go to sleep on the
  consumer side is essential for good performance of those running over Wine.


It's probably worth pointing out, for better or for worse, while this is 
*a* use case, it's also limited to an out-of-tree patch set/forked 
versions of Wine. I'm currently working on a different approach that 
should be upstreamable to Wine proper, as detailed in [1].


[1] 
https://lore.kernel.org/lkml/f4cc1a38-1441-62f8-47e4-0c67f5ad1...@codeweavers.com/


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread Stefan Metzmacher
Am 08.03.21 um 12:11 schrieb David Laight:
> From: Stefan Metzmacher
>> Sent: 07 March 2021 11:35
>>
>> Hi André,
>>>  ** The wait on multiple problem
>>>
>>>  The use case lies in the Wine implementation of the Windows NT interface
>>>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>>>  waiting on the first of a set of event sources (mutexes, timers, signal,
>>>  console input, etc) to signal.
> 
> They are all events.
> You can only wait on either events or sockets (using select).
> There is a socket api to signal an event when data arrives (etc).
> There is also the insane (these days) restriction of 64 events.

Ok.

>> With that in mind would it be good to have some interaction with epoll (and 
>> similar calls)?
> 
> Or hook something up so that pollwakeup can kick a futex as well
> as waking up poll() and adding an event to epoll().

I guess as FUTEX_FD was already there and was removed we can stop this 
discussion.

If there will every be the need to an async call, I guess a io_uring based one 
would
be the best...

metze



Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread Stefan Metzmacher


Am 07.03.21 um 12:56 schrieb Daurnimator:
> On Sun, 7 Mar 2021 at 22:35, Stefan Metzmacher  wrote:
>> Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic 
>> eventfd() or a new futex2fd())
>> would be a better interface?
> 
> Like bring back FUTEX_FD? (which was removed back in 2.6.25)

Ah, ok, yes something like that.

But as that was removed because of races, but might not be a good idea to bring 
it back.

metze


RE: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread David Laight
From: Stefan Metzmacher
> Sent: 07 March 2021 11:35
> 
> Hi André,
> >  ** The wait on multiple problem
> >
> >  The use case lies in the Wine implementation of the Windows NT interface
> >  WaitMultipleObjects. This Windows API function allows a thread to sleep
> >  waiting on the first of a set of event sources (mutexes, timers, signal,
> >  console input, etc) to signal.

They are all events.
You can only wait on either events or sockets (using select).
There is a socket api to signal an event when data arrives (etc).
There is also the insane (these days) restriction of 64 events.

> With that in mind would it be good to have some interaction with epoll (and 
> similar calls)?

Or hook something up so that pollwakeup can kick a futex as well
as waking up poll() and adding an event to epoll().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-07 Thread Daurnimator
On Sun, 7 Mar 2021 at 22:35, Stefan Metzmacher  wrote:
> Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic 
> eventfd() or a new futex2fd())
> would be a better interface?

Like bring back FUTEX_FD? (which was removed back in 2.6.25)


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-07 Thread Stefan Metzmacher


Hi André,
>  ** The wait on multiple problem
> 
>  The use case lies in the Wine implementation of the Windows NT interface
>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>  waiting on the first of a set of event sources (mutexes, timers, signal,
>  console input, etc) to signal.  

With that in mind would it be good to have some interaction with epoll (and 
similar calls)?

Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic 
eventfd() or a new futex2fd())
would be a better interface?

Or instead introduce an IORING_OP_FUTEX2_WAITV? Then the futex_waitv logic wait
in an io-wq kernel thread...

I guess the io_uring way would mean we could have that in mind as future 
addition, which can be implemented
later...

metze


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-05 Thread Peter Oskolkov
On Fri, Mar 5, 2021 at 12:03 PM Peter Oskolkov  wrote:
>
> Hi André!
>
> On Thu, Mar 4, 2021 at 10:58 AM André Almeida  
> wrote:
> >
> > Hi Peter,
> >
> > Às 02:44 de 04/03/21, Peter Oskolkov escreveu:
> > > On Wed, Mar 3, 2021 at 5:22 PM André Almeida  
> > > wrote:
> > >>
> > >> Hi,
> > >>
> > >> This patch series introduces the futex2 syscalls.
> > >>
> > >> * FAQ
> > >>
> > >>   ** "And what's about FUTEX_64?"
> > >>
> > >>   By supporting 64 bit futexes, the kernel structure for futex would
> > >>   need to have a 64 bit field for the value, and that could defeat one of
> > >>   the purposes of having different sized futexes in the first place:
> > >>   supporting smaller ones to decrease memory usage. This might be
> > >>   something that could be disabled for 32bit archs (and even for
> > >>   CONFIG_BASE_SMALL).
> > >>
> > >>   Which use case would benefit for FUTEX_64? Does it worth the 
> > >> trade-offs?
> > >
> > > The ability to store a pointer value on 64bit platforms is an
> > > important use case.
> > > Imagine a simple producer/consumer scenario, with the producer updating
> > > some shared memory data and waking the consumer. Storing the pointer
> > > in the futex makes it so that only one shared memory location needs to be
> > > accessed "atomically", etc. With two atomics synchronization becomes
> > > more involved (= slower).
> > >
> >
> > So the idea is to, instead of doing this:
> >
> > T1:
> > atomic_set(&shm_addr, buffer_addr);
> > atomic_set(&futex, 0);
> > futex_wake(&futex, 1);
> >
> > T2:
> > consume(shm_addr);
> >
> > To do that:
> >
> > T1:
> > atomic_set(&futex, buffer_addr);
> > futex_wake(&futex, 1);
> >
> > T2:
> > consume(futex);
> >
> > Right?
>
> More like this:
>
> T1 (producer):
> while (true) {
> ptr = get_new_data();
> atomic_set(&futex, ptr);
> futex_wake(&futex, 1);
> }
>
> T1 (consumer):
> some_data *prev = NULL;
> while (true) {
>   futex_wait(&futex, prev);
>   some_data *next = atomic_get(&futex);
>   if (next == prev) continue;  /* spurious wakeup */
>
>   consume_data(next);
>   prev = next;
> }

Or an even more complete example:

T1 (producer):
while (true) {
next = get_new_data();
atomic_set(&futex, next);
futex_wake(&futex, 1);

   /* wait for the consumer */
   prev = next;
   do {
 next = atomic_get(&futex);
 futex_wait(&futex, prev);
  } while (next != NULL);

}

T2 (consumer):
some_data *prev = NULL;
while (true) {
futex_wait(&futex, prev);
some_data *next = atomic_get(&futex);
if (next == prev) continue;  /* spurious wakeup */

consume_data(next);
prev = next;
atomic_set(&futex, NULL);
futex_wake(&futex, 1); /* signal we can consumer more */
}

>
>
>
> >
> > I'll try to write a small test to see how the perf numbers looks like.


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-05 Thread Peter Oskolkov
Hi André!

On Thu, Mar 4, 2021 at 10:58 AM André Almeida  wrote:
>
> Hi Peter,
>
> Às 02:44 de 04/03/21, Peter Oskolkov escreveu:
> > On Wed, Mar 3, 2021 at 5:22 PM André Almeida  
> > wrote:
> >>
> >> Hi,
> >>
> >> This patch series introduces the futex2 syscalls.
> >>
> >> * FAQ
> >>
> >>   ** "And what's about FUTEX_64?"
> >>
> >>   By supporting 64 bit futexes, the kernel structure for futex would
> >>   need to have a 64 bit field for the value, and that could defeat one of
> >>   the purposes of having different sized futexes in the first place:
> >>   supporting smaller ones to decrease memory usage. This might be
> >>   something that could be disabled for 32bit archs (and even for
> >>   CONFIG_BASE_SMALL).
> >>
> >>   Which use case would benefit for FUTEX_64? Does it worth the trade-offs?
> >
> > The ability to store a pointer value on 64bit platforms is an
> > important use case.
> > Imagine a simple producer/consumer scenario, with the producer updating
> > some shared memory data and waking the consumer. Storing the pointer
> > in the futex makes it so that only one shared memory location needs to be
> > accessed "atomically", etc. With two atomics synchronization becomes
> > more involved (= slower).
> >
>
> So the idea is to, instead of doing this:
>
> T1:
> atomic_set(&shm_addr, buffer_addr);
> atomic_set(&futex, 0);
> futex_wake(&futex, 1);
>
> T2:
> consume(shm_addr);
>
> To do that:
>
> T1:
> atomic_set(&futex, buffer_addr);
> futex_wake(&futex, 1);
>
> T2:
> consume(futex);
>
> Right?

More like this:

T1 (producer):
while (true) {
ptr = get_new_data();
atomic_set(&futex, ptr);
futex_wake(&futex, 1);
}

T1 (consumer):
some_data *prev = NULL;
while (true) {
  futex_wait(&futex, prev);
  some_data *next = atomic_get(&futex);
  if (next == prev) continue;  /* spurious wakeup */

  consume_data(next);
  prev = next;
}



>
> I'll try to write a small test to see how the perf numbers looks like.


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-04 Thread André Almeida

Hi Ted,

Às 12:01 de 04/03/21, Theodore Ts'o escreveu:

On Wed, Mar 03, 2021 at 09:42:06PM -0300, André Almeida wrote:

  ** Performance

  - For comparing futex() and futex2() performance, I used the artificial
benchmarks implemented at perf (wake, wake-parallel, hash and
requeue). The setup was 200 runs for each test and using 8, 80, 800,
8000 for the number of threads, Note that for this test, I'm not using
patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
at "The patchset" section.


How heavily contended where the benchmarks?  One of the benefits of
the original futex was that no system call was necessary in the happy
path when the lock is uncontended.  


futex2 has the same design in that aspect, no syscall is needed in the 
happy path. Did something in the cover letter gave the impression that 
is not the case? I would like to reword it to clarify this.



Especially on a non-NUMA system
(which are the far more common case), since that's where relying on a
single memory access was a huge win for the original futex.  I would
expect that futex2 will fare worse in this particular case, since it
requires a system call entry for all operations --- the question is
how large is the delta in this worst case (for futex2) and best case
(for futex) scenario.

Cheers,

- Ted



Thanks,
André


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-04 Thread André Almeida

Hi Peter,

Às 02:44 de 04/03/21, Peter Oskolkov escreveu:

On Wed, Mar 3, 2021 at 5:22 PM André Almeida  wrote:


Hi,

This patch series introduces the futex2 syscalls.

* FAQ

  ** "And what's about FUTEX_64?"

  By supporting 64 bit futexes, the kernel structure for futex would
  need to have a 64 bit field for the value, and that could defeat one of
  the purposes of having different sized futexes in the first place:
  supporting smaller ones to decrease memory usage. This might be
  something that could be disabled for 32bit archs (and even for
  CONFIG_BASE_SMALL).

  Which use case would benefit for FUTEX_64? Does it worth the trade-offs?


The ability to store a pointer value on 64bit platforms is an
important use case.
Imagine a simple producer/consumer scenario, with the producer updating
some shared memory data and waking the consumer. Storing the pointer
in the futex makes it so that only one shared memory location needs to be
accessed "atomically", etc. With two atomics synchronization becomes
more involved (= slower).



So the idea is to, instead of doing this:

T1:
atomic_set(&shm_addr, buffer_addr);
atomic_set(&futex, 0);
futex_wake(&futex, 1);

T2:
consume(shm_addr);

To do that:

T1:
atomic_set(&futex, buffer_addr);
futex_wake(&futex, 1);

T2:
consume(futex);

Right?

I'll try to write a small test to see how the perf numbers looks like.


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-04 Thread Theodore Ts'o
On Wed, Mar 03, 2021 at 09:42:06PM -0300, André Almeida wrote:
>  ** Performance
> 
>  - For comparing futex() and futex2() performance, I used the artificial
>benchmarks implemented at perf (wake, wake-parallel, hash and
>requeue). The setup was 200 runs for each test and using 8, 80, 800,
>8000 for the number of threads, Note that for this test, I'm not using
>patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
>at "The patchset" section.

How heavily contended where the benchmarks?  One of the benefits of
the original futex was that no system call was necessary in the happy
path when the lock is uncontended.  Especially on a non-NUMA system
(which are the far more common case), since that's where relying on a
single memory access was a huge win for the original futex.  I would
expect that futex2 will fare worse in this particular case, since it
requires a system call entry for all operations --- the question is
how large is the delta in this worst case (for futex2) and best case
(for futex) scenario.

Cheers,

- Ted


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-03 Thread Peter Oskolkov
On Wed, Mar 3, 2021 at 5:22 PM André Almeida  wrote:
>
> Hi,
>
> This patch series introduces the futex2 syscalls.
>
> * What happened to the current futex()?
>
> For some years now, developers have been trying to add new features to
> futex, but maintainers have been reluctant to accept then, given the
> multiplexed interface full of legacy features and tricky to do big
> changes. Some problems that people tried to address with patchsets are:
> NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
> NUMA, for instance, just doesn't fit the current API in a reasonable
> way. Considering that, it's not possible to merge new features into the
> current futex.
>
>  ** The NUMA problem
>
>  At the current implementation, all futex kernel side infrastructure is
>  stored on a single node. Given that, all futex() calls issued by
>  processors that aren't located on that node will have a memory access
>  penalty when doing it.
>
>  ** The 32bit sized futex problem
>
>  Embedded systems or anything with memory constrains would benefit of
>  using smaller sizes for the futex userspace integer. Also, a mutex
>  implementation can be done using just three values, so 8 bits is enough
>  for various scenarios.
>
>  ** The wait on multiple problem
>
>  The use case lies in the Wine implementation of the Windows NT interface
>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>  waiting on the first of a set of event sources (mutexes, timers, signal,
>  console input, etc) to signal.  Considering this is a primitive
>  synchronization operation for Windows applications, being able to quickly
>  signal events on the producer side, and quickly go to sleep on the
>  consumer side is essential for good performance of those running over Wine.
>
> [0] https://lore.kernel.org/lkml/20160505204230.932454...@linutronix.de/
> [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskaru...@web.de/
> [2] 
> https://lore.kernel.org/lkml/20200213214525.183689-1-andrealm...@collabora.com/
>
> * The solution
>
> As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
> is required to solve this, which must be designed with those features in
> mind. futex2() is that interface. As opposed to the current multiplexed
> interface, the new one should have one syscall per operation. This will
> allow the maintainability of the API if it gets extended, and will help
> users with type checking of arguments.
>
> In particular, the new interface is extended to support the ability to
> wait on any of a list of futexes at a time, which could be seen as a
> vectored extension of the FUTEX_WAIT semantics.
>
> [3] 
> https://lore.kernel.org/lkml/20200303120050.gc2...@hirez.programming.kicks-ass.net/
>
> * The interface
>
> The new interface can be seen in details in the following patches, but
> this is a high level summary of what the interface can do:
>
>  - Supports wake/wait semantics, as in futex()
>  - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
>individual flags for each address
>  - Supports waiting for a vector of futexes, using a new syscall named
>futex_waitv()
>  - Supports variable sized futexes (8bits, 16bits and 32bits)
>  - Supports NUMA-awareness operations, where the user can specify on
>which memory node would like to operate
>
> * Implementation
>
> The internal implementation follows a similar design to the original futex.
> Given that we want to replicate the same external behavior of current
> futex, this should be somewhat expected. For some functions, like the
> init and the code to get a shared key, I literally copied code and
> comments from kernel/futex.c. I decided to do so instead of exposing the
> original function as a public function since in that way we can freely
> modify our implementation if required, without any impact on old futex.
> Also, the comments precisely describes the details and corner cases of
> the implementation.
>
> Each patch contains a brief description of implementation, but patch 6
> "docs: locking: futex2: Add documentation" adds a more complete document
> about it.
>
> * The patchset
>
> This patchset can be also found at my git tree:
>
> https://gitlab.collabora.com/tonyk/linux/-/tree/futex2-dev
>
>   - Patch 1: Implements wait/wake, and the basics foundations of futex2
>
>   - Patches 2-4: Implement the remaining features (shared, waitv, requeue).
>
>   - Patch 5:  Adds the x86_x32 ABI handling. I kept it in a separated
> patch since I'm not sure if x86_x32 is still a thing, or if it should
> return -ENOSYS.
>
>   - Patch 6: Add a documentation file which details the interface and
> the internal implementation.
>
>   - Patches 7-13: Selftests for all operations along with perf
> support for futex2.
>
>   - Patch 14: While working on porting glibc for futex2, I found out
> that there's a futex_wake() call at the user thread exit path, if
> that thread was created with clone(.

[RFC PATCH v2 00/13] Add futex2 syscall

2021-03-03 Thread André Almeida
Hi,

This patch series introduces the futex2 syscalls.

* What happened to the current futex()?

For some years now, developers have been trying to add new features to
futex, but maintainers have been reluctant to accept then, given the
multiplexed interface full of legacy features and tricky to do big
changes. Some problems that people tried to address with patchsets are:
NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
NUMA, for instance, just doesn't fit the current API in a reasonable
way. Considering that, it's not possible to merge new features into the
current futex.

 ** The NUMA problem

 At the current implementation, all futex kernel side infrastructure is
 stored on a single node. Given that, all futex() calls issued by
 processors that aren't located on that node will have a memory access
 penalty when doing it.

 ** The 32bit sized futex problem

 Embedded systems or anything with memory constrains would benefit of
 using smaller sizes for the futex userspace integer. Also, a mutex
 implementation can be done using just three values, so 8 bits is enough
 for various scenarios.

 ** The wait on multiple problem

 The use case lies in the Wine implementation of the Windows NT interface
 WaitMultipleObjects. This Windows API function allows a thread to sleep
 waiting on the first of a set of event sources (mutexes, timers, signal,
 console input, etc) to signal.  Considering this is a primitive
 synchronization operation for Windows applications, being able to quickly
 signal events on the producer side, and quickly go to sleep on the
 consumer side is essential for good performance of those running over Wine.

[0] https://lore.kernel.org/lkml/20160505204230.932454...@linutronix.de/
[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskaru...@web.de/
[2] 
https://lore.kernel.org/lkml/20200213214525.183689-1-andrealm...@collabora.com/

* The solution

As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
is required to solve this, which must be designed with those features in
mind. futex2() is that interface. As opposed to the current multiplexed
interface, the new one should have one syscall per operation. This will
allow the maintainability of the API if it gets extended, and will help
users with type checking of arguments.

In particular, the new interface is extended to support the ability to
wait on any of a list of futexes at a time, which could be seen as a
vectored extension of the FUTEX_WAIT semantics.

[3] 
https://lore.kernel.org/lkml/20200303120050.gc2...@hirez.programming.kicks-ass.net/

* The interface

The new interface can be seen in details in the following patches, but
this is a high level summary of what the interface can do:

 - Supports wake/wait semantics, as in futex()
 - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
   individual flags for each address
 - Supports waiting for a vector of futexes, using a new syscall named
   futex_waitv()
 - Supports variable sized futexes (8bits, 16bits and 32bits)
 - Supports NUMA-awareness operations, where the user can specify on
   which memory node would like to operate

* Implementation

The internal implementation follows a similar design to the original futex.
Given that we want to replicate the same external behavior of current
futex, this should be somewhat expected. For some functions, like the
init and the code to get a shared key, I literally copied code and
comments from kernel/futex.c. I decided to do so instead of exposing the
original function as a public function since in that way we can freely
modify our implementation if required, without any impact on old futex.
Also, the comments precisely describes the details and corner cases of
the implementation.

Each patch contains a brief description of implementation, but patch 6
"docs: locking: futex2: Add documentation" adds a more complete document
about it.

* The patchset

This patchset can be also found at my git tree:

https://gitlab.collabora.com/tonyk/linux/-/tree/futex2-dev

  - Patch 1: Implements wait/wake, and the basics foundations of futex2

  - Patches 2-4: Implement the remaining features (shared, waitv, requeue).

  - Patch 5:  Adds the x86_x32 ABI handling. I kept it in a separated
patch since I'm not sure if x86_x32 is still a thing, or if it should
return -ENOSYS.

  - Patch 6: Add a documentation file which details the interface and
the internal implementation.

  - Patches 7-13: Selftests for all operations along with perf
support for futex2.

  - Patch 14: While working on porting glibc for futex2, I found out
that there's a futex_wake() call at the user thread exit path, if
that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
order to make pthreads work with futex2, it was required to add
this patch. Note that this is more a proof-of-concept of what we
will need to do in future, rather than part of the interface and
shouldn't