Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-26 Thread Waiman Long

On 09/23/2016 09:02 AM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Waiman Long wrote:

Locking was done mostly by lock stealing. This is where most of the
performance benefit comes from, not optimistic spinning.

How does the lock latency distribution of all this look like and how fair
is the whole thing?

The TO futexes are unfair as can be seen from the min/max thread times listed
above. It took the fastest thread 0.07s to complete all the locking
operations, whereas the slowest one needed 2.65s. However, the situation
reverses when I changed the critical section to a 1us sleep. In this case,

1us sleep is going to add another syscall and therefor scheduling, so what?

Or did you just extend the critical section busy time?


The 1us sleep will cause the spinning to stop and make all the waiters 
sleep. This is to simulate the extreme case where TO futex may not have 
the performance advantage.





there will be no optimistic spinning. The performance results for 100k locking
operations were listed below.

 wait-wake futex PI futexTO futex
 --- 
max time0.06s 9.32s  4.76s

    


Yes, wait-wake futex is the unfair one in this case.


min time5.59s 9.36s  5.62s
average time3.25s 9.35s  5.41s

In this case, the TO futexes are fairer but perform worse than the wait-wake
futexes. That is because the lock handoff mechanism limit the amount of lock
stealing in the TO futexes while the wait-wake futexes have no such
restriction. When I disabled  lock handoff, the TO futexes would then perform
similar to the wait-wake futexes.

So the benefit of these new fangled futexes is only there for extreme short
critical sections and a gazillion of threads fighting for the same futex,
right?


Not really. Lock stealing will help performance when a gazillion of 
threads fighting for the same futex. Optimistic spinning will help to 
reduce the lock transfer latency because the waiter isn't sleeping no 
matter the number of threads. One set of data that I haven't shown so 
far is that the performance delta between wait-wait and TO futexes 
actually increases as the critical section is lengthened. This is 
because for short critical section, the waiters of wait-wake futex may 
not actually go to sleep because of the latency introduced by the code 
that has to be run before they do a final check to see if the futex 
value change before going to sleep. The longer the critical section, the 
higher the chance that they actually sleep and hence their performance 
is getting worse relative to the TO futexes.


For example, with the critical section of 50 pause instructions instead 
of 5, the performance gain is about 5X instead of about 1.6X in the 
latter case.



I really wonder how the average programmer should pick the right flavour,
not to talk about any useful decision for something like glibc to pick the
proper one.


I would say that TO futexes will have better performance in most cases. 
Of course, I still need to run some real world benchmarks to quantify 
the effect of the new futexes. I am hoping to get suggestion of what is 
a good set of benchmarks to run.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-23 Thread Thomas Gleixner
On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 04:38 PM, Thomas Gleixner wrote:
> > >  wait-wake futex PI futexTO futex
> > >  --- 
> > > max time3.49s50.91s  2.65s
> > > min time3.24s50.84s  0.07s
> > > average time3.41s50.90s  1.84s
> > > sys time  7m22.4s55.73s2m32.9s
> > That's really interesting. Do you have any explanation for this massive
> > system time differences?
> 
> For the wait-wake futexes, the waiters will be kicked out with EAGAIN without
> sleeping as long as the futex value changes before they acquire the hash
> bucket spinlock. These waiters will attempt to acquire the lock again in the
> user space. If that fails, they will call FUTEX_WAIT again.
> 
> In the above test case, the critical section is pretty short and about 4/5 of
> the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a lot more
> user-space/kernel transitions than the TO futexes. I think the difference in
> the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls causes the
> difference between their sys times.

Ok.

> As for the PI futex, it is a strictly sleeping lock and so won't use that
> much sys time.

That's clear.

> > > Locking was done mostly by lock stealing. This is where most of the
> > > performance benefit comes from, not optimistic spinning.
> > How does the lock latency distribution of all this look like and how fair
> > is the whole thing?
> 
> The TO futexes are unfair as can be seen from the min/max thread times listed
> above. It took the fastest thread 0.07s to complete all the locking
> operations, whereas the slowest one needed 2.65s. However, the situation
> reverses when I changed the critical section to a 1us sleep. In this case,

1us sleep is going to add another syscall and therefor scheduling, so what?

Or did you just extend the critical section busy time?

> there will be no optimistic spinning. The performance results for 100k locking
> operations were listed below.
> 
> wait-wake futex PI futexTO futex
> --- 
> max time0.06s 9.32s  4.76s

    

> min time5.59s 9.36s  5.62s
> average time3.25s 9.35s  5.41s
> 
> In this case, the TO futexes are fairer but perform worse than the wait-wake
> futexes. That is because the lock handoff mechanism limit the amount of lock
> stealing in the TO futexes while the wait-wake futexes have no such
> restriction. When I disabled  lock handoff, the TO futexes would then perform
> similar to the wait-wake futexes.

So the benefit of these new fangled futexes is only there for extreme short
critical sections and a gazillion of threads fighting for the same futex,
right?

I really wonder how the average programmer should pick the right flavour,
not to talk about any useful decision for something like glibc to pick the
proper one.

Thanks,

tglx

 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 05:41 PM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Davidlohr Bueso wrote:

On Thu, 22 Sep 2016, Waiman Long wrote:


BTW, my initial attempt for the new futex was to use the same workflow as
the PI futexes, but use mutex which has optimistic spinning instead of
rt_mutex.

Btw, Thomas, do you still have any interest pursuing this for rtmutexes from
-rt into mainline? If so I can resend the patches from a while ago.

Certainly yes. My faint memory tells me that there was some potential issue
due to boosting the owner only if it gets scheduled out, but I might be
wrong.


It is tricky to add optimistic spinning to rtmutexes because of the need 
to observe process priorities. It is certainly possible to make the top 
waiter spin, but then I am not sure how much performance gain with just 
that.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 04:38 PM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Waiman Long wrote:

BTW, my initial attempt for the new futex was to use the same workflow as the
PI futexes, but use mutex which has optimistic spinning instead of rt_mutex.
That version can double the throughput compared with PI futexes but still far
short of what can be achieved with wait-wake futex. Looking at the performance
figures from the patch:

 wait-wake futex PI futexTO futex
 --- 
max time3.49s50.91s  2.65s
min time3.24s50.84s  0.07s
average time3.41s50.90s  1.84s
sys time  7m22.4s55.73s2m32.9s

That's really interesting. Do you have any explanation for this massive
system time differences?


For the wait-wake futexes, the waiters will be kicked out with EAGAIN 
without sleeping as long as the futex value changes before they acquire 
the hash bucket spinlock. These waiters will attempt to acquire the lock 
again in the user space. If that fails, they will call FUTEX_WAIT again.


In the above test case, the critical section is pretty short and about 
4/5 of the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a 
lot more user-space/kernel transitions than the TO futexes. I think the 
difference in the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls 
causes the difference between their sys times. As for the PI futex, it 
is a strictly sleeping lock and so won't use that much sys time.



lock count   3,090,294  9,999,813   698,318
unlock count 3,268,896  9,999,814   134

The problem with a PI futexes like version is that almost all the lock/unlock
operations were done in the kernel which added overhead and latency. Now
looking at the numbers for the TO futexes, less than 1/10 of the lock
operations were done in the kernel, the number of unlock was insignificant.
Locking was done mostly by lock stealing. This is where most of the
performance benefit comes from, not optimistic spinning.

How does the lock latency distribution of all this look like and how fair
is the whole thing?


The TO futexes are unfair as can be seen from the min/max thread times 
listed above. It took the fastest thread 0.07s to complete all the 
locking operations, whereas the slowest one needed 2.65s. However, the 
situation reverses when I changed the critical section to a 1us sleep. 
In this case, there will be no optimistic spinning. The performance 
results for 100k locking operations were listed below.


wait-wake futex PI futexTO futex
--- 
max time0.06s 9.32s  4.76s
min time5.59s 9.36s  5.62s
average time3.25s 9.35s  5.41s

In this case, the TO futexes are fairer but perform worse than the 
wait-wake futexes. That is because the lock handoff mechanism limit the 
amount of lock stealing in the TO futexes while the wait-wake futexes 
have no such restriction. When I disabled  lock handoff, the TO futexes 
would then perform similar to the wait-wake futexes.





This is also the reason that a lock handoff mechanism is implemented to
prevent lock starvation which is likely to happen without one.

Where is that lock handoff mechanism?




In the futex_state object, there is a new field handoff_pid. It is set 
when the threshold count in futex_spin_on_owner() becomes negative When 
this field is set, the unlocker will change the futex word to that value 
instead of clearing it to 0 and others can steal it. I will separate out 
the lock handoff in a separate patch in the next revision to highlight it.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Thomas Gleixner
On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
> 
> > BTW, my initial attempt for the new futex was to use the same workflow as
> > the PI futexes, but use mutex which has optimistic spinning instead of
> > rt_mutex.
> 
> Btw, Thomas, do you still have any interest pursuing this for rtmutexes from
> -rt into mainline? If so I can resend the patches from a while ago.

Certainly yes. My faint memory tells me that there was some potential issue
due to boosting the owner only if it gets scheduled out, but I might be
wrong.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Davidlohr Bueso

On Thu, 22 Sep 2016, Waiman Long wrote:

BTW, my initial attempt for the new futex was to use the same workflow 
as the PI futexes, but use mutex which has optimistic spinning instead 
of rt_mutex.


Btw, Thomas, do you still have any interest pursuing this for rtmutexes from
-rt into mainline? If so I can resend the patches from a while ago.

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 04:26 PM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Waiman Long wrote:

On 09/22/2016 09:34 AM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Peter Zijlstra wrote:

I'd leave out the TO part entirely (or only mention it in changelogs).

That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.

That brings me to a different question:

How is user space going to support this, i.e. is this some extra magic for
code which implements its own locking primitives or is there going to be a
wide use via e.g. glibc.

There are some applications that use futex(2) directly to implement their
synchronization primitives. For those applications, they will need to modify
their code to detect the presence of the new futexes. They can then use the
new futexes if available and use wait-wake futexes if not.

That's what I suspected. Did you talk to the other folks who complain about
futex performance (database, JVM, etc.) and play their own games with user
space spinlocks and whatever?


I am also part of the team that help large application vendors to tune 
their application performance on our large SMP systems. Those 
application vendors tend to use futex directly instead of relying on 
glibc. We had seen spinlock contention in the futex could sometimes be a 
significant portion of the CPU cycles consumed depending on the 
workloads that were being run. We had been providing suggestions on the 
best practice of how to use futexes. But there is only so much you can 
do with tuning their locking code implementation. That is why I am also 
looking for way to improve the performance of the futex code in the kernel.





I am also planning to take a look at the pthread_mutex* APIs to see if they
can be modified to use the new futexes later on when the patch becomes more
mature.

Please involve glibc people who are interested in the futex stuff early and
discuss the concept before it's set in stone for your particular usecase.
  


Sure, I will start to do some prototyping and performance testing with 
glibc and then engage those folks about that.



Also what's the reason that we can't do probabilistic spinning for
FUTEX_WAIT and have to add yet another specialized variant of futexes?


The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the
futex is. We usually do spinning when the lock owner is running and abort when
it goes to sleep. We can't do that for FUTEX_WAIT.

Fair enough. This wants to be spelled out in the changelog and explained a
bit more detailed.


I will.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 04:08 PM, Waiman Long wrote:

On 09/22/2016 11:11 AM, Davidlohr Bueso wrote:

On Thu, 22 Sep 2016, Thomas Gleixner wrote:


On Thu, 22 Sep 2016, Davidlohr Bueso wrote:

On Thu, 22 Sep 2016, Thomas Gleixner wrote:
> Also what's the reason that we can't do probabilistic spinning for
> FUTEX_WAIT and have to add yet another specialized variant of 
futexes?


Where would this leave the respective FUTEX_WAKE? A nop? Probably 
have to
differentiate the fact that the queue was empty, but there was a 
spinning,

instead of straightforward returning 0.


Sorry, but I really can't parse this answer.

Can you folks please communicate with proper and coherent explanations
instead of throwing a few gnawed off bones in my direction?


I actually think that FUTEX_WAIT is the better/nicer approach. But my 
immediate
question above was how to handle the FUTEX_WAKE counter-part. If we 
want to
maintain current FIFO ordering for wakeups, now with WAIT spinners 
this will
create lock stealing scenarios (including if we even guard against 
starvation).
Or we could reduce the scope of spinners, due to the restrictions, 
similar to
the top-waiter only being able to spin for rtmutexes. This of course 
will hurt

the effectiveness of spinning in FUTEX_WAIT in the first place.


Actually, there can be a lot of lock stealing going on with the 
wait-wake futexes. If the critical section is short enough, many of 
the lock waiters can be waiting in the hash bucket spinlock queue and 
not sleeping yet while the futex value changes. As a result, they will 
exit the futex syscall and back to user space with EAGAIN where one of 
them may get the lock. So we can't assume that they will get the lock 
in the FIFO order anyway.


BTW, my initial attempt for the new futex was to use the same workflow 
as the PI futexes, but use mutex which has optimistic spinning instead 
of rt_mutex. That version can double the throughput compared with PI 
futexes but still far short of what can be achieved with wait-wake 
futex. Looking at the performance figures from the patch:


wait-wake futex PI futexTO futex
--- 
max time3.49s50.91s  2.65s
min time3.24s50.84s  0.07s
average time3.41s50.90s  1.84s
sys time  7m22.4s55.73s2m32.9s
lock count   3,090,294  9,999,813   698,318
unlock count 3,268,896  9,999,814   134

The problem with a PI futexes like version is that almost all the 
lock/unlock operations were done in the kernel which added overhead and 
latency. Now looking at the numbers for the TO futexes, less than 1/10 
of the lock operations were done in the kernel, the number of unlock was 
insignificant. Locking was done mostly by lock stealing. This is where 
most of the performance benefit comes from, not optimistic spinning.


This is also the reason that a lock handoff mechanism is implemented to 
prevent lock starvation which is likely to happen without one.


Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Thomas Gleixner
On Thu, 22 Sep 2016, Waiman Long wrote:
> BTW, my initial attempt for the new futex was to use the same workflow as the
> PI futexes, but use mutex which has optimistic spinning instead of rt_mutex.
> That version can double the throughput compared with PI futexes but still far
> short of what can be achieved with wait-wake futex. Looking at the performance
> figures from the patch:
> 
> wait-wake futex PI futexTO futex
> --- 
> max time3.49s50.91s  2.65s
> min time3.24s50.84s  0.07s
> average time3.41s50.90s  1.84s
> sys time  7m22.4s55.73s2m32.9s

That's really interesting. Do you have any explanation for this massive
system time differences?

> lock count   3,090,294  9,999,813   698,318
> unlock count 3,268,896  9,999,814   134
> 
> The problem with a PI futexes like version is that almost all the lock/unlock
> operations were done in the kernel which added overhead and latency. Now
> looking at the numbers for the TO futexes, less than 1/10 of the lock
> operations were done in the kernel, the number of unlock was insignificant.
> Locking was done mostly by lock stealing. This is where most of the
> performance benefit comes from, not optimistic spinning.

How does the lock latency distribution of all this look like and how fair
is the whole thing?

> This is also the reason that a lock handoff mechanism is implemented to
> prevent lock starvation which is likely to happen without one.

Where is that lock handoff mechanism?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Thomas Gleixner
On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 09:34 AM, Thomas Gleixner wrote:
> > On Thu, 22 Sep 2016, Peter Zijlstra wrote:
> > > 
> > > I'd leave out the TO part entirely (or only mention it in changelogs).
> > > 
> > > That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.
> > That brings me to a different question:
> > 
> > How is user space going to support this, i.e. is this some extra magic for
> > code which implements its own locking primitives or is there going to be a
> > wide use via e.g. glibc.
> 
> There are some applications that use futex(2) directly to implement their
> synchronization primitives. For those applications, they will need to modify
> their code to detect the presence of the new futexes. They can then use the
> new futexes if available and use wait-wake futexes if not.

That's what I suspected. Did you talk to the other folks who complain about
futex performance (database, JVM, etc.) and play their own games with user
space spinlocks and whatever?
 
> I am also planning to take a look at the pthread_mutex* APIs to see if they
> can be modified to use the new futexes later on when the patch becomes more
> mature.

Please involve glibc people who are interested in the futex stuff early and
discuss the concept before it's set in stone for your particular usecase.
 
> > Also what's the reason that we can't do probabilistic spinning for
> > FUTEX_WAIT and have to add yet another specialized variant of futexes?
> > 
> 
> The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the
> futex is. We usually do spinning when the lock owner is running and abort when
> it goes to sleep. We can't do that for FUTEX_WAIT.

Fair enough. This wants to be spelled out in the changelog and explained a
bit more detailed.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 11:11 AM, Davidlohr Bueso wrote:

On Thu, 22 Sep 2016, Thomas Gleixner wrote:


On Thu, 22 Sep 2016, Davidlohr Bueso wrote:

On Thu, 22 Sep 2016, Thomas Gleixner wrote:
> Also what's the reason that we can't do probabilistic spinning for
> FUTEX_WAIT and have to add yet another specialized variant of 
futexes?


Where would this leave the respective FUTEX_WAKE? A nop? Probably 
have to
differentiate the fact that the queue was empty, but there was a 
spinning,

instead of straightforward returning 0.


Sorry, but I really can't parse this answer.

Can you folks please communicate with proper and coherent explanations
instead of throwing a few gnawed off bones in my direction?


I actually think that FUTEX_WAIT is the better/nicer approach. But my 
immediate
question above was how to handle the FUTEX_WAKE counter-part. If we 
want to
maintain current FIFO ordering for wakeups, now with WAIT spinners 
this will
create lock stealing scenarios (including if we even guard against 
starvation).
Or we could reduce the scope of spinners, due to the restrictions, 
similar to
the top-waiter only being able to spin for rtmutexes. This of course 
will hurt

the effectiveness of spinning in FUTEX_WAIT in the first place.


Actually, there can be a lot of lock stealing going on with the 
wait-wake futexes. If the critical section is short enough, many of the 
lock waiters can be waiting in the hash bucket spinlock queue and not 
sleeping yet while the futex value changes. As a result, they will exit 
the futex syscall and back to user space with EAGAIN where one of them 
may get the lock. So we can't assume that they will get the lock in the 
FIFO order anyway.


Another immediate thought was situations where we spinner(s) and the 
wait queue is
empty, the WAKE should also have to acknowledge that situation, as 
just returning 0

would indicate that there are actually no waiters on the futex.


I would say that adding optimistic spinning to FUTEX_WAIT will be a 
major change and I don't think it will be less complex than adding a new 
futex type like the TO futexes while introducing a fair amount of risk 
of breaking existing applications.


Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 09:34 AM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Peter Zijlstra wrote:


I'd leave out the TO part entirely (or only mention it in changelogs).

That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.

That brings me to a different question:

How is user space going to support this, i.e. is this some extra magic for
code which implements its own locking primitives or is there going to be a
wide use via e.g. glibc.


There are some applications that use futex(2) directly to implement 
their synchronization primitives. For those applications, they will need 
to modify their code to detect the presence of the new futexes. They can 
then use the new futexes if available and use wait-wake futexes if not.


I am also planning to take a look at the pthread_mutex* APIs to see if 
they can be modified to use the new futexes later on when the patch 
becomes more mature.



Also what's the reason that we can't do probabilistic spinning for
FUTEX_WAIT and have to add yet another specialized variant of futexes?



The main reason is that a FUTEX_WAIT waiter has no idea who the owner of 
the futex is. We usually do spinning when the lock owner is running and 
abort when it goes to sleep. We can't do that for FUTEX_WAIT.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 09:23 AM, Peter Zijlstra wrote:

On Tue, Sep 20, 2016 at 09:42:41AM -0400, Waiman Long wrote:

+/*
+ * Spinning threshold before enabling lock handoff.
+ * Each sleep will decrement the threshold by 1/32 of the start value.
+ */
+#define TO_SPIN_THRESHOLD  (1<<  13)
+#define TO_SLEEP_DECREMENT (TO_SPIN_THRESHOLD/32)

Argh, we should really get rid of those stupid numbers. Wasn't there a
patch set working on implementing paravirt functions that would make all
this fixable in a sane way?


These thresholds are not paravirt related. They are there to determine 
when we should activate explicit lock handoff when the waiter is waiting 
for too long. Yes, it is arbitrary. The numbers are on the high side as 
setting them too low will limit lock stealing and hence overall 
performance. The FUTEX_WAITERS bit is only turned on when either the top 
waiter is sleeping or a lock handoff is needed. Otherwise, it is off to 
encourage lock stealing in the userspace as well as in the kernel.


Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Davidlohr Bueso

On Thu, 22 Sep 2016, Thomas Gleixner wrote:


On Thu, 22 Sep 2016, Davidlohr Bueso wrote:

On Thu, 22 Sep 2016, Thomas Gleixner wrote:
> Also what's the reason that we can't do probabilistic spinning for
> FUTEX_WAIT and have to add yet another specialized variant of futexes?

Where would this leave the respective FUTEX_WAKE? A nop? Probably have to
differentiate the fact that the queue was empty, but there was a spinning,
instead of straightforward returning 0.


Sorry, but I really can't parse this answer.

Can you folks please communicate with proper and coherent explanations
instead of throwing a few gnawed off bones in my direction?


I actually think that FUTEX_WAIT is the better/nicer approach. But my immediate
question above was how to handle the FUTEX_WAKE counter-part. If we want to
maintain current FIFO ordering for wakeups, now with WAIT spinners this will
create lock stealing scenarios (including if we even guard against starvation).
Or we could reduce the scope of spinners, due to the restrictions, similar to
the top-waiter only being able to spin for rtmutexes. This of course will hurt
the effectiveness of spinning in FUTEX_WAIT in the first place.

Another immediate thought was situations where we spinner(s) and the wait queue 
is
empty, the WAKE should also have to acknowledge that situation, as just 
returning 0
would indicate that there are actually no waiters on the futex.

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Davidlohr Bueso

On Thu, 22 Sep 2016, Thomas Gleixner wrote:


On Thu, 22 Sep 2016, Peter Zijlstra wrote:

On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
> On 09/21/2016 02:59 AM, Mike Galbraith wrote:
> >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> >>This patch introduces a new futex implementation called
> >>throughput-optimized (TO) futexes.
> >nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
> >shorthand for timeout in the next patch.
>
> I agree. I am not that satisfied with the TO name. So I will change it to TP
> in my next revision of the patch. Thanks for the suggestion.

I'd leave out the TO part entirely (or only mention it in changelogs).

That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.


That brings me to a different question:

How is user space going to support this, i.e. is this some extra magic for
code which implements its own locking primitives or is there going to be a
wide use via e.g. glibc.

Also what's the reason that we can't do probabilistic spinning for
FUTEX_WAIT and have to add yet another specialized variant of futexes?


Where would this leave the respective FUTEX_WAKE? A nop? Probably have to
differentiate the fact that the queue was empty, but there was a spinning,
instead of straightforward returning 0.

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Thomas Gleixner
On Thu, 22 Sep 2016, Peter Zijlstra wrote:
> On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
> > On 09/21/2016 02:59 AM, Mike Galbraith wrote:
> > >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> > >>This patch introduces a new futex implementation called
> > >>throughput-optimized (TO) futexes.
> > >nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
> > >shorthand for timeout in the next patch.
> > 
> > I agree. I am not that satisfied with the TO name. So I will change it to TP
> > in my next revision of the patch. Thanks for the suggestion.
> 
> I'd leave out the TO part entirely (or only mention it in changelogs).
> 
> That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.

That brings me to a different question:

How is user space going to support this, i.e. is this some extra magic for
code which implements its own locking primitives or is there going to be a
wide use via e.g. glibc.

Also what's the reason that we can't do probabilistic spinning for
FUTEX_WAIT and have to add yet another specialized variant of futexes?

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 03:49 AM, Peter Zijlstra wrote:

On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:

On 09/21/2016 02:59 AM, Mike Galbraith wrote:

On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:

This patch introduces a new futex implementation called
throughput-optimized (TO) futexes.

nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
shorthand for timeout in the next patch.

I agree. I am not that satisfied with the TO name. So I will change it to TP
in my next revision of the patch. Thanks for the suggestion.

I'd leave out the TO part entirely (or only mention it in changelogs).

That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.


I was following the convention for the PI futexes where we have 
FUTEX_LOCK_PI and FUTEX_UNLOCK_PI. If you think it is OK to promote this 
new futex as default for mutex type locks and use the PI futexes only 
when needed, I am certainly OK to drop the suffix.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Peter Zijlstra
On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
> On 09/21/2016 02:59 AM, Mike Galbraith wrote:
> >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> >>This patch introduces a new futex implementation called
> >>throughput-optimized (TO) futexes.
> >nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
> >shorthand for timeout in the next patch.
> 
> I agree. I am not that satisfied with the TO name. So I will change it to TP
> in my next revision of the patch. Thanks for the suggestion.

I'd leave out the TO part entirely (or only mention it in changelogs).

That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-21 Thread Waiman Long

On 09/21/2016 02:59 AM, Mike Galbraith wrote:

On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:

This patch introduces a new futex implementation called
throughput-optimized (TO) futexes.

nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
shorthand for timeout in the next patch.


I agree. I am not that satisfied with the TO name. So I will change it 
to TP in my next revision of the patch. Thanks for the suggestion.


Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-21 Thread Mike Galbraith
On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> This patch introduces a new futex implementation called
> throughput-optimized (TO) futexes.

nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
shorthand for timeout in the next patch.

>   /*
> ->>   >  * Wake robust non-PI futexes here. The wakeup of
> ->>   >  * PI futexes happens in exit_pi_state():
> +>>   >  * Wake robust non-PI/TO futexes here. The wakeup of
> +>>   >  * PI/TO futexes happens in exit_pi_state():
>  >>   >  */

nit: comment attracted my eye during high speed scroll-by.

-Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

2016-09-20 Thread Waiman Long
This patch introduces a new futex implementation called
throughput-optimized (TO) futexes. The goal of this new futex
type is to maximize locking throughput at the expense of fairness
and deterministic latency. Its throughput is higher than that of
the wait-wake futexes especially on systems with a large number of
CPUs and the lock owners are unlikely to sleep. The downside is the
increase in the response time variance. It also implements a lock
hand-off mechanism to make sure that lock starvation won't happen.

Using a futex locking microbenchmark to do 10 millions locking
operations with 5 pause instructions as the critical section by 256
threads (10M/256 locking ops each) on a 4-socket 72-core 144-thread
Haswell-EX system, the results of the benchmark runs were as follows:

wait-wake futex PI futexTO futex
--- 
max time3.49s50.91s  2.65s
min time3.24s50.84s  0.07s
average time3.41s50.90s  1.84s
sys time  7m22.4s55.73s2m32.9s
lock count   3,090,294  9,999,813   698,318
unlock count 3,268,896  9,999,814   134

The lock and unlock counts above show the actual numbers of futex(2)
lock and unlock syscalls that were being issued.

Signed-off-by: Waiman Long 
---
 include/uapi/linux/futex.h |4 +
 kernel/futex.c |  634 +++-
 2 files changed, 627 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..e7deaf3 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET  10
 #define FUTEX_WAIT_REQUEUE_PI  11
 #define FUTEX_CMP_REQUEUE_PI   12
+#define FUTEX_LOCK_TO  13
+#define FUTEX_UNLOCK_TO14
 
 #define FUTEX_PRIVATE_FLAG 128
 #define FUTEX_CLOCK_REALTIME   256
@@ -39,6 +41,8 @@
 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE   (FUTEX_CMP_REQUEUE_PI | \
 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_TO_PRIVATE  (FUTEX_LOCK_TO | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_TO_PRIVATE(FUTEX_UNLOCK_TO | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index f8bb93f..7daba56 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -191,6 +191,11 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_CLOCKRT  0x02
 #define FLAGS_HAS_TIMEOUT  0x04
 
+enum futex_type {
+   TYPE_PI = 0,
+   TYPE_TO,
+};
+
 /*
  * Futex state object:
  *  - Priority Inheritance state
@@ -203,13 +208,30 @@ struct futex_state {
struct list_head list;
 
/*
-* The PI object:
+* Linking into the owning hashed bucket (TO futexes only)
+*/
+   struct list_head hb_list;
+
+   /*
+* The PI or mutex object:
 */
-   struct rt_mutex pi_mutex;
+   union {
+   struct rt_mutex pi_mutex;
+   struct mutex mutex;
+   };
 
+   /*
+* For PI futex, owner is the task that owns the futex.
+* For TO futex, owner is the mutex lock holder that is either
+* spinning on the futex owner or sleeping.
+*/
struct task_struct *owner;
atomic_t refcount;
 
+   enum futex_type type;
+
+   u32 handoff_pid;/* TO only, PID for lock hand-off */
+
union futex_key key;
 };
 
@@ -262,6 +284,7 @@ struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+   struct list_head fs_list;   /* List of futex state objects */
 } cacheline_aligned_in_smp;
 
 /*
@@ -801,8 +824,11 @@ static int refill_futex_state_cache(void)
return -ENOMEM;
 
INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>hb_list);
+
/* pi_mutex gets initialized later */
state->owner = NULL;
+   state->handoff_pid = 0;
atomic_set(>refcount, 1);
state->key = FUTEX_KEY_INIT;
 
@@ -836,10 +862,10 @@ static void put_futex_state(struct futex_state *state)
return;
 
/*
-* If state->owner is NULL, the owner is most probably dying
-* and has cleaned up the futex state already
+* If state->owner is NULL and the type is TYPE_PI, the owner
+* is most probably dying and has cleaned up the state already
 */
-   if (state->owner) {
+   if (state->owner && (state->type == TYPE_PI)) {
raw_spin_lock_irq(>owner->pi_lock);
list_del_init(>list);
raw_spin_unlock_irq(>owner->pi_lock);
@@ -847,6 +873,10 @@ static void put_futex_state(struct futex_state *state)