Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

2016-09-23 Thread Thomas Gleixner
On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> > > + WARN_ON(!pi_state->owner);
> > > + if (pi_state->owner)
> > > + wake_up_process(pi_state->owner);
> > And what handles or sanity checks the state->hb_list ???
> > 
> 
> The exit_pi_state_list() function doesn't need to deal with state->hb_list.
> The hb_list is used to locate the futex state, but the futex owner doesn't
> have a reference to the futex state. So it won't need to decrement it and
> potentially free it.

Ok. Though that wants comments as well.
 
> > > +/*
> > > + * Spinning threshold for futex word without setting FUTEX_WAITERS.
> > > + */
> > > +#define FUTEX_SPIN_THRESHOLD (1<<  10)
> > That number is pulled out of thin air or what is the rationale for chosing
> > 1024?
> 
> It is kind of arbitrary. I want a value large enough to encourage lock
> stealing, while not too large that it may take too long to get the lock. Will
> elaborate more about the choice in the comment.

I'm not really happy about these heuristics. The chosen value fits a
particular machine and scenario. Now try to do the same on some slow ARM
SoC and the 1024 loops are going to hurt badly.
 
> Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and my
> focus was to make it work. I haven't spent much time in cleaning up the
> comment. I will do that in the next version.

RFC is not an excuse for a unreadable mess. This stuff is complex enough
that it should be in your own interest to make it well documented and
readable. If it takes more time for a reviewer to distangle the code jungle
than spending time on thinking about the concept, verifying the lifetime
and serialization rules, etc. then there is something very wrong. And it's
wasting my time and yours. Not that I care much about your time waste :)

When Sebastian and myself worked on the attached futexes we spent more time
tidying up the code and writing documentation than on actually coding,
debugging and optimizing it.

Futexes have a gazillion of corner cases and pitfalls and your stuff even
if it seems to be simpler has the same issues.

> > > + WARN_ON_ONCE(!(uval&  FUTEX_TID_MASK));
> > > +
> > > + if ((uval&  FUTEX_TID_MASK) != opid) {
> > > + /*
> > > +  * Get the new task structure
> > > +  */
> > > + if (otask)
> > > + put_task_struct(otask);
> > > +
> > > + WARN_ON_ONCE(on_owner_list);
> > > + opid  = uval&  FUTEX_TID_MASK;
> > > + otask = futex_find_get_task(opid);
> > > + }
> > Can you pretty please use split out functions for this otask handling? This
> > for loop does not fit on a single screen and can't be understood in one go.
> 
> Yes, this is the largest function in the new code, I will try to add helper
> functions to make it easier to understand.

You should have done that in the first place. Because even when you develop
and experiment it's insane to try to follow a loop with several nest levels
which occupies _two_ screens.

It's your choice how you develop, but offering something like this for
review is just an affront.

> > > + pr_info("futex_spin_on_owner: pid %d grabs futex from
> > > pid %d (%s)!\n",
> > > + vpid, opid, otask ? "dying" : "invalid");
> > Really useful information to confuse sysadmins. A proper comment explaining
> > the issue and the implications and how we deal with it would have been the
> > right thing to do...
> 
> Yes, the message may be too cryptic. I will make it more clear in the next
> version.

No, this message is not too cryptic. It's entirely useless and it needs not
to be made more clear. It needs to be replaced by a comment explaining the
issue. This is a legit thing to happen and it has absolutely no value to
spam dmesg that something happened which is expected and dealt with.

> > > + ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
> > > + if (unlikely(ret))
> > > + goto out;
> > > +
> > > + hb = hash_futex(&key);
> > > + spin_lock(&hb->lock);
> > Why are you using the global hash for this? As we have shown the global
> > hash has horrible performance due to cross node access and potential hash
> > bucket lock contention of unrelated processes. If we add something new
> > which is performance optimized then we pretty please make it use a seperate
> > process private storage. I don't see any point in making this optimized for
> > process shared futexes.
> 
> I used the hash lock for managing the futex state objects only. I don't need
> it for other purpose. It is true that if there is a hash collision that
> another wait-wake futex is using the same hash. It may cause performance
> problem. I think I will add another spinlock in the hash bucket just for TO
> futexes. In this way, a collision with wait-wake futex wo

Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Waiman Long

On 09/22/2016 09:32 AM, Thomas Gleixner wrote:

On Tue, 6 Sep 2016, Waiman Long wrote:

+enum futex_type {
+   TYPE_PI = 0,
+   TYPE_TO,
+};

Please introduce the futex_type magic and the related changes to the pi
code in a seperate patch so it can be verified independently.

It's sad that one has to explain that to you over and over 


I didn't break it out because the changes to the PI code was pretty 
small. I will break it out in the next version.



@@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
return;

/*
-* If state->owner is NULL, the owner is most probably dying
-* and has cleaned up the futex state already
+* If state->owner is NULL and the type is TYPE_PI, the owner
+* is most probably dying and has cleaned up the state already
 */
-   if (state->owner) {
+   if (state->owner&&  (state->type == TYPE_PI)) {
raw_spin_lock_irq(&state->owner->pi_lock);
list_del_init(&state->list);
raw_spin_unlock_irq(&state->owner->pi_lock);
@@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
}

+   /*
+* Dequeue it from the HB futex state list.
+*/
+   list_del_init(&state->hb_list);

The comment above this list_del() is really pointless. I can see that from
the code itself.

Aside of that: Why do you need seperate list heads? You explain the
seperate list somewhere in that big comment below, but it should be
explained at the point where you add it to the state and the hash bucket.


Sure. Will fix the comment.


if (current->pi_state_cache)
kfree(state);
else {
@@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
continue;
}

-   WARN_ON(pi_state->owner != curr);
WARN_ON(list_empty(&pi_state->list));
+   if (pi_state->type == TYPE_PI) {
+   WARN_ON(pi_state->owner != curr);
+   pi_state->owner = NULL;
+   }
list_del_init(&pi_state->list);
-   pi_state->owner = NULL;
raw_spin_unlock_irq(&curr->pi_lock);

-   rt_mutex_unlock(&pi_state->pi_mutex);
+   if (pi_state->type == TYPE_PI)

lacks curly braces


Yes, you are right.


+   rt_mutex_unlock(&pi_state->pi_mutex);
+   else if (pi_state->type == TYPE_TO) {
+   /*
+* Need to wakeup the mutex owner.
+*/

Another completely useless comment. Because you tell what you do, but not
WHY.


Will elaborate on why the wakeup here.


+   WARN_ON(!pi_state->owner);
+   if (pi_state->owner)
+   wake_up_process(pi_state->owner);

And what handles or sanity checks the state->hb_list ???



The exit_pi_state_list() function doesn't need to deal with 
state->hb_list. The hb_list is used to locate the futex state, but the 
futex owner doesn't have a reference to the futex state. So it won't 
need to decrement it and potentially free it.



+/*
+ * Try to lock the userspace futex word (0 =>  vpid).
+ *
+ * Return: 1 if lock acquired or an error happens, 0 if not.
+ *The status code will be 0 if no error, or<  0 if an error happens.
+ **puval will contain the latest futex value when trylock fails.
+ *
+ * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
+ * The HB spinlock should NOT be held while calling this function. A
+ * successful lock acquisition will clear the waiter and died bits.
+ */
+static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
+  const bool waiter, int *status)
+{
+   u32 uval;
+
+   *status = 0;
+
+   if (unlikely(get_user(uval, uaddr)))
+   goto efault;
+
+   *puval = uval;
+
+   if (waiter ? (uval&  FUTEX_TID_MASK) : uval)
+   return 0;   /* Trylock fails */

Please do not use tail comments. They are hard to parse.



OK, will move the comment up.


+
+   if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
+   goto efault;
+
+   return *puval == uval;
+
+efault:
+   *status = -EFAULT;
+   return 1;
+}

Do we really need another variant of cmpxchg and why do you need that extra
status? What's wrong in doing the magic in the return value?


This is not another variant of cmpxchg. It is the cmpxchg used by 
cmpxchg_futex_value_locked().  The only difference is that page fault 
was disabled with the locked version. I call 
futex_atomic_cmpxchg_inatomic() directly because it is called without 
the HB spinlock. So I don't need to disable page fault. I will add a 
separate patch to introduce 

Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

2016-09-22 Thread Thomas Gleixner
On Tue, 6 Sep 2016, Waiman Long wrote:
> +enum futex_type {
> + TYPE_PI = 0,
> + TYPE_TO,
> +};

Please introduce the futex_type magic and the related changes to the pi
code in a seperate patch so it can be verified independently.

It's sad that one has to explain that to you over and over 

> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>   return;
>  
>   /*
> -  * If state->owner is NULL, the owner is most probably dying
> -  * and has cleaned up the futex state already
> +  * If state->owner is NULL and the type is TYPE_PI, the owner
> +  * is most probably dying and has cleaned up the state already
>*/
> - if (state->owner) {
> + if (state->owner && (state->type == TYPE_PI)) {
>   raw_spin_lock_irq(&state->owner->pi_lock);
>   list_del_init(&state->list);
>   raw_spin_unlock_irq(&state->owner->pi_lock);
> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>   rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>   }
>  
> + /*
> +  * Dequeue it from the HB futex state list.
> +  */
> + list_del_init(&state->hb_list);

The comment above this list_del() is really pointless. I can see that from
the code itself.

Aside of that: Why do you need seperate list heads? You explain the
seperate list somewhere in that big comment below, but it should be
explained at the point where you add it to the state and the hash bucket.

>   if (current->pi_state_cache)
>   kfree(state);
>   else {
> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>   continue;
>   }
>  
> - WARN_ON(pi_state->owner != curr);
>   WARN_ON(list_empty(&pi_state->list));
> + if (pi_state->type == TYPE_PI) {
> + WARN_ON(pi_state->owner != curr);
> + pi_state->owner = NULL;
> + }
>   list_del_init(&pi_state->list);
> - pi_state->owner = NULL;
>   raw_spin_unlock_irq(&curr->pi_lock);
>  
> - rt_mutex_unlock(&pi_state->pi_mutex);
> + if (pi_state->type == TYPE_PI)

lacks curly braces

> + rt_mutex_unlock(&pi_state->pi_mutex);
> + else if (pi_state->type == TYPE_TO) {
> + /*
> +  * Need to wakeup the mutex owner.
> +  */

Another completely useless comment. Because you tell what you do, but not
WHY.

> + WARN_ON(!pi_state->owner);
> + if (pi_state->owner)
> + wake_up_process(pi_state->owner);

And what handles or sanity checks the state->hb_list ???

> +/*
> + * Try to lock the userspace futex word (0 => vpid).
> + *
> + * Return: 1 if lock acquired or an error happens, 0 if not.
> + *  The status code will be 0 if no error, or < 0 if an error happens.
> + *  *puval will contain the latest futex value when trylock fails.
> + *
> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
> + * The HB spinlock should NOT be held while calling this function. A
> + * successful lock acquisition will clear the waiter and died bits.
> + */
> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
> +const bool waiter, int *status)
> +{
> + u32 uval;
> +
> + *status = 0;
> +
> + if (unlikely(get_user(uval, uaddr)))
> + goto efault;
> +
> + *puval = uval;
> +
> + if (waiter ? (uval & FUTEX_TID_MASK) : uval)
> + return 0;   /* Trylock fails */

Please do not use tail comments. They are hard to parse. 

> +
> + if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
> + goto efault;
> +
> + return *puval == uval;
> +
> +efault:
> + *status = -EFAULT;
> + return 1;
> +}

Do we really need another variant of cmpxchg and why do you need that extra
status? What's wrong in doing the magic in the return value?

This is utter crap, really.

> +
> +/*
> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
> + */
> +#define FUTEX_SPIN_THRESHOLD (1 << 10)

That number is pulled out of thin air or what is the rationale for chosing
1024?

> +/*
> + * Spin on the futex word while the futex owner is active. Otherwise, set
> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
> + * owner's task structure, we don't need to use RCU to ensure that the task
> + * structure is valid. The function will directly grab the lock if the
> + * owner is dying or the pid is invalid. That should take care of the problem
> + * of dead lock owners unless the pid wraps around and the preceived owner is
> + * not the real owner.
> + *
> + * Return: 0 if futex acquired, < 0 if an error happens.

If you document functions then pl