Re: Futex queue_me/get_user ordering
Ingo Molnar wrote: > > * Jakub Jelinek <[EMAIL PROTECTED]> wrote: > > > The futex man pages that have been around for years (certainly since > > mid 2002) certainly don't document FUTEX_WAIT as token passing > > operation, but as atomic operation: > > > > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html > > besides this documented-behavior argument, i dont think futexes should > be degraded into waitqueues I give in... Depending on atomicity makes it impossible for an application, which is linked with NPTL and Glibc, to write an NPTL-compatible "wait on two locks" function. I'm not saying that's a very clean thing to want, but it's a conceptual loss and I'm disappointed I seem to be the only one noticing it. On the other hand, I was mistaken to think it makes it impossible to write an emulation of synchronous futex() in terms of asynchronous futex().* In fact it makes it impossible to do so using the existing FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added somewhere down the line. * - The reason you would do this is if you were writing userspace-threading for any reason, and you had to include an emulation of synchronous futex() in terms of async futex because there are some libraries which might run on top of the userspace-threading which use futex in an application-dependent way. > - in fact, to solve some of the known > performance problems the opposite will have to happen: e.g. i believe > that in the future we'll need to enable the kernel-side futex code to > actually modify the futex variable. I.e. atomicity of the read in > FUTEX_WAIT is an absolute must, and is only the first step. Some of those performance problems can be solved already by better use of FUTEX_REQUEUE instead of FUTEX_WAKE. > [ the double-context-switch problem in cond_signal() that Jamie > mentioned is precisely one such case: pthread semantics force us that > the wakeup of the wakee _must_ happen while still holding the internal > lock. So we cannot just delay the wakeup to outside the glibc critical > section. This double context-switch could be avoided if the 'release > internal lock and wake up wakee' operation could be done atomically > within the kernel. (A sane default 'userspace unlock' operation on a > machine word could be defined .. e.g. decrement-to-zero.) ] Did you not see the solution I gave last November, using FUTEX_REQUEUE? See: http://lkml.org/lkml/2004/11/29/201 I spent a _lot_ of time figuring it out but everyone was too busy to confirm that it worked. It would improve performance in a number of cases. I hope that it does not get ignored yet again. There _may_ be cases where more complex futex operations are needed, but we should try the better algorithms that use the existing futex operations before adding new ones. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
* Jakub Jelinek <[EMAIL PROTECTED]> wrote: > The futex man pages that have been around for years (certainly since > mid 2002) certainly don't document FUTEX_WAIT as token passing > operation, but as atomic operation: > > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html besides this documented-behavior argument, i dont think futexes should be degraded into waitqueues - in fact, to solve some of the known performance problems the opposite will have to happen: e.g. i believe that in the future we'll need to enable the kernel-side futex code to actually modify the futex variable. I.e. atomicity of the read in FUTEX_WAIT is an absolute must, and is only the first step. [ the double-context-switch problem in cond_signal() that Jamie mentioned is precisely one such case: pthread semantics force us that the wakeup of the wakee _must_ happen while still holding the internal lock. So we cannot just delay the wakeup to outside the glibc critical section. This double context-switch could be avoided if the 'release internal lock and wake up wakee' operation could be done atomically within the kernel. (A sane default 'userspace unlock' operation on a machine word could be defined .. e.g. decrement-to-zero.) ] so i'm very much in favor of your patch - it fixes a real bug and is also the right step forward. We'll need more locking code in the kernel to remove fundamental limitations of userspace (such as no ability to control preemption), not less. i've tested your latest patch (from today) on x86 and it boots/works fine with Fedora userspace, where futexes do get utilized, and ran a few tests as well. (Andrew - might make sense to include in the next -mm so that we get some feel of stability, while the conceptual discussion continues?) Ingo -- this patch makes FUTEX_WAIT atomic again. Signed-off-by: Jakub Jelinek <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- linux/kernel/futex.c.orig +++ linux/kernel/futex.c @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned intnqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(¤t->mm->mmap_sem); @@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(&key1); bh2 = hash_futex(&key2); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(&bh1->lock); + spin_lock(&bh2->lock); + if (bh1 > bh2) + spin_lock(&bh1->lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(&curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(&bh1->lock); + if (bh1 != bh2) + spin_unlock(&bh2->lock); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 < bh2) - spin_lock(&bh1->lock); - spin_lock(&bh2->lock); - if (bh1 > bh2) - spin_lock(&bh1->lock); - - if (unlikely(nqueued != bh1->nqueued && valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = &bh1->chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (&this->key, &key1)) @@ -435,13 +420,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q->key. */ -static void queue_me(struct futex_q *q, int fd, struct fi
Re: Futex queue_me/get_user ordering
On Thu, Mar 17, 2005 at 03:20:31PM +, Jamie Lokier wrote: > > [numerous instances of...] > > + preempt_check_resched(); > > Not required. The spin unlocks will do this. Here is updated patch with those removed (all of them are preceeded by spin_unlock) and out_unqueue label and following unused code removed too. --- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.0 -0500 +++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.0 -0500 @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned intnqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(¤t->mm->mmap_sem); @@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(&key1); bh2 = hash_futex(&key2); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(&bh1->lock); + spin_lock(&bh2->lock); + if (bh1 > bh2) + spin_lock(&bh1->lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(&curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(&bh1->lock); + if (bh1 != bh2) + spin_unlock(&bh2->lock); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 < bh2) - spin_lock(&bh1->lock); - spin_lock(&bh2->lock); - if (bh1 > bh2) - spin_lock(&bh1->lock); - - if (unlikely(nqueued != bh1->nqueued && valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = &bh1->chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (&this->key, &key1)) @@ -435,13 +420,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q->key. */ -static void queue_me(struct futex_q *q, int fd, struct file *filp) +static inline struct futex_hash_bucket * +queue_lock(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, q->lock_ptr = &bh->lock; spin_lock(&bh->lock); - bh->nqueued++; + return bh; +} + +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) +{ list_add_tail(&q->list, &bh->chain); spin_unlock(&bh->lock); } +static inline void +queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) +{ + spin_unlock(&bh->lock); + drop_key_refs(&q->key); +} + +/* + * queue_me and unqueue_me must be called as a pair, each + * exactly once. They are called with the hashed spinlock held. + */ + +/* The key must be already stored in q->key. */ +static void queue_me(struct futex_q *q, int fd, struct file *filp) +{ + struct futex_hash_bucket *bh; + bh = queue_lock(q, fd, filp); + __queue_me(q, bh); +} + /* Return 1 if we were still queued (ie. 0 means we were woken) */ static int unqueue_me(struct futex_q *q) { @@ -503,6 +508,7 @@ static int futex_wait(unsigned long uadd DECLARE_WAITQUEUE(wait, current); int ret, curval; struct futex_q q; + struct futex_hash_bucket *bh; retry: down_read(¤t->mm->mmap_sem); @@ -511,7 +517,7 @@ static int futex_wait(unsigned long uadd if (unlikely(ret != 0)) goto out_release_sem; - queue_me(&q, -1,
Re: Futex queue_me/get_user ordering
On Thu, Mar 17, 2005 at 03:20:31PM +, Jamie Lokier wrote: > If you change futex_wait to be "atomic", and then have userspace locks > which _depend_ on that atomicity, it becomes impossible to wait on > multiple of those locks, or make poll-driven state machines which can > wait on those locks. The futex man pages that have been around for years (certainly since mid 2002) certainly don't document FUTEX_WAIT as token passing operation, but as atomic operation: Say http://www.icewalkers.com/Linux/ManPages/futex-2.html FUTEX_WAIT This operation atomically verifies that the futex address still contains the value given, and sleeps awaiting FUTEX_WAKE on this futex address. If the timeout argument is non-NULL, its contents describe the maximum duration of the wait, which is infinite otherwise. For futex(4), this call is executed if decrementing the count gave a negative value (indi cating contention), and will sleep until another process releases the futex and executes the FUTEX_WAKE operation. RETURN VALUE FUTEX_WAIT Returns 0 if the process was woken by a FUTEX_WAKE call. In case of timeout, ETIMEDOUT is returned. If the futex was not equal to the expected value, the operation returns EWOULDBLOCK. Signals (or other spurious wakeups) cause FUTEX_WAIT to return EINTR. so there very well might be programs other than glibc that depend on this behaviour. Given that in most cases the race is not hit every day (after all, we have been living with it for several years), they probably wouldn't know there is a problem like that. > You can do userspace threading and simulate most blocking system calls > by making them non-blocking and using poll). Sure, but then you need to write your own locking as well and can just use the token passing property of futexes there. > It's not a _huge_ loss, but considering it's only Glibc which is > demanding this and futexes have another property, token-passing, which > Glibc could be using instead - why not use it? Because that requires requeue being done with the cv lock held, which means an extra context switch. > > @@ -265,7 +264,6 @@ static inline int get_futex_value_locked > > inc_preempt_count(); > > ret = __copy_from_user_inatomic(dest, from, sizeof(int)); > > dec_preempt_count(); > > - preempt_check_resched(); > > > > return ret ? -EFAULT : 0; > > } > > inc_preempt_count() and dec_preempt_count() aren't needed, as > preemption is disabled by the queue spinlocks. So > get_futex_value_locked isn't needed any more: with the spinlocks held, > __get_user will do. They aren't needed if CONFIG_PREEMPT. But with !CONFIG_PREEMPT, they are IMHO still needed, as spin_lock/spin_unlock call preempt_{disable,enable}, which is a nop if !CONFIG_PREEMPT. __get_user can't be used though, it should be __get_user_inatomic (or __copy_from_user_inatomic if the former doesn't exist). > > [numerous instances of...] > > + preempt_check_resched(); > > Not required. The spin unlocks will do this. True, preempt_check_resched() is a nop if !CONFIG_PREEMPT and for CONFIG_PREEMPT spin_unlock will handle it. Will remove them from the patch. > > But with the recent changes to futex.c I think kernel can ensure > > atomicity for free. > > I agree it would probably not slow the kernel, but I would _strongly_ > prefer that Glibc were fixed to use the token-passing property, if > Glibc is the driving intention behind this patch - instead of this > becoming a semantic that application-level users of futex (like > database and IPC libraries) come to depend on and which can't be > decomposed into a multiple-waiting form. > > (I admit that the kernel code does look nicer with > get_futex_value_locked gone, though). > > By the way, do you know of Scott Snyder's recent work on fixing Glibc > in this way? He bumped into one of Glibc's currently broken corner > cases, fixed it (according to the algorithm I gave in November), and > reported that it works fine with the fix. I certainly haven't seen his patch. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
Jakub Jelinek wrote: > http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html > > Your argument in November was that you don't want to slow down the > kernel and that userland must be able to cope with the > non-atomicity of futex syscall. Those were two of them. But my other main concern is conceptual. Right now, a futex_wait call is roughly equivalent to to add_wait_queue, which is quite versatile. It means anything you can do with one futex, you can extend to multiple futexes (e.g. waiting on more than one lock), and you can do asynchronously (e.g. futex_wait can be implemented in userspace as futex_fd[1] + poll[2], and therefore things like poll-driven state machines where one of the state machines wants to wait on a lock are possible). [1] Ulrich was mistaken in his paper to say futex_fd needs to check a word to be useful; userspace is supposed to check the word after futex_fd and before polling or waiting on it. This is more useful because it extends to multiple futexes. [2] actually it can't right now because of a flaw in futex_fd's poll function, but that could be fixed. The _principle_ is sound. If you change futex_wait to be "atomic", and then have userspace locks which _depend_ on that atomicity, it becomes impossible to wait on multiple of those locks, or make poll-driven state machines which can wait on those locks. There are applications and libraries which use futex, not just for threading but things like database locks in files. You can do userspace threading and simulate most blocking system calls by making them non-blocking and using poll). (I'm not saying anything against NPTL by this, by the way - NPTL is a very good general purpose library - but there are occasions when an application wants to do it's own equivalent of simulated blocking system calls for one reason or another. My favourite being research into inter-thread JIT-optimisation in an environment like valgrind). Right now, in principle, futex_wait is among the system calls which can be simulated by making it non-blocking (= futex_fd) and using poll()[2]. Which means programs using futex themselves can be subject to interesting thread optimisations by code which knows nothing about the program (similar to valgrind..) If you change futex_wait to be "atomic", then it would be _impossible_ to take a some random 3rd party library which is using that futex_wait, and convert it's blocking system calls to use poll-driven state machines instead. I think taking that away would be a great conceptual loss. It's not a _huge_ loss, but considering it's only Glibc which is demanding this and futexes have another property, token-passing, which Glibc could be using instead - why not use it? That said, let's look at your patch. > It would simplify requeue implementation (getting rid of the nqueued > field), The change to FUTEX_REQUEUE2 is an improvement :) nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :) > @@ -265,7 +264,6 @@ static inline int get_futex_value_locked > inc_preempt_count(); > ret = __copy_from_user_inatomic(dest, from, sizeof(int)); > dec_preempt_count(); > - preempt_check_resched(); > > return ret ? -EFAULT : 0; > } inc_preempt_count() and dec_preempt_count() aren't needed, as preemption is disabled by the queue spinlocks. So get_futex_value_locked isn't needed any more: with the spinlocks held, __get_user will do. > [numerous instances of...] > + preempt_check_resched(); Not required. The spin unlocks will do this. > But with the recent changes to futex.c I think kernel can ensure > atomicity for free. I agree it would probably not slow the kernel, but I would _strongly_ prefer that Glibc were fixed to use the token-passing property, if Glibc is the driving intention behind this patch - instead of this becoming a semantic that application-level users of futex (like database and IPC libraries) come to depend on and which can't be decomposed into a multiple-waiting form. (I admit that the kernel code does look nicer with get_futex_value_locked gone, though). By the way, do you know of Scott Snyder's recent work on fixing Glibc in this way? He bumped into one of Glibc's currently broken corner cases, fixed it (according to the algorithm I gave in November), and reported that it works fine with the fix. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
On Thu, Nov 18, 2004 at 02:47:26PM -0500, Jakub Jelinek wrote: > The scenario described in futex_wait-fix.patch IMHO can happen even > if all calls to pthread_cond_signal are done with mutex held around it, i.e. > A B X Y > pthread_mutex_lock (&mtx); > pthread_cond_wait (&cv, &mtx); > - mtx release *) > total++ [1/0/0] (0) {} > pthread_mutex_lock (&mtx); > pthread_cond_signal (&cv); > - wake++ [1/1/0] (1) {} > FUTEX_WAKE, 1 (returns, nothing is queued) > pthread_mutex_unlock (&mtx); > pthread_mutex_lock (&mtx); > pthread_cond_wait (&cv, &mtx); > - mtx release *) > total++ [2/1/0] (1) {} > FUTEX_WAIT, 0 > queue_me [2/1/0] (1) {A} > 0 != 1 > FUTEX_WAIT, 1 > queue_me [2/1/0] (1) {A,B} > 1 == 1 > pthread_mutex_lock (&mtx); > pthread_cond_signal (&cv); > - wake++ [2/2/0] (2) {A,B} > FUTEX_WAKE, 1 (unqueues > incorrectly A) > [2/2/0] (2) {B} > pthread_mutex_unlock (&mtx); > try to dequeue but already dequeued > would normally return EWOULDBLOCK here > but as unqueue_me failed, returns 0 > woken++ [2/2/1] (2) {B} > schedule_timeout (forever) > - mtx reacquire > pthread_cond_wait returns > pthread_mutex_unlock (&mtx); > > --- > the code would like to say pthread_mutex_unlock (&mtx); > and pthread_exit here, but never reaches there. ... http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html Your argument in November was that you don't want to slow down the kernel and that userland must be able to cope with the non-atomicity of futex syscall. But with the recent changes to futex.c I think kernel can ensure atomicity for free. With get_futex_value_locked doing the user access in_atomic () and repeating if that failed, I think it would be just a matter of something as in the patch below (totally untested though). It would simplify requeue implementation (getting rid of the nqueued field), as well as never enqueue a futex in futex_wait until the *uaddr == val uaccess check has shown it should be enqueued. And I don't think the kernel will be any slower because of that, in the common case where get_futex_value_locked does not cause a mm fault (userland typically accessed that memory a few cycles before the syscall), the futex_wait change is just about doing first half of queue_me before the user access and second half after it. --- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.0 -0500 +++ linux-2.6.11/kernel/futex.c 2005-03-17 05:13:45.0 -0500 @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned intnqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(¤t->mm->mmap_sem); @@ -354,23 +351,24 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(&key1); bh2 = hash_futex(&key2); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(&bh1->lock); + spin_lock(&bh2->lock); + if (bh1 > bh2) + spin_lock(&bh1->lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(&curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(&bh1->lock); + if (bh1 != bh2) + spin_unlock(&bh2->lock); + + preempt_check_resched(); +