[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic This first patch introduces XXX_PRIVATE futexes operations. When a process uses a XXX_PRIVATE futex primitive, kernel can avoid to take a read lock on mmap_sem, to find the vma that contains the futex, to learn if it is associated to an inode (shared) or the mm (private to process) We also avoid taking a reference on the found inode or the mm. Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. This first patch is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. The case of multiple virtual addresses mapped on the same physical address is just insane : "Dont do it on PROCESS_PRIVATE futexes, please ?" If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that only one syscall might fail. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : SHARED futexes can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> --- include/linux/futex.h | 12 + kernel/futex.c| 273 +--- 2 files changed, 188 insertions(+), 97 deletions(-) --- linux-2.6.21-rc3/kernel/futex.c 2007-03-13 13:22:31.0 +0100 +++ linux-2.6.21-rc3-ed/kernel/futex.c 2007-03-15 18:30:15.0 +0100 @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[EMAIL PROTECTED]> * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]> * + * Introduction of PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet <[EMAIL PROTECTED]> + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -60,8 +63,18 @@ * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) */ + +#define OFF_INODE1 /* We set bit 0 if key has a reference on inode */ +#define OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { struct { unsigned long pgoff; @@ -129,9 +142,6 @@ struct futex_q { struct task_struct *task; }; -/* - * Split the global futex_lock into every hash list lock. - */ struct futex_hash_bucket { spinlock_t lock; struct list_head chain; @@ -175,7 +185,8 @@ static inline int match_futex(union fute * * Should be called with >mm->mmap_sem but NOT any spinlocks. */ -static int get_futex_key(u32 __user *uaddr, union futex_key *key) +static int get_futex_key(u32 __user *uaddr, union futex_key *key, + struct rw_semaphore *shared) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -192,6 +203,22 @@ static int get_futex_key(u32 __user *uad address -= key->both.offset; /* +* PROCESS_PRIVATE futexes are fast. +* As the mm cannot disappear under us and the 'key' only needs +* virtual address, we dont even have to find the underlying vma. +* Note : We do have to check 'address' is a valid user address, +*but access_ok() should be faster than find_vma() +* N
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic This first patch introduces XXX_PRIVATE futexes operations. When a process uses a XXX_PRIVATE futex primitive, kernel can avoid to take a read lock on mmap_sem, to find the vma that contains the futex, to learn if it is associated to an inode (shared) or the mm (private to process) We also avoid taking a reference on the found inode or the mm. Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. This first patch is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. The case of multiple virtual addresses mapped on the same physical address is just insane : Dont do it on PROCESS_PRIVATE futexes, please ? If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that only one syscall might fail. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : SHARED futexes can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- include/linux/futex.h | 12 + kernel/futex.c| 273 +--- 2 files changed, 188 insertions(+), 97 deletions(-) --- linux-2.6.21-rc3/kernel/futex.c 2007-03-13 13:22:31.0 +0100 +++ linux-2.6.21-rc3-ed/kernel/futex.c 2007-03-15 18:30:15.0 +0100 @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar [EMAIL PROTECTED] * Copyright (C) 2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED] * + * Introduction of PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet [EMAIL PROTECTED] + * * Thanks to Ben LaHaise for yelling hashed waitqueues loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -60,8 +63,18 @@ * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) */ + +#define OFF_INODE1 /* We set bit 0 if key has a reference on inode */ +#define OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { struct { unsigned long pgoff; @@ -129,9 +142,6 @@ struct futex_q { struct task_struct *task; }; -/* - * Split the global futex_lock into every hash list lock. - */ struct futex_hash_bucket { spinlock_t lock; struct list_head chain; @@ -175,7 +185,8 @@ static inline int match_futex(union fute * * Should be called with current-mm-mmap_sem but NOT any spinlocks. */ -static int get_futex_key(u32 __user *uaddr, union futex_key *key) +static int get_futex_key(u32 __user *uaddr, union futex_key *key, + struct rw_semaphore *shared) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current-mm; @@ -192,6 +203,22 @@ static int get_futex_key(u32 __user *uad address -= key-both.offset; /* +* PROCESS_PRIVATE futexes are fast. +* As the mm cannot disappear under us and the 'key' only needs +* virtual address, we dont even have to find the underlying vma. +* Note : We do have to check 'address' is a valid user address, +*but access_ok() should be faster than find_vma() +* Note : At this point, address points to the start of page