[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

2007-03-15 Thread Eric Dumazet
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

This first patch introduces XXX_PRIVATE futexes operations.

When a process uses a XXX_PRIVATE futex primitive, kernel can avoid
to take a read lock on mmap_sem, to find the vma that contains the futex,
to learn if it is associated to an inode (shared) or the mm (private to 
process)

We also avoid taking a reference on the found inode or the mm.

Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

This first patch is possible because, for one process using 
PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their 
virtual address, no matter the underlying mm storage is. The case of multiple 
virtual addresses mapped on the same physical address is just insane : "Dont 
do it on PROCESS_PRIVATE futexes, please ?"

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that 
only one syscall might fail.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : SHARED futexes can be used by old binaries *and* new binaries,
because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
---
 include/linux/futex.h |   12 +
 kernel/futex.c|  273 +---
 2 files changed, 188 insertions(+), 97 deletions(-)
--- linux-2.6.21-rc3/kernel/futex.c 2007-03-13 13:22:31.0 +0100
+++ linux-2.6.21-rc3-ed/kernel/futex.c  2007-03-15 18:30:15.0 +0100
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[EMAIL PROTECTED]>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]>
  *
+ *  Introduction of PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <[EMAIL PROTECTED]>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -60,8 +63,18 @@
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *   (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *   (but private mapping on an mm, and reference taken on it)
  */
+
+#define OFF_INODE1 /* We set bit 0 if key has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
struct {
unsigned long pgoff;
@@ -129,9 +142,6 @@ struct futex_q {
struct task_struct *task;
 };
 
-/*
- * Split the global futex_lock into every hash list lock.
- */
 struct futex_hash_bucket {
spinlock_t  lock;
struct list_head   chain;
@@ -175,7 +185,8 @@ static inline int match_futex(union fute
  *
  * Should be called with >mm->mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key,
+   struct rw_semaphore *shared)
 {
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -192,6 +203,22 @@ static int get_futex_key(u32 __user *uad
address -= key->both.offset;
 
/*
+* PROCESS_PRIVATE futexes are fast.
+* As the mm cannot disappear under us and the 'key' only needs
+* virtual address, we dont even have to find the underlying vma.
+* Note : We do have to check 'address' is a valid user address,
+*but access_ok() should be faster than find_vma()
+* N

[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

2007-03-15 Thread Eric Dumazet
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

This first patch introduces XXX_PRIVATE futexes operations.

When a process uses a XXX_PRIVATE futex primitive, kernel can avoid
to take a read lock on mmap_sem, to find the vma that contains the futex,
to learn if it is associated to an inode (shared) or the mm (private to 
process)

We also avoid taking a reference on the found inode or the mm.

Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

This first patch is possible because, for one process using 
PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their 
virtual address, no matter the underlying mm storage is. The case of multiple 
virtual addresses mapped on the same physical address is just insane : Dont 
do it on PROCESS_PRIVATE futexes, please ?

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that 
only one syscall might fail.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : SHARED futexes can be used by old binaries *and* new binaries,
because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
---
 include/linux/futex.h |   12 +
 kernel/futex.c|  273 +---
 2 files changed, 188 insertions(+), 97 deletions(-)
--- linux-2.6.21-rc3/kernel/futex.c 2007-03-13 13:22:31.0 +0100
+++ linux-2.6.21-rc3-ed/kernel/futex.c  2007-03-15 18:30:15.0 +0100
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar [EMAIL PROTECTED]
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED]
  *
+ *  Introduction of PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet [EMAIL PROTECTED]
+ *
  *  Thanks to Ben LaHaise for yelling hashed waitqueues loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -60,8 +63,18 @@
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *   (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *   (but private mapping on an mm, and reference taken on it)
  */
+
+#define OFF_INODE1 /* We set bit 0 if key has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
struct {
unsigned long pgoff;
@@ -129,9 +142,6 @@ struct futex_q {
struct task_struct *task;
 };
 
-/*
- * Split the global futex_lock into every hash list lock.
- */
 struct futex_hash_bucket {
spinlock_t  lock;
struct list_head   chain;
@@ -175,7 +185,8 @@ static inline int match_futex(union fute
  *
  * Should be called with current-mm-mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key,
+   struct rw_semaphore *shared)
 {
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current-mm;
@@ -192,6 +203,22 @@ static int get_futex_key(u32 __user *uad
address -= key-both.offset;
 
/*
+* PROCESS_PRIVATE futexes are fast.
+* As the mm cannot disappear under us and the 'key' only needs
+* virtual address, we dont even have to find the underlying vma.
+* Note : We do have to check 'address' is a valid user address,
+*but access_ok() should be faster than find_vma()
+* Note : At this point, address points to the start of page