Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation

2014-01-21 Thread Waiman Long

On 01/20/2014 10:21 AM, Peter Zijlstra wrote:

On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote:

+#ifndef arch_mutex_cpu_relax
+# define arch_mutex_cpu_relax() cpu_relax()
+#endif

Include



Will do so.


+#ifndef smp_load_acquire
+# ifdef CONFIG_X86
+#   define smp_load_acquire(p) \
+   ({  \
+   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+   barrier();  \
+   ___p1;  \
+   })
+# else
+#   define smp_load_acquire(p) \
+   ({  \
+   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+   smp_mb();   \
+   ___p1;  \
+   })
+# endif
+#endif
+
+#ifndef smp_store_release
+# ifdef CONFIG_X86
+#   define smp_store_release(p, v) \
+   do {\
+   barrier();  \
+   ACCESS_ONCE(*p) = v;\
+   } while (0)
+# else
+#   define smp_store_release(p, v) \
+   do {\
+   smp_mb();   \
+   ACCESS_ONCE(*p) = v;\
+   } while (0)
+# endif
+#endif

Remove these.


Will do that.


+/*
+ * If an xadd (exchange-add) macro isn't available, simulate one with
+ * the atomic_add_return() function.
+ */
+#ifdef xadd
+# define qrw_xadd(rw, inc) xadd(&(rw).rwc, inc)
+#else
+# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc,&(rw).rwa) - inc)
+#endif

Is GCC really so stupid that you cannot always use the
atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll
end up with:

  i + xadd() - i

Surely it can just remove the two i terms?


I guess gcc should do the right thing. I will remove the macro.


+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ */
+static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+   struct qrwnode *prev;
+
+   node->next = NULL;
+   node->wait = true;
+   prev = xchg(>waitq, node);
+   if (prev) {
+   prev->next = node;
+   /*
+* Wait until the waiting flag is off
+*/
+   while (smp_load_acquire(>wait))
+   arch_mutex_cpu_relax();
+   }
+}

Please rebase on top of the MCS lock patches such that this is gone.


I would like to keep this as long as the MCS patches have not been 
merged into tip. However, I will take that out if the MCS patches are in 
when I need to revise the qrwlock patches.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation

2014-01-21 Thread Waiman Long

On 01/20/2014 10:21 AM, Peter Zijlstra wrote:

On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote:

+#ifndef arch_mutex_cpu_relax
+# define arch_mutex_cpu_relax() cpu_relax()
+#endif

Includelinux/mutex.h



Will do so.


+#ifndef smp_load_acquire
+# ifdef CONFIG_X86
+#   define smp_load_acquire(p) \
+   ({  \
+   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+   barrier();  \
+   ___p1;  \
+   })
+# else
+#   define smp_load_acquire(p) \
+   ({  \
+   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+   smp_mb();   \
+   ___p1;  \
+   })
+# endif
+#endif
+
+#ifndef smp_store_release
+# ifdef CONFIG_X86
+#   define smp_store_release(p, v) \
+   do {\
+   barrier();  \
+   ACCESS_ONCE(*p) = v;\
+   } while (0)
+# else
+#   define smp_store_release(p, v) \
+   do {\
+   smp_mb();   \
+   ACCESS_ONCE(*p) = v;\
+   } while (0)
+# endif
+#endif

Remove these.


Will do that.


+/*
+ * If an xadd (exchange-add) macro isn't available, simulate one with
+ * the atomic_add_return() function.
+ */
+#ifdef xadd
+# define qrw_xadd(rw, inc) xadd((rw).rwc, inc)
+#else
+# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc,(rw).rwa) - inc)
+#endif

Is GCC really so stupid that you cannot always use the
atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll
end up with:

  i + xadd() - i

Surely it can just remove the two i terms?


I guess gcc should do the right thing. I will remove the macro.


+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ */
+static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+   struct qrwnode *prev;
+
+   node-next = NULL;
+   node-wait = true;
+   prev = xchg(lock-waitq, node);
+   if (prev) {
+   prev-next = node;
+   /*
+* Wait until the waiting flag is off
+*/
+   while (smp_load_acquire(node-wait))
+   arch_mutex_cpu_relax();
+   }
+}

Please rebase on top of the MCS lock patches such that this is gone.


I would like to keep this as long as the MCS patches have not been 
merged into tip. However, I will take that out if the MCS patches are in 
when I need to revise the qrwlock patches.


-Longman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation

2014-01-20 Thread Peter Zijlstra
On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote:
> +#ifndef arch_mutex_cpu_relax
> +# define arch_mutex_cpu_relax() cpu_relax()
> +#endif

Include 


> +#ifndef smp_load_acquire
> +# ifdef CONFIG_X86
> +#   define smp_load_acquire(p)   \
> + ({  \
> + typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> + barrier();  \
> + ___p1;  \
> + })
> +# else
> +#   define smp_load_acquire(p)   \
> + ({  \
> + typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> + smp_mb();   \
> + ___p1;  \
> + })
> +# endif
> +#endif
> +
> +#ifndef smp_store_release
> +# ifdef CONFIG_X86
> +#   define smp_store_release(p, v)   \
> + do {\
> + barrier();  \
> + ACCESS_ONCE(*p) = v;\
> + } while (0)
> +# else
> +#   define smp_store_release(p, v)   \
> + do {\
> + smp_mb();   \
> + ACCESS_ONCE(*p) = v;\
> + } while (0)
> +# endif
> +#endif

Remove these.

> +/*
> + * If an xadd (exchange-add) macro isn't available, simulate one with
> + * the atomic_add_return() function.
> + */
> +#ifdef xadd
> +# define qrw_xadd(rw, inc)   xadd(&(rw).rwc, inc)
> +#else
> +# define qrw_xadd(rw, inc)   (u32)(atomic_add_return(inc, &(rw).rwa) - inc)
> +#endif

Is GCC really so stupid that you cannot always use the
atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll
end up with:

 i + xadd() - i

Surely it can just remove the two i terms?

> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to be added to the queue
> + */
> +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> + struct qrwnode *prev;
> +
> + node->next = NULL;
> + node->wait = true;
> + prev = xchg(>waitq, node);
> + if (prev) {
> + prev->next = node;
> + /*
> +  * Wait until the waiting flag is off
> +  */
> + while (smp_load_acquire(>wait))
> + arch_mutex_cpu_relax();
> + }
> +}

Please rebase on top of the MCS lock patches such that this is gone.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation

2014-01-20 Thread Peter Zijlstra
On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote:
 +#ifndef arch_mutex_cpu_relax
 +# define arch_mutex_cpu_relax() cpu_relax()
 +#endif

Include linux/mutex.h


 +#ifndef smp_load_acquire
 +# ifdef CONFIG_X86
 +#   define smp_load_acquire(p)   \
 + ({  \
 + typeof(*p) ___p1 = ACCESS_ONCE(*p); \
 + barrier();  \
 + ___p1;  \
 + })
 +# else
 +#   define smp_load_acquire(p)   \
 + ({  \
 + typeof(*p) ___p1 = ACCESS_ONCE(*p); \
 + smp_mb();   \
 + ___p1;  \
 + })
 +# endif
 +#endif
 +
 +#ifndef smp_store_release
 +# ifdef CONFIG_X86
 +#   define smp_store_release(p, v)   \
 + do {\
 + barrier();  \
 + ACCESS_ONCE(*p) = v;\
 + } while (0)
 +# else
 +#   define smp_store_release(p, v)   \
 + do {\
 + smp_mb();   \
 + ACCESS_ONCE(*p) = v;\
 + } while (0)
 +# endif
 +#endif

Remove these.

 +/*
 + * If an xadd (exchange-add) macro isn't available, simulate one with
 + * the atomic_add_return() function.
 + */
 +#ifdef xadd
 +# define qrw_xadd(rw, inc)   xadd((rw).rwc, inc)
 +#else
 +# define qrw_xadd(rw, inc)   (u32)(atomic_add_return(inc, (rw).rwa) - inc)
 +#endif

Is GCC really so stupid that you cannot always use the
atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll
end up with:

 i + xadd() - i

Surely it can just remove the two i terms?

 +/**
 + * wait_in_queue - Add to queue and wait until it is at the head
 + * @lock: Pointer to queue rwlock structure
 + * @node: Node pointer to be added to the queue
 + */
 +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
 +{
 + struct qrwnode *prev;
 +
 + node-next = NULL;
 + node-wait = true;
 + prev = xchg(lock-waitq, node);
 + if (prev) {
 + prev-next = node;
 + /*
 +  * Wait until the waiting flag is off
 +  */
 + while (smp_load_acquire(node-wait))
 + arch_mutex_cpu_relax();
 + }
 +}

Please rebase on top of the MCS lock patches such that this is gone.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v9 1/5] qrwlock: A queue read/write lock implementation

2014-01-14 Thread Waiman Long
This patch introduces a new read/write lock implementation that put
waiting readers and writers into a queue instead of actively contending
the lock like the current read/write lock implementation. This will
improve performance in highly contended situation by reducing the
cache line bouncing effect.

The queue read/write lock (qrwlock) is a fair lock even though there
is still a slight chance of lock stealing if a reader or writer comes
at the right moment.  Other than that, lock granting is done in a
FIFO manner.  As a result, it is possible to determine a maximum time
period after which the waiting is over and the lock can be acquired.

Internally, however, there is a second type of readers which try to
steal lock aggressively. They simply increments the reader count and
wait until the writer releases the lock. The transition to aggressive
reader happens in the read lock slowpath when

 1. In an interrupt context.
 2. When a reader comes to the head of the wait queue and sees
the release of a write lock.

The queue read lock is safe to use in an interrupt context (softirq
or hardirq) as it will switch to become an aggressive reader in such
environment allowing recursive read lock.

The only downside of queue rwlock is the size increase in the lock
structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
systems.

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
CPUs. The following table shows the average time (in ns) for a single
lock/unlock sequence (including the looping and timing overhead):

Lock Type   2.4GHz  2.93GHz
-   --  ---
Ticket spinlock  14.912.3
Read lock17.013.5
Write lock   17.013.5
Queue read lock  16.013.4
Queue write lock  9.2 7.8

The queue read lock is slightly slower than the spinlock, but is
slightly faster than the read lock. The queue write lock, however,
is the fastest of all. It is almost twice as fast as the write lock
and about 1.5X of the spinlock.

With lock contention, the speed of each individual lock/unlock function
is less important than the amount of contention-induced delays.

To investigate the performance characteristics of the queue rwlock
compared with the regular rwlock, Ingo's anon_vmas patch that converts
rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
tested under the following 3 conditions:

 1) Plain 3.12
 2) Ingo's patch
 3) Ingo's patch + qrwlock

Each of the 3 kernels were booted up twice with and without the
"idle=poll" kernel parameter which keeps the CPUs in C0 state while
idling instead of a more energy-saving sleep state.  The jobs per
minutes (JPM) results of the AIM7's high_systime workload at 1500
users on a 8-socket 80-core DL980 (HT off) were:

 Kernel JPMs%Change from (1)
 -- 
   1145704/227295   -
   2229750/236066   +58%/+3.8%
   4240062/248606   +65%/+9.4%

The first JPM number is without the "idle=poll" kernel parameter,
the second number is with that parameter. It can be seen that most
of the performance benefit of converting rwsem to rwlock actually
come from the latency improvement of not needing to wake up a CPU
from deep sleep state when work is available.

The use of non-sleeping locks did improve performance by eliminating
the context switching cost. Using queue rwlock gave almost tripling
of performance gain. The performance gain was reduced somewhat with
a fair lock which was to be expected.

Looking at the perf profiles (with idle=poll) below, we can clearly see
that other bottlenecks were constraining the performance improvement.

Perf profile of kernel (2):

 18.65%reaim  [kernel.kallsyms]  [k] __write_lock_failed
  9.00%reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  5.21%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  3.08%reaim  [kernel.kallsyms]  [k] mspin_lock
  2.50%reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.00%   ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.29%reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  1.21%reaim  [kernel.kallsyms]  [k] __read_lock_failed
  1.12%reaim  [kernel.kallsyms]  [k] _raw_spin_lock
  1.10%reaim  [kernel.kallsyms]  [k] perf_event_aux
  1.09% true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave

Perf profile of kernel (3):

 20.14%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  7.94%reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  5.41%reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
  5.01%reaim  [kernel.kallsyms]  [k] mspin_lock
  2.12%reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.07%   ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.58%reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
 

[PATCH v9 1/5] qrwlock: A queue read/write lock implementation

2014-01-14 Thread Waiman Long
This patch introduces a new read/write lock implementation that put
waiting readers and writers into a queue instead of actively contending
the lock like the current read/write lock implementation. This will
improve performance in highly contended situation by reducing the
cache line bouncing effect.

The queue read/write lock (qrwlock) is a fair lock even though there
is still a slight chance of lock stealing if a reader or writer comes
at the right moment.  Other than that, lock granting is done in a
FIFO manner.  As a result, it is possible to determine a maximum time
period after which the waiting is over and the lock can be acquired.

Internally, however, there is a second type of readers which try to
steal lock aggressively. They simply increments the reader count and
wait until the writer releases the lock. The transition to aggressive
reader happens in the read lock slowpath when

 1. In an interrupt context.
 2. When a reader comes to the head of the wait queue and sees
the release of a write lock.

The queue read lock is safe to use in an interrupt context (softirq
or hardirq) as it will switch to become an aggressive reader in such
environment allowing recursive read lock.

The only downside of queue rwlock is the size increase in the lock
structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
systems.

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
CPUs. The following table shows the average time (in ns) for a single
lock/unlock sequence (including the looping and timing overhead):

Lock Type   2.4GHz  2.93GHz
-   --  ---
Ticket spinlock  14.912.3
Read lock17.013.5
Write lock   17.013.5
Queue read lock  16.013.4
Queue write lock  9.2 7.8

The queue read lock is slightly slower than the spinlock, but is
slightly faster than the read lock. The queue write lock, however,
is the fastest of all. It is almost twice as fast as the write lock
and about 1.5X of the spinlock.

With lock contention, the speed of each individual lock/unlock function
is less important than the amount of contention-induced delays.

To investigate the performance characteristics of the queue rwlock
compared with the regular rwlock, Ingo's anon_vmas patch that converts
rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
tested under the following 3 conditions:

 1) Plain 3.12
 2) Ingo's patch
 3) Ingo's patch + qrwlock

Each of the 3 kernels were booted up twice with and without the
idle=poll kernel parameter which keeps the CPUs in C0 state while
idling instead of a more energy-saving sleep state.  The jobs per
minutes (JPM) results of the AIM7's high_systime workload at 1500
users on a 8-socket 80-core DL980 (HT off) were:

 Kernel JPMs%Change from (1)
 -- 
   1145704/227295   -
   2229750/236066   +58%/+3.8%
   4240062/248606   +65%/+9.4%

The first JPM number is without the idle=poll kernel parameter,
the second number is with that parameter. It can be seen that most
of the performance benefit of converting rwsem to rwlock actually
come from the latency improvement of not needing to wake up a CPU
from deep sleep state when work is available.

The use of non-sleeping locks did improve performance by eliminating
the context switching cost. Using queue rwlock gave almost tripling
of performance gain. The performance gain was reduced somewhat with
a fair lock which was to be expected.

Looking at the perf profiles (with idle=poll) below, we can clearly see
that other bottlenecks were constraining the performance improvement.

Perf profile of kernel (2):

 18.65%reaim  [kernel.kallsyms]  [k] __write_lock_failed
  9.00%reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  5.21%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  3.08%reaim  [kernel.kallsyms]  [k] mspin_lock
  2.50%reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.00%   ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.29%reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  1.21%reaim  [kernel.kallsyms]  [k] __read_lock_failed
  1.12%reaim  [kernel.kallsyms]  [k] _raw_spin_lock
  1.10%reaim  [kernel.kallsyms]  [k] perf_event_aux
  1.09% true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave

Perf profile of kernel (3):

 20.14%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  7.94%reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  5.41%reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
  5.01%reaim  [kernel.kallsyms]  [k] mspin_lock
  2.12%reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.07%   ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.58%reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load