Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation
On 01/20/2014 10:21 AM, Peter Zijlstra wrote: On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote: +#ifndef arch_mutex_cpu_relax +# define arch_mutex_cpu_relax() cpu_relax() +#endif Include Will do so. +#ifndef smp_load_acquire +# ifdef CONFIG_X86 +# define smp_load_acquire(p) \ + ({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + barrier(); \ + ___p1; \ + }) +# else +# define smp_load_acquire(p) \ + ({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + smp_mb(); \ + ___p1; \ + }) +# endif +#endif + +#ifndef smp_store_release +# ifdef CONFIG_X86 +# define smp_store_release(p, v) \ + do {\ + barrier(); \ + ACCESS_ONCE(*p) = v;\ + } while (0) +# else +# define smp_store_release(p, v) \ + do {\ + smp_mb(); \ + ACCESS_ONCE(*p) = v;\ + } while (0) +# endif +#endif Remove these. Will do that. +/* + * If an xadd (exchange-add) macro isn't available, simulate one with + * the atomic_add_return() function. + */ +#ifdef xadd +# define qrw_xadd(rw, inc) xadd(&(rw).rwc, inc) +#else +# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc,&(rw).rwa) - inc) +#endif Is GCC really so stupid that you cannot always use the atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll end up with: i + xadd() - i Surely it can just remove the two i terms? I guess gcc should do the right thing. I will remove the macro. +/** + * wait_in_queue - Add to queue and wait until it is at the head + * @lock: Pointer to queue rwlock structure + * @node: Node pointer to be added to the queue + */ +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node) +{ + struct qrwnode *prev; + + node->next = NULL; + node->wait = true; + prev = xchg(>waitq, node); + if (prev) { + prev->next = node; + /* +* Wait until the waiting flag is off +*/ + while (smp_load_acquire(>wait)) + arch_mutex_cpu_relax(); + } +} Please rebase on top of the MCS lock patches such that this is gone. I would like to keep this as long as the MCS patches have not been merged into tip. However, I will take that out if the MCS patches are in when I need to revise the qrwlock patches. -Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation
On 01/20/2014 10:21 AM, Peter Zijlstra wrote: On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote: +#ifndef arch_mutex_cpu_relax +# define arch_mutex_cpu_relax() cpu_relax() +#endif Includelinux/mutex.h Will do so. +#ifndef smp_load_acquire +# ifdef CONFIG_X86 +# define smp_load_acquire(p) \ + ({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + barrier(); \ + ___p1; \ + }) +# else +# define smp_load_acquire(p) \ + ({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + smp_mb(); \ + ___p1; \ + }) +# endif +#endif + +#ifndef smp_store_release +# ifdef CONFIG_X86 +# define smp_store_release(p, v) \ + do {\ + barrier(); \ + ACCESS_ONCE(*p) = v;\ + } while (0) +# else +# define smp_store_release(p, v) \ + do {\ + smp_mb(); \ + ACCESS_ONCE(*p) = v;\ + } while (0) +# endif +#endif Remove these. Will do that. +/* + * If an xadd (exchange-add) macro isn't available, simulate one with + * the atomic_add_return() function. + */ +#ifdef xadd +# define qrw_xadd(rw, inc) xadd((rw).rwc, inc) +#else +# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc,(rw).rwa) - inc) +#endif Is GCC really so stupid that you cannot always use the atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll end up with: i + xadd() - i Surely it can just remove the two i terms? I guess gcc should do the right thing. I will remove the macro. +/** + * wait_in_queue - Add to queue and wait until it is at the head + * @lock: Pointer to queue rwlock structure + * @node: Node pointer to be added to the queue + */ +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node) +{ + struct qrwnode *prev; + + node-next = NULL; + node-wait = true; + prev = xchg(lock-waitq, node); + if (prev) { + prev-next = node; + /* +* Wait until the waiting flag is off +*/ + while (smp_load_acquire(node-wait)) + arch_mutex_cpu_relax(); + } +} Please rebase on top of the MCS lock patches such that this is gone. I would like to keep this as long as the MCS patches have not been merged into tip. However, I will take that out if the MCS patches are in when I need to revise the qrwlock patches. -Longman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation
On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote: > +#ifndef arch_mutex_cpu_relax > +# define arch_mutex_cpu_relax() cpu_relax() > +#endif Include > +#ifndef smp_load_acquire > +# ifdef CONFIG_X86 > +# define smp_load_acquire(p) \ > + ({ \ > + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ > + barrier(); \ > + ___p1; \ > + }) > +# else > +# define smp_load_acquire(p) \ > + ({ \ > + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ > + smp_mb(); \ > + ___p1; \ > + }) > +# endif > +#endif > + > +#ifndef smp_store_release > +# ifdef CONFIG_X86 > +# define smp_store_release(p, v) \ > + do {\ > + barrier(); \ > + ACCESS_ONCE(*p) = v;\ > + } while (0) > +# else > +# define smp_store_release(p, v) \ > + do {\ > + smp_mb(); \ > + ACCESS_ONCE(*p) = v;\ > + } while (0) > +# endif > +#endif Remove these. > +/* > + * If an xadd (exchange-add) macro isn't available, simulate one with > + * the atomic_add_return() function. > + */ > +#ifdef xadd > +# define qrw_xadd(rw, inc) xadd(&(rw).rwc, inc) > +#else > +# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc, &(rw).rwa) - inc) > +#endif Is GCC really so stupid that you cannot always use the atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll end up with: i + xadd() - i Surely it can just remove the two i terms? > +/** > + * wait_in_queue - Add to queue and wait until it is at the head > + * @lock: Pointer to queue rwlock structure > + * @node: Node pointer to be added to the queue > + */ > +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node) > +{ > + struct qrwnode *prev; > + > + node->next = NULL; > + node->wait = true; > + prev = xchg(>waitq, node); > + if (prev) { > + prev->next = node; > + /* > + * Wait until the waiting flag is off > + */ > + while (smp_load_acquire(>wait)) > + arch_mutex_cpu_relax(); > + } > +} Please rebase on top of the MCS lock patches such that this is gone. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 1/5] qrwlock: A queue read/write lock implementation
On Tue, Jan 14, 2014 at 11:44:03PM -0500, Waiman Long wrote: +#ifndef arch_mutex_cpu_relax +# define arch_mutex_cpu_relax() cpu_relax() +#endif Include linux/mutex.h +#ifndef smp_load_acquire +# ifdef CONFIG_X86 +# define smp_load_acquire(p) \ + ({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + barrier(); \ + ___p1; \ + }) +# else +# define smp_load_acquire(p) \ + ({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + smp_mb(); \ + ___p1; \ + }) +# endif +#endif + +#ifndef smp_store_release +# ifdef CONFIG_X86 +# define smp_store_release(p, v) \ + do {\ + barrier(); \ + ACCESS_ONCE(*p) = v;\ + } while (0) +# else +# define smp_store_release(p, v) \ + do {\ + smp_mb(); \ + ACCESS_ONCE(*p) = v;\ + } while (0) +# endif +#endif Remove these. +/* + * If an xadd (exchange-add) macro isn't available, simulate one with + * the atomic_add_return() function. + */ +#ifdef xadd +# define qrw_xadd(rw, inc) xadd((rw).rwc, inc) +#else +# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc, (rw).rwa) - inc) +#endif Is GCC really so stupid that you cannot always use the atomic_add_return()? The x86 atomic_add_return is i + xadd(), so you'll end up with: i + xadd() - i Surely it can just remove the two i terms? +/** + * wait_in_queue - Add to queue and wait until it is at the head + * @lock: Pointer to queue rwlock structure + * @node: Node pointer to be added to the queue + */ +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node) +{ + struct qrwnode *prev; + + node-next = NULL; + node-wait = true; + prev = xchg(lock-waitq, node); + if (prev) { + prev-next = node; + /* + * Wait until the waiting flag is off + */ + while (smp_load_acquire(node-wait)) + arch_mutex_cpu_relax(); + } +} Please rebase on top of the MCS lock patches such that this is gone. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v9 1/5] qrwlock: A queue read/write lock implementation
This patch introduces a new read/write lock implementation that put waiting readers and writers into a queue instead of actively contending the lock like the current read/write lock implementation. This will improve performance in highly contended situation by reducing the cache line bouncing effect. The queue read/write lock (qrwlock) is a fair lock even though there is still a slight chance of lock stealing if a reader or writer comes at the right moment. Other than that, lock granting is done in a FIFO manner. As a result, it is possible to determine a maximum time period after which the waiting is over and the lock can be acquired. Internally, however, there is a second type of readers which try to steal lock aggressively. They simply increments the reader count and wait until the writer releases the lock. The transition to aggressive reader happens in the read lock slowpath when 1. In an interrupt context. 2. When a reader comes to the head of the wait queue and sees the release of a write lock. The queue read lock is safe to use in an interrupt context (softirq or hardirq) as it will switch to become an aggressive reader in such environment allowing recursive read lock. The only downside of queue rwlock is the size increase in the lock structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit systems. In term of single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64 CPUs. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type 2.4GHz 2.93GHz - -- --- Ticket spinlock 14.912.3 Read lock17.013.5 Write lock 17.013.5 Queue read lock 16.013.4 Queue write lock 9.2 7.8 The queue read lock is slightly slower than the spinlock, but is slightly faster than the read lock. The queue write lock, however, is the fastest of all. It is almost twice as fast as the write lock and about 1.5X of the spinlock. With lock contention, the speed of each individual lock/unlock function is less important than the amount of contention-induced delays. To investigate the performance characteristics of the queue rwlock compared with the regular rwlock, Ingo's anon_vmas patch that converts rwsem to rwlock was applied to a 3.12 kernel. This kernel was then tested under the following 3 conditions: 1) Plain 3.12 2) Ingo's patch 3) Ingo's patch + qrwlock Each of the 3 kernels were booted up twice with and without the "idle=poll" kernel parameter which keeps the CPUs in C0 state while idling instead of a more energy-saving sleep state. The jobs per minutes (JPM) results of the AIM7's high_systime workload at 1500 users on a 8-socket 80-core DL980 (HT off) were: Kernel JPMs%Change from (1) -- 1145704/227295 - 2229750/236066 +58%/+3.8% 4240062/248606 +65%/+9.4% The first JPM number is without the "idle=poll" kernel parameter, the second number is with that parameter. It can be seen that most of the performance benefit of converting rwsem to rwlock actually come from the latency improvement of not needing to wake up a CPU from deep sleep state when work is available. The use of non-sleeping locks did improve performance by eliminating the context switching cost. Using queue rwlock gave almost tripling of performance gain. The performance gain was reduced somewhat with a fair lock which was to be expected. Looking at the perf profiles (with idle=poll) below, we can clearly see that other bottlenecks were constraining the performance improvement. Perf profile of kernel (2): 18.65%reaim [kernel.kallsyms] [k] __write_lock_failed 9.00%reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.21% swapper [kernel.kallsyms] [k] cpu_idle_loop 3.08%reaim [kernel.kallsyms] [k] mspin_lock 2.50%reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.00% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.29%reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load 1.21%reaim [kernel.kallsyms] [k] __read_lock_failed 1.12%reaim [kernel.kallsyms] [k] _raw_spin_lock 1.10%reaim [kernel.kallsyms] [k] perf_event_aux 1.09% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave Perf profile of kernel (3): 20.14% swapper [kernel.kallsyms] [k] cpu_idle_loop 7.94%reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.41%reaim [kernel.kallsyms] [k] queue_write_lock_slowpath 5.01%reaim [kernel.kallsyms] [k] mspin_lock 2.12%reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.07% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.58%reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
[PATCH v9 1/5] qrwlock: A queue read/write lock implementation
This patch introduces a new read/write lock implementation that put waiting readers and writers into a queue instead of actively contending the lock like the current read/write lock implementation. This will improve performance in highly contended situation by reducing the cache line bouncing effect. The queue read/write lock (qrwlock) is a fair lock even though there is still a slight chance of lock stealing if a reader or writer comes at the right moment. Other than that, lock granting is done in a FIFO manner. As a result, it is possible to determine a maximum time period after which the waiting is over and the lock can be acquired. Internally, however, there is a second type of readers which try to steal lock aggressively. They simply increments the reader count and wait until the writer releases the lock. The transition to aggressive reader happens in the read lock slowpath when 1. In an interrupt context. 2. When a reader comes to the head of the wait queue and sees the release of a write lock. The queue read lock is safe to use in an interrupt context (softirq or hardirq) as it will switch to become an aggressive reader in such environment allowing recursive read lock. The only downside of queue rwlock is the size increase in the lock structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit systems. In term of single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64 CPUs. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type 2.4GHz 2.93GHz - -- --- Ticket spinlock 14.912.3 Read lock17.013.5 Write lock 17.013.5 Queue read lock 16.013.4 Queue write lock 9.2 7.8 The queue read lock is slightly slower than the spinlock, but is slightly faster than the read lock. The queue write lock, however, is the fastest of all. It is almost twice as fast as the write lock and about 1.5X of the spinlock. With lock contention, the speed of each individual lock/unlock function is less important than the amount of contention-induced delays. To investigate the performance characteristics of the queue rwlock compared with the regular rwlock, Ingo's anon_vmas patch that converts rwsem to rwlock was applied to a 3.12 kernel. This kernel was then tested under the following 3 conditions: 1) Plain 3.12 2) Ingo's patch 3) Ingo's patch + qrwlock Each of the 3 kernels were booted up twice with and without the idle=poll kernel parameter which keeps the CPUs in C0 state while idling instead of a more energy-saving sleep state. The jobs per minutes (JPM) results of the AIM7's high_systime workload at 1500 users on a 8-socket 80-core DL980 (HT off) were: Kernel JPMs%Change from (1) -- 1145704/227295 - 2229750/236066 +58%/+3.8% 4240062/248606 +65%/+9.4% The first JPM number is without the idle=poll kernel parameter, the second number is with that parameter. It can be seen that most of the performance benefit of converting rwsem to rwlock actually come from the latency improvement of not needing to wake up a CPU from deep sleep state when work is available. The use of non-sleeping locks did improve performance by eliminating the context switching cost. Using queue rwlock gave almost tripling of performance gain. The performance gain was reduced somewhat with a fair lock which was to be expected. Looking at the perf profiles (with idle=poll) below, we can clearly see that other bottlenecks were constraining the performance improvement. Perf profile of kernel (2): 18.65%reaim [kernel.kallsyms] [k] __write_lock_failed 9.00%reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.21% swapper [kernel.kallsyms] [k] cpu_idle_loop 3.08%reaim [kernel.kallsyms] [k] mspin_lock 2.50%reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.00% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.29%reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load 1.21%reaim [kernel.kallsyms] [k] __read_lock_failed 1.12%reaim [kernel.kallsyms] [k] _raw_spin_lock 1.10%reaim [kernel.kallsyms] [k] perf_event_aux 1.09% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave Perf profile of kernel (3): 20.14% swapper [kernel.kallsyms] [k] cpu_idle_loop 7.94%reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.41%reaim [kernel.kallsyms] [k] queue_write_lock_slowpath 5.01%reaim [kernel.kallsyms] [k] mspin_lock 2.12%reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.07% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.58%reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load