Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Fri, 16 Jan 2015 05:40:59 -0800 Eric Dumazet wrote: > I made same observation about 3 years ago, on old cpus. > Thank you for letting me know. I was thinking I was going insane! (yeah yeah, there's lots of people who will still say that I've already gone insane, but at least I know my memory is still intact) -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 2015-01-15 at 23:07 -0500, Steven Rostedt wrote: > On Thu, 15 Jan 2015 21:57:58 -0600 (CST) > Christoph Lameter wrote: > > > > I get: > > > > > > mov%gs:0x18(%rax),%rdx > > > > > > Looks to me that %gs is used. > > > > %gs is used as a segment prefix. That does not add significant cycles. > > Retrieving the content of %gs and loading it into another register > > would be expensive in terms of cpu cycles. > > OK, maybe that's what I saw in my previous benchmarks. Again, that was > a while ago. I made same observation about 3 years ago, on old cpus. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015 21:57:58 -0600 (CST) Christoph Lameter wrote: > > I get: > > > > mov%gs:0x18(%rax),%rdx > > > > Looks to me that %gs is used. > > %gs is used as a segment prefix. That does not add significant cycles. > Retrieving the content of %gs and loading it into another register > would be expensive in terms of cpu cycles. OK, maybe that's what I saw in my previous benchmarks. Again, that was a while ago. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015 22:51:30 -0500 Steven Rostedt wrote: > > I haven't done benchmarks in a while, so perhaps accessing the %gs > segment isn't as expensive as I saw it before. I'll have to profile > function tracing on my i7 and see where things are slow again. I just ran it on my i7, and yeah, the %gs access isn't much worse than any of the other instructions. I had an old box that recently died that I did my last benchmarks on, so that was probably why it made such a difference. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
> I get: > > mov%gs:0x18(%rax),%rdx > > Looks to me that %gs is used. %gs is used as a segment prefix. That does not add significant cycles. Retrieving the content of %gs and loading it into another register would be expensive in terms of cpu cycles. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015 21:27:14 -0600 (CST) Christoph Lameter wrote: > > The %gs register is not used since the address of the per cpu area is > available as one of the first fields in the per cpu areas. Have you disassembled your code? Looking at put_cpu_partial() from 3.19-rc3 where it does: oldpage = this_cpu_read(s->cpu_slab->partial); I get: mov%gs:0x18(%rax),%rdx Looks to me that %gs is used. I haven't done benchmarks in a while, so perhaps accessing the %gs segment isn't as expensive as I saw it before. I'll have to profile function tracing on my i7 and see where things are slow again. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015, Andrew Morton wrote: > > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free > > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) > > I'm surprised. preempt_disable/enable are pretty fast. I wonder why > this makes a measurable difference. Perhaps preempt_enable()'s call to > preempt_schedule() added pain? The rest of the fastpath is already highly optimized. That is why something like preempt enable/disable has such a disproportionate effect. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015, Steven Rostedt wrote: > profiling function tracing I discovered that accessing preempt_count > was actually quite expensive, even just to read. But it may not be as > bad since Peter Zijlstra converted preempt_count to a per_cpu variable. > Although, IIRC, the perf profiling showed the access to the %gs > register was where the time consuming was happening, which is what > I believe per_cpu variables still use. The %gs register is not used since the address of the per cpu area is available as one of the first fields in the per cpu areas. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015 17:16:34 -0800 Andrew Morton wrote: > > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free > > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) > > I'm surprised. preempt_disable/enable are pretty fast. I wonder why > this makes a measurable difference. Perhaps preempt_enable()'s call > to preempt_schedule() added pain? profiling function tracing I discovered that accessing preempt_count was actually quite expensive, even just to read. But it may not be as bad since Peter Zijlstra converted preempt_count to a per_cpu variable. Although, IIRC, the perf profiling showed the access to the %gs register was where the time consuming was happening, which is what I believe per_cpu variables still use. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015 16:40:32 +0900 Joonsoo Kim wrote: > We had to insert a preempt enable/disable in the fastpath a while ago > in order to guarantee that tid and kmem_cache_cpu are retrieved on the > same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler > can move the process to other cpu during retrieving data. > > Now, I reach the solution to remove preempt enable/disable in the fastpath. > If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu > are retrieved by separate this_cpu operation, it means that they are > retrieved on the same cpu. If not matched, we just have to retry it. > > With this guarantee, preemption enable/disable isn't need at all even if > CONFIG_PREEMPT, so this patch removes it. > > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) I'm surprised. preempt_disable/enable are pretty fast. I wonder why this makes a measurable difference. Perhaps preempt_enable()'s call to preempt_schedule() added pain? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
On Thu, 15 Jan 2015 16:40:32 +0900 Joonsoo Kim wrote: [...] > > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) > > Below is the result of Christoph's slab_test reported by > Jesper Dangaard Brouer. > [...] Acked-by: Jesper Dangaard Brouer > Acked-by: Christoph Lameter > Tested-by: Jesper Dangaard Brouer > Signed-off-by: Joonsoo Kim > --- > mm/slub.c | 35 +++ > 1 file changed, 23 insertions(+), 12 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index fe376fe..ceee1d7 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2398,13 +2398,24 @@ redo: [...] >*/ > - preempt_disable(); > - c = this_cpu_ptr(s->cpu_slab); > + do { > + tid = this_cpu_read(s->cpu_slab->tid); > + c = this_cpu_ptr(s->cpu_slab); > + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid)); > + > + /* > + * Irqless object alloc/free alogorithm used here depends on sequence Spelling of algorithm contains a typo ^^ > + * of fetching cpu_slab's data. tid should be fetched before anything > + * on c to guarantee that object and page associated with previous tid > + * won't be used with current tid. If we fetch tid first, object and > + * page could be one associated with next tid and our alloc/free > + * request will be failed. In this case, we will retry. So, no problem. > + */ > + barrier(); -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
We had to insert a preempt enable/disable in the fastpath a while ago in order to guarantee that tid and kmem_cache_cpu are retrieved on the same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler can move the process to other cpu during retrieving data. Now, I reach the solution to remove preempt enable/disable in the fastpath. If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu are retrieved by separate this_cpu operation, it means that they are retrieved on the same cpu. If not matched, we just have to retry it. With this guarantee, preemption enable/disable isn't need at all even if CONFIG_PREEMPT, so this patch removes it. I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) Below is the result of Christoph's slab_test reported by Jesper Dangaard Brouer. * Before Single thread testing = 1. Kmalloc: Repeatedly allocate then free test 1 times kmalloc(8) -> 49 cycles kfree -> 62 cycles 1 times kmalloc(16) -> 48 cycles kfree -> 64 cycles 1 times kmalloc(32) -> 53 cycles kfree -> 70 cycles 1 times kmalloc(64) -> 64 cycles kfree -> 77 cycles 1 times kmalloc(128) -> 74 cycles kfree -> 84 cycles 1 times kmalloc(256) -> 84 cycles kfree -> 114 cycles 1 times kmalloc(512) -> 83 cycles kfree -> 116 cycles 1 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles 1 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles 1 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles 1 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles 1 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles 2. Kmalloc: alloc/free test 1 times kmalloc(8)/kfree -> 68 cycles 1 times kmalloc(16)/kfree -> 68 cycles 1 times kmalloc(32)/kfree -> 69 cycles 1 times kmalloc(64)/kfree -> 68 cycles 1 times kmalloc(128)/kfree -> 68 cycles 1 times kmalloc(256)/kfree -> 68 cycles 1 times kmalloc(512)/kfree -> 74 cycles 1 times kmalloc(1024)/kfree -> 75 cycles 1 times kmalloc(2048)/kfree -> 74 cycles 1 times kmalloc(4096)/kfree -> 74 cycles 1 times kmalloc(8192)/kfree -> 75 cycles 1 times kmalloc(16384)/kfree -> 510 cycles * After Single thread testing = 1. Kmalloc: Repeatedly allocate then free test 1 times kmalloc(8) -> 46 cycles kfree -> 61 cycles 1 times kmalloc(16) -> 46 cycles kfree -> 63 cycles 1 times kmalloc(32) -> 49 cycles kfree -> 69 cycles 1 times kmalloc(64) -> 57 cycles kfree -> 76 cycles 1 times kmalloc(128) -> 66 cycles kfree -> 83 cycles 1 times kmalloc(256) -> 84 cycles kfree -> 110 cycles 1 times kmalloc(512) -> 77 cycles kfree -> 114 cycles 1 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles 1 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles 1 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles 1 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles 1 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles 2. Kmalloc: alloc/free test 1 times kmalloc(8)/kfree -> 65 cycles 1 times kmalloc(16)/kfree -> 66 cycles 1 times kmalloc(32)/kfree -> 65 cycles 1 times kmalloc(64)/kfree -> 66 cycles 1 times kmalloc(128)/kfree -> 66 cycles 1 times kmalloc(256)/kfree -> 71 cycles 1 times kmalloc(512)/kfree -> 72 cycles 1 times kmalloc(1024)/kfree -> 71 cycles 1 times kmalloc(2048)/kfree -> 71 cycles 1 times kmalloc(4096)/kfree -> 71 cycles 1 times kmalloc(8192)/kfree -> 65 cycles 1 times kmalloc(16384)/kfree -> 511 cycles Most of the results are better than before. Note that this change slightly worses performance in !CONFIG_PREEMPT, roughly 0.3%. Implementing each case separately would help performance, but, since it's so marginal, I didn't do that. This would help maintanance since we have same code for all cases. Change from v1: add comment about barrier() usage Acked-by: Christoph Lameter Tested-by: Jesper Dangaard Brouer Signed-off-by: Joonsoo Kim --- mm/slub.c | 35 +++ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index fe376fe..ceee1d7 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2398,13 +2398,24 @@ redo: * reading from one cpu area. That does not matter as long * as we end up on the original cpu again when doing the cmpxchg. * -* Preemption is disabled for the retrieval of the tid because that -* must occur from the current processor. We cannot allow rescheduling -* on a different processor between the determination of the pointer -* and the retrieval of the tid. +* We should guarantee that tid and kmem_cache are retrieved on +* the same cpu. It could be different if CONFIG_PREEMPT so we need +* to check if it is matched or not. */ - preempt_disable(); - c = this_cpu_pt