Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-16 Thread Steven Rostedt
On Fri, 16 Jan 2015 05:40:59 -0800
Eric Dumazet  wrote:

> I made same observation about 3 years ago, on old cpus.
> 

Thank you for letting me know. I was thinking I was going insane!

(yeah yeah, there's lots of people who will still say that I've already
gone insane, but at least I know my memory is still intact)

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-16 Thread Eric Dumazet
On Thu, 2015-01-15 at 23:07 -0500, Steven Rostedt wrote:
> On Thu, 15 Jan 2015 21:57:58 -0600 (CST)
> Christoph Lameter  wrote:
> 
> > > I get:
> > >
> > >   mov%gs:0x18(%rax),%rdx
> > >
> > > Looks to me that %gs is used.
> > 
> > %gs is used as a segment prefix. That does not add significant cycles.
> > Retrieving the content of %gs and loading it into another register
> > would be expensive in terms of cpu cycles.
> 
> OK, maybe that's what I saw in my previous benchmarks. Again, that was
> a while ago.

I made same observation about 3 years ago, on old cpus.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Steven Rostedt
On Thu, 15 Jan 2015 21:57:58 -0600 (CST)
Christoph Lameter  wrote:

> > I get:
> >
> > mov%gs:0x18(%rax),%rdx
> >
> > Looks to me that %gs is used.
> 
> %gs is used as a segment prefix. That does not add significant cycles.
> Retrieving the content of %gs and loading it into another register
> would be expensive in terms of cpu cycles.

OK, maybe that's what I saw in my previous benchmarks. Again, that was
a while ago.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Steven Rostedt
On Thu, 15 Jan 2015 22:51:30 -0500
Steven Rostedt  wrote:

> 
> I haven't done benchmarks in a while, so perhaps accessing the %gs
> segment isn't as expensive as I saw it before. I'll have to profile
> function tracing on my i7 and see where things are slow again.

I just ran it on my i7, and yeah, the %gs access isn't much worse than
any of the other instructions. I had an old box that recently died that
I did my last benchmarks on, so that was probably why it made such a
difference.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Christoph Lameter
> I get:
>
>   mov%gs:0x18(%rax),%rdx
>
> Looks to me that %gs is used.

%gs is used as a segment prefix. That does not add significant cycles.
Retrieving the content of %gs and loading it into another register would
be expensive in terms of cpu cycles.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Steven Rostedt
On Thu, 15 Jan 2015 21:27:14 -0600 (CST)
Christoph Lameter  wrote:

> 
> The %gs register is not used since the address of the per cpu area is
> available as one of the first fields in the per cpu areas.

Have you disassembled your code?

Looking at put_cpu_partial() from 3.19-rc3 where it does:

oldpage = this_cpu_read(s->cpu_slab->partial);

I get:

mov%gs:0x18(%rax),%rdx

Looks to me that %gs is used.


I haven't done benchmarks in a while, so perhaps accessing the %gs
segment isn't as expensive as I saw it before. I'll have to profile
function tracing on my i7 and see where things are slow again.


-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Christoph Lameter
On Thu, 15 Jan 2015, Andrew Morton wrote:

> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
>
> I'm surprised.  preempt_disable/enable are pretty fast.  I wonder why
> this makes a measurable difference.  Perhaps preempt_enable()'s call to
> preempt_schedule() added pain?

The rest of the fastpath is already highly optimized. That is why
something like preempt enable/disable has such a disproportionate effect.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Christoph Lameter
On Thu, 15 Jan 2015, Steven Rostedt wrote:

> profiling function tracing I discovered that accessing preempt_count
> was actually quite expensive, even just to read. But it may not be as
> bad since Peter Zijlstra converted preempt_count to a per_cpu variable.
> Although, IIRC, the perf profiling showed the access to the %gs
> register was where the time consuming was happening, which is what
> I believe per_cpu variables still use.

The %gs register is not used since the address of the per cpu area is
available as one of the first fields in the per cpu areas.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Steven Rostedt
On Thu, 15 Jan 2015 17:16:34 -0800
Andrew Morton  wrote:

> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> I'm surprised.  preempt_disable/enable are pretty fast.  I wonder why
> this makes a measurable difference.  Perhaps preempt_enable()'s call
> to preempt_schedule() added pain?

profiling function tracing I discovered that accessing preempt_count
was actually quite expensive, even just to read. But it may not be as
bad since Peter Zijlstra converted preempt_count to a per_cpu variable.
Although, IIRC, the perf profiling showed the access to the %gs
register was where the time consuming was happening, which is what
I believe per_cpu variables still use.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Andrew Morton
On Thu, 15 Jan 2015 16:40:32 +0900 Joonsoo Kim  wrote:

> We had to insert a preempt enable/disable in the fastpath a while ago
> in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> can move the process to other cpu during retrieving data.
> 
> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.
> 
> With this guarantee, preemption enable/disable isn't need at all even if
> CONFIG_PREEMPT, so this patch removes it.
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

I'm surprised.  preempt_disable/enable are pretty fast.  I wonder why
this makes a measurable difference.  Perhaps preempt_enable()'s call to
preempt_schedule() added pain?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-15 Thread Jesper Dangaard Brouer
On Thu, 15 Jan 2015 16:40:32 +0900
Joonsoo Kim  wrote:

[...]
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> Below is the result of Christoph's slab_test reported by
> Jesper Dangaard Brouer.
>
[...]

Acked-by: Jesper Dangaard Brouer 

> Acked-by: Christoph Lameter 
> Tested-by: Jesper Dangaard Brouer 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/slub.c |   35 +++
>  1 file changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fe376fe..ceee1d7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2398,13 +2398,24 @@ redo:
[...]
>*/
> - preempt_disable();
> - c = this_cpu_ptr(s->cpu_slab);
> + do {
> + tid = this_cpu_read(s->cpu_slab->tid);
> + c = this_cpu_ptr(s->cpu_slab);
> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +
> + /*
> +  * Irqless object alloc/free alogorithm used here depends on sequence

Spelling of algorithm contains a typo ^^ 

> +  * of fetching cpu_slab's data. tid should be fetched before anything
> +  * on c to guarantee that object and page associated with previous tid
> +  * won't be used with current tid. If we fetch tid first, object and
> +  * page could be one associated with next tid and our alloc/free
> +  * request will be failed. In this case, we will retry. So, no problem.
> +  */
> + barrier();

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

2015-01-14 Thread Joonsoo Kim
We had to insert a preempt enable/disable in the fastpath a while ago
in order to guarantee that tid and kmem_cache_cpu are retrieved on the
same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
can move the process to other cpu during retrieving data.

Now, I reach the solution to remove preempt enable/disable in the fastpath.
If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
are retrieved by separate this_cpu operation, it means that they are
retrieved on the same cpu. If not matched, we just have to retry it.

With this guarantee, preemption enable/disable isn't need at all even if
CONFIG_PREEMPT, so this patch removes it.

I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

Below is the result of Christoph's slab_test reported by
Jesper Dangaard Brouer.

* Before

 Single thread testing
 =
 1. Kmalloc: Repeatedly allocate then free test
 1 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 1 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 1 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 1 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 1 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 1 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 1 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 1 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 1 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 1 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 1 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 1 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 1 times kmalloc(8)/kfree -> 68 cycles
 1 times kmalloc(16)/kfree -> 68 cycles
 1 times kmalloc(32)/kfree -> 69 cycles
 1 times kmalloc(64)/kfree -> 68 cycles
 1 times kmalloc(128)/kfree -> 68 cycles
 1 times kmalloc(256)/kfree -> 68 cycles
 1 times kmalloc(512)/kfree -> 74 cycles
 1 times kmalloc(1024)/kfree -> 75 cycles
 1 times kmalloc(2048)/kfree -> 74 cycles
 1 times kmalloc(4096)/kfree -> 74 cycles
 1 times kmalloc(8192)/kfree -> 75 cycles
 1 times kmalloc(16384)/kfree -> 510 cycles

* After

 Single thread testing
 =
 1. Kmalloc: Repeatedly allocate then free test
 1 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 1 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 1 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 1 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 1 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 1 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 1 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 1 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 1 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 1 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 1 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 1 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 1 times kmalloc(8)/kfree -> 65 cycles
 1 times kmalloc(16)/kfree -> 66 cycles
 1 times kmalloc(32)/kfree -> 65 cycles
 1 times kmalloc(64)/kfree -> 66 cycles
 1 times kmalloc(128)/kfree -> 66 cycles
 1 times kmalloc(256)/kfree -> 71 cycles
 1 times kmalloc(512)/kfree -> 72 cycles
 1 times kmalloc(1024)/kfree -> 71 cycles
 1 times kmalloc(2048)/kfree -> 71 cycles
 1 times kmalloc(4096)/kfree -> 71 cycles
 1 times kmalloc(8192)/kfree -> 65 cycles
 1 times kmalloc(16384)/kfree -> 511 cycles

Most of the results are better than before.

Note that this change slightly worses performance in !CONFIG_PREEMPT,
roughly 0.3%. Implementing each case separately would help performance,
but, since it's so marginal, I didn't do that. This would help
maintanance since we have same code for all cases.

Change from v1: add comment about barrier() usage

Acked-by: Christoph Lameter 
Tested-by: Jesper Dangaard Brouer 
Signed-off-by: Joonsoo Kim 
---
 mm/slub.c |   35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index fe376fe..ceee1d7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2398,13 +2398,24 @@ redo:
 * reading from one cpu area. That does not matter as long
 * as we end up on the original cpu again when doing the cmpxchg.
 *
-* Preemption is disabled for the retrieval of the tid because that
-* must occur from the current processor. We cannot allow rescheduling
-* on a different processor between the determination of the pointer
-* and the retrieval of the tid.
+* We should guarantee that tid and kmem_cache are retrieved on
+* the same cpu. It could be different if CONFIG_PREEMPT so we need
+* to check if it is matched or not.
 */
-   preempt_disable();
-   c = this_cpu_pt