Re: [Q] Default SLAB allocator

2012-10-19 Thread Eric Dumazet
On Fri, 2012-10-19 at 09:03 +0900, JoonSoo Kim wrote:
> Hello, Eric.
> Thank you very much for a kind comment about my question.
> I have one more question related to network subsystem.
> Please let me know what I misunderstand.
> 
> 2012/10/14 Eric Dumazet :
> > In latest kernels, skb->head no longer use kmalloc()/kfree(), so SLAB vs
> > SLUB is less a concern for network loads.
> >
> > In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
> > populate skb->head.
> 
> You mentioned that in latest kernel skb->head no longer use kmalloc()/kfree().

I hadnt the time to fully explain what was going on, only to give some
general ideas/hints.

Only incoming skbs, delivered by NIC are built this way.

I plan to extend this to some kind of frames, for example TCP ACK.
(They have a short life, so using __netdev_alloc_frag makes sense)

But when an application does a tcp_sendmsg() we use GFP_KERNEL
allocations and thus still use kmalloc().

> But, why result of David's "netperf RR" test on v3.6 is differentiated
> by choosing the allocator?

Because outgoing skb are still using a kmalloc() for their skb->head

RR sends one frame, receives one frame for each transaction.

So with 3.5, each RR transaction using a NIC needs 3 kmalloc() instead
of 4 for previous kernels.

Note that loopback traffic is different, since we do 2 kmalloc() per
transaction, and there is no difference on 3.5 for this kind of network
load.

> As far as I know, __netdev_alloc_frag may be introduced in v3.5, so
> I'm just confused.
> Does this test use __netdev_alloc_skb with "__GFP_WAIT | GFP_DMA"?
> 
> Does normal workload for network use __netdev_alloc_skb with
> "__GFP_WAIT | GFP_DMA"?
> 

Not especially.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-19 Thread Eric Dumazet
On Fri, 2012-10-19 at 09:03 +0900, JoonSoo Kim wrote:
 Hello, Eric.
 Thank you very much for a kind comment about my question.
 I have one more question related to network subsystem.
 Please let me know what I misunderstand.
 
 2012/10/14 Eric Dumazet eric.duma...@gmail.com:
  In latest kernels, skb-head no longer use kmalloc()/kfree(), so SLAB vs
  SLUB is less a concern for network loads.
 
  In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
  populate skb-head.
 
 You mentioned that in latest kernel skb-head no longer use kmalloc()/kfree().

I hadnt the time to fully explain what was going on, only to give some
general ideas/hints.

Only incoming skbs, delivered by NIC are built this way.

I plan to extend this to some kind of frames, for example TCP ACK.
(They have a short life, so using __netdev_alloc_frag makes sense)

But when an application does a tcp_sendmsg() we use GFP_KERNEL
allocations and thus still use kmalloc().

 But, why result of David's netperf RR test on v3.6 is differentiated
 by choosing the allocator?

Because outgoing skb are still using a kmalloc() for their skb-head

RR sends one frame, receives one frame for each transaction.

So with 3.5, each RR transaction using a NIC needs 3 kmalloc() instead
of 4 for previous kernels.

Note that loopback traffic is different, since we do 2 kmalloc() per
transaction, and there is no difference on 3.5 for this kind of network
load.

 As far as I know, __netdev_alloc_frag may be introduced in v3.5, so
 I'm just confused.
 Does this test use __netdev_alloc_skb with __GFP_WAIT | GFP_DMA?
 
 Does normal workload for network use __netdev_alloc_skb with
 __GFP_WAIT | GFP_DMA?
 

Not especially.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-18 Thread JoonSoo Kim
Hello, Eric.
Thank you very much for a kind comment about my question.
I have one more question related to network subsystem.
Please let me know what I misunderstand.

2012/10/14 Eric Dumazet :
> In latest kernels, skb->head no longer use kmalloc()/kfree(), so SLAB vs
> SLUB is less a concern for network loads.
>
> In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
> populate skb->head.

You mentioned that in latest kernel skb->head no longer use kmalloc()/kfree().
But, why result of David's "netperf RR" test on v3.6 is differentiated
by choosing the allocator?
As far as I know, __netdev_alloc_frag may be introduced in v3.5, so
I'm just confused.
Does this test use __netdev_alloc_skb with "__GFP_WAIT | GFP_DMA"?

Does normal workload for network use __netdev_alloc_skb with
"__GFP_WAIT | GFP_DMA"?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-18 Thread JoonSoo Kim
Hello, Eric.
Thank you very much for a kind comment about my question.
I have one more question related to network subsystem.
Please let me know what I misunderstand.

2012/10/14 Eric Dumazet eric.duma...@gmail.com:
 In latest kernels, skb-head no longer use kmalloc()/kfree(), so SLAB vs
 SLUB is less a concern for network loads.

 In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
 populate skb-head.

You mentioned that in latest kernel skb-head no longer use kmalloc()/kfree().
But, why result of David's netperf RR test on v3.6 is differentiated
by choosing the allocator?
As far as I know, __netdev_alloc_frag may be introduced in v3.5, so
I'm just confused.
Does this test use __netdev_alloc_skb with __GFP_WAIT | GFP_DMA?

Does normal workload for network use __netdev_alloc_skb with
__GFP_WAIT | GFP_DMA?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Shentino
On Wed, Oct 17, 2012 at 1:33 PM, Tim Bird  wrote:
> On 10/17/2012 12:20 PM, Shentino wrote:
>> Potentially stupid question
>>
>> But is SLAB the one where all objects per cache have a fixed size and
>> thus you don't have any bookkeeping overhead for the actual
>> allocations?
>>
>> I remember something about one of the allocation mechanisms being
>> designed for caches of fixed sized objects to minimize the need for
>> bookkeeping.
>
> I wouldn't say "don't have _any_ bookkeeping", but minimizing the
> bookkeeping is indeed part of the SLAB goals.
>
> However, that is for objects that are allocated at fixed size.
> kmalloc is (currently) a thin wrapper over the slab system,
> and it maps non-power-of-two allocations onto slabs that are
> power-of-two sized.

...yuck?

> So, for example a string that is 18 bytes long
> will be allocated out of a slab with 32-byte objects.  This
> is the wastage that we're talking about here.  "Overhead" may
> have been the wrong word on my part, as that may imply overhead
> in the actual slab mechanisms, rather than just slop in the
> data area.

Data slop (both for alignment as well as for making room for
per-allocation bookkeeping overhead as is often done with userspace
malloc arenas) is precisely what I was referring to here.

Thanks for the answers I was curious.

> As an aside...
>
> Is there a canonical glossary for memory-related terms?  What
> is the correct term for the difference between what is requested
> and what is actually returned by the allocator?  I've been
> calling it alternately "wastage" or "overhead", but maybe there's
> a more official term?
>
> I looked here: http://www.memorymanagement.org/glossary/
> but didn't find exactly what I was looking for.  The closest
> things I found were "internal fragmentation" and
> "padding", but those didn't seem to exactly describe
> the situation here.

Another stupid question.

Is it possible to have both SLAB for fixed sized objects and something
like SLOB or SLUB standing aside with a different pool for variable
sized allocations ala kmalloc?

My hunch is that handling the two cases with separate methods may get
the best of both worlds.  Or layering kmalloc through something that
gets huge blocks from slab and slices them up in ways more amenable to
avoiding power-of-2 slopping.

No memory geek, so my two cents.

>  -- Tim
>
> =
> Tim Bird
> Architecture Group Chair, CE Workgroup of the Linux Foundation
> Senior Staff Engineer, Sony Network Entertainment
> =
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Ezequiel Garcia
On Wed, Oct 17, 2012 at 5:58 PM, Tim Bird  wrote:
> On 10/17/2012 12:13 PM, Eric Dumazet wrote:
>> On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:
>>
>>> 8G is a small web server?  The RAM budget for Linux on one of
>>> Sony's cameras was 10M.  We're not merely not in the same ballpark -
>>> you're in a ballpark and I'm trimming bonsai trees... :-)
>>>
>>
>> Even laptops in 2012 have +4GB of ram.
>>
>> (Maybe not Sony laptops, I have to double check ?)
>>
>> Yes, servers do have more ram than laptops.
>>
>> (Maybe not Sony servers, I have to double check ?)
>
> I wouldn't know.  I suspect they are running 4GB+
> like everyone else.
>
 # grep Slab /proc/meminfo
 Slab: 351592 kB

 # egrep "kmalloc-32|kmalloc-16|kmalloc-8" /proc/slabinfo
 kmalloc-32 11332  12544 32  1281 : tunables000 
 : slabdata 98 98  0
 kmalloc-16  5888   5888 16  2561 : tunables000 
 : slabdata 23 23  0
 kmalloc-8  76563  82432  8  5121 : tunables000 
 : slabdata161161  0

 Really, some waste on these small objects is pure noise on SMP hosts.
>>> In this example, it appears that if all kmalloc-8's were pushed into 
>>> 32-byte slabs,
>>> we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
>>> on my system.
>> I said :
>>
>> 
>> I would remove small kmalloc-XX caches, as sharing a cache line
>> is sometime dangerous for performance, because of false sharing.
>>
>> They make sense only for very small hosts
>> 
>>
>> I think your 10M cameras are very tiny hosts.
>
> I agree.  Actually, I'm currently doing research for
> items with smaller memory footprints that this.  My current
> target is devices with 4M RAM and 8M NOR flash.
> Undoubtedly this is different than what a lot of other
> people are doing with Linux.
>
>> Using SLUB on them might not be the best choice.
> Indeed. :-)
>

I think the above assertion still needs some updated measurement.

  Is SLUB really a bad choice? Is SLAB the best choice? Or is this a
SLOB use case?

I've been trying to answer this questions, again focusing on
memory-constrained tiny hosts.
If anyone has some insight, it would very much like to hear it.

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Tim Bird
On 10/17/2012 12:13 PM, Eric Dumazet wrote:
> On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:
> 
>> 8G is a small web server?  The RAM budget for Linux on one of
>> Sony's cameras was 10M.  We're not merely not in the same ballpark -
>> you're in a ballpark and I'm trimming bonsai trees... :-)
>>
> 
> Even laptops in 2012 have +4GB of ram.
> 
> (Maybe not Sony laptops, I have to double check ?)
> 
> Yes, servers do have more ram than laptops.
> 
> (Maybe not Sony servers, I have to double check ?)

I wouldn't know.  I suspect they are running 4GB+
like everyone else.

>>> # grep Slab /proc/meminfo
>>> Slab: 351592 kB
>>>
>>> # egrep "kmalloc-32|kmalloc-16|kmalloc-8" /proc/slabinfo 
>>> kmalloc-32 11332  12544 32  1281 : tunables000 
>>> : slabdata 98 98  0
>>> kmalloc-16  5888   5888 16  2561 : tunables000 
>>> : slabdata 23 23  0
>>> kmalloc-8  76563  82432  8  5121 : tunables000 
>>> : slabdata161161  0
>>>
>>> Really, some waste on these small objects is pure noise on SMP hosts.
>> In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
>> slabs,
>> we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
>> on my system.
> I said :
> 
> 
> I would remove small kmalloc-XX caches, as sharing a cache line
> is sometime dangerous for performance, because of false sharing.
> 
> They make sense only for very small hosts
> 
> 
> I think your 10M cameras are very tiny hosts.

I agree.  Actually, I'm currently doing research for
items with smaller memory footprints that this.  My current
target is devices with 4M RAM and 8M NOR flash.
Undoubtedly this is different than what a lot of other
people are doing with Linux.

> Using SLUB on them might not be the best choice.
Indeed. :-)

I'm still interested in the dynamics of the slab sizes
and how it impacts performance, how much memory is wasted, etc.
I think there are a few "power-of-two-and-a-half" kmalloc
slabs (e.g. kmalloc-192).  Are these configurable anywhere?

Anyway, I greatly appreciate the discussion.

> First time I ran linux, years ago, it was on 486SX machines with 8M of
> memory (or maybe less, I dont remember exactly). But I no longer use
> this class of machines with recent kernels.

I ran a web server on an 8M machine that had an uptime of over 2 years,
but that was in the mid-90's.  Ahhh - those were the days...
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Tim Bird
On 10/17/2012 12:20 PM, Shentino wrote:
> Potentially stupid question
> 
> But is SLAB the one where all objects per cache have a fixed size and
> thus you don't have any bookkeeping overhead for the actual
> allocations?
> 
> I remember something about one of the allocation mechanisms being
> designed for caches of fixed sized objects to minimize the need for
> bookkeeping.

I wouldn't say "don't have _any_ bookkeeping", but minimizing the
bookkeeping is indeed part of the SLAB goals.

However, that is for objects that are allocated at fixed size.
kmalloc is (currently) a thin wrapper over the slab system,
and it maps non-power-of-two allocations onto slabs that are
power-of-two sized.  So, for example a string that is 18 bytes long
will be allocated out of a slab with 32-byte objects.  This
is the wastage that we're talking about here.  "Overhead" may
have been the wrong word on my part, as that may imply overhead
in the actual slab mechanisms, rather than just slop in the
data area.

As an aside...

Is there a canonical glossary for memory-related terms?  What
is the correct term for the difference between what is requested
and what is actually returned by the allocator?  I've been
calling it alternately "wastage" or "overhead", but maybe there's
a more official term?

I looked here: http://www.memorymanagement.org/glossary/
but didn't find exactly what I was looking for.  The closest
things I found were "internal fragmentation" and
"padding", but those didn't seem to exactly describe
the situation here.
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Shentino
On Wed, Oct 17, 2012 at 12:13 PM, Eric Dumazet  wrote:
> On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:
>
>> 8G is a small web server?  The RAM budget for Linux on one of
>> Sony's cameras was 10M.  We're not merely not in the same ballpark -
>> you're in a ballpark and I'm trimming bonsai trees... :-)
>>
>
> Even laptops in 2012 have +4GB of ram.
>
> (Maybe not Sony laptops, I have to double check ?)
>
> Yes, servers do have more ram than laptops.
>
> (Maybe not Sony servers, I have to double check ?)
>
>> > # grep Slab /proc/meminfo
>> > Slab: 351592 kB
>> >
>> > # egrep "kmalloc-32|kmalloc-16|kmalloc-8" /proc/slabinfo
>> > kmalloc-32 11332  12544 32  1281 : tunables000 
>> > : slabdata 98 98  0
>> > kmalloc-16  5888   5888 16  2561 : tunables000 
>> > : slabdata 23 23  0
>> > kmalloc-8  76563  82432  8  5121 : tunables000 
>> > : slabdata161161  0
>> >
>> > Really, some waste on these small objects is pure noise on SMP hosts.
>> In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
>> slabs,
>> we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
>> on my system.
>
>
> I said :
>
> 
> I would remove small kmalloc-XX caches, as sharing a cache line
> is sometime dangerous for performance, because of false sharing.
>
> They make sense only for very small hosts
> 
>
> I think your 10M cameras are very tiny hosts.
>
> Using SLUB on them might not be the best choice.
>
> First time I ran linux, years ago, it was on 486SX machines with 8M of
> memory (or maybe less, I dont remember exactly). But I no longer use
> this class of machines with recent kernels.
>
> # size vmlinux
>textdata bss dec hex filename
> 102906311278976 1896448 13466055 cd79c7 vmlinux
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Potentially stupid question

But is SLAB the one where all objects per cache have a fixed size and
thus you don't have any bookkeeping overhead for the actual
allocations?

I remember something about one of the allocation mechanisms being
designed for caches of fixed sized objects to minimize the need for
bookkeeping.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Eric Dumazet
On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:

> 8G is a small web server?  The RAM budget for Linux on one of
> Sony's cameras was 10M.  We're not merely not in the same ballpark -
> you're in a ballpark and I'm trimming bonsai trees... :-)
> 

Even laptops in 2012 have +4GB of ram.

(Maybe not Sony laptops, I have to double check ?)

Yes, servers do have more ram than laptops.

(Maybe not Sony servers, I have to double check ?)

> > # grep Slab /proc/meminfo
> > Slab: 351592 kB
> > 
> > # egrep "kmalloc-32|kmalloc-16|kmalloc-8" /proc/slabinfo 
> > kmalloc-32 11332  12544 32  1281 : tunables000 
> > : slabdata 98 98  0
> > kmalloc-16  5888   5888 16  2561 : tunables000 
> > : slabdata 23 23  0
> > kmalloc-8  76563  82432  8  5121 : tunables000 
> > : slabdata161161  0
> > 
> > Really, some waste on these small objects is pure noise on SMP hosts.
> In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
> slabs,
> we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
> on my system.


I said :


I would remove small kmalloc-XX caches, as sharing a cache line
is sometime dangerous for performance, because of false sharing.

They make sense only for very small hosts


I think your 10M cameras are very tiny hosts.

Using SLUB on them might not be the best choice.

First time I ran linux, years ago, it was on 486SX machines with 8M of
memory (or maybe less, I dont remember exactly). But I no longer use
this class of machines with recent kernels.

# size vmlinux
   textdata bss dec hex filename
102906311278976 1896448 13466055 cd79c7 vmlinux


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Tim Bird
On 10/16/2012 12:16 PM, Eric Dumazet wrote:
> On Tue, 2012-10-16 at 15:27 -0300, Ezequiel Garcia wrote:
> 
>> Yes, we have some numbers:
>>
>> http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects
>>
>> Are they too informal? I can add some details...
>>
>> They've been measured on a **very** minimal setup, almost every option
>> is stripped out, except from initramfs, sysfs, and trace.
>>
>> On this scenario, strings allocated for file names and directories
>> created by sysfs
>> are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
>> from
>> that 32 byte cache at SLAB.
>>
>> Is an option to enable small caches on SLUB and SLAB worth it?
> 
> Random small web server :
> 
> # free
>  total   used   free sharedbuffers cached
> Mem:   788453654125722471964  0 1554401803340
> -/+ buffers/cache:34537924430744
> Swap:  2438140  511642386976

8G is a small web server?  The RAM budget for Linux on one of
Sony's cameras was 10M.  We're not merely not in the same ballpark -
you're in a ballpark and I'm trimming bonsai trees... :-)

> # grep Slab /proc/meminfo
> Slab: 351592 kB
> 
> # egrep "kmalloc-32|kmalloc-16|kmalloc-8" /proc/slabinfo 
> kmalloc-32 11332  12544 32  1281 : tunables000 : 
> slabdata 98 98  0
> kmalloc-16  5888   5888 16  2561 : tunables000 : 
> slabdata 23 23  0
> kmalloc-8  76563  82432  8  5121 : tunables000 : 
> slabdata161161  0
> 
> Really, some waste on these small objects is pure noise on SMP hosts.
In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
slabs,
we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
on my system.

> (Waste on bigger objects is probably more important by orders of magnitude)

Maybe.

I need to run some measurements on systems that are more similar to what
we're deploying in products.  I'll see if I can share them.
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Tim Bird
On 10/16/2012 12:16 PM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 15:27 -0300, Ezequiel Garcia wrote:
 
 Yes, we have some numbers:

 http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects

 Are they too informal? I can add some details...

 They've been measured on a **very** minimal setup, almost every option
 is stripped out, except from initramfs, sysfs, and trace.

 On this scenario, strings allocated for file names and directories
 created by sysfs
 are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
 from
 that 32 byte cache at SLAB.

 Is an option to enable small caches on SLUB and SLAB worth it?
 
 Random small web server :
 
 # free
  total   used   free sharedbuffers cached
 Mem:   788453654125722471964  0 1554401803340
 -/+ buffers/cache:34537924430744
 Swap:  2438140  511642386976

8G is a small web server?  The RAM budget for Linux on one of
Sony's cameras was 10M.  We're not merely not in the same ballpark -
you're in a ballpark and I'm trimming bonsai trees... :-)

 # grep Slab /proc/meminfo
 Slab: 351592 kB
 
 # egrep kmalloc-32|kmalloc-16|kmalloc-8 /proc/slabinfo 
 kmalloc-32 11332  12544 32  1281 : tunables000 : 
 slabdata 98 98  0
 kmalloc-16  5888   5888 16  2561 : tunables000 : 
 slabdata 23 23  0
 kmalloc-8  76563  82432  8  5121 : tunables000 : 
 slabdata161161  0
 
 Really, some waste on these small objects is pure noise on SMP hosts.
In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
slabs,
we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
on my system.

 (Waste on bigger objects is probably more important by orders of magnitude)

Maybe.

I need to run some measurements on systems that are more similar to what
we're deploying in products.  I'll see if I can share them.
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Eric Dumazet
On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:

 8G is a small web server?  The RAM budget for Linux on one of
 Sony's cameras was 10M.  We're not merely not in the same ballpark -
 you're in a ballpark and I'm trimming bonsai trees... :-)
 

Even laptops in 2012 have +4GB of ram.

(Maybe not Sony laptops, I have to double check ?)

Yes, servers do have more ram than laptops.

(Maybe not Sony servers, I have to double check ?)

  # grep Slab /proc/meminfo
  Slab: 351592 kB
  
  # egrep kmalloc-32|kmalloc-16|kmalloc-8 /proc/slabinfo 
  kmalloc-32 11332  12544 32  1281 : tunables000 
  : slabdata 98 98  0
  kmalloc-16  5888   5888 16  2561 : tunables000 
  : slabdata 23 23  0
  kmalloc-8  76563  82432  8  5121 : tunables000 
  : slabdata161161  0
  
  Really, some waste on these small objects is pure noise on SMP hosts.
 In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
 slabs,
 we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
 on my system.


I said :

quote
I would remove small kmalloc-XX caches, as sharing a cache line
is sometime dangerous for performance, because of false sharing.

They make sense only for very small hosts
/quote

I think your 10M cameras are very tiny hosts.

Using SLUB on them might not be the best choice.

First time I ran linux, years ago, it was on 486SX machines with 8M of
memory (or maybe less, I dont remember exactly). But I no longer use
this class of machines with recent kernels.

# size vmlinux
   textdata bss dec hex filename
102906311278976 1896448 13466055 cd79c7 vmlinux


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Shentino
On Wed, Oct 17, 2012 at 12:13 PM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:

 8G is a small web server?  The RAM budget for Linux on one of
 Sony's cameras was 10M.  We're not merely not in the same ballpark -
 you're in a ballpark and I'm trimming bonsai trees... :-)


 Even laptops in 2012 have +4GB of ram.

 (Maybe not Sony laptops, I have to double check ?)

 Yes, servers do have more ram than laptops.

 (Maybe not Sony servers, I have to double check ?)

  # grep Slab /proc/meminfo
  Slab: 351592 kB
 
  # egrep kmalloc-32|kmalloc-16|kmalloc-8 /proc/slabinfo
  kmalloc-32 11332  12544 32  1281 : tunables000 
  : slabdata 98 98  0
  kmalloc-16  5888   5888 16  2561 : tunables000 
  : slabdata 23 23  0
  kmalloc-8  76563  82432  8  5121 : tunables000 
  : slabdata161161  0
 
  Really, some waste on these small objects is pure noise on SMP hosts.
 In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
 slabs,
 we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
 on my system.


 I said :

 quote
 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts
 /quote

 I think your 10M cameras are very tiny hosts.

 Using SLUB on them might not be the best choice.

 First time I ran linux, years ago, it was on 486SX machines with 8M of
 memory (or maybe less, I dont remember exactly). But I no longer use
 this class of machines with recent kernels.

 # size vmlinux
textdata bss dec hex filename
 102906311278976 1896448 13466055 cd79c7 vmlinux


 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

Potentially stupid question

But is SLAB the one where all objects per cache have a fixed size and
thus you don't have any bookkeeping overhead for the actual
allocations?

I remember something about one of the allocation mechanisms being
designed for caches of fixed sized objects to minimize the need for
bookkeeping.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Tim Bird
On 10/17/2012 12:20 PM, Shentino wrote:
 Potentially stupid question
 
 But is SLAB the one where all objects per cache have a fixed size and
 thus you don't have any bookkeeping overhead for the actual
 allocations?
 
 I remember something about one of the allocation mechanisms being
 designed for caches of fixed sized objects to minimize the need for
 bookkeeping.

I wouldn't say don't have _any_ bookkeeping, but minimizing the
bookkeeping is indeed part of the SLAB goals.

However, that is for objects that are allocated at fixed size.
kmalloc is (currently) a thin wrapper over the slab system,
and it maps non-power-of-two allocations onto slabs that are
power-of-two sized.  So, for example a string that is 18 bytes long
will be allocated out of a slab with 32-byte objects.  This
is the wastage that we're talking about here.  Overhead may
have been the wrong word on my part, as that may imply overhead
in the actual slab mechanisms, rather than just slop in the
data area.

As an aside...

Is there a canonical glossary for memory-related terms?  What
is the correct term for the difference between what is requested
and what is actually returned by the allocator?  I've been
calling it alternately wastage or overhead, but maybe there's
a more official term?

I looked here: http://www.memorymanagement.org/glossary/
but didn't find exactly what I was looking for.  The closest
things I found were internal fragmentation and
padding, but those didn't seem to exactly describe
the situation here.
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Tim Bird
On 10/17/2012 12:13 PM, Eric Dumazet wrote:
 On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:
 
 8G is a small web server?  The RAM budget for Linux on one of
 Sony's cameras was 10M.  We're not merely not in the same ballpark -
 you're in a ballpark and I'm trimming bonsai trees... :-)

 
 Even laptops in 2012 have +4GB of ram.
 
 (Maybe not Sony laptops, I have to double check ?)
 
 Yes, servers do have more ram than laptops.
 
 (Maybe not Sony servers, I have to double check ?)

I wouldn't know.  I suspect they are running 4GB+
like everyone else.

 # grep Slab /proc/meminfo
 Slab: 351592 kB

 # egrep kmalloc-32|kmalloc-16|kmalloc-8 /proc/slabinfo 
 kmalloc-32 11332  12544 32  1281 : tunables000 
 : slabdata 98 98  0
 kmalloc-16  5888   5888 16  2561 : tunables000 
 : slabdata 23 23  0
 kmalloc-8  76563  82432  8  5121 : tunables000 
 : slabdata161161  0

 Really, some waste on these small objects is pure noise on SMP hosts.
 In this example, it appears that if all kmalloc-8's were pushed into 32-byte 
 slabs,
 we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
 on my system.
 I said :
 
 quote
 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.
 
 They make sense only for very small hosts
 /quote
 
 I think your 10M cameras are very tiny hosts.

I agree.  Actually, I'm currently doing research for
items with smaller memory footprints that this.  My current
target is devices with 4M RAM and 8M NOR flash.
Undoubtedly this is different than what a lot of other
people are doing with Linux.

 Using SLUB on them might not be the best choice.
Indeed. :-)

I'm still interested in the dynamics of the slab sizes
and how it impacts performance, how much memory is wasted, etc.
I think there are a few power-of-two-and-a-half kmalloc
slabs (e.g. kmalloc-192).  Are these configurable anywhere?

Anyway, I greatly appreciate the discussion.

 First time I ran linux, years ago, it was on 486SX machines with 8M of
 memory (or maybe less, I dont remember exactly). But I no longer use
 this class of machines with recent kernels.

I ran a web server on an 8M machine that had an uptime of over 2 years,
but that was in the mid-90's.  Ahhh - those were the days...
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Ezequiel Garcia
On Wed, Oct 17, 2012 at 5:58 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/17/2012 12:13 PM, Eric Dumazet wrote:
 On Wed, 2012-10-17 at 11:45 -0700, Tim Bird wrote:

 8G is a small web server?  The RAM budget for Linux on one of
 Sony's cameras was 10M.  We're not merely not in the same ballpark -
 you're in a ballpark and I'm trimming bonsai trees... :-)


 Even laptops in 2012 have +4GB of ram.

 (Maybe not Sony laptops, I have to double check ?)

 Yes, servers do have more ram than laptops.

 (Maybe not Sony servers, I have to double check ?)

 I wouldn't know.  I suspect they are running 4GB+
 like everyone else.

 # grep Slab /proc/meminfo
 Slab: 351592 kB

 # egrep kmalloc-32|kmalloc-16|kmalloc-8 /proc/slabinfo
 kmalloc-32 11332  12544 32  1281 : tunables000 
 : slabdata 98 98  0
 kmalloc-16  5888   5888 16  2561 : tunables000 
 : slabdata 23 23  0
 kmalloc-8  76563  82432  8  5121 : tunables000 
 : slabdata161161  0

 Really, some waste on these small objects is pure noise on SMP hosts.
 In this example, it appears that if all kmalloc-8's were pushed into 
 32-byte slabs,
 we'd lose about 1.8 meg due to pure slab overhead.  This would not be noise
 on my system.
 I said :

 quote
 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts
 /quote

 I think your 10M cameras are very tiny hosts.

 I agree.  Actually, I'm currently doing research for
 items with smaller memory footprints that this.  My current
 target is devices with 4M RAM and 8M NOR flash.
 Undoubtedly this is different than what a lot of other
 people are doing with Linux.

 Using SLUB on them might not be the best choice.
 Indeed. :-)


I think the above assertion still needs some updated measurement.

  Is SLUB really a bad choice? Is SLAB the best choice? Or is this a
SLOB use case?

I've been trying to answer this questions, again focusing on
memory-constrained tiny hosts.
If anyone has some insight, it would very much like to hear it.

Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-17 Thread Shentino
On Wed, Oct 17, 2012 at 1:33 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/17/2012 12:20 PM, Shentino wrote:
 Potentially stupid question

 But is SLAB the one where all objects per cache have a fixed size and
 thus you don't have any bookkeeping overhead for the actual
 allocations?

 I remember something about one of the allocation mechanisms being
 designed for caches of fixed sized objects to minimize the need for
 bookkeeping.

 I wouldn't say don't have _any_ bookkeeping, but minimizing the
 bookkeeping is indeed part of the SLAB goals.

 However, that is for objects that are allocated at fixed size.
 kmalloc is (currently) a thin wrapper over the slab system,
 and it maps non-power-of-two allocations onto slabs that are
 power-of-two sized.

...yuck?

 So, for example a string that is 18 bytes long
 will be allocated out of a slab with 32-byte objects.  This
 is the wastage that we're talking about here.  Overhead may
 have been the wrong word on my part, as that may imply overhead
 in the actual slab mechanisms, rather than just slop in the
 data area.

Data slop (both for alignment as well as for making room for
per-allocation bookkeeping overhead as is often done with userspace
malloc arenas) is precisely what I was referring to here.

Thanks for the answers I was curious.

 As an aside...

 Is there a canonical glossary for memory-related terms?  What
 is the correct term for the difference between what is requested
 and what is actually returned by the allocator?  I've been
 calling it alternately wastage or overhead, but maybe there's
 a more official term?

 I looked here: http://www.memorymanagement.org/glossary/
 but didn't find exactly what I was looking for.  The closest
 things I found were internal fragmentation and
 padding, but those didn't seem to exactly describe
 the situation here.

Another stupid question.

Is it possible to have both SLAB for fixed sized objects and something
like SLOB or SLUB standing aside with a different pool for variable
sized allocations ala kmalloc?

My hunch is that handling the two cases with separate methods may get
the best of both worlds.  Or layering kmalloc through something that
gets huge blocks from slab and slices them up in ways more amenable to
avoiding power-of-2 slopping.

No memory geek, so my two cents.

  -- Tim

 =
 Tim Bird
 Architecture Group Chair, CE Workgroup of the Linux Foundation
 Senior Staff Engineer, Sony Network Entertainment
 =
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Eric Dumazet
On Tue, 2012-10-16 at 15:27 -0300, Ezequiel Garcia wrote:

> Yes, we have some numbers:
> 
> http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects
> 
> Are they too informal? I can add some details...
> 
> They've been measured on a **very** minimal setup, almost every option
> is stripped out, except from initramfs, sysfs, and trace.
> 
> On this scenario, strings allocated for file names and directories
> created by sysfs
> are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
> from
> that 32 byte cache at SLAB.
> 
> Is an option to enable small caches on SLUB and SLAB worth it?

Random small web server :

# free
 total   used   free sharedbuffers cached
Mem:   788453654125722471964  0 1554401803340
-/+ buffers/cache:34537924430744
Swap:  2438140  511642386976

# grep Slab /proc/meminfo
Slab: 351592 kB

# egrep "kmalloc-32|kmalloc-16|kmalloc-8" /proc/slabinfo 
kmalloc-32 11332  12544 32  1281 : tunables000 : 
slabdata 98 98  0
kmalloc-16  5888   5888 16  2561 : tunables000 : 
slabdata 23 23  0
kmalloc-8  76563  82432  8  5121 : tunables000 : 
slabdata161161  0

Really, some waste on these small objects is pure noise on SMP hosts.

(Waste on bigger objects is probably more important by orders of magnitude)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Christoph Lameter
On Thu, 11 Oct 2012, Ezequiel Garcia wrote:

> * Is SLAB a proper choice? or is it just historical an never been 
> re-evaluated?
> * Does the average embedded guy knows which allocator to choose
>   and what's the impact on his platform?

My current ideas on this subject matter is to get to a point where we have
a generic slab allocator framework that allows us to provide any
object layout we want. This will simplify handling new slab allocators that
seems to crop up frequently. Maybe even allow the specification of the
storage layout when the slab is created. Depending on how the memory is
used there may be different object layouts that are most advantageous.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Christoph Lameter
On Tue, 16 Oct 2012, Ezequiel Garcia wrote:

> It might be worth reminding that very small systems can use SLOB
> allocator, which does not suffer from this kind of fragmentation.

Well, I have never seen non experimental systems that use SLOB. Others
have claimed they exist.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Christoph Lameter
On Mon, 15 Oct 2012, David Rientjes wrote:

> This type of workload that really exhibits the problem with remote freeing
> would suggest that the design of slub itself is the problem here.

There is a tradeoff here between spatial data locality and temporal
locality. Slub always frees to the queue associated with the slab page
that the object originated from and therefore restores spatial data
locality. It will always serve all objects available in a slab page
before moving onto the next. Within a slab page it can consider temporal
locality.

Slab considers temporal locatlity more important and will not return
objects to the originating slab pages until they are no longer in use. It
(ideally) will serve objects in the order they were freed. This breaks
down in the NUMA case and the allocator got into a pretty bizarre queueing
configuration (with lots and lots of queues) as a result of our attempt to
preverse the free/alloc order per NUMA node (look at the alien caches
f.e.). Slub is an alternative to that approach.

Slab also has the problem of queue handling overhead due to the attempt to
throw objects out of the queues that are likely no more cache hot. Every
few seconds it needs to run queue cleaning through all queues that exists
on the system. How accurate it tracks the actual cache hotness of objects
is not clear.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
On Tue, Oct 16, 2012 at 3:44 PM, Tim Bird  wrote:
> On 10/16/2012 11:27 AM, Ezequiel Garcia wrote:
>> On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird  wrote:
>>> On 10/16/2012 05:56 AM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

> Now, returning to the fragmentation. The problem with SLAB is that
> its smaller cache available for kmalloced objects is 32 bytes;
> while SLUB allows 8, 16, 24 ...
>
> Perhaps adding smaller caches to SLAB might make sense?
> Is there any strong reason for NOT doing this?

 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts.
>>>
>>> That's interesting...
>>>
>>> It would be good to measure the performance/size tradeoff here.
>>> I'm interested in very small systems, and it might be worth
>>> the tradeoff, depending on how bad the performance is.  Maybe
>>> a new config option would be useful (I can hear the groans now... :-)
>>>
>>> Ezequiel - do you have any measurements of how much memory
>>> is wasted by 32-byte kmalloc allocations for smaller objects,
>>> in the tests you've been doing?
>>
>> Yes, we have some numbers:
>>
>> http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects
>>
>> Are they too informal? I can add some details...
>
>
>> They've been measured on a **very** minimal setup, almost every option
>> is stripped out, except from initramfs, sysfs, and trace.
>>
>> On this scenario, strings allocated for file names and directories
>> created by sysfs
>> are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
>> from
>> that 32 byte cache at SLAB.
>
> The detail I'm interested in is the amount of wastage for a
> "common" workload, for each of the SLxB systems.  Are we talking a
> few K, or 10's or 100's of K?  It sounds like it's all from short strings.
> Are there other things using the 32-byte kmalloc cache, that waste
> a lot of memory (in aggregate) as well?
>

A more "Common" workload is one of the next items on my queue.


> Does your tool indicate a specific callsite (or small set of callsites)
> where these small allocations are made?  It sounds like it's in the filesystem
> and would be content-driven (by the length of filenames)?
>

That's right. And, IMHO, the problem is precisely that the allocation
size is content-driven.


Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Tim Bird
On 10/16/2012 11:27 AM, Ezequiel Garcia wrote:
> On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird  wrote:
>> On 10/16/2012 05:56 AM, Eric Dumazet wrote:
>>> On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:
>>>
 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...

 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?
>>>
>>> I would remove small kmalloc-XX caches, as sharing a cache line
>>> is sometime dangerous for performance, because of false sharing.
>>>
>>> They make sense only for very small hosts.
>>
>> That's interesting...
>>
>> It would be good to measure the performance/size tradeoff here.
>> I'm interested in very small systems, and it might be worth
>> the tradeoff, depending on how bad the performance is.  Maybe
>> a new config option would be useful (I can hear the groans now... :-)
>>
>> Ezequiel - do you have any measurements of how much memory
>> is wasted by 32-byte kmalloc allocations for smaller objects,
>> in the tests you've been doing?
> 
> Yes, we have some numbers:
> 
> http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects
> 
> Are they too informal? I can add some details...


> They've been measured on a **very** minimal setup, almost every option
> is stripped out, except from initramfs, sysfs, and trace.
> 
> On this scenario, strings allocated for file names and directories
> created by sysfs
> are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
> from
> that 32 byte cache at SLAB.

The detail I'm interested in is the amount of wastage for a
"common" workload, for each of the SLxB systems.  Are we talking a
few K, or 10's or 100's of K?  It sounds like it's all from short strings.
Are there other things using the 32-byte kmalloc cache, that waste
a lot of memory (in aggregate) as well?

Does your tool indicate a specific callsite (or small set of callsites)
where these small allocations are made?  It sounds like it's in the filesystem
and would be content-driven (by the length of filenames)?

This might be an issue particularly for cameras, where all the generated
filenames are 8.3 (and will be for the foreseeable future)

> Is an option to enable small caches on SLUB and SLAB worth it?
I'll have to do some measurements to see.  I'm guessing the option
itself would be pretty trivial to implement?
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird  wrote:
> On 10/16/2012 05:56 AM, Eric Dumazet wrote:
>> On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:
>>
>>> Now, returning to the fragmentation. The problem with SLAB is that
>>> its smaller cache available for kmalloced objects is 32 bytes;
>>> while SLUB allows 8, 16, 24 ...
>>>
>>> Perhaps adding smaller caches to SLAB might make sense?
>>> Is there any strong reason for NOT doing this?
>>
>> I would remove small kmalloc-XX caches, as sharing a cache line
>> is sometime dangerous for performance, because of false sharing.
>>
>> They make sense only for very small hosts.
>
> That's interesting...
>
> It would be good to measure the performance/size tradeoff here.
> I'm interested in very small systems, and it might be worth
> the tradeoff, depending on how bad the performance is.  Maybe
> a new config option would be useful (I can hear the groans now... :-)
>

It might be worth reminding that very small systems can use SLOB
allocator, which does not suffer from this kind of fragmentation.

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird  wrote:
> On 10/16/2012 05:56 AM, Eric Dumazet wrote:
>> On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:
>>
>>> Now, returning to the fragmentation. The problem with SLAB is that
>>> its smaller cache available for kmalloced objects is 32 bytes;
>>> while SLUB allows 8, 16, 24 ...
>>>
>>> Perhaps adding smaller caches to SLAB might make sense?
>>> Is there any strong reason for NOT doing this?
>>
>> I would remove small kmalloc-XX caches, as sharing a cache line
>> is sometime dangerous for performance, because of false sharing.
>>
>> They make sense only for very small hosts.
>
> That's interesting...
>
> It would be good to measure the performance/size tradeoff here.
> I'm interested in very small systems, and it might be worth
> the tradeoff, depending on how bad the performance is.  Maybe
> a new config option would be useful (I can hear the groans now... :-)
>
> Ezequiel - do you have any measurements of how much memory
> is wasted by 32-byte kmalloc allocations for smaller objects,
> in the tests you've been doing?

Yes, we have some numbers:

http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects

Are they too informal? I can add some details...

They've been measured on a **very** minimal setup, almost every option
is stripped out, except from initramfs, sysfs, and trace.

On this scenario, strings allocated for file names and directories
created by sysfs
are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation from
that 32 byte cache at SLAB.

Is an option to enable small caches on SLUB and SLAB worth it?

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Tim Bird
On 10/16/2012 05:56 AM, Eric Dumazet wrote:
> On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:
> 
>> Now, returning to the fragmentation. The problem with SLAB is that
>> its smaller cache available for kmalloced objects is 32 bytes;
>> while SLUB allows 8, 16, 24 ...
>>
>> Perhaps adding smaller caches to SLAB might make sense?
>> Is there any strong reason for NOT doing this?
> 
> I would remove small kmalloc-XX caches, as sharing a cache line
> is sometime dangerous for performance, because of false sharing.
> 
> They make sense only for very small hosts.

That's interesting...

It would be good to measure the performance/size tradeoff here.
I'm interested in very small systems, and it might be worth
the tradeoff, depending on how bad the performance is.  Maybe
a new config option would be useful (I can hear the groans now... :-)

Ezequiel - do you have any measurements of how much memory
is wasted by 32-byte kmalloc allocations for smaller objects,
in the tests you've been doing?
 -- Tim


=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Eric Dumazet
On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

> Now, returning to the fragmentation. The problem with SLAB is that
> its smaller cache available for kmalloced objects is 32 bytes;
> while SLUB allows 8, 16, 24 ...
> 
> Perhaps adding smaller caches to SLAB might make sense?
> Is there any strong reason for NOT doing this?

I would remove small kmalloc-XX caches, as sharing a cache line
is sometime dangerous for performance, because of false sharing.

They make sense only for very small hosts.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
David,

On Mon, Oct 15, 2012 at 9:46 PM, David Rientjes  wrote:
> On Sat, 13 Oct 2012, Ezequiel Garcia wrote:
>
>> But SLAB suffers from a lot more internal fragmentation than SLUB,
>> which I guess is a known fact. So memory-constrained devices
>> would waste more memory by using SLAB.
>
> Even with slub's per-cpu partial lists?

I'm not considering that, but rather plain fragmentation: the difference
between requested and allocated, per object.
Admittedly, perhaps this is a naive analysis.

However, devices where this matters would have only one cpu, right?
So the overhead imposed by per-cpu data shouldn't impact so much.

Study the difference in overhead imposed by allocators is
something that's still on my TODO.

Now, returning to the fragmentation. The problem with SLAB is that
its smaller cache available for kmalloced objects is 32 bytes;
while SLUB allows 8, 16, 24 ...

Perhaps adding smaller caches to SLAB might make sense?
Is there any strong reason for NOT doing this?

Thanks,

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Eric Dumazet
On Tue, 2012-10-16 at 10:28 +0900, JoonSoo Kim wrote:
> Hello, Eric.
> 
> 2012/10/14 Eric Dumazet :
> > SLUB was really bad in the common workload you describe (allocations
> > done by one cpu, freeing done by other cpus), because all kfree() hit
> > the slow path and cpus contend in __slab_free() in the loop guarded by
> > cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
> > hit the main "struct page" to add the freed object to freelist.
> 
> Could you elaborate more on how 'netperf RR' makes kernel "allocations
> done by one cpu, freeling done by other cpus", please?
> I don't have enough background network subsystem, so I'm just curious.
> 

Common network load is to have one cpu A handling device interrupts
doing the memory allocations to hold incoming frames,
and queueing skbs to various sockets.

These sockets are read by other cpus (if the cpu A is fully used to
service softirqs under high load), so the kfree() are done by other
cpus.

Each incoming frame uses one sk_buff, allocated from skbuff_head_cache
kmemcache (256 bytes on x86_64)

# ls -l /sys/kernel/slab/skbuff_head_cache
lrwxrwxrwx 1 root root 0 oct.  16
08:50 /sys/kernel/slab/skbuff_head_cache -> :t-256

# cat /sys/kernel/slab/skbuff_head_cache/objs_per_slab 
32

On a configuration with 24 cpus and one cpu servicing network, we may
have 23 cpus doing the frees roughly at the same time, all competing in 
__slab_free() on the same page. This increases if we increase slub page
order (as recommended by SLUB hackers)

To reproduce this kind of workload without a real NIC, we probably need
some test module, using one thread doing allocations, and other threads
doing the free.

> > I played some months ago adding a percpu associative cache to SLUB, then
> > just moved on other strategy.
> >
> > (Idea for this per cpu cache was to build a temporary free list of
> > objects to batch accesses to struct page)
> 
> Is this implemented and submitted?
> If it is, could you tell me the link for the patches?

It was implemented in february and not submitted at that time.

The following rebase has probably some issues with slab debug, but seems
to work.

 include/linux/slub_def.h |   22 ++
 mm/slub.c|  127 +++--
 2 files changed, 131 insertions(+), 18 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index df448ad..9e5b91c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -41,8 +41,30 @@ enum stat_item {
CPU_PARTIAL_FREE,   /* Refill cpu partial on free */
CPU_PARTIAL_NODE,   /* Refill cpu partial from node partial */
CPU_PARTIAL_DRAIN,  /* Drain cpu partial to node partial */
+   FREE_CACHED,/* free delayed in secondary freelist, 
cumulative counter */
+   FREE_CACHED_ITEMS,  /* items in victim cache */
NR_SLUB_STAT_ITEMS };
 
+/**
+ * struct slub_cache_desc - victim cache descriptor 
+ * @page: slab page
+ * @objects_head: head of freed objects list
+ * @objects_tail: tail of freed objects list
+ * @count: number of objects in list
+ *
+ * freed objects in slow path are managed into an associative cache,
+ * to reduce contention on @page->freelist
+ */
+struct slub_cache_desc {
+   struct page *page;
+   void**objects_head;
+   void**objects_tail;
+   int count;
+};
+
+#define NR_SLUB_PCPU_CACHE_SHIFT 6
+#define NR_SLUB_PCPU_CACHE (1 << NR_SLUB_PCPU_CACHE_SHIFT)
+
 struct kmem_cache_cpu {
void **freelist;/* Pointer to next available object */
unsigned long tid;  /* Globally unique transaction id */
diff --git a/mm/slub.c b/mm/slub.c
index a0d6984..30a6d72 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -221,6 +222,14 @@ static inline void stat(const struct kmem_cache *s, enum 
stat_item si)
 #endif
 }
 
+static inline void stat_add(const struct kmem_cache *s, enum stat_item si,
+   int cnt)
+{
+#ifdef CONFIG_SLUB_STATS
+   __this_cpu_add(s->cpu_slab->stat[si], cnt);
+#endif
+}
+
 /
  * Core slab cache functions
  ***/
@@ -1993,6 +2002,8 @@ static inline void flush_slab(struct kmem_cache *s, 
struct kmem_cache_cpu *c)
c->freelist = NULL;
 }
 
+static void victim_cache_flush(struct kmem_cache *s, int cpu);
+
 /*
  * Flush cpu slab.
  *
@@ -2006,6 +2017,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, 
int cpu)
if (c->page)
flush_slab(s, c);
 
+   victim_cache_flush(s, cpu);
unfreeze_partials(s);
}
 }
@@ -2446,38 +2458,34 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
 #endif
 
 /*
- * Slow patch handling. This may still be called 

Re: [Q] Default SLAB allocator

2012-10-16 Thread Eric Dumazet
On Tue, 2012-10-16 at 10:28 +0900, JoonSoo Kim wrote:
 Hello, Eric.
 
 2012/10/14 Eric Dumazet eric.duma...@gmail.com:
  SLUB was really bad in the common workload you describe (allocations
  done by one cpu, freeing done by other cpus), because all kfree() hit
  the slow path and cpus contend in __slab_free() in the loop guarded by
  cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
  hit the main struct page to add the freed object to freelist.
 
 Could you elaborate more on how 'netperf RR' makes kernel allocations
 done by one cpu, freeling done by other cpus, please?
 I don't have enough background network subsystem, so I'm just curious.
 

Common network load is to have one cpu A handling device interrupts
doing the memory allocations to hold incoming frames,
and queueing skbs to various sockets.

These sockets are read by other cpus (if the cpu A is fully used to
service softirqs under high load), so the kfree() are done by other
cpus.

Each incoming frame uses one sk_buff, allocated from skbuff_head_cache
kmemcache (256 bytes on x86_64)

# ls -l /sys/kernel/slab/skbuff_head_cache
lrwxrwxrwx 1 root root 0 oct.  16
08:50 /sys/kernel/slab/skbuff_head_cache - :t-256

# cat /sys/kernel/slab/skbuff_head_cache/objs_per_slab 
32

On a configuration with 24 cpus and one cpu servicing network, we may
have 23 cpus doing the frees roughly at the same time, all competing in 
__slab_free() on the same page. This increases if we increase slub page
order (as recommended by SLUB hackers)

To reproduce this kind of workload without a real NIC, we probably need
some test module, using one thread doing allocations, and other threads
doing the free.

  I played some months ago adding a percpu associative cache to SLUB, then
  just moved on other strategy.
 
  (Idea for this per cpu cache was to build a temporary free list of
  objects to batch accesses to struct page)
 
 Is this implemented and submitted?
 If it is, could you tell me the link for the patches?

It was implemented in february and not submitted at that time.

The following rebase has probably some issues with slab debug, but seems
to work.

 include/linux/slub_def.h |   22 ++
 mm/slub.c|  127 +++--
 2 files changed, 131 insertions(+), 18 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index df448ad..9e5b91c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -41,8 +41,30 @@ enum stat_item {
CPU_PARTIAL_FREE,   /* Refill cpu partial on free */
CPU_PARTIAL_NODE,   /* Refill cpu partial from node partial */
CPU_PARTIAL_DRAIN,  /* Drain cpu partial to node partial */
+   FREE_CACHED,/* free delayed in secondary freelist, 
cumulative counter */
+   FREE_CACHED_ITEMS,  /* items in victim cache */
NR_SLUB_STAT_ITEMS };
 
+/**
+ * struct slub_cache_desc - victim cache descriptor 
+ * @page: slab page
+ * @objects_head: head of freed objects list
+ * @objects_tail: tail of freed objects list
+ * @count: number of objects in list
+ *
+ * freed objects in slow path are managed into an associative cache,
+ * to reduce contention on @page-freelist
+ */
+struct slub_cache_desc {
+   struct page *page;
+   void**objects_head;
+   void**objects_tail;
+   int count;
+};
+
+#define NR_SLUB_PCPU_CACHE_SHIFT 6
+#define NR_SLUB_PCPU_CACHE (1  NR_SLUB_PCPU_CACHE_SHIFT)
+
 struct kmem_cache_cpu {
void **freelist;/* Pointer to next available object */
unsigned long tid;  /* Globally unique transaction id */
diff --git a/mm/slub.c b/mm/slub.c
index a0d6984..30a6d72 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -31,6 +31,7 @@
 #include linux/fault-inject.h
 #include linux/stacktrace.h
 #include linux/prefetch.h
+#include linux/hash.h
 
 #include trace/events/kmem.h
 
@@ -221,6 +222,14 @@ static inline void stat(const struct kmem_cache *s, enum 
stat_item si)
 #endif
 }
 
+static inline void stat_add(const struct kmem_cache *s, enum stat_item si,
+   int cnt)
+{
+#ifdef CONFIG_SLUB_STATS
+   __this_cpu_add(s-cpu_slab-stat[si], cnt);
+#endif
+}
+
 /
  * Core slab cache functions
  ***/
@@ -1993,6 +2002,8 @@ static inline void flush_slab(struct kmem_cache *s, 
struct kmem_cache_cpu *c)
c-freelist = NULL;
 }
 
+static void victim_cache_flush(struct kmem_cache *s, int cpu);
+
 /*
  * Flush cpu slab.
  *
@@ -2006,6 +2017,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, 
int cpu)
if (c-page)
flush_slab(s, c);
 
+   victim_cache_flush(s, cpu);
unfreeze_partials(s);
}
 }
@@ -2446,38 +2458,34 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
 #endif
 

Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
David,

On Mon, Oct 15, 2012 at 9:46 PM, David Rientjes rient...@google.com wrote:
 On Sat, 13 Oct 2012, Ezequiel Garcia wrote:

 But SLAB suffers from a lot more internal fragmentation than SLUB,
 which I guess is a known fact. So memory-constrained devices
 would waste more memory by using SLAB.

 Even with slub's per-cpu partial lists?

I'm not considering that, but rather plain fragmentation: the difference
between requested and allocated, per object.
Admittedly, perhaps this is a naive analysis.

However, devices where this matters would have only one cpu, right?
So the overhead imposed by per-cpu data shouldn't impact so much.

Study the difference in overhead imposed by allocators is
something that's still on my TODO.

Now, returning to the fragmentation. The problem with SLAB is that
its smaller cache available for kmalloced objects is 32 bytes;
while SLUB allows 8, 16, 24 ...

Perhaps adding smaller caches to SLAB might make sense?
Is there any strong reason for NOT doing this?

Thanks,

Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Eric Dumazet
On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...
 
 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?

I would remove small kmalloc-XX caches, as sharing a cache line
is sometime dangerous for performance, because of false sharing.

They make sense only for very small hosts.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Tim Bird
On 10/16/2012 05:56 AM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:
 
 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...

 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?
 
 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.
 
 They make sense only for very small hosts.

That's interesting...

It would be good to measure the performance/size tradeoff here.
I'm interested in very small systems, and it might be worth
the tradeoff, depending on how bad the performance is.  Maybe
a new config option would be useful (I can hear the groans now... :-)

Ezequiel - do you have any measurements of how much memory
is wasted by 32-byte kmalloc allocations for smaller objects,
in the tests you've been doing?
 -- Tim


=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/16/2012 05:56 AM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...

 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?

 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts.

 That's interesting...

 It would be good to measure the performance/size tradeoff here.
 I'm interested in very small systems, and it might be worth
 the tradeoff, depending on how bad the performance is.  Maybe
 a new config option would be useful (I can hear the groans now... :-)

 Ezequiel - do you have any measurements of how much memory
 is wasted by 32-byte kmalloc allocations for smaller objects,
 in the tests you've been doing?

Yes, we have some numbers:

http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects

Are they too informal? I can add some details...

They've been measured on a **very** minimal setup, almost every option
is stripped out, except from initramfs, sysfs, and trace.

On this scenario, strings allocated for file names and directories
created by sysfs
are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation from
that 32 byte cache at SLAB.

Is an option to enable small caches on SLUB and SLAB worth it?

Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/16/2012 05:56 AM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...

 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?

 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts.

 That's interesting...

 It would be good to measure the performance/size tradeoff here.
 I'm interested in very small systems, and it might be worth
 the tradeoff, depending on how bad the performance is.  Maybe
 a new config option would be useful (I can hear the groans now... :-)


It might be worth reminding that very small systems can use SLOB
allocator, which does not suffer from this kind of fragmentation.

Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Tim Bird
On 10/16/2012 11:27 AM, Ezequiel Garcia wrote:
 On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/16/2012 05:56 AM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...

 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?

 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts.

 That's interesting...

 It would be good to measure the performance/size tradeoff here.
 I'm interested in very small systems, and it might be worth
 the tradeoff, depending on how bad the performance is.  Maybe
 a new config option would be useful (I can hear the groans now... :-)

 Ezequiel - do you have any measurements of how much memory
 is wasted by 32-byte kmalloc allocations for smaller objects,
 in the tests you've been doing?
 
 Yes, we have some numbers:
 
 http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects
 
 Are they too informal? I can add some details...


 They've been measured on a **very** minimal setup, almost every option
 is stripped out, except from initramfs, sysfs, and trace.
 
 On this scenario, strings allocated for file names and directories
 created by sysfs
 are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
 from
 that 32 byte cache at SLAB.

The detail I'm interested in is the amount of wastage for a
common workload, for each of the SLxB systems.  Are we talking a
few K, or 10's or 100's of K?  It sounds like it's all from short strings.
Are there other things using the 32-byte kmalloc cache, that waste
a lot of memory (in aggregate) as well?

Does your tool indicate a specific callsite (or small set of callsites)
where these small allocations are made?  It sounds like it's in the filesystem
and would be content-driven (by the length of filenames)?

This might be an issue particularly for cameras, where all the generated
filenames are 8.3 (and will be for the foreseeable future)

 Is an option to enable small caches on SLUB and SLAB worth it?
I'll have to do some measurements to see.  I'm guessing the option
itself would be pretty trivial to implement?
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Workgroup of the Linux Foundation
Senior Staff Engineer, Sony Network Entertainment
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Ezequiel Garcia
On Tue, Oct 16, 2012 at 3:44 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/16/2012 11:27 AM, Ezequiel Garcia wrote:
 On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird tim.b...@am.sony.com wrote:
 On 10/16/2012 05:56 AM, Eric Dumazet wrote:
 On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:

 Now, returning to the fragmentation. The problem with SLAB is that
 its smaller cache available for kmalloced objects is 32 bytes;
 while SLUB allows 8, 16, 24 ...

 Perhaps adding smaller caches to SLAB might make sense?
 Is there any strong reason for NOT doing this?

 I would remove small kmalloc-XX caches, as sharing a cache line
 is sometime dangerous for performance, because of false sharing.

 They make sense only for very small hosts.

 That's interesting...

 It would be good to measure the performance/size tradeoff here.
 I'm interested in very small systems, and it might be worth
 the tradeoff, depending on how bad the performance is.  Maybe
 a new config option would be useful (I can hear the groans now... :-)

 Ezequiel - do you have any measurements of how much memory
 is wasted by 32-byte kmalloc allocations for smaller objects,
 in the tests you've been doing?

 Yes, we have some numbers:

 http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects

 Are they too informal? I can add some details...


 They've been measured on a **very** minimal setup, almost every option
 is stripped out, except from initramfs, sysfs, and trace.

 On this scenario, strings allocated for file names and directories
 created by sysfs
 are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
 from
 that 32 byte cache at SLAB.

 The detail I'm interested in is the amount of wastage for a
 common workload, for each of the SLxB systems.  Are we talking a
 few K, or 10's or 100's of K?  It sounds like it's all from short strings.
 Are there other things using the 32-byte kmalloc cache, that waste
 a lot of memory (in aggregate) as well?


A more Common workload is one of the next items on my queue.


 Does your tool indicate a specific callsite (or small set of callsites)
 where these small allocations are made?  It sounds like it's in the filesystem
 and would be content-driven (by the length of filenames)?


That's right. And, IMHO, the problem is precisely that the allocation
size is content-driven.


Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Christoph Lameter
On Mon, 15 Oct 2012, David Rientjes wrote:

 This type of workload that really exhibits the problem with remote freeing
 would suggest that the design of slub itself is the problem here.

There is a tradeoff here between spatial data locality and temporal
locality. Slub always frees to the queue associated with the slab page
that the object originated from and therefore restores spatial data
locality. It will always serve all objects available in a slab page
before moving onto the next. Within a slab page it can consider temporal
locality.

Slab considers temporal locatlity more important and will not return
objects to the originating slab pages until they are no longer in use. It
(ideally) will serve objects in the order they were freed. This breaks
down in the NUMA case and the allocator got into a pretty bizarre queueing
configuration (with lots and lots of queues) as a result of our attempt to
preverse the free/alloc order per NUMA node (look at the alien caches
f.e.). Slub is an alternative to that approach.

Slab also has the problem of queue handling overhead due to the attempt to
throw objects out of the queues that are likely no more cache hot. Every
few seconds it needs to run queue cleaning through all queues that exists
on the system. How accurate it tracks the actual cache hotness of objects
is not clear.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Christoph Lameter
On Tue, 16 Oct 2012, Ezequiel Garcia wrote:

 It might be worth reminding that very small systems can use SLOB
 allocator, which does not suffer from this kind of fragmentation.

Well, I have never seen non experimental systems that use SLOB. Others
have claimed they exist.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Christoph Lameter
On Thu, 11 Oct 2012, Ezequiel Garcia wrote:

 * Is SLAB a proper choice? or is it just historical an never been 
 re-evaluated?
 * Does the average embedded guy knows which allocator to choose
   and what's the impact on his platform?

My current ideas on this subject matter is to get to a point where we have
a generic slab allocator framework that allows us to provide any
object layout we want. This will simplify handling new slab allocators that
seems to crop up frequently. Maybe even allow the specification of the
storage layout when the slab is created. Depending on how the memory is
used there may be different object layouts that are most advantageous.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-16 Thread Eric Dumazet
On Tue, 2012-10-16 at 15:27 -0300, Ezequiel Garcia wrote:

 Yes, we have some numbers:
 
 http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects
 
 Are they too informal? I can add some details...
 
 They've been measured on a **very** minimal setup, almost every option
 is stripped out, except from initramfs, sysfs, and trace.
 
 On this scenario, strings allocated for file names and directories
 created by sysfs
 are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation 
 from
 that 32 byte cache at SLAB.
 
 Is an option to enable small caches on SLUB and SLAB worth it?

Random small web server :

# free
 total   used   free sharedbuffers cached
Mem:   788453654125722471964  0 1554401803340
-/+ buffers/cache:34537924430744
Swap:  2438140  511642386976

# grep Slab /proc/meminfo
Slab: 351592 kB

# egrep kmalloc-32|kmalloc-16|kmalloc-8 /proc/slabinfo 
kmalloc-32 11332  12544 32  1281 : tunables000 : 
slabdata 98 98  0
kmalloc-16  5888   5888 16  2561 : tunables000 : 
slabdata 23 23  0
kmalloc-8  76563  82432  8  5121 : tunables000 : 
slabdata161161  0

Really, some waste on these small objects is pure noise on SMP hosts.

(Waste on bigger objects is probably more important by orders of magnitude)




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread JoonSoo Kim
Hello, Eric.

2012/10/14 Eric Dumazet :
> SLUB was really bad in the common workload you describe (allocations
> done by one cpu, freeing done by other cpus), because all kfree() hit
> the slow path and cpus contend in __slab_free() in the loop guarded by
> cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
> hit the main "struct page" to add the freed object to freelist.

Could you elaborate more on how 'netperf RR' makes kernel "allocations
done by one cpu, freeling done by other cpus", please?
I don't have enough background network subsystem, so I'm just curious.

> I played some months ago adding a percpu associative cache to SLUB, then
> just moved on other strategy.
>
> (Idea for this per cpu cache was to build a temporary free list of
> objects to batch accesses to struct page)

Is this implemented and submitted?
If it is, could you tell me the link for the patches?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread David Rientjes
On Sat, 13 Oct 2012, Ezequiel Garcia wrote:

> But SLAB suffers from a lot more internal fragmentation than SLUB,
> which I guess is a known fact. So memory-constrained devices
> would waste more memory by using SLAB.

Even with slub's per-cpu partial lists?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread David Rientjes
On Sat, 13 Oct 2012, David Rientjes wrote:

> This was in August when preparing for LinuxCon, I tested netperf TCP_RR on 
> two 64GB machines (one client, one server), four nodes each, with thread 
> counts in multiples of the number of cores.  SLUB does a comparable job, 
> but once we have the the number of threads equal to three times the number 
> of cores, it degrades almost linearly.  I'll run it again next week and 
> get some numbers on 3.6.
> 

On 3.6, I tested CONFIG_SLAB (no CONFIG_DEBUG_SLAB) vs.
CONFIG_SLUB and CONFIG_SLUB_DEBUG (no CONFIG_SLUB_DEBUG_ON or 
CONFIG_SLUB_STATS), which are the defconfigs for both allocators.

Using netperf-2.4.5 and two machines both with 16 cores (4 cores/node) and 
32GB of memory each (one client, one netserver), here are the results:

threads SLABSLUB
 16 115408  114477 (-0.8%)
 32 214664  209582 (-2.4%)
 48 297414  290552 (-2.3%)
 64 372207  360177 (-3.2%)
 80 435872  421674 (-3.3%)
 96 490927  472547 (-3.7%)
112 543685  522593 (-3.9%)
128 586026  564078 (-3.7%)
144 630320  604681 (-4.1%)
160 671953  639643 (-4.8%)

It seems that slub has improved because of the per-cpu partial lists, 
which truly makes the "unqueued" allocator queued, by significantly 
increasing the amount of memory that the allocator uses.  However, the 
netperf benchmark still regresses significantly and is still a non-
starter for us.

This type of workload that really exhibits the problem with remote freeing 
would suggest that the design of slub itself is the problem here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread David Rientjes
On Sat, 13 Oct 2012, David Rientjes wrote:

 This was in August when preparing for LinuxCon, I tested netperf TCP_RR on 
 two 64GB machines (one client, one server), four nodes each, with thread 
 counts in multiples of the number of cores.  SLUB does a comparable job, 
 but once we have the the number of threads equal to three times the number 
 of cores, it degrades almost linearly.  I'll run it again next week and 
 get some numbers on 3.6.
 

On 3.6, I tested CONFIG_SLAB (no CONFIG_DEBUG_SLAB) vs.
CONFIG_SLUB and CONFIG_SLUB_DEBUG (no CONFIG_SLUB_DEBUG_ON or 
CONFIG_SLUB_STATS), which are the defconfigs for both allocators.

Using netperf-2.4.5 and two machines both with 16 cores (4 cores/node) and 
32GB of memory each (one client, one netserver), here are the results:

threads SLABSLUB
 16 115408  114477 (-0.8%)
 32 214664  209582 (-2.4%)
 48 297414  290552 (-2.3%)
 64 372207  360177 (-3.2%)
 80 435872  421674 (-3.3%)
 96 490927  472547 (-3.7%)
112 543685  522593 (-3.9%)
128 586026  564078 (-3.7%)
144 630320  604681 (-4.1%)
160 671953  639643 (-4.8%)

It seems that slub has improved because of the per-cpu partial lists, 
which truly makes the unqueued allocator queued, by significantly 
increasing the amount of memory that the allocator uses.  However, the 
netperf benchmark still regresses significantly and is still a non-
starter for us.

This type of workload that really exhibits the problem with remote freeing 
would suggest that the design of slub itself is the problem here.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread David Rientjes
On Sat, 13 Oct 2012, Ezequiel Garcia wrote:

 But SLAB suffers from a lot more internal fragmentation than SLUB,
 which I guess is a known fact. So memory-constrained devices
 would waste more memory by using SLAB.

Even with slub's per-cpu partial lists?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread JoonSoo Kim
Hello, Eric.

2012/10/14 Eric Dumazet eric.duma...@gmail.com:
 SLUB was really bad in the common workload you describe (allocations
 done by one cpu, freeing done by other cpus), because all kfree() hit
 the slow path and cpus contend in __slab_free() in the loop guarded by
 cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
 hit the main struct page to add the freed object to freelist.

Could you elaborate more on how 'netperf RR' makes kernel allocations
done by one cpu, freeling done by other cpus, please?
I don't have enough background network subsystem, so I'm just curious.

 I played some months ago adding a percpu associative cache to SLUB, then
 just moved on other strategy.

 (Idea for this per cpu cache was to build a temporary free list of
 objects to batch accesses to struct page)

Is this implemented and submitted?
If it is, could you tell me the link for the patches?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread Eric Dumazet
On Sat, 2012-10-13 at 02:51 -0700, David Rientjes wrote:
> On Thu, 11 Oct 2012, Andi Kleen wrote:
> 
> > When did you last test? Our regressions had disappeared a few kernels
> > ago.
> > 
> 
> This was in August when preparing for LinuxCon, I tested netperf TCP_RR on 
> two 64GB machines (one client, one server), four nodes each, with thread 
> counts in multiples of the number of cores.  SLUB does a comparable job, 
> but once we have the the number of threads equal to three times the number 
> of cores, it degrades almost linearly.  I'll run it again next week and 
> get some numbers on 3.6.

In latest kernels, skb->head no longer use kmalloc()/kfree(), so SLAB vs
SLUB is less a concern for network loads.

In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
populate skb->head.

SLUB was really bad in the common workload you describe (allocations
done by one cpu, freeing done by other cpus), because all kfree() hit
the slow path and cpus contend in __slab_free() in the loop guarded by
cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
hit the main "struct page" to add the freed object to freelist.

I played some months ago adding a percpu associative cache to SLUB, then
just moved on other strategy.

(Idea for this per cpu cache was to build a temporary free list of
objects to batch accesses to struct page)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread Ezequiel Garcia
Hi David,

On Sat, Oct 13, 2012 at 6:54 AM, David Rientjes  wrote:
> On Fri, 12 Oct 2012, Ezequiel Garcia wrote:
>
>> >> SLUB is a non-starter for us and incurs a >10% performance degradation in
>> >> netperf TCP_RR.
>> >
>>
>> Where are you seeing that?
>>
>
> In my benchmarking results.
>
>> Notice that many defconfigs are for embedded devices,
>> and many of them say "use SLAB"; I wonder if that's right.
>>
>
> If a device doesn't require the smallest memory footprint possible (SLOB)
> then SLAB is the right choice when there's a limited amount of memory;
> SLUB requires higher order pages for the best performance (on my desktop
> system running with CONFIG_SLUB, over 50% of the slab caches default to be
> high order).
>

But SLAB suffers from a lot more internal fragmentation than SLUB,
which I guess is a known fact. So memory-constrained devices
would waste more memory by using SLAB.
I must admit a didn't look at page order (but I will now).


>> Is there any intention to replace SLAB by SLUB?
>
> There may be an intent, but it'll be nacked as long as there's a
> performance degradation.
>
>> In that case it could make sense to change defconfigs, although
>> it wouldn't be based on any actual tests.
>>
>
> Um, you can't just go changing defconfigs without doing some due diligence
> in ensuring it won't be deterimental for those users.

Yeah, it would be very interesting to compare SLABs on at least
some of those platforms.


Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread David Rientjes
On Fri, 12 Oct 2012, Ezequiel Garcia wrote:

> >> SLUB is a non-starter for us and incurs a >10% performance degradation in
> >> netperf TCP_RR.
> >
> 
> Where are you seeing that?
> 

In my benchmarking results.

> Notice that many defconfigs are for embedded devices,
> and many of them say "use SLAB"; I wonder if that's right.
> 

If a device doesn't require the smallest memory footprint possible (SLOB) 
then SLAB is the right choice when there's a limited amount of memory; 
SLUB requires higher order pages for the best performance (on my desktop 
system running with CONFIG_SLUB, over 50% of the slab caches default to be 
high order).

> Is there any intention to replace SLAB by SLUB?

There may be an intent, but it'll be nacked as long as there's a 
performance degradation.

> In that case it could make sense to change defconfigs, although
> it wouldn't be based on any actual tests.
> 

Um, you can't just go changing defconfigs without doing some due diligence 
in ensuring it won't be deterimental for those users.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread David Rientjes
On Thu, 11 Oct 2012, Andi Kleen wrote:

> When did you last test? Our regressions had disappeared a few kernels
> ago.
> 

This was in August when preparing for LinuxCon, I tested netperf TCP_RR on 
two 64GB machines (one client, one server), four nodes each, with thread 
counts in multiples of the number of cores.  SLUB does a comparable job, 
but once we have the the number of threads equal to three times the number 
of cores, it degrades almost linearly.  I'll run it again next week and 
get some numbers on 3.6.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread David Rientjes
On Thu, 11 Oct 2012, Andi Kleen wrote:

 When did you last test? Our regressions had disappeared a few kernels
 ago.
 

This was in August when preparing for LinuxCon, I tested netperf TCP_RR on 
two 64GB machines (one client, one server), four nodes each, with thread 
counts in multiples of the number of cores.  SLUB does a comparable job, 
but once we have the the number of threads equal to three times the number 
of cores, it degrades almost linearly.  I'll run it again next week and 
get some numbers on 3.6.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread David Rientjes
On Fri, 12 Oct 2012, Ezequiel Garcia wrote:

  SLUB is a non-starter for us and incurs a 10% performance degradation in
  netperf TCP_RR.
 
 
 Where are you seeing that?
 

In my benchmarking results.

 Notice that many defconfigs are for embedded devices,
 and many of them say use SLAB; I wonder if that's right.
 

If a device doesn't require the smallest memory footprint possible (SLOB) 
then SLAB is the right choice when there's a limited amount of memory; 
SLUB requires higher order pages for the best performance (on my desktop 
system running with CONFIG_SLUB, over 50% of the slab caches default to be 
high order).

 Is there any intention to replace SLAB by SLUB?

There may be an intent, but it'll be nacked as long as there's a 
performance degradation.

 In that case it could make sense to change defconfigs, although
 it wouldn't be based on any actual tests.
 

Um, you can't just go changing defconfigs without doing some due diligence 
in ensuring it won't be deterimental for those users.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread Ezequiel Garcia
Hi David,

On Sat, Oct 13, 2012 at 6:54 AM, David Rientjes rient...@google.com wrote:
 On Fri, 12 Oct 2012, Ezequiel Garcia wrote:

  SLUB is a non-starter for us and incurs a 10% performance degradation in
  netperf TCP_RR.
 

 Where are you seeing that?


 In my benchmarking results.

 Notice that many defconfigs are for embedded devices,
 and many of them say use SLAB; I wonder if that's right.


 If a device doesn't require the smallest memory footprint possible (SLOB)
 then SLAB is the right choice when there's a limited amount of memory;
 SLUB requires higher order pages for the best performance (on my desktop
 system running with CONFIG_SLUB, over 50% of the slab caches default to be
 high order).


But SLAB suffers from a lot more internal fragmentation than SLUB,
which I guess is a known fact. So memory-constrained devices
would waste more memory by using SLAB.
I must admit a didn't look at page order (but I will now).


 Is there any intention to replace SLAB by SLUB?

 There may be an intent, but it'll be nacked as long as there's a
 performance degradation.

 In that case it could make sense to change defconfigs, although
 it wouldn't be based on any actual tests.


 Um, you can't just go changing defconfigs without doing some due diligence
 in ensuring it won't be deterimental for those users.

Yeah, it would be very interesting to compare SLABs on at least
some of those platforms.


Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-13 Thread Eric Dumazet
On Sat, 2012-10-13 at 02:51 -0700, David Rientjes wrote:
 On Thu, 11 Oct 2012, Andi Kleen wrote:
 
  When did you last test? Our regressions had disappeared a few kernels
  ago.
  
 
 This was in August when preparing for LinuxCon, I tested netperf TCP_RR on 
 two 64GB machines (one client, one server), four nodes each, with thread 
 counts in multiples of the number of cores.  SLUB does a comparable job, 
 but once we have the the number of threads equal to three times the number 
 of cores, it degrades almost linearly.  I'll run it again next week and 
 get some numbers on 3.6.

In latest kernels, skb-head no longer use kmalloc()/kfree(), so SLAB vs
SLUB is less a concern for network loads.

In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
populate skb-head.

SLUB was really bad in the common workload you describe (allocations
done by one cpu, freeing done by other cpus), because all kfree() hit
the slow path and cpus contend in __slab_free() in the loop guarded by
cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
hit the main struct page to add the freed object to freelist.

I played some months ago adding a percpu associative cache to SLUB, then
just moved on other strategy.

(Idea for this per cpu cache was to build a temporary free list of
objects to batch accesses to struct page)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-12 Thread Ezequiel Garcia
Hi,

On Thu, Oct 11, 2012 at 8:10 PM, Andi Kleen  wrote:
> David Rientjes  writes:
>
>> On Thu, 11 Oct 2012, Andi Kleen wrote:
>>
>>> > While I've always thought SLUB was the default and recommended allocator,
>>> > I'm surprise to find that it's not always the case:
>>>
>>> iirc the main performance reasons for slab over slub have mostly
>>> disappeared, so in theory slab could be finally deprecated now.
>>>
>>
>> SLUB is a non-starter for us and incurs a >10% performance degradation in
>> netperf TCP_RR.
>

Where are you seeing that?

Notice that many defconfigs are for embedded devices,
and many of them say "use SLAB"; I wonder if that's right.

Is there any intention to replace SLAB by SLUB?
In that case it could make sense to change defconfigs, although
it wouldn't be based on any actual tests.

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-12 Thread Ezequiel Garcia
Hi,

On Thu, Oct 11, 2012 at 8:10 PM, Andi Kleen a...@firstfloor.org wrote:
 David Rientjes rient...@google.com writes:

 On Thu, 11 Oct 2012, Andi Kleen wrote:

  While I've always thought SLUB was the default and recommended allocator,
  I'm surprise to find that it's not always the case:

 iirc the main performance reasons for slab over slub have mostly
 disappeared, so in theory slab could be finally deprecated now.


 SLUB is a non-starter for us and incurs a 10% performance degradation in
 netperf TCP_RR.


Where are you seeing that?

Notice that many defconfigs are for embedded devices,
and many of them say use SLAB; I wonder if that's right.

Is there any intention to replace SLAB by SLUB?
In that case it could make sense to change defconfigs, although
it wouldn't be based on any actual tests.

Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-11 Thread Andi Kleen
David Rientjes  writes:

> On Thu, 11 Oct 2012, Andi Kleen wrote:
>
>> > While I've always thought SLUB was the default and recommended allocator,
>> > I'm surprise to find that it's not always the case:
>> 
>> iirc the main performance reasons for slab over slub have mostly
>> disappeared, so in theory slab could be finally deprecated now.
>> 
>
> SLUB is a non-starter for us and incurs a >10% performance degradation in 
> netperf TCP_RR.

When did you last test? Our regressions had disappeared a few kernels
ago.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-11 Thread David Rientjes
On Thu, 11 Oct 2012, Andi Kleen wrote:

> > While I've always thought SLUB was the default and recommended allocator,
> > I'm surprise to find that it's not always the case:
> 
> iirc the main performance reasons for slab over slub have mostly
> disappeared, so in theory slab could be finally deprecated now.
> 

SLUB is a non-starter for us and incurs a >10% performance degradation in 
netperf TCP_RR.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-11 Thread Andi Kleen
Ezequiel Garcia  writes:

> Hello,
>
> While I've always thought SLUB was the default and recommended allocator,
> I'm surprise to find that it's not always the case:

iirc the main performance reasons for slab over slub have mostly
disappeared, so in theory slab could be finally deprecated now.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Q] Default SLAB allocator

2012-10-11 Thread Ezequiel Garcia
Hello,

While I've always thought SLUB was the default and recommended allocator,
I'm surprise to find that it's not always the case:

$ find arch/*/configs -name "*defconfig" | wc -l
452

$ grep -r "SLOB=y" arch/*/configs/ | wc -l
11

$ grep -r "SLAB=y" arch/*/configs/ | wc -l
245

This shows that, SLUB being the default, there are actually more
defconfigs that choose SLAB.

I wonder...

* Is SLAB a proper choice? or is it just historical an never been re-evaluated?
* Does the average embedded guy knows which allocator to choose
  and what's the impact on his platform?

Thanks,

   Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Q] Default SLAB allocator

2012-10-11 Thread Ezequiel Garcia
Hello,

While I've always thought SLUB was the default and recommended allocator,
I'm surprise to find that it's not always the case:

$ find arch/*/configs -name *defconfig | wc -l
452

$ grep -r SLOB=y arch/*/configs/ | wc -l
11

$ grep -r SLAB=y arch/*/configs/ | wc -l
245

This shows that, SLUB being the default, there are actually more
defconfigs that choose SLAB.

I wonder...

* Is SLAB a proper choice? or is it just historical an never been re-evaluated?
* Does the average embedded guy knows which allocator to choose
  and what's the impact on his platform?

Thanks,

   Ezequiel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-11 Thread Andi Kleen
Ezequiel Garcia elezegar...@gmail.com writes:

 Hello,

 While I've always thought SLUB was the default and recommended allocator,
 I'm surprise to find that it's not always the case:

iirc the main performance reasons for slab over slub have mostly
disappeared, so in theory slab could be finally deprecated now.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-11 Thread David Rientjes
On Thu, 11 Oct 2012, Andi Kleen wrote:

  While I've always thought SLUB was the default and recommended allocator,
  I'm surprise to find that it's not always the case:
 
 iirc the main performance reasons for slab over slub have mostly
 disappeared, so in theory slab could be finally deprecated now.
 

SLUB is a non-starter for us and incurs a 10% performance degradation in 
netperf TCP_RR.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-11 Thread Andi Kleen
David Rientjes rient...@google.com writes:

 On Thu, 11 Oct 2012, Andi Kleen wrote:

  While I've always thought SLUB was the default and recommended allocator,
  I'm surprise to find that it's not always the case:
 
 iirc the main performance reasons for slab over slub have mostly
 disappeared, so in theory slab could be finally deprecated now.
 

 SLUB is a non-starter for us and incurs a 10% performance degradation in 
 netperf TCP_RR.

When did you last test? Our regressions had disappeared a few kernels
ago.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/