Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Jörn Engel
On Thu, 10 January 2008 11:49:25 -0600, Matt Mackall wrote:
> 
> b) grouping objects of the same -type- (not size) together should mean
> they have similar lifetimes and thereby keep fragmentation low
> 
> (b) is known to be false, you just have to look at our dcache and icache
> pinning.

(b) is half-true, actually.  The grouping by lifetime makes a lot of
sense.  LogFS has a similar problem to slabs (only full segments are
useful, a single object can pin the segment).  And when I grouped my
objects very roughly by their life expectency, the impact was *HUGE*!

In both cases, you want slabs/segments that are either close to 100%
full or close to 0% full.  It matters a lot when you have to move
objects around and I would bet it matters even more when you cannot move
objects and the slab just remains pinned.

So just because the type alone is a relatively bad heuristic for life
expectency does not make the concept false.  Bonwick was onto something.
He just failed in picking a good heuristic.  Quite likely spreading by
type was even a bonus when slab was developed, because even such a crude
heuristic is slightly better than completely randomized lifetimes.

I've been meaning to split the dentry cache into 2-3 seperate ones for a
while and kept spending my time elsewhere.  But I remain convinced that
this will make a measurable difference.

Jörn

-- 
Never argue with idiots - first they drag you down to their level,
then they beat you with experience.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Andi Kleen wrote:

> I did essentially that for my GBpages hugetlbfs patchkit. GB pages are already
> beyond MAX_ORDER and increasing MAX_ORDER didn't seem attractive because
> it would require aligning the zones all to 1GB which would quite nasty.

I am very very interested in that work and I could not find it when I 
looked for it a couple of weeks back. Can you sent me a copy?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

> Well, I think we'd still have the same page size, in the sense that we'd
> have a struct page for every hardware page and we'd still have hardware
> page-sized pages in the page cache. We'd just change how we allocated
> them. Right now we've got a stack that looks like:

We would not change the hardware page. Cannot do that. But we would have 
preferential threadment for 64k and 2M pages in the page allocator?

>  buddy / page allocator
>  SL*B allocator
>  kmalloc
> 
> And we'd change that to:
> 
>  buddy allocator
>  SL*B allocator
>  page allocator / kmalloc
> 
> So get_free_page() would still hand you back a hardware page, it would
> just do it through SL*B.

Hmm Not sure what effect this would have. We already have the pcp's 
that have a similar effect.
 
> >  It would decrease listlock effect drastically for SLUB.
> 
> Not sure what you're referring to here.

Allocations in 64k chunks means 16 times less basic allocation blocks to 
manage for the slab allocators. So the metadata to be maintained by the 
allocators is reduces by that factor. SLUB only needs to touch the 
list_lock (in some situations like a free to a non cpu slab) if a block 
becomes completely empty or is going from fully allocated to partially 
allocated. The larger the block size the more objects are in a block and 
the less of these actions that need a per node lock are needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Andi Kleen
>  - huge pages (superpages for those crazy db people)
> 
>Just a simple linked list of these things is fine, we'd never care 
>about coalescing large pages together anyway.

I did essentially that for my GBpages hugetlbfs patchkit. GB pages are already
beyond MAX_ORDER and increasing MAX_ORDER didn't seem attractive because
it would require aligning the zones all to 1GB which would quite nasty.

So it just grabbed them out of bootmem early and managed them in a 
per node list.

Not sure it's a good idea for 2MB pages though.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Linus Torvalds wrote:

> It's not even clear that a buddy allocator even for the high-order pages 
> is at all the right choice. Almost nobody actually wants >64kB blocks, and 
> the ones that *do* want bigger allocations tend to want *much* bigger 
> ones, so it's quite possible that it could be worth it to have something 
> like a three-level allocator:

Excellent! I am definitely on board with this.

>  - huge pages (superpages for those crazy db people)
> 
>Just a simple linked list of these things is fine, we'd never care 
>about coalescing large pages together anyway.
> 
>  - "large pages" (on the order of ~64kB) - with *perhaps* a buddy bitmap 
>setup to try to coalesce back into huge-pages, but more likely just 
>admitting that you'd need something like migration to ever get back a 
>hugepage that got split into large-pages.
> 
>So maybe a simple bitmap allocator per huge-page for large pages. Say 
>you have a 4MB huge-page, and just a 64-bit free-bitmap per huge-page 
>when you split it into large pages.
> 
>  - slab/slub/slob for anything else, and "get_free_page()" ends up being 
>just a shorthand for saying "naturally aligned kmalloc of size 
>"PAGE_SIZE< 
> and maybe it would all work out ok. 

Hmmm... a 3 level allocator? Basically we would have BASE_PAGE 
STANDARD_PAGE and HUGE_PAGE? We could simply extend the page allocator to 
have 3 pcp lists for these sizes and go from there?

Thinking about the arches this would mean

BASE_PAGE   STANDARD_PAGE   HUGE_PAGE
x86_64  4k  64k 2M
i3864k  16k 4M
ia6416k 256k1G

?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 11:24 -0800, Christoph Lameter wrote:
> On Thu, 10 Jan 2008, Matt Mackall wrote:
> 
> > One idea I've been kicking around is pushing the boundary for the buddy
> > allocator back a bit (to 64k, say) and using SL*B under that. The page
> > allocators would call into buddy for larger than 64k (rare!) and SL*B
> > otherwise. This would let us greatly improve our handling of things like
> > task structs and skbs and possibly also things like 8k stacks and jumbo
> > frames. As SL*B would never be competing with the page allocator for
> > contiguous pages (the buddy allocator's granularity would be 64k), I
> > don't think this would exacerbate the page-level fragmentation issues.
> 
> This would create another large page size (and that would have my 
> enthusiastic support).

Well, I think we'd still have the same page size, in the sense that we'd
have a struct page for every hardware page and we'd still have hardware
page-sized pages in the page cache. We'd just change how we allocated
them. Right now we've got a stack that looks like:

 buddy / page allocator
 SL*B allocator
 kmalloc

And we'd change that to:

 buddy allocator
 SL*B allocator
 page allocator / kmalloc

So get_free_page() would still hand you back a hardware page, it would
just do it through SL*B.

>  It would decrease listlock effect drastically for SLUB.

Not sure what you're referring to here.

> However, isnt this is basically confessing that the page allocator is not 
> efficient for 4k page allocations?

Well I wasn't thinking of doing this for any performance reasons. But
there certainly could be some.
-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Linus Torvalds


On Thu, 10 Jan 2008, Matt Mackall wrote:
> 
> One idea I've been kicking around is pushing the boundary for the buddy
> allocator back a bit (to 64k, say) and using SL*B under that. The page
> allocators would call into buddy for larger than 64k (rare!) and SL*B
> otherwise. This would let us greatly improve our handling of things like
> task structs and skbs and possibly also things like 8k stacks and jumbo
> frames.

Yes, something like that may well be reasonable. It could possibly solve 
some of the issues for bigger page cache sizes too, but one issue is that 
many things actually end up having those power-of-two alignment 
constraints too - so an 8kB allocation would often still have to be 
naturally aligned, which then removes some of the freedom.

> Crazy?

It sounds like it might be worth trying out - there's just no way to know 
how well it would work. Buddy allocators sure as hell have problems too, 
no question about that. It's not like the page allocator is perfect.

It's not even clear that a buddy allocator even for the high-order pages 
is at all the right choice. Almost nobody actually wants >64kB blocks, and 
the ones that *do* want bigger allocations tend to want *much* bigger 
ones, so it's quite possible that it could be worth it to have something 
like a three-level allocator:

 - huge pages (superpages for those crazy db people)

   Just a simple linked list of these things is fine, we'd never care 
   about coalescing large pages together anyway.

 - "large pages" (on the order of ~64kB) - with *perhaps* a buddy bitmap 
   setup to try to coalesce back into huge-pages, but more likely just 
   admitting that you'd need something like migration to ever get back a 
   hugepage that got split into large-pages.

   So maybe a simple bitmap allocator per huge-page for large pages. Say 
   you have a 4MB huge-page, and just a 64-bit free-bitmap per huge-page 
   when you split it into large pages.

 - slab/slub/slob for anything else, and "get_free_page()" ends up being 
   just a shorthand for saying "naturally aligned kmalloc of size 
   "PAGE_SIZE

Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

> > I agree. Crap too. We removed the destructors. The constructors are needed 
> > so that objects in slab pages always have a definite state. That is f.e.
> > necessary for slab defragmentation because it has to be able to inspect an 
> > object at an arbitrary time and either remove it or move it to another 
> > slab page.
> 
> Are you saying that the state of -freed- objects matters for your active
> defragmentation? That's odd.

The state of the object immediately after it is allocated matters for a 
defrag solution. A kmalloc leads to an object in a undetermined state if 
you have no constructor. Code will then initialize the object but defrag 
f.e. must be able to inspect the object before. This means either that the 
freed object has a defined state or that kmalloc establishes that state 
before the object is marked as allocated.

> > Constructors also make sense because the initialization of a cache object 
> > may be expensive. Initializing list heads and spinlocks can take some code 
> > and that code can be omitted if objects have a definite state when they 
> > are free. We saw that when measuring the buffer_head constructors effect 
> > on performance.
> 
> Hmm. SLOB proves that you don't need to segregate objects based on
> constructors, so you could combine even slabs that have constructors and
> just delay construction until allocation. I'm surprised constructors
> have measurable advantage..

That is not working if you need to inspect allocated objects at any time 
for a defrag solution. All objects in a defragmentable slab need to have a 
consistent object state if allocated. If you have some without 
constructors then these object have no defined state and may contain 
arbitrary bytes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

> One idea I've been kicking around is pushing the boundary for the buddy
> allocator back a bit (to 64k, say) and using SL*B under that. The page
> allocators would call into buddy for larger than 64k (rare!) and SL*B
> otherwise. This would let us greatly improve our handling of things like
> task structs and skbs and possibly also things like 8k stacks and jumbo
> frames. As SL*B would never be competing with the page allocator for
> contiguous pages (the buddy allocator's granularity would be 64k), I
> don't think this would exacerbate the page-level fragmentation issues.

This would create another large page size (and that would have my 
enthusiastic support). It would decrease listlock effect drastically for 
SLUB. Even the initial simplistic implementation of SLUB was superior on 
the database transaction tests (I think it was up ~1%) on IA64 from the 
get go. Likely due to the larger 16k page size there. The larger page size 
could also be used for the page cache (ducking and running.)? A 64k 
page size that could be allocated without zone locks would be very good 
for SLUB.

However, isnt this is basically confessing that the page allocator is not 
efficient for 4k page allocations? I have seen some weaknesses there. The 
overhead in the allocation path in particular is bad and I was thinking 
about applying the same ideas used in SLUB also to the page allocator in 
order to bring the cycle count down from 500-1000 to 60 or so. Since both 
SLUB and SLOB use the page allocator for allocs >PAGE_SIZE this would not 
only benefit the general kernel but also the slab allocations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 11:16 -0800, Christoph Lameter wrote:
> On Thu, 10 Jan 2008, Matt Mackall wrote:
> 
> > Here I'm going to differ with you. The premises of the SLAB concept
> > (from the original paper) are: 
> > 
> > a) fragmentation of conventional allocators gets worse over time
> 
> Even fragmentation of SLAB/SLUB gets worses over time. That is why we need 
> a defrag solution.
> 
> > b) grouping objects of the same -type- (not size) together should mean
> > they have similar lifetimes and thereby keep fragmentation low
> 
> I agree that is crap. The lifetimes argument is mostly only exploitable in 
> benchmarks. That is why SLUB just groups them by size if possible.
> 
> > d) constructors and destructors are cache-friendly
> 
> I agree. Crap too. We removed the destructors. The constructors are needed 
> so that objects in slab pages always have a definite state. That is f.e.
> necessary for slab defragmentation because it has to be able to inspect an 
> object at an arbitrary time and either remove it or move it to another 
> slab page.

Are you saying that the state of -freed- objects matters for your active
defragmentation? That's odd.

> Constructors also make sense because the initialization of a cache object 
> may be expensive. Initializing list heads and spinlocks can take some code 
> and that code can be omitted if objects have a definite state when they 
> are free. We saw that when measuring the buffer_head constructors effect 
> on performance.

Hmm. SLOB proves that you don't need to segregate objects based on
constructors, so you could combine even slabs that have constructors and
just delay construction until allocation. I'm surprised constructors
have measurable advantage..

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

> Here I'm going to differ with you. The premises of the SLAB concept
> (from the original paper) are: 
> 
> a) fragmentation of conventional allocators gets worse over time

Even fragmentation of SLAB/SLUB gets worses over time. That is why we need 
a defrag solution.

> b) grouping objects of the same -type- (not size) together should mean
> they have similar lifetimes and thereby keep fragmentation low

I agree that is crap. The lifetimes argument is mostly only exploitable in 
benchmarks. That is why SLUB just groups them by size if possible.

> d) constructors and destructors are cache-friendly

I agree. Crap too. We removed the destructors. The constructors are needed 
so that objects in slab pages always have a definite state. That is f.e.
necessary for slab defragmentation because it has to be able to inspect an 
object at an arbitrary time and either remove it or move it to another 
slab page.

Constructors also make sense because the initialization of a cache object 
may be expensive. Initializing list heads and spinlocks can take some code 
and that code can be omitted if objects have a definite state when they 
are free. We saw that when measuring the buffer_head constructors effect 
on performance.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 10:28 -0800, Linus Torvalds wrote:
> 
> On Thu, 10 Jan 2008, Matt Mackall wrote:
> > > 
> > > (I'm not a fan of slabs per se - I think all the constructor/destructor 
> > > crap is just that: total crap - but the size/type binning is a big deal, 
> > > and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
> > > you guys are size-binning by just two or three bins, and it seems to make 
> > > a difference for some loads, but compared to SLUB/SLAB it's a total hack).
> > 
> > Here I'm going to differ with you. The premises of the SLAB concept
> > (from the original paper) are: 
> 
> I really don't think we differ.
> 
> The advantage of slab was largely the binning by type. Everything else was 
> just a big crock. SLUB does the binning better, by really just making the 
> type binning be about what really matters - the *size* of the type.
> 
> So my argument was that the type/size binning makes sense (size more so 
> than type), but the rest of the original Sun arguments for why slab was 
> such a great idea were basically just the crap.
> 
> Hard type binning was a mistake (but needed by slab due to the idiotic 
> notion that constructors/destructors are "good for caches" - bleargh). I 
> suspect that hard size binning is a mistake too (ie there are probably 
> cases where you do want to split unused bigger size areas), but the fact 
> that all of our allocators are two-level (with the page allocator acting 
> as a size-agnostic free space) may help it somewhat.
> 
> And yes, I do agree that any current allocator has problems with the big 
> sizes that don't fit well into a page or two (like task_struct). That 
> said, most of those don't have lots of allocations under many normal 
> circumstances (even if there are uses that will really blow them up).
> 
> The *big* slab users at least for me tend to be ext3_inode_cache and 
> dentry. Everything else is orders of magnitude less. And of the two bad 
> ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting 
> in ~10% fragmentation just due to the page thing, regardless of whether 
> you use an order-0 or order-1 page allocation).
> 
> Of course, dentries fit better in a page (due to being smaller), but then 
> the bigger number of dentries per page make it harder to actually free 
> pages, so then you get fragmentation from that. Oh well. You can't win.

One idea I've been kicking around is pushing the boundary for the buddy
allocator back a bit (to 64k, say) and using SL*B under that. The page
allocators would call into buddy for larger than 64k (rare!) and SL*B
otherwise. This would let us greatly improve our handling of things like
task structs and skbs and possibly also things like 8k stacks and jumbo
frames. As SL*B would never be competing with the page allocator for
contiguous pages (the buddy allocator's granularity would be 64k), I
don't think this would exacerbate the page-level fragmentation issues.

Crazy?

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Linus Torvalds


On Thu, 10 Jan 2008, Matt Mackall wrote:
> > 
> > (I'm not a fan of slabs per se - I think all the constructor/destructor 
> > crap is just that: total crap - but the size/type binning is a big deal, 
> > and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
> > you guys are size-binning by just two or three bins, and it seems to make 
> > a difference for some loads, but compared to SLUB/SLAB it's a total hack).
> 
> Here I'm going to differ with you. The premises of the SLAB concept
> (from the original paper) are: 

I really don't think we differ.

The advantage of slab was largely the binning by type. Everything else was 
just a big crock. SLUB does the binning better, by really just making the 
type binning be about what really matters - the *size* of the type.

So my argument was that the type/size binning makes sense (size more so 
than type), but the rest of the original Sun arguments for why slab was 
such a great idea were basically just the crap.

Hard type binning was a mistake (but needed by slab due to the idiotic 
notion that constructors/destructors are "good for caches" - bleargh). I 
suspect that hard size binning is a mistake too (ie there are probably 
cases where you do want to split unused bigger size areas), but the fact 
that all of our allocators are two-level (with the page allocator acting 
as a size-agnostic free space) may help it somewhat.

And yes, I do agree that any current allocator has problems with the big 
sizes that don't fit well into a page or two (like task_struct). That 
said, most of those don't have lots of allocations under many normal 
circumstances (even if there are uses that will really blow them up).

The *big* slab users at least for me tend to be ext3_inode_cache and 
dentry. Everything else is orders of magnitude less. And of the two bad 
ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting 
in ~10% fragmentation just due to the page thing, regardless of whether 
you use an order-0 or order-1 page allocation).

Of course, dentries fit better in a page (due to being smaller), but then 
the bigger number of dentries per page make it harder to actually free 
pages, so then you get fragmentation from that. Oh well. You can't win.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Andi Kleen
> I would suggest that if you guys are really serious about memory use, try 
> to do a size-based heap thing, and do best-fit in that heap. Or just some 

iirc best fit usually also has some nasty long term fragmentation behaviour.
That is why it is usually also not used.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 08:13 -0800, Linus Torvalds wrote:
> 
> On Thu, 10 Jan 2008, Pekka J Enberg wrote:
> > 
> > We probably don't have the same version of GCC which perhaps affects 
> > memory layout (struct sizes) and thus allocation patterns?
> 
> No, struct sizes will not change with compiler versions - that would 
> create binary incompatibilities for libraries etc.
> 
> So apart from the kernel itself working around some old gcc bugs by making 
> spinlocks have different size depending on the compiler version, sizes of 
> structures should be the same (as long as the configuration is the same, 
> of course).
> 
> However, a greedy first-fit allocator will be *very* sensitive to 
> allocation pattern differences, so timing will probably make a big 
> difference. In contrast, SLUB and SLAB both use fixed sizes per allocation 
> queue, which makes them almost totally impervious to any allocation 
> pattern from different allocation sizes (they still end up caring about 
> the pattern *within* one size, but those tend to be much stabler).
> 
> There really is a reason why traditional heaps with first-fit are almost 
> never used for any real loads.
> 
> (I'm not a fan of slabs per se - I think all the constructor/destructor 
> crap is just that: total crap - but the size/type binning is a big deal, 
> and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
> you guys are size-binning by just two or three bins, and it seems to make 
> a difference for some loads, but compared to SLUB/SLAB it's a total hack).

Here I'm going to differ with you. The premises of the SLAB concept
(from the original paper) are: 

a) fragmentation of conventional allocators gets worse over time
b) grouping objects of the same -type- (not size) together should mean
they have similar lifetimes and thereby keep fragmentation low
c) slabs can be O(1)
d) constructors and destructors are cache-friendly

There's some truth to (a), but the problem has been quite overstated,
pre-SLAB Linux kernels and countless other systems run for years. And
(b) is known to be false, you just have to look at our dcache and icache
pinning. (c) of course is a whole separate argument and often trumps the
others. And (d) is pretty much crap now too.

And as it happens, SLOB basically always beats SLAB on size.

SLUB only wins when it starts merging caches of different -types- based
on -size-, effectively throwing out the whole (b) concept. Which is
good, because its wrong. So SLUB starts to look a lot like a purely
size-binned allocator.

> I would suggest that if you guys are really serious about memory use, try 
> to do a size-based heap thing, and do best-fit in that heap. Or just some 
> really simple size-based binning, eg
> 
>   if (size > 2*PAGE_SIZE)
>   goto page_allocator;

I think both SLOB and SLUB punt everything >= PAGE_SIZE off to the page
allocator.

>   bin = lookup_bin[(size+31) >> 5]
> 
> or whatever. Because first-fit is *known* to be bad.

It is known to be crummy, but it is NOT known to actually be worse than
the SLAB/SLUB approach when you consider both internal and external
fragmentation. Power-of-two ala SLAB kmalloc basically guarantees your
internal fragmentation will be 30% or so.

In fact, first-fit is known to be pretty good if the ratio of object
size to arena size is reasonable. I've already shown a system booting
with 6% overhead compared to SLUB's 35% (and SLAB's nearly 70%). The
fragmentation measurements for the small object list are in fact quite
nice. Not the best benchmark, but it certainly shows that there's
substantial room for improvement.

Where it hurts is larger objects (task structs, SKBs), which are also a
problem for SLAB/SLUB and I don't think any variation on the search
order is going to help there. If we threw 64k pages at it, those
problems might very well disappear. Hell, it's pretty tempting to throw
vmalloc at it, especially on small boxes..

> At try to change it to address-ordered first-fit or something (which is 
> much more complex than just plain LIFO, but hey, that's life).

Will think about that. Most of the literature here is of limited
usefulness - even Knuth didn't look at 4k heaps, let alone collections
of them.

> I haven't checked much, but I *think* SLOB is just basic first-fit 
> (perhaps the "next-fit" variation?) Next-fit is known to be EVEN WORSE 
> than the simple first-fit when it comes to fragmentation (so no, Knuth was 
> not always right - let's face it, much of Knuth is simply outdated).

The SLOB allocator is inherently two-level. I'm using a next-fit-like
approach to decide which heap (page) to use and first-fit within that
heap.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Linus Torvalds


On Thu, 10 Jan 2008, Pekka J Enberg wrote:
> 
> We probably don't have the same version of GCC which perhaps affects 
> memory layout (struct sizes) and thus allocation patterns?

No, struct sizes will not change with compiler versions - that would 
create binary incompatibilities for libraries etc.

So apart from the kernel itself working around some old gcc bugs by making 
spinlocks have different size depending on the compiler version, sizes of 
structures should be the same (as long as the configuration is the same, 
of course).

However, a greedy first-fit allocator will be *very* sensitive to 
allocation pattern differences, so timing will probably make a big 
difference. In contrast, SLUB and SLAB both use fixed sizes per allocation 
queue, which makes them almost totally impervious to any allocation 
pattern from different allocation sizes (they still end up caring about 
the pattern *within* one size, but those tend to be much stabler).

There really is a reason why traditional heaps with first-fit are almost 
never used for any real loads.

(I'm not a fan of slabs per se - I think all the constructor/destructor 
crap is just that: total crap - but the size/type binning is a big deal, 
and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
you guys are size-binning by just two or three bins, and it seems to make 
a difference for some loads, but compared to SLUB/SLAB it's a total hack).

I would suggest that if you guys are really serious about memory use, try 
to do a size-based heap thing, and do best-fit in that heap. Or just some 
really simple size-based binning, eg

if (size > 2*PAGE_SIZE)
goto page_allocator;
bin = lookup_bin[(size+31) >> 5];

or whatever. Because first-fit is *known* to be bad.

At try to change it to address-ordered first-fit or something (which is 
much more complex than just plain LIFO, but hey, that's life).

I haven't checked much, but I *think* SLOB is just basic first-fit 
(perhaps the "next-fit" variation?) Next-fit is known to be EVEN WORSE 
than the simple first-fit when it comes to fragmentation (so no, Knuth was 
not always right - let's face it, much of Knuth is simply outdated).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 12:54 +0200, Pekka J Enberg wrote:
> Hi Matt,
> 
> On Thu, 10 Jan 2008, Pekka J Enberg wrote:
> > I'll double check the results for SLUB next but it seems obvious that your 
> > patches are a net gain for SLOB and should be applied. One problem though 
> > with SLOB seems to be that its memory efficiency is not so stable. Any 
> > ideas why that is?

We're seeing different numbers in each allocator indicating that the
ordering of allocations is not stable. When fragmentation occurs, it
magnifies the underlying instability. On my config, where the split list
combats fragmentation extremely effectively, the stability is quite
good.

Perhaps I'll add a printk to allocs and generate some size histograms.
lguest + grep = relayfs for dummies.

> Ok, I did that. The number are stable and reproducible. In fact, the 
> average for SLUB is within 1 KB of the previous numbers. So, we have the 
> same .config, the same userspace, and the same hypervisor, so what's the 
> difference here?

I've got:

gcc version 4.2.3 20080102 (prerelease) (Debian 4.2.2-5)
BusyBox v1.2.1 (2006.11.01-11:21+) Built-in shell (ash)

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Pekka J Enberg
Hi Matt,

On Thu, 10 Jan 2008, Pekka J Enberg wrote:
> I'll double check the results for SLUB next but it seems obvious that your 
> patches are a net gain for SLOB and should be applied. One problem though 
> with SLOB seems to be that its memory efficiency is not so stable. Any 
> ideas why that is?

Ok, I did that. The number are stable and reproducible. In fact, the 
average for SLUB is within 1 KB of the previous numbers. So, we have the 
same .config, the same userspace, and the same hypervisor, so what's the 
difference here?

We probably don't have the same version of GCC which perhaps affects 
memory layout (struct sizes) and thus allocation patterns? I have included 
ver_linux from my laptop here:

Linux haji 2.6.24-rc6 #21 SMP Thu Jan 10 12:30:59 EET 2008 i686 GNU/Linux
 
Gnu C  4.1.3
Gnu make   3.81
binutils   2.18
util-linux 2.13
mount  2.13
module-init-tools  3.3-pre2
e2fsprogs  1.40.2
reiserfsprogs  3.6.19
pcmciautils014
PPP2.4.4
Linux C Library2.6.1
Dynamic linker (ldd)   2.6.1
Procps 3.2.7
Net-tools  1.60
Console-tools  0.2.3
Sh-utils   5.97
udev   113
wireless-tools 29
Modules Loaded af_packet ipv6 binfmt_misc rfcomm l2cap uinput radeon 
drm acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_powersave 
cpufreq_ondemand freq_table cpufreq_conservative dock container joydev 
snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm 
snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq 
snd_timer snd_seq_device pcmcia psmouse snd ipw2200 serio_raw pcspkr hci_usb 
soundcore snd_page_alloc bluetooth ieee80211 ieee80211_crypt yenta_socket 
rsrc_nonstatic pcmcia_core iTCO_wdt iTCO_vendor_support video output shpchp 
pci_hotplug intel_agp button agpgart thinkpad_acpi nvram evdev sg sd_mod 
ata_piix floppy ata_generic libata scsi_mod e1000 ehci_hcd uhci_hcd usbcore 
thermal processor fan fuse

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Pekka J Enberg
On Wed, 9 Jan 2008, Matt Mackall wrote:
> 
> slob: split free list by size
> 

[snip]

Reviewed-by: Pekka Enberg <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Pekka J Enberg
Hi Matt,

On Wed, 9 Jan 2008, Matt Mackall wrote:
> Huh, that's a fairly negligible change on your system. Is that with or
> without the earlier patch? That doesn't appear to change much here.
> Guess I'll have to clean up my stats patch and send it to you.

Ok, if I apply both of the patches, I get better results for SLOB:

[ the minimum, maximum, and average are captured from 10 individual runs ]

Total   Free (kB) Used (kB)
(kB)min   max   average   min  max  average
  SLUB (no debug)   26536   23868 23892 23877.6   2644 2668 2658.4
  SLOB (both patches)   26548   23612 23860 23766.4   2688 2936 2781.6
  SLOB (two lists)  26548   23456 23708 23603.2   2840 3092 2944.8
  SLOB (vanilla)26548   23472 23640 23579.6   2908 3076 2968.4
  SLAB (no debug)   26544   23316 23364 23343.2   3180 3228 3200.8
  SLOB (merge fix)  26548   23260 23728 23385.2   2820 3288 3162.8
  SLUB (with debug) 26484   23120 23136 23127.2   3348 3364 3356.8

I'll double check the results for SLUB next but it seems obvious that your 
patches are a net gain for SLOB and should be applied. One problem though 
with SLOB seems to be that its memory efficiency is not so stable. Any 
ideas why that is?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Pekka J Enberg
Hi Matt,

On Wed, 9 Jan 2008, Matt Mackall wrote:
 Huh, that's a fairly negligible change on your system. Is that with or
 without the earlier patch? That doesn't appear to change much here.
 Guess I'll have to clean up my stats patch and send it to you.

Ok, if I apply both of the patches, I get better results for SLOB:

[ the minimum, maximum, and average are captured from 10 individual runs ]

Total   Free (kB) Used (kB)
(kB)min   max   average   min  max  average
  SLUB (no debug)   26536   23868 23892 23877.6   2644 2668 2658.4
  SLOB (both patches)   26548   23612 23860 23766.4   2688 2936 2781.6
  SLOB (two lists)  26548   23456 23708 23603.2   2840 3092 2944.8
  SLOB (vanilla)26548   23472 23640 23579.6   2908 3076 2968.4
  SLAB (no debug)   26544   23316 23364 23343.2   3180 3228 3200.8
  SLOB (merge fix)  26548   23260 23728 23385.2   2820 3288 3162.8
  SLUB (with debug) 26484   23120 23136 23127.2   3348 3364 3356.8

I'll double check the results for SLUB next but it seems obvious that your 
patches are a net gain for SLOB and should be applied. One problem though 
with SLOB seems to be that its memory efficiency is not so stable. Any 
ideas why that is?

Pekka
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Pekka J Enberg
On Wed, 9 Jan 2008, Matt Mackall wrote:
 
 slob: split free list by size
 

[snip]

Reviewed-by: Pekka Enberg [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Pekka J Enberg
Hi Matt,

On Thu, 10 Jan 2008, Pekka J Enberg wrote:
 I'll double check the results for SLUB next but it seems obvious that your 
 patches are a net gain for SLOB and should be applied. One problem though 
 with SLOB seems to be that its memory efficiency is not so stable. Any 
 ideas why that is?

Ok, I did that. The number are stable and reproducible. In fact, the 
average for SLUB is within 1 KB of the previous numbers. So, we have the 
same .config, the same userspace, and the same hypervisor, so what's the 
difference here?

We probably don't have the same version of GCC which perhaps affects 
memory layout (struct sizes) and thus allocation patterns? I have included 
ver_linux from my laptop here:

Linux haji 2.6.24-rc6 #21 SMP Thu Jan 10 12:30:59 EET 2008 i686 GNU/Linux
 
Gnu C  4.1.3
Gnu make   3.81
binutils   2.18
util-linux 2.13
mount  2.13
module-init-tools  3.3-pre2
e2fsprogs  1.40.2
reiserfsprogs  3.6.19
pcmciautils014
PPP2.4.4
Linux C Library2.6.1
Dynamic linker (ldd)   2.6.1
Procps 3.2.7
Net-tools  1.60
Console-tools  0.2.3
Sh-utils   5.97
udev   113
wireless-tools 29
Modules Loaded af_packet ipv6 binfmt_misc rfcomm l2cap uinput radeon 
drm acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_powersave 
cpufreq_ondemand freq_table cpufreq_conservative dock container joydev 
snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm 
snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq 
snd_timer snd_seq_device pcmcia psmouse snd ipw2200 serio_raw pcspkr hci_usb 
soundcore snd_page_alloc bluetooth ieee80211 ieee80211_crypt yenta_socket 
rsrc_nonstatic pcmcia_core iTCO_wdt iTCO_vendor_support video output shpchp 
pci_hotplug intel_agp button agpgart thinkpad_acpi nvram evdev sg sd_mod 
ata_piix floppy ata_generic libata scsi_mod e1000 ehci_hcd uhci_hcd usbcore 
thermal processor fan fuse

Pekka
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 12:54 +0200, Pekka J Enberg wrote:
 Hi Matt,
 
 On Thu, 10 Jan 2008, Pekka J Enberg wrote:
  I'll double check the results for SLUB next but it seems obvious that your 
  patches are a net gain for SLOB and should be applied. One problem though 
  with SLOB seems to be that its memory efficiency is not so stable. Any 
  ideas why that is?

We're seeing different numbers in each allocator indicating that the
ordering of allocations is not stable. When fragmentation occurs, it
magnifies the underlying instability. On my config, where the split list
combats fragmentation extremely effectively, the stability is quite
good.

Perhaps I'll add a printk to allocs and generate some size histograms.
lguest + grep = relayfs for dummies.

 Ok, I did that. The number are stable and reproducible. In fact, the 
 average for SLUB is within 1 KB of the previous numbers. So, we have the 
 same .config, the same userspace, and the same hypervisor, so what's the 
 difference here?

I've got:

gcc version 4.2.3 20080102 (prerelease) (Debian 4.2.2-5)
BusyBox v1.2.1 (2006.11.01-11:21+) Built-in shell (ash)

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Linus Torvalds


On Thu, 10 Jan 2008, Pekka J Enberg wrote:
 
 We probably don't have the same version of GCC which perhaps affects 
 memory layout (struct sizes) and thus allocation patterns?

No, struct sizes will not change with compiler versions - that would 
create binary incompatibilities for libraries etc.

So apart from the kernel itself working around some old gcc bugs by making 
spinlocks have different size depending on the compiler version, sizes of 
structures should be the same (as long as the configuration is the same, 
of course).

However, a greedy first-fit allocator will be *very* sensitive to 
allocation pattern differences, so timing will probably make a big 
difference. In contrast, SLUB and SLAB both use fixed sizes per allocation 
queue, which makes them almost totally impervious to any allocation 
pattern from different allocation sizes (they still end up caring about 
the pattern *within* one size, but those tend to be much stabler).

There really is a reason why traditional heaps with first-fit are almost 
never used for any real loads.

(I'm not a fan of slabs per se - I think all the constructor/destructor 
crap is just that: total crap - but the size/type binning is a big deal, 
and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
you guys are size-binning by just two or three bins, and it seems to make 
a difference for some loads, but compared to SLUB/SLAB it's a total hack).

I would suggest that if you guys are really serious about memory use, try 
to do a size-based heap thing, and do best-fit in that heap. Or just some 
really simple size-based binning, eg

if (size  2*PAGE_SIZE)
goto page_allocator;
bin = lookup_bin[(size+31)  5];

or whatever. Because first-fit is *known* to be bad.

At try to change it to address-ordered first-fit or something (which is 
much more complex than just plain LIFO, but hey, that's life).

I haven't checked much, but I *think* SLOB is just basic first-fit 
(perhaps the next-fit variation?) Next-fit is known to be EVEN WORSE 
than the simple first-fit when it comes to fragmentation (so no, Knuth was 
not always right - let's face it, much of Knuth is simply outdated).

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 08:13 -0800, Linus Torvalds wrote:
 
 On Thu, 10 Jan 2008, Pekka J Enberg wrote:
  
  We probably don't have the same version of GCC which perhaps affects 
  memory layout (struct sizes) and thus allocation patterns?
 
 No, struct sizes will not change with compiler versions - that would 
 create binary incompatibilities for libraries etc.
 
 So apart from the kernel itself working around some old gcc bugs by making 
 spinlocks have different size depending on the compiler version, sizes of 
 structures should be the same (as long as the configuration is the same, 
 of course).
 
 However, a greedy first-fit allocator will be *very* sensitive to 
 allocation pattern differences, so timing will probably make a big 
 difference. In contrast, SLUB and SLAB both use fixed sizes per allocation 
 queue, which makes them almost totally impervious to any allocation 
 pattern from different allocation sizes (they still end up caring about 
 the pattern *within* one size, but those tend to be much stabler).
 
 There really is a reason why traditional heaps with first-fit are almost 
 never used for any real loads.
 
 (I'm not a fan of slabs per se - I think all the constructor/destructor 
 crap is just that: total crap - but the size/type binning is a big deal, 
 and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
 you guys are size-binning by just two or three bins, and it seems to make 
 a difference for some loads, but compared to SLUB/SLAB it's a total hack).

Here I'm going to differ with you. The premises of the SLAB concept
(from the original paper) are: 

a) fragmentation of conventional allocators gets worse over time
b) grouping objects of the same -type- (not size) together should mean
they have similar lifetimes and thereby keep fragmentation low
c) slabs can be O(1)
d) constructors and destructors are cache-friendly

There's some truth to (a), but the problem has been quite overstated,
pre-SLAB Linux kernels and countless other systems run for years. And
(b) is known to be false, you just have to look at our dcache and icache
pinning. (c) of course is a whole separate argument and often trumps the
others. And (d) is pretty much crap now too.

And as it happens, SLOB basically always beats SLAB on size.

SLUB only wins when it starts merging caches of different -types- based
on -size-, effectively throwing out the whole (b) concept. Which is
good, because its wrong. So SLUB starts to look a lot like a purely
size-binned allocator.

 I would suggest that if you guys are really serious about memory use, try 
 to do a size-based heap thing, and do best-fit in that heap. Or just some 
 really simple size-based binning, eg
 
   if (size  2*PAGE_SIZE)
   goto page_allocator;

I think both SLOB and SLUB punt everything = PAGE_SIZE off to the page
allocator.

   bin = lookup_bin[(size+31)  5]
 
 or whatever. Because first-fit is *known* to be bad.

It is known to be crummy, but it is NOT known to actually be worse than
the SLAB/SLUB approach when you consider both internal and external
fragmentation. Power-of-two ala SLAB kmalloc basically guarantees your
internal fragmentation will be 30% or so.

In fact, first-fit is known to be pretty good if the ratio of object
size to arena size is reasonable. I've already shown a system booting
with 6% overhead compared to SLUB's 35% (and SLAB's nearly 70%). The
fragmentation measurements for the small object list are in fact quite
nice. Not the best benchmark, but it certainly shows that there's
substantial room for improvement.

Where it hurts is larger objects (task structs, SKBs), which are also a
problem for SLAB/SLUB and I don't think any variation on the search
order is going to help there. If we threw 64k pages at it, those
problems might very well disappear. Hell, it's pretty tempting to throw
vmalloc at it, especially on small boxes..

 At try to change it to address-ordered first-fit or something (which is 
 much more complex than just plain LIFO, but hey, that's life).

Will think about that. Most of the literature here is of limited
usefulness - even Knuth didn't look at 4k heaps, let alone collections
of them.

 I haven't checked much, but I *think* SLOB is just basic first-fit 
 (perhaps the next-fit variation?) Next-fit is known to be EVEN WORSE 
 than the simple first-fit when it comes to fragmentation (so no, Knuth was 
 not always right - let's face it, much of Knuth is simply outdated).

The SLOB allocator is inherently two-level. I'm using a next-fit-like
approach to decide which heap (page) to use and first-fit within that
heap.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Andi Kleen
 I would suggest that if you guys are really serious about memory use, try 
 to do a size-based heap thing, and do best-fit in that heap. Or just some 

iirc best fit usually also has some nasty long term fragmentation behaviour.
That is why it is usually also not used.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Linus Torvalds


On Thu, 10 Jan 2008, Matt Mackall wrote:
  
  (I'm not a fan of slabs per se - I think all the constructor/destructor 
  crap is just that: total crap - but the size/type binning is a big deal, 
  and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
  you guys are size-binning by just two or three bins, and it seems to make 
  a difference for some loads, but compared to SLUB/SLAB it's a total hack).
 
 Here I'm going to differ with you. The premises of the SLAB concept
 (from the original paper) are: 

I really don't think we differ.

The advantage of slab was largely the binning by type. Everything else was 
just a big crock. SLUB does the binning better, by really just making the 
type binning be about what really matters - the *size* of the type.

So my argument was that the type/size binning makes sense (size more so 
than type), but the rest of the original Sun arguments for why slab was 
such a great idea were basically just the crap.

Hard type binning was a mistake (but needed by slab due to the idiotic 
notion that constructors/destructors are good for caches - bleargh). I 
suspect that hard size binning is a mistake too (ie there are probably 
cases where you do want to split unused bigger size areas), but the fact 
that all of our allocators are two-level (with the page allocator acting 
as a size-agnostic free space) may help it somewhat.

And yes, I do agree that any current allocator has problems with the big 
sizes that don't fit well into a page or two (like task_struct). That 
said, most of those don't have lots of allocations under many normal 
circumstances (even if there are uses that will really blow them up).

The *big* slab users at least for me tend to be ext3_inode_cache and 
dentry. Everything else is orders of magnitude less. And of the two bad 
ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting 
in ~10% fragmentation just due to the page thing, regardless of whether 
you use an order-0 or order-1 page allocation).

Of course, dentries fit better in a page (due to being smaller), but then 
the bigger number of dentries per page make it harder to actually free 
pages, so then you get fragmentation from that. Oh well. You can't win.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 10:28 -0800, Linus Torvalds wrote:
 
 On Thu, 10 Jan 2008, Matt Mackall wrote:
   
   (I'm not a fan of slabs per se - I think all the constructor/destructor 
   crap is just that: total crap - but the size/type binning is a big deal, 
   and I think SLOB was naïve to think a pure first-fit makes any sense. Now 
   you guys are size-binning by just two or three bins, and it seems to make 
   a difference for some loads, but compared to SLUB/SLAB it's a total hack).
  
  Here I'm going to differ with you. The premises of the SLAB concept
  (from the original paper) are: 
 
 I really don't think we differ.
 
 The advantage of slab was largely the binning by type. Everything else was 
 just a big crock. SLUB does the binning better, by really just making the 
 type binning be about what really matters - the *size* of the type.
 
 So my argument was that the type/size binning makes sense (size more so 
 than type), but the rest of the original Sun arguments for why slab was 
 such a great idea were basically just the crap.
 
 Hard type binning was a mistake (but needed by slab due to the idiotic 
 notion that constructors/destructors are good for caches - bleargh). I 
 suspect that hard size binning is a mistake too (ie there are probably 
 cases where you do want to split unused bigger size areas), but the fact 
 that all of our allocators are two-level (with the page allocator acting 
 as a size-agnostic free space) may help it somewhat.
 
 And yes, I do agree that any current allocator has problems with the big 
 sizes that don't fit well into a page or two (like task_struct). That 
 said, most of those don't have lots of allocations under many normal 
 circumstances (even if there are uses that will really blow them up).
 
 The *big* slab users at least for me tend to be ext3_inode_cache and 
 dentry. Everything else is orders of magnitude less. And of the two bad 
 ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting 
 in ~10% fragmentation just due to the page thing, regardless of whether 
 you use an order-0 or order-1 page allocation).
 
 Of course, dentries fit better in a page (due to being smaller), but then 
 the bigger number of dentries per page make it harder to actually free 
 pages, so then you get fragmentation from that. Oh well. You can't win.

One idea I've been kicking around is pushing the boundary for the buddy
allocator back a bit (to 64k, say) and using SL*B under that. The page
allocators would call into buddy for larger than 64k (rare!) and SL*B
otherwise. This would let us greatly improve our handling of things like
task structs and skbs and possibly also things like 8k stacks and jumbo
frames. As SL*B would never be competing with the page allocator for
contiguous pages (the buddy allocator's granularity would be 64k), I
don't think this would exacerbate the page-level fragmentation issues.

Crazy?

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

 Here I'm going to differ with you. The premises of the SLAB concept
 (from the original paper) are: 
 
 a) fragmentation of conventional allocators gets worse over time

Even fragmentation of SLAB/SLUB gets worses over time. That is why we need 
a defrag solution.

 b) grouping objects of the same -type- (not size) together should mean
 they have similar lifetimes and thereby keep fragmentation low

I agree that is crap. The lifetimes argument is mostly only exploitable in 
benchmarks. That is why SLUB just groups them by size if possible.

 d) constructors and destructors are cache-friendly

I agree. Crap too. We removed the destructors. The constructors are needed 
so that objects in slab pages always have a definite state. That is f.e.
necessary for slab defragmentation because it has to be able to inspect an 
object at an arbitrary time and either remove it or move it to another 
slab page.

Constructors also make sense because the initialization of a cache object 
may be expensive. Initializing list heads and spinlocks can take some code 
and that code can be omitted if objects have a definite state when they 
are free. We saw that when measuring the buffer_head constructors effect 
on performance.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

 One idea I've been kicking around is pushing the boundary for the buddy
 allocator back a bit (to 64k, say) and using SL*B under that. The page
 allocators would call into buddy for larger than 64k (rare!) and SL*B
 otherwise. This would let us greatly improve our handling of things like
 task structs and skbs and possibly also things like 8k stacks and jumbo
 frames. As SL*B would never be competing with the page allocator for
 contiguous pages (the buddy allocator's granularity would be 64k), I
 don't think this would exacerbate the page-level fragmentation issues.

This would create another large page size (and that would have my 
enthusiastic support). It would decrease listlock effect drastically for 
SLUB. Even the initial simplistic implementation of SLUB was superior on 
the database transaction tests (I think it was up ~1%) on IA64 from the 
get go. Likely due to the larger 16k page size there. The larger page size 
could also be used for the page cache (ducking and running.)? A 64k 
page size that could be allocated without zone locks would be very good 
for SLUB.

However, isnt this is basically confessing that the page allocator is not 
efficient for 4k page allocations? I have seen some weaknesses there. The 
overhead in the allocation path in particular is bad and I was thinking 
about applying the same ideas used in SLUB also to the page allocator in 
order to bring the cycle count down from 500-1000 to 60 or so. Since both 
SLUB and SLOB use the page allocator for allocs PAGE_SIZE this would not 
only benefit the general kernel but also the slab allocations.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Matt Mackall

On Thu, 2008-01-10 at 11:16 -0800, Christoph Lameter wrote:
 On Thu, 10 Jan 2008, Matt Mackall wrote:
 
  Here I'm going to differ with you. The premises of the SLAB concept
  (from the original paper) are: 
  
  a) fragmentation of conventional allocators gets worse over time
 
 Even fragmentation of SLAB/SLUB gets worses over time. That is why we need 
 a defrag solution.
 
  b) grouping objects of the same -type- (not size) together should mean
  they have similar lifetimes and thereby keep fragmentation low
 
 I agree that is crap. The lifetimes argument is mostly only exploitable in 
 benchmarks. That is why SLUB just groups them by size if possible.
 
  d) constructors and destructors are cache-friendly
 
 I agree. Crap too. We removed the destructors. The constructors are needed 
 so that objects in slab pages always have a definite state. That is f.e.
 necessary for slab defragmentation because it has to be able to inspect an 
 object at an arbitrary time and either remove it or move it to another 
 slab page.

Are you saying that the state of -freed- objects matters for your active
defragmentation? That's odd.

 Constructors also make sense because the initialization of a cache object 
 may be expensive. Initializing list heads and spinlocks can take some code 
 and that code can be omitted if objects have a definite state when they 
 are free. We saw that when measuring the buffer_head constructors effect 
 on performance.

Hmm. SLOB proves that you don't need to segregate objects based on
constructors, so you could combine even slabs that have constructors and
just delay construction until allocation. I'm surprised constructors
have measurable advantage..

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

  I agree. Crap too. We removed the destructors. The constructors are needed 
  so that objects in slab pages always have a definite state. That is f.e.
  necessary for slab defragmentation because it has to be able to inspect an 
  object at an arbitrary time and either remove it or move it to another 
  slab page.
 
 Are you saying that the state of -freed- objects matters for your active
 defragmentation? That's odd.

The state of the object immediately after it is allocated matters for a 
defrag solution. A kmalloc leads to an object in a undetermined state if 
you have no constructor. Code will then initialize the object but defrag 
f.e. must be able to inspect the object before. This means either that the 
freed object has a defined state or that kmalloc establishes that state 
before the object is marked as allocated.

  Constructors also make sense because the initialization of a cache object 
  may be expensive. Initializing list heads and spinlocks can take some code 
  and that code can be omitted if objects have a definite state when they 
  are free. We saw that when measuring the buffer_head constructors effect 
  on performance.
 
 Hmm. SLOB proves that you don't need to segregate objects based on
 constructors, so you could combine even slabs that have constructors and
 just delay construction until allocation. I'm surprised constructors
 have measurable advantage..

That is not working if you need to inspect allocated objects at any time 
for a defrag solution. All objects in a defragmentable slab need to have a 
consistent object state if allocated. If you have some without 
constructors then these object have no defined state and may contain 
arbitrary bytes.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Matt Mackall wrote:

 Well, I think we'd still have the same page size, in the sense that we'd
 have a struct page for every hardware page and we'd still have hardware
 page-sized pages in the page cache. We'd just change how we allocated
 them. Right now we've got a stack that looks like:

We would not change the hardware page. Cannot do that. But we would have 
preferential threadment for 64k and 2M pages in the page allocator?

  buddy / page allocator
  SL*B allocator
  kmalloc
 
 And we'd change that to:
 
  buddy allocator
  SL*B allocator
  page allocator / kmalloc
 
 So get_free_page() would still hand you back a hardware page, it would
 just do it through SL*B.

Hmm Not sure what effect this would have. We already have the pcp's 
that have a similar effect.
 
   It would decrease listlock effect drastically for SLUB.
 
 Not sure what you're referring to here.

Allocations in 64k chunks means 16 times less basic allocation blocks to 
manage for the slab allocators. So the metadata to be maintained by the 
allocators is reduces by that factor. SLUB only needs to touch the 
list_lock (in some situations like a free to a non cpu slab) if a block 
becomes completely empty or is going from fully allocated to partially 
allocated. The larger the block size the more objects are in a block and 
the less of these actions that need a per node lock are needed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Andi Kleen wrote:

 I did essentially that for my GBpages hugetlbfs patchkit. GB pages are already
 beyond MAX_ORDER and increasing MAX_ORDER didn't seem attractive because
 it would require aligning the zones all to 1GB which would quite nasty.

I am very very interested in that work and I could not find it when I 
looked for it a couple of weeks back. Can you sent me a copy?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Christoph Lameter
On Thu, 10 Jan 2008, Linus Torvalds wrote:

 It's not even clear that a buddy allocator even for the high-order pages 
 is at all the right choice. Almost nobody actually wants 64kB blocks, and 
 the ones that *do* want bigger allocations tend to want *much* bigger 
 ones, so it's quite possible that it could be worth it to have something 
 like a three-level allocator:

Excellent! I am definitely on board with this.

  - huge pages (superpages for those crazy db people)
 
Just a simple linked list of these things is fine, we'd never care 
about coalescing large pages together anyway.
 
  - large pages (on the order of ~64kB) - with *perhaps* a buddy bitmap 
setup to try to coalesce back into huge-pages, but more likely just 
admitting that you'd need something like migration to ever get back a 
hugepage that got split into large-pages.
 
So maybe a simple bitmap allocator per huge-page for large pages. Say 
you have a 4MB huge-page, and just a 64-bit free-bitmap per huge-page 
when you split it into large pages.
 
  - slab/slub/slob for anything else, and get_free_page() ends up being 
just a shorthand for saying naturally aligned kmalloc of size 
PAGE_SIZEorder
 
 and maybe it would all work out ok. 

Hmmm... a 3 level allocator? Basically we would have BASE_PAGE 
STANDARD_PAGE and HUGE_PAGE? We could simply extend the page allocator to 
have 3 pcp lists for these sizes and go from there?

Thinking about the arches this would mean

BASE_PAGE   STANDARD_PAGE   HUGE_PAGE
x86_64  4k  64k 2M
i3864k  16k 4M
ia6416k 256k1G

?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Andi Kleen
  - huge pages (superpages for those crazy db people)
 
Just a simple linked list of these things is fine, we'd never care 
about coalescing large pages together anyway.

I did essentially that for my GBpages hugetlbfs patchkit. GB pages are already
beyond MAX_ORDER and increasing MAX_ORDER didn't seem attractive because
it would require aligning the zones all to 1GB which would quite nasty.

So it just grabbed them out of bootmem early and managed them in a 
per node list.

Not sure it's a good idea for 2MB pages though.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Jörn Engel
On Thu, 10 January 2008 11:49:25 -0600, Matt Mackall wrote:
 
 b) grouping objects of the same -type- (not size) together should mean
 they have similar lifetimes and thereby keep fragmentation low
 
 (b) is known to be false, you just have to look at our dcache and icache
 pinning.

(b) is half-true, actually.  The grouping by lifetime makes a lot of
sense.  LogFS has a similar problem to slabs (only full segments are
useful, a single object can pin the segment).  And when I grouped my
objects very roughly by their life expectency, the impact was *HUGE*!

In both cases, you want slabs/segments that are either close to 100%
full or close to 0% full.  It matters a lot when you have to move
objects around and I would bet it matters even more when you cannot move
objects and the slab just remains pinned.

So just because the type alone is a relatively bad heuristic for life
expectency does not make the concept false.  Bonwick was onto something.
He just failed in picking a good heuristic.  Quite likely spreading by
type was even a bonus when slab was developed, because even such a crude
heuristic is slightly better than completely randomized lifetimes.

I've been meaning to split the dentry cache into 2-3 seperate ones for a
while and kept spending my time elsewhere.  But I remain convinced that
this will make a measurable difference.

Jörn

-- 
Never argue with idiots - first they drag you down to their level,
then they beat you with experience.
-- unknown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Matt Mackall

On Thu, 2008-01-10 at 00:43 +0200, Pekka J Enberg wrote:
> Hi Matt,
> 
> On Wed, 9 Jan 2008, Matt Mackall wrote:
> > I kicked this around for a while, slept on it, and then came up with
> > this little hack first thing this morning:
> > 
> > 
> > slob: split free list by size
> > 
> 
> [snip]
> 
> > And the results are fairly miraculous, so please double-check them on
> > your setup. The resulting statistics change to this:
> 
> [snip]
>  
> > So the average jumped by 714k from before the patch, became much more
> > stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
> > now, up from 1 before. And we can't get a whole lot better than this:
> > we're using 259 pages for 244 pages of actual data, so our total
> > overhead is only 6%! For comparison, SLUB's using about 70 pages more
> > for the same data, so its total overhead appears to be about 35%.
> 
> Unfortunately I only see a slight improvement to SLOB (but it still gets 
> beaten by SLUB):
> 
> [ the minimum, maximum, and average are captured from 10 individual runs ]
> 
>  Free (kB) Used (kB)
> Total (kB)   min   max   average   min  max  average
>   SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
>   SLOB (patched)2654823456 23708 23603.2   2840 3092 2944.8
>   SLOB (vanilla)2654823472 23640 23579.6   2908 3076 2968.4
>   SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
>   SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8

With your kernel config and my lguest+busybox setup, I get:

SLUB:
MemFree: 24208 kB
MemFree: 24212 kB
MemFree: 24212 kB
MemFree: 24212 kB
MemFree: 24216 kB
MemFree: 24216 kB
MemFree: 24220 kB
MemFree: 24220 kB
MemFree: 24224 kB
MemFree: 24232 kB
avg: 24217.2

SLOB with two lists:
MemFree: 24204 kB
MemFree: 24260 kB
MemFree: 24260 kB
MemFree: 24276 kB
MemFree: 24288 kB
MemFree: 24292 kB
MemFree: 24312 kB
MemFree: 24320 kB
MemFree: 24336 kB
MemFree: 24396 kB
avg: 24294.4

Not sure why this result is so different from yours.

Hacked this up to three lists to experiment and we now have:
MemFree: 24348 kB
MemFree: 24372 kB
MemFree: 24372 kB
MemFree: 24372 kB
MemFree: 24372 kB
MemFree: 24380 kB
MemFree: 24384 kB
MemFree: 24404 kB
MemFree: 24404 kB
MemFree: 24408 kB
avg: 24344.4

Even the last version is still using about 250 pages of storage for 209
pages of data, so it's got about 20% overhead still.



-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Matt Mackall

On Thu, 2008-01-10 at 00:43 +0200, Pekka J Enberg wrote:
> Hi Matt,
> 
> On Wed, 9 Jan 2008, Matt Mackall wrote:
> > I kicked this around for a while, slept on it, and then came up with
> > this little hack first thing this morning:
> > 
> > 
> > slob: split free list by size
> > 
> 
> [snip]
> 
> > And the results are fairly miraculous, so please double-check them on
> > your setup. The resulting statistics change to this:
> 
> [snip]
>  
> > So the average jumped by 714k from before the patch, became much more
> > stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
> > now, up from 1 before. And we can't get a whole lot better than this:
> > we're using 259 pages for 244 pages of actual data, so our total
> > overhead is only 6%! For comparison, SLUB's using about 70 pages more
> > for the same data, so its total overhead appears to be about 35%.
> 
> Unfortunately I only see a slight improvement to SLOB (but it still gets 
> beaten by SLUB):
> 
> [ the minimum, maximum, and average are captured from 10 individual runs ]
> 
>  Free (kB) Used (kB)
> Total (kB)   min   max   average   min  max  average
>   SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
>   SLOB (patched)2654823456 23708 23603.2   2840 3092 2944.8
>   SLOB (vanilla)2654823472 23640 23579.6   2908 3076 2968.4
>   SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
>   SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8

Huh, that's a fairly negligible change on your system. Is that with or
without the earlier patch? That doesn't appear to change much here.
Guess I'll have to clean up my stats patch and send it to you.

> What .config did you use? What kind of user-space do you have? (I am still 
> using the exact same configuration I described in the first mail.)

I'm using lguest with my Thinkpad config (I'll send that separately),
with busybox and init=doit like your setup. Was having trouble getting
lguest going with your config, but I may have found the problem.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Pekka J Enberg
Hi Matt,

On Wed, 9 Jan 2008, Matt Mackall wrote:
> I kicked this around for a while, slept on it, and then came up with
> this little hack first thing this morning:
> 
> 
> slob: split free list by size
> 

[snip]

> And the results are fairly miraculous, so please double-check them on
> your setup. The resulting statistics change to this:

[snip]
 
> So the average jumped by 714k from before the patch, became much more
> stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
> now, up from 1 before. And we can't get a whole lot better than this:
> we're using 259 pages for 244 pages of actual data, so our total
> overhead is only 6%! For comparison, SLUB's using about 70 pages more
> for the same data, so its total overhead appears to be about 35%.

Unfortunately I only see a slight improvement to SLOB (but it still gets 
beaten by SLUB):

[ the minimum, maximum, and average are captured from 10 individual runs ]

 Free (kB) Used (kB)
Total (kB)   min   max   average   min  max  average
  SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
  SLOB (patched)2654823456 23708 23603.2   2840 3092 2944.8
  SLOB (vanilla)2654823472 23640 23579.6   2908 3076 2968.4
  SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
  SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8

What .config did you use? What kind of user-space do you have? (I am still 
using the exact same configuration I described in the first mail.)

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Matt Mackall

On Mon, 2008-01-07 at 20:06 +0200, Pekka J Enberg wrote:
> Hi Matt,
> 
> On Sun, 6 Jan 2008, Matt Mackall wrote:
> > I don't have any particular "terrible" workloads for SLUB. But my
> > attempts to simply boot with all three allocators to init=/bin/bash in,
> > say, lguest show a fair margin for SLOB.
> 
> Sorry, I once again have bad news ;-). I did some testing with
> 
>   lguest --block= 32 /boot/vmlinuz-2.6.24-rc6 root=/dev/vda 
> init=doit
> 
> where rootfile is
> 
>   http://uml.nagafix.co.uk/BusyBox-1.5.0/BusyBox-1.5.0-x86-root_fs.bz2
> 
> and the "doit" script in the guest passed as init= is just
> 
>   #!/bin/sh
>   mount -t proc proc /proc
>   cat /proc/meminfo | grep MemTotal
>   cat /proc/meminfo | grep MemFree
>   cat /proc/meminfo | grep Slab
> 
> and the results are:
> 
> [ the minimum, maximum, and average are of captured from 10 individual runs ]
> 
>  Free (kB) Used (kB)
> Total (kB)   min   max   average   min  max  average
>   SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
>   SLOB  2654823472 23640 23579.6   2908 3076 2968.4
>   SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
>   SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8
> 
> So it seems that on average SLUB uses 300 kilobytes *less memory* (!) (which 
> is
> roughly 1% of total memory available) after boot than SLOB for my
> configuration.
> 
> One possible explanation is that the high internal fragmentation (space
> allocated but not used) of SLUB kmalloc() only affects short-lived allocations
> and thus does not show up in the more permanent memory footprint.  Likewise, 
> it
> could be that SLOB has higher external fragmentation (small blocks that are
> unavailable for allocation) of which SLUB does not suffer from.  Dunno, 
> haven't
> investigated as my results are contradictory to yours.

Yep, you we definitely onto something here. Here's what I got with 10
runs of SLUB on my setup:

MemFree: 55780 kB
MemFree: 55780 kB
MemFree: 55780 kB
MemFree: 55784 kB
MemFree: 55788 kB
MemFree: 55788 kB
MemFree: 55788 kB
MemFree: 55792 kB
MemFree: 55796 kB
MemFree: 55800 kB
Avg: 55787.6

And with SLOB + my defrag fix:

MemFree: 55236 kB
MemFree: 55284 kB
MemFree: 55292 kB
MemFree: 55304 kB
MemFree: 55384 kB
MemFree: 55388 kB
MemFree: 55412 kB
MemFree: 55420 kB
MemFree: 55436 kB
MemFree: 55444 kB
Avg: 55360.0

Ouch!

So I added a bunch of statistics gathering:

counted pages 409 unused 185242 biggest 1005 fragments 1416
slob pages 410 allocs 22528 frees 12109 active 10419 allocated 932647
  page scans 11249 block scans 40650
kmallocs 10247 active 5450 allocated 3484680 overhead 48836
bigpages 827 active 17
total 427 used 245

The first line tells us about SLOB's free list, which has 409 pages,
185k unused, spread into 1416 fragments. The average fragment is 130
bytes.

The next tells us that we've got 410 total SLOB pages (1 is fully
allocated), we've done 22k allocs, 12k frees, have 10k allocations
active, and 932k total memory allocated (including kmallocs). That means
our average SLOB allocation is ~90 bytes.

The kmallocs line tells us we've done 10k allocs, and 5k of them are not
yet freed. Since boot, we requested 3.48MiB of kmalloc (without padding)
and added on 49k of padding. Thus the average kmalloc is 340 bytes and
has 4.77 bytes of padding (1.2% overhead, quite good!).

SLAB and kmalloc objects => 4k are handed straight to the page allocator
(same as SLUB), of which there are 17 active pages.

So in total, SLOB is using 427 pages for what optimally could fit in 245
pages. In other words, external fragmentation is horrible.

I kicked this around for a while, slept on it, and then came up with
this little hack first thing this morning:


slob: split free list by size

diff -r 6901ca355181 mm/slob.c
--- a/mm/slob.c Tue Jan 08 21:01:15 2008 -0600
+++ b/mm/slob.c Wed Jan 09 12:31:59 2008 -0600
@@ -112,7 +112,9 @@ static inline void free_slob_page(struct
 /*
  * All (partially) free slob pages go on this list.
  */
-static LIST_HEAD(free_slob_pages);
+#define SLOB_BREAK_POINT 300
+static LIST_HEAD(free_slob_pages_big);
+static LIST_HEAD(free_slob_pages_small);
 
 /*
  * slob_page: True for all slob pages (false for bigblock pages)
@@ -140,9 +142,9 @@ static inline int slob_page_free(struct 
return test_bit(PG_private, >flags);
 }
 
-static inline void set_slob_page_free(struct slob_page *sp)
+static inline void set_slob_page_free(struct slob_page *sp, struct list_head 
*list)
 {
-   list_add(>list, _slob_pages);
+   list_add(>list, list);
__set_bit(PG_private, >flags);
 }
 
@@ -294,12 +296,18 @@ static void *slob_alloc(size_t size, gfp
 {

[RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Matt Mackall

On Mon, 2008-01-07 at 20:06 +0200, Pekka J Enberg wrote:
 Hi Matt,
 
 On Sun, 6 Jan 2008, Matt Mackall wrote:
  I don't have any particular terrible workloads for SLUB. But my
  attempts to simply boot with all three allocators to init=/bin/bash in,
  say, lguest show a fair margin for SLOB.
 
 Sorry, I once again have bad news ;-). I did some testing with
 
   lguest --block=rootfile 32 /boot/vmlinuz-2.6.24-rc6 root=/dev/vda 
 init=doit
 
 where rootfile is
 
   http://uml.nagafix.co.uk/BusyBox-1.5.0/BusyBox-1.5.0-x86-root_fs.bz2
 
 and the doit script in the guest passed as init= is just
 
   #!/bin/sh
   mount -t proc proc /proc
   cat /proc/meminfo | grep MemTotal
   cat /proc/meminfo | grep MemFree
   cat /proc/meminfo | grep Slab
 
 and the results are:
 
 [ the minimum, maximum, and average are of captured from 10 individual runs ]
 
  Free (kB) Used (kB)
 Total (kB)   min   max   average   min  max  average
   SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
   SLOB  2654823472 23640 23579.6   2908 3076 2968.4
   SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
   SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8
 
 So it seems that on average SLUB uses 300 kilobytes *less memory* (!) (which 
 is
 roughly 1% of total memory available) after boot than SLOB for my
 configuration.
 
 One possible explanation is that the high internal fragmentation (space
 allocated but not used) of SLUB kmalloc() only affects short-lived allocations
 and thus does not show up in the more permanent memory footprint.  Likewise, 
 it
 could be that SLOB has higher external fragmentation (small blocks that are
 unavailable for allocation) of which SLUB does not suffer from.  Dunno, 
 haven't
 investigated as my results are contradictory to yours.

Yep, you we definitely onto something here. Here's what I got with 10
runs of SLUB on my setup:

MemFree: 55780 kB
MemFree: 55780 kB
MemFree: 55780 kB
MemFree: 55784 kB
MemFree: 55788 kB
MemFree: 55788 kB
MemFree: 55788 kB
MemFree: 55792 kB
MemFree: 55796 kB
MemFree: 55800 kB
Avg: 55787.6

And with SLOB + my defrag fix:

MemFree: 55236 kB
MemFree: 55284 kB
MemFree: 55292 kB
MemFree: 55304 kB
MemFree: 55384 kB
MemFree: 55388 kB
MemFree: 55412 kB
MemFree: 55420 kB
MemFree: 55436 kB
MemFree: 55444 kB
Avg: 55360.0

Ouch!

So I added a bunch of statistics gathering:

counted pages 409 unused 185242 biggest 1005 fragments 1416
slob pages 410 allocs 22528 frees 12109 active 10419 allocated 932647
  page scans 11249 block scans 40650
kmallocs 10247 active 5450 allocated 3484680 overhead 48836
bigpages 827 active 17
total 427 used 245

The first line tells us about SLOB's free list, which has 409 pages,
185k unused, spread into 1416 fragments. The average fragment is 130
bytes.

The next tells us that we've got 410 total SLOB pages (1 is fully
allocated), we've done 22k allocs, 12k frees, have 10k allocations
active, and 932k total memory allocated (including kmallocs). That means
our average SLOB allocation is ~90 bytes.

The kmallocs line tells us we've done 10k allocs, and 5k of them are not
yet freed. Since boot, we requested 3.48MiB of kmalloc (without padding)
and added on 49k of padding. Thus the average kmalloc is 340 bytes and
has 4.77 bytes of padding (1.2% overhead, quite good!).

SLAB and kmalloc objects = 4k are handed straight to the page allocator
(same as SLUB), of which there are 17 active pages.

So in total, SLOB is using 427 pages for what optimally could fit in 245
pages. In other words, external fragmentation is horrible.

I kicked this around for a while, slept on it, and then came up with
this little hack first thing this morning:


slob: split free list by size

diff -r 6901ca355181 mm/slob.c
--- a/mm/slob.c Tue Jan 08 21:01:15 2008 -0600
+++ b/mm/slob.c Wed Jan 09 12:31:59 2008 -0600
@@ -112,7 +112,9 @@ static inline void free_slob_page(struct
 /*
  * All (partially) free slob pages go on this list.
  */
-static LIST_HEAD(free_slob_pages);
+#define SLOB_BREAK_POINT 300
+static LIST_HEAD(free_slob_pages_big);
+static LIST_HEAD(free_slob_pages_small);
 
 /*
  * slob_page: True for all slob pages (false for bigblock pages)
@@ -140,9 +142,9 @@ static inline int slob_page_free(struct 
return test_bit(PG_private, sp-flags);
 }
 
-static inline void set_slob_page_free(struct slob_page *sp)
+static inline void set_slob_page_free(struct slob_page *sp, struct list_head 
*list)
 {
-   list_add(sp-list, free_slob_pages);
+   list_add(sp-list, list);
__set_bit(PG_private, sp-flags);
 }
 
@@ -294,12 +296,18 @@ static void *slob_alloc(size_t size, gfp
 {
struct slob_page *sp;
struct 

Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Pekka J Enberg
Hi Matt,

On Wed, 9 Jan 2008, Matt Mackall wrote:
 I kicked this around for a while, slept on it, and then came up with
 this little hack first thing this morning:
 
 
 slob: split free list by size
 

[snip]

 And the results are fairly miraculous, so please double-check them on
 your setup. The resulting statistics change to this:

[snip]
 
 So the average jumped by 714k from before the patch, became much more
 stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
 now, up from 1 before. And we can't get a whole lot better than this:
 we're using 259 pages for 244 pages of actual data, so our total
 overhead is only 6%! For comparison, SLUB's using about 70 pages more
 for the same data, so its total overhead appears to be about 35%.

Unfortunately I only see a slight improvement to SLOB (but it still gets 
beaten by SLUB):

[ the minimum, maximum, and average are captured from 10 individual runs ]

 Free (kB) Used (kB)
Total (kB)   min   max   average   min  max  average
  SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
  SLOB (patched)2654823456 23708 23603.2   2840 3092 2944.8
  SLOB (vanilla)2654823472 23640 23579.6   2908 3076 2968.4
  SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
  SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8

What .config did you use? What kind of user-space do you have? (I am still 
using the exact same configuration I described in the first mail.)

Pekka
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Matt Mackall

On Thu, 2008-01-10 at 00:43 +0200, Pekka J Enberg wrote:
 Hi Matt,
 
 On Wed, 9 Jan 2008, Matt Mackall wrote:
  I kicked this around for a while, slept on it, and then came up with
  this little hack first thing this morning:
  
  
  slob: split free list by size
  
 
 [snip]
 
  And the results are fairly miraculous, so please double-check them on
  your setup. The resulting statistics change to this:
 
 [snip]
  
  So the average jumped by 714k from before the patch, became much more
  stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
  now, up from 1 before. And we can't get a whole lot better than this:
  we're using 259 pages for 244 pages of actual data, so our total
  overhead is only 6%! For comparison, SLUB's using about 70 pages more
  for the same data, so its total overhead appears to be about 35%.
 
 Unfortunately I only see a slight improvement to SLOB (but it still gets 
 beaten by SLUB):
 
 [ the minimum, maximum, and average are captured from 10 individual runs ]
 
  Free (kB) Used (kB)
 Total (kB)   min   max   average   min  max  average
   SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
   SLOB (patched)2654823456 23708 23603.2   2840 3092 2944.8
   SLOB (vanilla)2654823472 23640 23579.6   2908 3076 2968.4
   SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
   SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8

Huh, that's a fairly negligible change on your system. Is that with or
without the earlier patch? That doesn't appear to change much here.
Guess I'll have to clean up my stats patch and send it to you.

 What .config did you use? What kind of user-space do you have? (I am still 
 using the exact same configuration I described in the first mail.)

I'm using lguest with my Thinkpad config (I'll send that separately),
with busybox and init=doit like your setup. Was having trouble getting
lguest going with your config, but I may have found the problem.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-09 Thread Matt Mackall

On Thu, 2008-01-10 at 00:43 +0200, Pekka J Enberg wrote:
 Hi Matt,
 
 On Wed, 9 Jan 2008, Matt Mackall wrote:
  I kicked this around for a while, slept on it, and then came up with
  this little hack first thing this morning:
  
  
  slob: split free list by size
  
 
 [snip]
 
  And the results are fairly miraculous, so please double-check them on
  your setup. The resulting statistics change to this:
 
 [snip]
  
  So the average jumped by 714k from before the patch, became much more
  stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
  now, up from 1 before. And we can't get a whole lot better than this:
  we're using 259 pages for 244 pages of actual data, so our total
  overhead is only 6%! For comparison, SLUB's using about 70 pages more
  for the same data, so its total overhead appears to be about 35%.
 
 Unfortunately I only see a slight improvement to SLOB (but it still gets 
 beaten by SLUB):
 
 [ the minimum, maximum, and average are captured from 10 individual runs ]
 
  Free (kB) Used (kB)
 Total (kB)   min   max   average   min  max  average
   SLUB (no debug)   2653623868 23892 23877.6   2644 2668 2658.4
   SLOB (patched)2654823456 23708 23603.2   2840 3092 2944.8
   SLOB (vanilla)2654823472 23640 23579.6   2908 3076 2968.4
   SLAB (no debug)   2654423316 23364 23343.2   3180 3228 3200.8
   SLUB (with debug) 2648423120 23136 23127.2   3348 3364 3356.8

With your kernel config and my lguest+busybox setup, I get:

SLUB:
MemFree: 24208 kB
MemFree: 24212 kB
MemFree: 24212 kB
MemFree: 24212 kB
MemFree: 24216 kB
MemFree: 24216 kB
MemFree: 24220 kB
MemFree: 24220 kB
MemFree: 24224 kB
MemFree: 24232 kB
avg: 24217.2

SLOB with two lists:
MemFree: 24204 kB
MemFree: 24260 kB
MemFree: 24260 kB
MemFree: 24276 kB
MemFree: 24288 kB
MemFree: 24292 kB
MemFree: 24312 kB
MemFree: 24320 kB
MemFree: 24336 kB
MemFree: 24396 kB
avg: 24294.4

Not sure why this result is so different from yours.

Hacked this up to three lists to experiment and we now have:
MemFree: 24348 kB
MemFree: 24372 kB
MemFree: 24372 kB
MemFree: 24372 kB
MemFree: 24372 kB
MemFree: 24380 kB
MemFree: 24384 kB
MemFree: 24404 kB
MemFree: 24404 kB
MemFree: 24408 kB
avg: 24344.4

Even the last version is still using about 250 pages of storage for 209
pages of data, so it's got about 20% overhead still.



-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/