Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Christoph Hellwig
On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
 Here is the incremental small cleanup patch. 
 
 Remove kamlloc usages in jbd/jbd2 and consistently use 
 jbd_kmalloc/jbd2_malloc.

Shouldn't we kill jbd_kmalloc instead?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Mel Gorman
On (17/09/07 15:00), Christoph Lameter didst pronounce:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
 
  I don't know how it would prevent fragmentation from building up
  anyway. It's commonly the case that potentially unmovable objects
  are allowed to fill up all of ram (dentries, inodes, etc).
 
 Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
 ZONE_MOVABLE and thus the memory that can be allocated for them is 
 limited.
 

As Nick points out, having to configure something makes it a #2
solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out
clause to be able to control fragmentation no matter what the workload is
as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some
mechanism in grouping pages by mobility to force a number of blocks to be
MIGRATE_MOVABLE_ONLY, the emergency option will exist,

We still lack data on what sort of workloads really benefit from large
blocks (assuming there are any that cannot also be solved by improving
order-0). With Christophs approach + grouping pages by mobility +
ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the
course of the next few months while we're waiting for fsblock or software
pagesize to mature.

Do we really need to keep discussing this as no new point has been made ina
while? Can we at least take out the non-contentious parts of Christoph's
patches such as the page cache macros and do something with them?

-- 
Mel tired of typing Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Jörn Engel
On Tue, 18 September 2007 11:00:40 +0100, Mel Gorman wrote:
 
 We still lack data on what sort of workloads really benefit from large
 blocks

Compressing filesystems like jffs2 and logfs gain better compression
ratio with larger blocks.  Going from 4KiB to 64KiB gave somewhere
around 10% benefit iirc.  Testdata was a 128MiB qemu root filesystem.

Granted, the same could be achieved by adding some extra code and a few
bounce buffers to the filesystem.  How suck a hack would perform I'd
prefer not to find out, though. :)

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread David Chinner
On Tue, Sep 18, 2007 at 11:00:40AM +0100, Mel Gorman wrote:
 We still lack data on what sort of workloads really benefit from large
 blocks (assuming there are any that cannot also be solved by improving
 order-0).

No we don't. All workloads benefit from larger block sizes when
you've got a btree tracking 20 million inodes and a create has to
search that tree for a free inode.  The tree gets much wider and
hence we take fewer disk seeks to traverse the tree. Same for large
directories, btree's tracking free space, etc - everything goes
faster with a larger filesystem block size because we spent less
time doing metadata I/O.

And the other advantage is that sequential I/O speeds also tend to
increase with larger block sizes. e.g. XFS on an Altix (16k pages)
using 16k block size is about 20-25% faster on writes than 4k block
size. See the graphs at the top of page 12:

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

The benefits are really about scalability and with terabyte sized
disks on the market.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/3] 2.6.23-rc6: known regressions v2

2007-09-18 Thread Jan Kara
 FS
 
 Subject : hanging ext3 dbench tests
 References  : http://lkml.org/lkml/2007/9/11/176
 Last known good : ?
 Submitter   : Andy Whitcroft [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Status  : under test -- unreproducible at present
  Yep... Hard to do anything until Andy is able to reproduce it at least
once more to gather needed info.

 Subject : umount triggers a warning in jfs and takes almost a minute
 References  : http://lkml.org/lkml/2007/9/4/73
 Last known good : ?
 Submitter   : Oliver Neukum [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Status  : unknown
  I thought Shaggy asked Oliver about some details (and he did not
answer so far) so I'd assume Shaggy is handling this.


Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/3] 2.6.23-rc6: known regressions v2

2007-09-18 Thread Dave Kleikamp
On Tue, 2007-09-18 at 16:24 +0200, Jan Kara wrote:

  Subject : umount triggers a warning in jfs and takes almost a minute
  References  : http://lkml.org/lkml/2007/9/4/73
  Last known good : ?
  Submitter   : Oliver Neukum [EMAIL PROTECTED]
  Caused-By   : ?
  Handled-By  : ?
  Status  : unknown
   I thought Shaggy asked Oliver about some details (and he did not
 answer so far) so I'd assume Shaggy is handling this.

I've put it on the back-burner since I haven't heard back from Oliver.
I still haven't found out whether or not it is a regression.  I'm not
too concerned about fixing it right away.  I don't think jfs on flash is
very important.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/3] 2.6.23-rc6: known regressions v2

2007-09-18 Thread Oliver Neukum
Am Dienstag 18 September 2007 schrieb Jan Kara:
  Subject         : umount triggers a warning in jfs and takes almost a minute
  References      : http://lkml.org/lkml/2007/9/4/73
  Last known good : ?
  Submitter       : Oliver Neukum [EMAIL PROTECTED]
  Caused-By       : ?
  Handled-By      : ?
  Status          : unknown
   I thought Shaggy asked Oliver about some details (and he did not
 answer so far) so I'd assume Shaggy is handling this.

I was without access to the hardware for a week. Testing will start
tomorrow.

Regards
Oliver

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Mingming Cao
On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
 On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
  Here is the incremental small cleanup patch. 
  
  Remove kamlloc usages in jbd/jbd2 and consistently use 
  jbd_kmalloc/jbd2_malloc.
 
 Shouldn't we kill jbd_kmalloc instead?
 

It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
places to handle memory (de)allocation(page size) via kmalloc/kfree, so
in the future if we need to change memory allocation in jbd(e.g. not
using kmalloc or using different flag), we don't need to touch every
place in the jbd code calling jbd_kmalloc.

Regards,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin
On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
  I don't know how it would prevent fragmentation from building up
  anyway. It's commonly the case that potentially unmovable objects
  are allowed to fill up all of ram (dentries, inodes, etc).

 Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
 ZONE_MOVABLE and thus the memory that can be allocated for them is
 limited.

Why would ZONE_MOVABLE require that movable objects should be moved
out of the way for unmovable ones? It never _has_ any unmovable objects in
it. Quite obviously we were not talking about reserve zones.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin
On Tuesday 18 September 2007 08:21, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?
  
   That is again pretty undifferentiated. Are we talking about low page
 
  In general.

 There is no -ENOMEM approach. Lower order page allocation (
 PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer
 will be activated.

ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
as the solution to the fragmentation problem? It would only have been funnier
if you had said to reboot every so often when memory gets fragmented :)


 That is the nature of the failures that we saw early in 
 the year when this was first merged into mm.

   With the ZONE_MOVABLE you can remove the unmovable objects into a
   defined pool then higher order success rates become reasonable.
 
  OK, if you rely on reserve pools, then it is not 1st class support and
  hence it is a non-solution to VM and IO scalability problems.

 ZONE_MOVABLE creates two memory pools in a machine. One of them is for
 movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first
 call support for order 0 pages?

What?


If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the extra layer
argument as a point against fsblock here?
  
   Yes sure. You code could not live without these approaches. Without the
 
  Actually: your code is the one that relies on higher order allocations.
  Now you're trying to turn that into an argument against fsblock?

 fsblock also needs contiguous pages in order to have a beneficial
 effect that we seem to be looking for.

Keyword: relies.


   antifragmentation measures your fsblock code would not be very
   successful in getting the larger contiguous segments you need to
   improve performance.
 
  Complely wrong. *I* don't need to do any of that to improve performance.
  Actually the VM is well tuned for order-0 pages, and so seeing as I have
  sane hardware, 4K pagecache works beautifully for me.

 Sure the system works fine as is. Not sure why we would need fsblock then.

Large block filesystem.


   (There is no new mmap layer, the higher order pagecache is simply the
   old API with set_blocksize expanded).
 
  Yes you add another layer in the userspace mapping code to handle higher
  order pagecache.

 That would imply a new API or something? I do not see it.

I was not implying a new API.


   Why: It is the same approach that you use.
 
  Again, rubbish.

 Ok the logical conclusion from the above is that you think your approach
 is rubbish 

The logical conclusion is that _they are not the same approach_!


 Is there some way you could cool down a bit? 

I'm not upset, but what you were saying was rubbish, plain and simple. The
amount of times we've gone in circles, I most likely have already explained
this, serveral times, in a more polite manner.

And I know you're more than capable to understand at least the concept
behind fsblock, even without time to work through the exact details. What
are you expecting me to say, after all this back and forth, when you come
up with things like [fsblock] is not a generic change but special to the
block layer, and then claim that fsblock is the same as allocating virtual
compound pages with vmalloc as a fallback for higher order allocs.

What I will say is that fsblock has still a relatively longer way to go, so
maybe that's your reason for not looking at it. And yes, when fsblock is
in a better state to actually perform useful comparisons with it, will be a
much better time to have these debates. But in that case, just say so :)
then I can go away and do more constructive work on it instead of filling
people's inboxes.

I believe the fsblock approach is the best one, but it's not without problems
and complexities, so I'm quite ready for it to be proven incorrect, not
performant, or otherwise rejected.

I'm going on holiday for 2 weeks. I'll try to stay away from email, and
particularly this thread.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin
On Tuesday 18 September 2007 08:05, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
fsblock doesn't need any of those hacks, of course.
  
   Nor does mine for the low orders that we are considering. For order 
   MAX_ORDER this is unavoidable since the page allocator cannot manage
   such large pages. It can be used for lower order if there are issues
   (that I have not seen yet).
 
  Or we can just avoid all doubt (and doesn't have arbitrary limitations
  according to what you think might be reasonable or how well the
  system actually behaves).

 We can avoid all doubt in this patchset as well by adding support for
 fallback to a vmalloced compound page.

How would you do a vmapped fallback in your patchset? How would
you keep track of pages 2..N if they don't exist in the radix tree?
What if they don't even exist in the kernel's linear mapping? It seems
you would also require more special casing in the fault path and special
casing in the block layer to do this.

It's not a trivial problem you can just brush away by handwaving. Let's
see... you could add another field in struct page to store the vmap
virtual address, and set a new flag to indicate indicate that constituent
page N can be found via vmalloc_to_page(page-vaddr + N*PAGE_SIZE).
Then add more special casing to the block layer and fault path etc. to handle
these new non-contiguous compound pages. I guess you must have thought
about it much harder than the 2 minutes I just did then, so you must have a
much nicer solution...

But even so, you're still trying very hard to avoid touching the filesystems
or buffer layer while advocating instead to squeeze the complexity out into
the vm and block layer. I don't agree that is the right thing to do. Sure it
is _easier_, because we know the VM.

I don't argue that fsblock large block support is trivial. But you are first
asserting that it is too complicated and then trying to address one of the
issues it solves by introducing complexity elsewhere.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Dave Kleikamp
On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
 On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
  On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
   Here is the incremental small cleanup patch. 
   
   Remove kamlloc usages in jbd/jbd2 and consistently use 
   jbd_kmalloc/jbd2_malloc.
  
  Shouldn't we kill jbd_kmalloc instead?
  
 
 It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
 places to handle memory (de)allocation(page size) via kmalloc/kfree, so
 in the future if we need to change memory allocation in jbd(e.g. not
 using kmalloc or using different flag), we don't need to touch every
 place in the jbd code calling jbd_kmalloc.

I disagree.  Why would jbd need to globally change the way it allocates
memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
variety of structures.  Having to change one particular instance won't
necessarily mean we want to change all of them.  Adding unnecessary
wrappers only obfuscates the code making it harder to understand.  You
wouldn't want every subsystem to have it's own *_kmalloc() that took
different arguments.  Besides, there aren't that many calls to kmalloc
and kfree in the jbd code, so there wouldn't be much pain in changing
GFP flags or whatever, if it ever needed to be done.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Tue, 18 Sep 2007, Nick Piggin wrote:
 
 ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
 as the solution to the fragmentation problem? It would only have been funnier
 if you had said to reboot every so often when memory gets fragmented :)

Can we please stop this *idiotic* thread.

Nick, you and some others seem to be arguing based on a totally flawed 
base, namely:
 - we can guarantee anything at all in the VM
 - we even care about the 16kB blocksize
 - second-class citizenry is bad

The fact is, *none* of those things are true. The VM doesn't guarantee 
anything, and is already very much about statistics in many places. You 
seem to be arguing as if Christoph was introducing something new and 
unacceptable, when it's largely just more of the same.

And the fact is, nobody but SGI customers would ever want the 16kB 
blocksize. IOW - NONE OF THIS MATTERS!

Can you guys stop this inane thread already, or at least take it private 
between you guys, instead of forcing everybody else to listen in on your 
flamefest.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli
On Tue, Sep 18, 2007 at 11:30:17AM -0700, Linus Torvalds wrote:
 The fact is, *none* of those things are true. The VM doesn't guarantee 
 anything, and is already very much about statistics in many places. You 

Many? I can't recall anything besides PF_MEMALLOC and the decision
that the VM is oom. Those are the only two gray areas... the safety
margin is large enough that nobody notices the lack of black-and-white
solution.

So instead of working to provide guarantees for the above two gray
spots, we're making everything weaker, that's the wrong direction as
far as I can tell, especially if we're going to mess up big time the
commo code in a backwards way only for those few users of those few
I/O devices out there.

In general every time reliability has a low priority than performance
I've an hard time to enjoy it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli
On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
 When has free ever given any usefull free number? I can perfectly
 fine allocate another gigabyte of memory despide free saing 25MB. But
 that is because I know that the buffer/cached are not locked in.

Well, as you said you know that buffer/cached are not locked in. If
/proc/meminfo would be rubbish like you seem to imply in the first
line, why would we ever bother to export that information and even
waste time writing a binary that parse it for admins?

 On the other hand 1GB can instantly vanish when I start a xen domain
 and anything relying on the free value would loose.

Actually you better check meminfo or free before starting a 1G of Xen!!

 The only sensible thing for an application concerned with swapping is
 to whatch the swapping and then reduce itself. Not the amount
 free. Although I wish there were some kernel interface to get a
 preasure value of how valuable free pages would be right now. I would
 like that for fuse so a userspace filesystem can do caching without
 cripling the kernel.

Repeated drop caches + free can help.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Tue, 18 Sep 2007, Andrea Arcangeli wrote:
 
 Many? I can't recall anything besides PF_MEMALLOC and the decision
 that the VM is oom.

*All* of the buddy bitmaps, *all* of the GPF_ATOMIC, *all* of the zone 
watermarks, everything that we depend on every single day, is in the end 
just about statistically workable.

We do 1- and 2-order allocations all the time, and we know they work. 
Yet Nick (and this whole *idiotic* thread) has all been about how they 
cannot work.

 In general every time reliability has a low priority than performance
 I've an hard time to enjoy it.

This is not about performance. Never has been. It's about SGI wanting a 
way out of their current 16kB mess.

The way to fix performance is to move to x86-64, and use 4kB pages and be 
happy. However, the SGI people want a 16kB (and possibly bigger) 
crap-option for their people who are (often _already_) running some 
special case situation that nobody else cares about.

It's not about performance. If it was, they would never have used ia64 
in the first place.  It's about special-case users that do odd things.

Nobody sane would *ever* argue for 16kB+ blocksizes in general. 

Linus

PS. Yes, I realize that there's a lot of insane people out there. However, 
we generally don't do kernel design decisions based on them. But we can 
pat the insane users on the head and say we won't guarantee it works, but 
if you eat your prozac, and don't bother us, go do your stupid things.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Christoph Lameter
On Tue, 18 Sep 2007, Nick Piggin wrote:

 On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
  On Sun, 16 Sep 2007, Nick Piggin wrote:
   I don't know how it would prevent fragmentation from building up
   anyway. It's commonly the case that potentially unmovable objects
   are allowed to fill up all of ram (dentries, inodes, etc).
 
  Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
  ZONE_MOVABLE and thus the memory that can be allocated for them is
  limited.
 
 Why would ZONE_MOVABLE require that movable objects should be moved
 out of the way for unmovable ones? It never _has_ any unmovable objects in
 it. Quite obviously we were not talking about reserve zones.

This was a response to your statement all of memory could be filled up by 
unmovable 
objects. Which cannot occur if the memory for unmovable objects is 
limited. Not sure what you mean by reserves? Mel's reserves? The reserves 
for unmovable objects established by ZONE_MOVABLE?


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Christoph Lameter
On Tue, 18 Sep 2007, Nick Piggin wrote:

  We can avoid all doubt in this patchset as well by adding support for
  fallback to a vmalloced compound page.
 
 How would you do a vmapped fallback in your patchset? How would
 you keep track of pages 2..N if they don't exist in the radix tree?

Through the vmalloc structures and through the conventions established for 
compound pages?

 What if they don't even exist in the kernel's linear mapping? It seems
 you would also require more special casing in the fault path and special
 casing in the block layer to do this.

Well yeah there is some sucky part about vmapping things (same as in yours,
possibly more in mine since its general and not specific to the page 
cache). On the other hand a generic vcompound fallback will allow us to 
use the page allocator in many places where we currently have to use 
vmalloc because the allocations are too big. It will allow us to get rid 
of most of the vmalloc uses and thereby reduce TLB pressure somewhat.

The vcompound patchset is almost ready. Maybe bits and pieces may 
even help fsblock.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Mingming Cao
On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote:
 On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
  On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
   On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use 
jbd_kmalloc/jbd2_malloc.
   
   Shouldn't we kill jbd_kmalloc instead?
   
  
  It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
  places to handle memory (de)allocation(page size) via kmalloc/kfree, so
  in the future if we need to change memory allocation in jbd(e.g. not
  using kmalloc or using different flag), we don't need to touch every
  place in the jbd code calling jbd_kmalloc.
 
 I disagree.  Why would jbd need to globally change the way it allocates
 memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
 variety of structures.  Having to change one particular instance won't
 necessarily mean we want to change all of them.  Adding unnecessary
 wrappers only obfuscates the code making it harder to understand.  You
 wouldn't want every subsystem to have it's own *_kmalloc() that took
 different arguments.  Besides, there aren't that many calls to kmalloc
 and kfree in the jbd code, so there wouldn't be much pain in changing
 GFP flags or whatever, if it ever needed to be done.
 
 Shaggy

Okay, Points taken, Here is the updated patch to get rid of slab
management and jbd_kmalloc from jbd totally. This patch is intend to
replace the patch in mm tree, Andrew, could you pick up this one
instead?

Thanks,

Mingming


jbd/jbd2: JBD memory allocation cleanups

From: Christoph Lameter [EMAIL PROTECTED]

JBD: Replace slab allocations with page cache allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator 
pages instead. This will also prepare JBD for the large blocksize patchset.


Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]

---
 fs/jbd/commit.c   |6 +--
 fs/jbd/journal.c  |   99 ++
 fs/jbd/transaction.c  |   12 +++---
 fs/jbd2/commit.c  |6 +--
 fs/jbd2/journal.c |   99 ++
 fs/jbd2/transaction.c |   18 -
 include/linux/jbd.h   |   18 +
 include/linux/jbd2.h  |   21 +-
 8 files changed, 52 insertions(+), 227 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-18 17:19:01.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-18 17:51:21.0 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;
 
jbd_unlock_bh_state(bh_in);
-   tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS);
+   tmp = jbd_alloc(bh_in-b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in-b_frozen_data) {
-   jbd_slab_free(tmp, bh_in-b_size);
+   jbd_free(tmp, bh_in-b_size);
goto repeat;
}
 
@@ -654,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
}
}
 
-   /*
-* Create a slab for this blocksize
-*/
-   err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize));
-   if (err)
-   return err;
-
/* Let the recovery code check whether it needs to recover any
 * data from the journal. */
if (journal_recover(journal))
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-   return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Wed, 19 Sep 2007, Nathan Scott wrote:
 
 FWIW (and I hate to let reality get in the way of a good conspiracy) -
 all SGI systems have always defaulted to using 4K blocksize filesystems;

Yes. And I've been told that:

 there's very few customers who would use larger

.. who apparently would like to  move to x86-64. That was what people 
implied at the kernel summit.

especially as the Linux
 kernel limitations in this area are well known.  There's no 16K mess
 that SGI is trying to clean up here (and SGI have offered both IA64 and
 x86_64 systems for some time now, so not sure how you came up with that
 whacko theory).

Well, if that is the case, then I vote that we drop the whole patch-series 
entirely. It clearly has no reason for existing at all.

There is *no* valid reason for 16kB blocksizes unless you have legacy 
issues. The performance issues have nothing to do with the block-size, and 
should be solvable by just making sure that your stupid state of the art 
crap SCSI controller gets contiguous physical memory, which is best done 
in the read-ahead code.

So get your stories straight, people.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nathan Scott
On Tue, 2007-09-18 at 12:44 -0700, Linus Torvalds wrote:
 This is not about performance. Never has been. It's about SGI wanting a 
 way out of their current 16kB mess.

Pass the crack pipe, Linus?

 The way to fix performance is to move to x86-64, and use 4kB pages and be 
 happy. However, the SGI people want a 16kB (and possibly bigger) 
 crap-option for their people who are (often _already_) running some 
 special case situation that nobody else cares about.

FWIW (and I hate to let reality get in the way of a good conspiracy) -
all SGI systems have always defaulted to using 4K blocksize filesystems;
there's very few customers who would use larger, especially as the Linux
kernel limitations in this area are well known.  There's no 16K mess
that SGI is trying to clean up here (and SGI have offered both IA64 and
x86_64 systems for some time now, so not sure how you came up with that
whacko theory).

 It's not about performance. If it was, they would never have used ia64

For SGI it really is about optimising ondisk layouts for some workloads
and large filesystems, and has nothing to do with IA64.  Read the paper
Dave sent out earlier, it's quite interesting.

For other people, like AntonA, who has also been asking for this
functionality literally for years (and ended up trying to do his own
thing inside NTFS IIRC) it's to be able to access existing filesystems
from other operating systems.  Here's a more recent discussion, I know
Anton had discussed it several times on fsdevel before this 2005 post
too:   http://oss.sgi.com/archives/xfs/2005-01/msg00126.html

Although I'm sure others exist, I've never worked on any platform other
than Linux that doesn't support filesystem block sizes larger than the
pagesize.  Its one thing to stick your head in the sand about the need
for this feature, its another thing entirely to try pass it off as an
SGI mess, sorry.

I do entirely support the sentiment to stop this pissing match and get
on with fixing the problem though.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Andrew Morton
On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote:

 JBD: Replace slab allocations with page cache allocations
 
 JBD allocate memory for committed_data and frozen_data from slab. However
 JBD should not pass slab pages down to the block layer. Use page allocator 
 pages instead. This will also prepare JBD for the large blocksize patchset.
 
 
 Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

__GFP_NOFAIL should only be used when we have no way of recovering
from failure.  The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
there as a marker which says we really shouldn't be doing this but
we don't know how to fix it).

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nathan Scott
On Tue, 2007-09-18 at 18:06 -0700, Linus Torvalds wrote:
 There is *no* valid reason for 16kB blocksizes unless you have legacy 
 issues.

That's not correct.

 The performance issues have nothing to do with the block-size, and 

We must be thinking of different performance issues.

 should be solvable by just making sure that your stupid state of the
 art 
 crap SCSI controller gets contiguous physical memory, which is best
 done 
 in the read-ahead code. 

SCSI controllers have nothing to do with improving ondisk layout, which
is the performance issue I've been referring to.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[04/17] vmalloc: clean up page array indexing

2007-09-18 Thread Christoph Lameter
The page array is repeatedly indexed both in vunmap and vmalloc_area_node().
Add a temporary variable to make it easier to read (and easier to patch
later).

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/vmalloc.c |   16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-09-18 13:22:16.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-09-18 13:22:17.0 -0700
@@ -383,8 +383,10 @@ static void __vunmap(const void *addr, i
int i;
 
for (i = 0; i  area-nr_pages; i++) {
-   BUG_ON(!area-pages[i]);
-   __free_page(area-pages[i]);
+   struct page *page = area-pages[i];
+
+   BUG_ON(!page);
+   __free_page(page);
}
 
if (area-flags  VM_VPAGES)
@@ -488,15 +490,19 @@ void *__vmalloc_area_node(struct vm_stru
}
 
for (i = 0; i  area-nr_pages; i++) {
+   struct page *page;
+
if (node  0)
-   area-pages[i] = alloc_page(gfp_mask);
+   page = alloc_page(gfp_mask);
else
-   area-pages[i] = alloc_pages_node(node, gfp_mask, 0);
-   if (unlikely(!area-pages[i])) {
+   page = alloc_pages_node(node, gfp_mask, 0);
+
+   if (unlikely(!page)) {
/* Successfully allocated i pages, free them in 
__vunmap() */
area-nr_pages = i;
goto fail;
}
+   area-pages[i] = page;
}
 
if (map_vm_area(area, prot, pages))

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[00/17] [RFC] Virtual Compound Page Support

2007-09-18 Thread Christoph Lameter
Currently there is a strong tendency to avoid larger page allocations in
the kernel because of past fragmentation issues and the current
defragmentation methods are still evolving. It is not clear to what extend
they can provide reliable allocations for higher order pages (plus the
definition of reliable seems to be in the eye of the beholder).

Currently we use vmalloc allocations in many locations to provide a safe
way to allocate larger arrays. That is due to the danger of higher order
allocations failing. Virtual Compound pages allow the use of regular
page allocator allocations that will fall back only if there is an actual
problem with acquiring a higher order page.

This patch set provides a way for a higher page allocation to fall back.
Instead of a physically contiguous page a virtually contiguous page
is provided. The functionality of the vmalloc layer is used to provide
the necessary page tables and control structures to establish a virtually
contiguous area.

Advantages:

- If higher order allocations are failing then virtual compound pages
  consisting of a series of order-0 pages can stand in for those
  allocations.

- Reliability as long as the vmalloc layer can provide virtual mappings.

- Ability to reduce the use of vmalloc layer significantly by using
  physically contiguous memory instead of virtual contiguous memory.
  Most uses of vmalloc() can be converted to page allocator calls.

- The use of physically contiguous memory instead of vmalloc may allow the
  use larger TLB entries thus reducing TLB pressure. Also reduces the need
  for page table walks.

Disadvantages:

- In order to use fall back the logic accessing the memory must be
  aware that the memory could be backed by a virtual mapping and take
  precautions. virt_to_page() and page_address() may not work and
  vmalloc_to_page() and vmalloc_address() (introduced through this
  patch set) may have to be called.

- Virtual mappings are less efficient than physical mappings.
  Performance will drop once virtual fall back occurs.

- Virtual mappings have more memory overhead. vm_area control structures
  page tables, page arrays etc need to be allocated and managed to provide
  virtual mappings.

The patchset provides this functionality in stages. Stage 1 introduces
the basic fall back mechanism necessary to replace vmalloc allocations
with

alloc_page(GFP_VFALLBACK, order, )

which signifies to the page allocator that a higher order is to be found
but a virtual mapping may stand in if there is an issue with fragmentation.

Stage 1 functionality does not allow allocation and freeing of virtual
mappings from interrupt contexts.

The stage 1 series ends with the conversion of a few key uses of vmalloc
in the VM to alloc_pages() for the allocation of sparsemems memmap table
and the wait table in each zone. Other uses of vmalloc could be converted
in the same way.


Stage 2 functionality enhances the fallback even more allowing allocation
and frees in interrupt context.

SLUB is then modified to use the virtual mappings for slab caches
that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
then we drop all the restraints regarding page order and allocate
good large memory areas that fit lots of objects so that we rarely
have to use the slow paths.

Two slab caches--the dentry cache and the buffer_heads--are then flagged
that way. Others could be converted in the same way.

The patch set also provides a debugging aid through setting

CONFIG_VFALLBACK_ALWAYS

If set then all GFP_VFALLBACK allocations fall back to the virtual
mappings. This is useful for verification tests. The test of this
patch set was done by enabling that options and compiling a kernel.


Stage 3 functionality could be the adding of support for the large
buffer size patchset. Not done yet and not sure if it would be useful
to do.

Much of this patchset may only be needed for special cases in which the
existing defragmentation methods fail for some reason. It may be better to
have the system operate without such a safety net and make sure that the
page allocator can return large orders in a reliable way.

The initial idea for this patchset came from Nick Piggin's fsblock
and from his arguments about reliability and guarantees. Since his
fsblock uses the virtual mappings I think it is legitimate to
generalize the use of virtual mappings to support higher order
allocations in this way. The application of these ideas to the large
block size patchset etc are straightforward. If wanted I can base
the next rev of the largebuffer patchset on this one and implement
fallback.

Contrary to Nick, I still doubt that any of this provides a guarantee.
Have said that I have to deal with various failure scenarios in the VM
daily and I'd certainly like to see it work in a more reliable manner.

IMHO getting rid of the various workarounds to deal with the small 4k
pages and avoiding additional layers that group these pages in subsystem
specific 

[02/17] Vmalloc: add const

2007-09-18 Thread Christoph Lameter
Make vmalloc functions work the same way as kfree() and friends that
take a const void * argument.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/vmalloc.h |   10 +-
 mm/vmalloc.c|   16 
 2 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-09-18 18:34:06.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-09-18 18:34:33.0 -0700
@@ -169,7 +169,7 @@ EXPORT_SYMBOL_GPL(map_vm_area);
 /*
  * Map a vmalloc()-space virtual address to the physical page.
  */
-struct page *vmalloc_to_page(void *vmalloc_addr)
+struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
unsigned long addr = (unsigned long) vmalloc_addr;
struct page *page = NULL;
@@ -198,7 +198,7 @@ EXPORT_SYMBOL(vmalloc_to_page);
 /*
  * Map a vmalloc()-space virtual address to the physical page frame number.
  */
-unsigned long vmalloc_to_pfn(void *vmalloc_addr)
+unsigned long vmalloc_to_pfn(const void *vmalloc_addr)
 {
return page_to_pfn(vmalloc_to_page(vmalloc_addr));
 }
@@ -305,7 +305,7 @@ struct vm_struct *get_vm_area_node(unsig
 }
 
 /* Caller must hold vmlist_lock */
-static struct vm_struct *__find_vm_area(void *addr)
+static struct vm_struct *__find_vm_area(const void *addr)
 {
struct vm_struct *tmp;
 
@@ -318,7 +318,7 @@ static struct vm_struct *__find_vm_area(
 }
 
 /* Caller must hold vmlist_lock */
-static struct vm_struct *__remove_vm_area(void *addr)
+static struct vm_struct *__remove_vm_area(const void *addr)
 {
struct vm_struct **p, *tmp;
 
@@ -347,7 +347,7 @@ found:
  * This function returns the found VM area, but using it is NOT safe
  * on SMP machines, except for its size or flags.
  */
-struct vm_struct *remove_vm_area(void *addr)
+struct vm_struct *remove_vm_area(const void *addr)
 {
struct vm_struct *v;
write_lock(vmlist_lock);
@@ -356,7 +356,7 @@ struct vm_struct *remove_vm_area(void *a
return v;
 }
 
-static void __vunmap(void *addr, int deallocate_pages)
+static void __vunmap(const void *addr, int deallocate_pages)
 {
struct vm_struct *area;
 
@@ -407,7 +407,7 @@ static void __vunmap(void *addr, int dea
  *
  * Must not be called in interrupt context.
  */
-void vfree(void *addr)
+void vfree(const void *addr)
 {
BUG_ON(in_interrupt());
__vunmap(addr, 1);
@@ -423,7 +423,7 @@ EXPORT_SYMBOL(vfree);
  *
  * Must not be called in interrupt context.
  */
-void vunmap(void *addr)
+void vunmap(const void *addr)
 {
BUG_ON(in_interrupt());
__vunmap(addr, 0);
Index: linux-2.6/include/linux/vmalloc.h
===
--- linux-2.6.orig/include/linux/vmalloc.h  2007-09-18 18:34:24.0 
-0700
+++ linux-2.6/include/linux/vmalloc.h   2007-09-18 18:35:03.0 -0700
@@ -45,11 +45,11 @@ extern void *vmalloc_32_user(unsigned lo
 extern void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot);
 extern void *__vmalloc_area(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot);
-extern void vfree(void *addr);
+extern void vfree(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
-extern void vunmap(void *addr);
+extern void vunmap(const void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
@@ -71,7 +71,7 @@ extern struct vm_struct *__get_vm_area(u
 extern struct vm_struct *get_vm_area_node(unsigned long size,
  unsigned long flags, int node,
  gfp_t gfp_mask);
-extern struct vm_struct *remove_vm_area(void *addr);
+extern struct vm_struct *remove_vm_area(const void *addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
struct page ***pages);
@@ -82,8 +82,8 @@ extern struct vm_struct *alloc_vm_area(s
 extern void free_vm_area(struct vm_struct *area);
 
 /* Determine page struct from address */
-struct page *vmalloc_to_page(void *addr);
-unsigned long vmalloc_to_pfn(void *addr);
+struct page *vmalloc_to_page(const void *addr);
+unsigned long vmalloc_to_pfn(const void *addr);
 
 /*
  * Internals.  Dont't use..

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[01/17] Vmalloc: Move vmalloc_to_page to mm/vmalloc.

2007-09-18 Thread Christoph Lameter
We already have page table manipulation for vmalloc in vmalloc.c. Move the
vmalloc_to_page() function there as well. Also move the related definitions
from include/linux/mm.h.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm.h  |2 --
 include/linux/vmalloc.h |4 
 mm/memory.c |   40 
 mm/vmalloc.c|   38 ++
 4 files changed, 42 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2007-09-18 18:33:56.0 -0700
+++ linux-2.6/mm/memory.c   2007-09-18 18:34:06.0 -0700
@@ -2727,46 +2727,6 @@ int make_pages_present(unsigned long add
return ret == len ? 0 : -1;
 }
 
-/* 
- * Map a vmalloc()-space virtual address to the physical page.
- */
-struct page * vmalloc_to_page(void * vmalloc_addr)
-{
-   unsigned long addr = (unsigned long) vmalloc_addr;
-   struct page *page = NULL;
-   pgd_t *pgd = pgd_offset_k(addr);
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *ptep, pte;
-  
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   ptep = pte_offset_map(pmd, addr);
-   pte = *ptep;
-   if (pte_present(pte))
-   page = pte_page(pte);
-   pte_unmap(ptep);
-   }
-   }
-   }
-   return page;
-}
-
-EXPORT_SYMBOL(vmalloc_to_page);
-
-/*
- * Map a vmalloc()-space virtual address to the physical page frame number.
- */
-unsigned long vmalloc_to_pfn(void * vmalloc_addr)
-{
-   return page_to_pfn(vmalloc_to_page(vmalloc_addr));
-}
-
-EXPORT_SYMBOL(vmalloc_to_pfn);
-
 #if !defined(__HAVE_ARCH_GATE_AREA)
 
 #if defined(AT_SYSINFO_EHDR)
Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-09-18 18:33:56.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-09-18 18:34:06.0 -0700
@@ -166,6 +166,44 @@ int map_vm_area(struct vm_struct *area, 
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
 
+/*
+ * Map a vmalloc()-space virtual address to the physical page.
+ */
+struct page *vmalloc_to_page(void *vmalloc_addr)
+{
+   unsigned long addr = (unsigned long) vmalloc_addr;
+   struct page *page = NULL;
+   pgd_t *pgd = pgd_offset_k(addr);
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+
+   if (!pgd_none(*pgd)) {
+   pud = pud_offset(pgd, addr);
+   if (!pud_none(*pud)) {
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_none(*pmd)) {
+   ptep = pte_offset_map(pmd, addr);
+   pte = *ptep;
+   if (pte_present(pte))
+   page = pte_page(pte);
+   pte_unmap(ptep);
+   }
+   }
+   }
+   return page;
+}
+EXPORT_SYMBOL(vmalloc_to_page);
+
+/*
+ * Map a vmalloc()-space virtual address to the physical page frame number.
+ */
+unsigned long vmalloc_to_pfn(void *vmalloc_addr)
+{
+   return page_to_pfn(vmalloc_to_page(vmalloc_addr));
+}
+EXPORT_SYMBOL(vmalloc_to_pfn);
+
 static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long 
flags,
unsigned long start, unsigned long 
end,
int node, gfp_t gfp_mask)
Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-09-18 18:33:56.0 -0700
+++ linux-2.6/include/linux/mm.h2007-09-18 18:34:06.0 -0700
@@ -1160,8 +1160,6 @@ static inline unsigned long vma_pages(st
 
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
-struct page *vmalloc_to_page(void *addr);
-unsigned long vmalloc_to_pfn(void *addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
Index: linux-2.6/include/linux/vmalloc.h
===
--- linux-2.6.orig/include/linux/vmalloc.h  2007-09-18 18:33:57.0 
-0700
+++ linux-2.6/include/linux/vmalloc.h   2007-09-18 18:34:24.0 -0700
@@ -81,6 +81,10 @@ extern void unmap_kernel_range(unsigned 
 extern struct vm_struct *alloc_vm_area(size_t size);
 extern void 

[03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-18 Thread Christoph Lameter
This test is used in a couple of places. Add a version to vmalloc.h
and replace the other checks.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 drivers/net/cxgb3/cxgb3_offload.c |4 +---
 fs/ntfs/malloc.h  |3 +--
 fs/proc/kcore.c   |2 +-
 fs/xfs/linux-2.6/kmem.c   |3 +--
 fs/xfs/linux-2.6/xfs_buf.c|3 +--
 include/linux/mm.h|8 
 mm/sparse.c   |   10 +-
 7 files changed, 14 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-09-17 21:46:06.0 -0700
+++ linux-2.6/include/linux/mm.h2007-09-17 23:56:54.0 -0700
@@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
return (vma-vm_end - vma-vm_start)  PAGE_SHIFT;
 }
 
+/* Determine if an address is within the vmalloc range */
+static inline int is_vmalloc_addr(const void *x)
+{
+   unsigned long addr = (unsigned long)x;
+
+   return addr = VMALLOC_START  addr  VMALLOC_END;
+}
+
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
Index: linux-2.6/mm/sparse.c
===
--- linux-2.6.orig/mm/sparse.c  2007-09-17 21:45:24.0 -0700
+++ linux-2.6/mm/sparse.c   2007-09-17 23:56:26.0 -0700
@@ -289,17 +289,9 @@ got_map_ptr:
return ret;
 }
 
-static int vaddr_in_vmalloc_area(void *addr)
-{
-   if (addr = (void *)VMALLOC_START 
-   addr  (void *)VMALLOC_END)
-   return 1;
-   return 0;
-}
-
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-   if (vaddr_in_vmalloc_area(memmap))
+   if (is_vmalloc_addr(memmap))
vfree(memmap);
else
free_pages((unsigned long)memmap,
Index: linux-2.6/drivers/net/cxgb3/cxgb3_offload.c
===
--- linux-2.6.orig/drivers/net/cxgb3/cxgb3_offload.c2007-09-17 
21:45:24.0 -0700
+++ linux-2.6/drivers/net/cxgb3/cxgb3_offload.c 2007-09-17 21:46:06.0 
-0700
@@ -1035,9 +1035,7 @@ void *cxgb_alloc_mem(unsigned long size)
  */
 void cxgb_free_mem(void *addr)
 {
-   unsigned long p = (unsigned long)addr;
-
-   if (p = VMALLOC_START  p  VMALLOC_END)
+   if (is_vmalloc_addr(addr))
vfree(addr);
else
kfree(addr);
Index: linux-2.6/fs/ntfs/malloc.h
===
--- linux-2.6.orig/fs/ntfs/malloc.h 2007-09-17 21:45:24.0 -0700
+++ linux-2.6/fs/ntfs/malloc.h  2007-09-17 21:46:06.0 -0700
@@ -85,8 +85,7 @@ static inline void *ntfs_malloc_nofs_nof
 
 static inline void ntfs_free(void *addr)
 {
-   if (likely(((unsigned long)addr  VMALLOC_START) ||
-   ((unsigned long)addr = VMALLOC_END ))) {
+   if (!is_vmalloc_addr(addr)) {
kfree(addr);
/* free_page((unsigned long)addr); */
return;
Index: linux-2.6/fs/proc/kcore.c
===
--- linux-2.6.orig/fs/proc/kcore.c  2007-09-17 21:45:24.0 -0700
+++ linux-2.6/fs/proc/kcore.c   2007-09-17 21:46:06.0 -0700
@@ -325,7 +325,7 @@ read_kcore(struct file *file, char __use
if (m == NULL) {
if (clear_user(buffer, tsz))
return -EFAULT;
-   } else if ((start = VMALLOC_START)  (start  VMALLOC_END)) {
+   } else if (is_vmalloc_addr((void *)start)) {
char * elf_buf;
struct vm_struct *m;
unsigned long curstart = start;
Index: linux-2.6/fs/xfs/linux-2.6/kmem.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/kmem.c  2007-09-17 21:45:24.0 
-0700
+++ linux-2.6/fs/xfs/linux-2.6/kmem.c   2007-09-17 21:46:06.0 -0700
@@ -92,8 +92,7 @@ kmem_zalloc_greedy(size_t *size, size_t 
 void
 kmem_free(void *ptr, size_t size)
 {
-   if (((unsigned long)ptr  VMALLOC_START) ||
-   ((unsigned long)ptr = VMALLOC_END)) {
+   if (!is_vmalloc_addr(ptr)) {
kfree(ptr);
} else {
vfree(ptr);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c   2007-09-17 21:45:24.0 
-0700
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c2007-09-17 21:46:06.0 
-0700
@@ -696,8 +696,7 @@ static inline struct page *
 mem_to_page(
void*addr)
 {
-   if 

[08/17] Pass vmalloc address in page-private

2007-09-18 Thread Christoph Lameter
Avoid expensive lookups of virtual addresses from page structs by
storing the vmalloc address in page-private. We can then avoid
the vmalloc_address() in the get__page() functions and
simply return page-private.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/page_alloc.c |   15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-09-18 18:35:55.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-09-18 18:36:01.0 -0700
@@ -1276,6 +1276,11 @@ struct page *vcompound_alloc(gfp_t gfp_m
if (!addr)
goto abort;
 
+   /*
+* Give the caller a chance to avoid an expensive vmalloc_addr()
+* call.
+*/
+   pages[0]-private = (unsigned long)addr;
return pages[0];
 
 abort:
@@ -1534,6 +1539,8 @@ fastcall unsigned long __get_free_pages(
page = alloc_pages(gfp_mask, order);
if (!page)
return 0;
+   if (unlikely(PageVcompound(page)))
+   return page-private;
return (unsigned long) page_address(page);
 }
 
@@ -1550,9 +1557,11 @@ fastcall unsigned long get_zeroed_page(g
VM_BUG_ON((gfp_mask  __GFP_HIGHMEM) != 0);
 
page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
-   if (page)
-   return (unsigned long) page_address(page);
-   return 0;
+   if (!page)
+   return 0;
+   if (unlikely(PageVcompound(page)))
+   return page-private;
+   return (unsigned long) page_address(page);
 }
 
 EXPORT_SYMBOL(get_zeroed_page);

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[11/17] GFP_VFALLBACK for zone wait table.

2007-09-18 Thread Christoph Lameter
Currently we have to use vmalloc for the zone wait table possibly generating
the need to create lots of TLBs to access the tables. We can now use
GFP_VFALLBACK to attempt the use of a physically contiguous page that can then
use the large kernel TLBs.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/page_alloc.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-09-18 14:29:05.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-09-18 14:29:10.0 -0700
@@ -2572,7 +2572,9 @@ int zone_wait_table_init(struct zone *zo
 * To use this new node's memory, further consideration will be
 * necessary.
 */
-   zone-wait_table = (wait_queue_head_t *)vmalloc(alloc_size);
+   zone-wait_table = (wait_queue_head_t *)
+   __get_free_pages(GFP_VFALLBACK,
+   get_order(alloc_size));
}
if (!zone-wait_table)
return -ENOMEM;

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[12/17] Virtual Compound page allocation from interrupt context.

2007-09-18 Thread Christoph Lameter
In an interrupt context we cannot wait for the vmlist_lock in
__get_vm_area_node(). So use a trylock instead. If the trylock fails
then the atomic allocation will fail and subsequently be retried.

This only works because the flush_cache_vunmap in use for
allocation is never performing any IPIs in contrast to flush_tlb_...
in use for freeing.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/vmalloc.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-09-18 10:52:11.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-09-18 10:54:21.0 -0700
@@ -289,7 +289,6 @@ static struct vm_struct *__get_vm_area_n
unsigned long align = 1;
unsigned long addr;
 
-   BUG_ON(in_interrupt());
if (flags  VM_IOREMAP) {
int bit = fls(size);
 
@@ -314,7 +313,14 @@ static struct vm_struct *__get_vm_area_n
 */
size += PAGE_SIZE;
 
-   write_lock(vmlist_lock);
+   if (gfp_mask  __GFP_WAIT)
+   write_lock(vmlist_lock);
+   else {
+   if (!write_trylock(vmlist_lock)) {
+   kfree(area);
+   return NULL;
+   }
+   }
for (p = vmlist; (tmp = *p) != NULL ;p = tmp-next) {
if ((unsigned long)tmp-addr  addr) {
if((unsigned long)tmp-addr + tmp-size = addr)

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[13/17] Virtual compound page freeing in interrupt context

2007-09-18 Thread Christoph Lameter
If we are in an interrupt context then simply defer the free via a workqueue.

In an interrupt context it is not possible to use vmalloc_addr() to determine
the vmalloc address. So add a variant that does that too.

Removing a virtual mappping *must* be done with interrupts enabled
since tlb_xx functions are called that rely on interrupts for
processor to processor communications.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/page_alloc.c |   23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-09-18 20:10:55.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-09-18 20:11:40.0 -0700
@@ -1297,7 +1297,12 @@ abort:
return NULL;
 }
 
-static void vcompound_free(void *addr)
+/*
+ * Virtual Compound freeing functions. This is complicated by the vmalloc
+ * layer not being able to free virtual allocations when interrupts are
+ * disabled. So we defer the frees via a workqueue if necessary.
+ */
+static void __vcompound_free(void *addr)
 {
struct page **pages = vunmap(addr);
int i;
@@ -1320,6 +1325,22 @@ static void vcompound_free(void *addr)
kfree(pages);
 }
 
+static void vcompound_free_work(struct work_struct *w)
+{
+   __vcompound_free((void *)w);
+}
+
+static void vcompound_free(void *addr)
+{
+   if (in_interrupt()) {
+   struct work_struct *w = addr;
+
+   INIT_WORK(w, vcompound_free_work);
+   schedule_work(w);
+   } else
+   __vcompound_free(addr);
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area

2007-09-18 Thread Christoph Lameter
If bit waitqueue is passed a virtual address then it must use
vmalloc_to_page instead of virt_to_page to get to the page struct.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 kernel/wait.c |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/wait.c
===
--- linux-2.6.orig/kernel/wait.c2007-09-18 19:19:27.0 -0700
+++ linux-2.6/kernel/wait.c 2007-09-18 20:10:39.0 -0700
@@ -9,6 +9,7 @@
 #include linux/mm.h
 #include linux/wait.h
 #include linux/hash.h
+#include linux/vmalloc.h
 
 void init_waitqueue_head(wait_queue_head_t *q)
 {
@@ -245,9 +246,16 @@ EXPORT_SYMBOL(wake_up_bit);
 fastcall wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
-   const struct zone *zone = page_zone(virt_to_page(word));
unsigned long val = (unsigned long)word  shift | bit;
+   struct page *page;
+   struct zone *zone;
 
+   if (is_vmalloc_addr(word))
+   page = vmalloc_to_page(word)
+   else
+   page = virt_to_page(word);
+
+   zone = page_zone(page);
return zone-wait_table[hash_long(val, zone-wait_table_bits)];
 }
 EXPORT_SYMBOL(bit_waitqueue);

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[17/17] Allow virtual fallback for dentries

2007-09-18 Thread Christoph Lameter
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 fs/dcache.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===
--- linux-2.6.orig/fs/dcache.c  2007-09-18 18:42:19.0 -0700
+++ linux-2.6/fs/dcache.c   2007-09-18 18:42:55.0 -0700
@@ -2118,7 +2118,8 @@ static void __init dcache_init(unsigned 
 * of the dcache. 
 */
dentry_cache = KMEM_CACHE(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|
+   SLAB_VFALLBACK);

register_shrinker(dcache_shrinker);
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[16/17] Allow virtual fallback for buffer_heads

2007-09-18 Thread Christoph Lameter
This is in particular useful for large I/Os because it will allow  100
allocs from the SLUB fast path without having to go to the page allocator.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 fs/buffer.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-09-18 15:44:37.0 -0700
+++ linux-2.6/fs/buffer.c   2007-09-18 15:44:51.0 -0700
@@ -3008,7 +3008,8 @@ void __init buffer_init(void)
int nrpages;
 
bh_cachep = KMEM_CACHE(buffer_head,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|
+   SLAB_VFALLBACK);
 
/*
 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-18 Thread Christoph Lameter
SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
available then the conservative settings for higher order allocations are
overridden. We then request an order that can accomodate at mininum
100 objects. The size of an individual slab allocation is allowed to reach
up to 256k (order 6 on i386, order 4 on IA64).

Implementing fallback requires special handling of virtual mappings in
the free path. However, the impact is minimal since we already check the
address if its NULL or ZERO_SIZE_PTR. No additional cachelines are
touched if we do not fall back. However, if we need to handle a virtual
compound page then walk the kernel page table in the free paths to
determine the page struct.

We also need special handling in the allocation paths since the virtual
addresses cannot be obtained via page_address(). SLUB exploits that
page-private is set to the vmalloc address to avoid a costly
vmalloc_address().

However, for diagnostics there is still the need to determine the
vmalloc address from the page struct. There we must use the costly
vmalloc_address().

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/slab.h |1 
 include/linux/slub_def.h |1 
 mm/slub.c|   83 ---
 3 files changed, 60 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h 2007-09-18 17:03:30.0 -0700
+++ linux-2.6/include/linux/slab.h  2007-09-18 17:07:39.0 -0700
@@ -19,6 +19,7 @@
  * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set.
  */
 #define SLAB_DEBUG_FREE0x0100UL/* DEBUG: Perform 
(expensive) checks on free */
+#define SLAB_VFALLBACK 0x0200UL/* May fall back to vmalloc */
 #define SLAB_RED_ZONE  0x0400UL/* DEBUG: Red zone objs in a 
cache */
 #define SLAB_POISON0x0800UL/* DEBUG: Poison objects */
 #define SLAB_HWCACHE_ALIGN 0x2000UL/* Align objs on cache lines */
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-09-18 17:03:30.0 -0700
+++ linux-2.6/mm/slub.c 2007-09-18 18:13:38.0 -0700
@@ -20,6 +20,7 @@
 #include linux/mempolicy.h
 #include linux/ctype.h
 #include linux/kallsyms.h
+#include linux/vmalloc.h
 
 /*
  * Lock order:
@@ -277,6 +278,26 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+static inline void *slab_address(struct page *page)
+{
+   if (unlikely(PageVcompound(page)))
+   return vmalloc_address(page);
+   else
+   return page_address(page);
+}
+
+static inline struct page *virt_to_slab(const void *addr)
+{
+   struct page *page;
+
+   if (unlikely(is_vmalloc_addr(addr)))
+   page = vmalloc_to_page(addr);
+   else
+   page = virt_to_page(addr);
+
+   return compound_head(page);
+}
+
 static inline int check_valid_pointer(struct kmem_cache *s,
struct page *page, const void *object)
 {
@@ -285,7 +306,7 @@ static inline int check_valid_pointer(st
if (!object)
return 1;
 
-   base = page_address(page);
+   base = slab_address(page);
if (object  base || object = base + s-objects * s-size ||
(object - base) % s-size) {
return 0;
@@ -470,7 +491,7 @@ static void slab_fix(struct kmem_cache *
 static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 {
unsigned int off;   /* Offset of last byte */
-   u8 *addr = page_address(page);
+   u8 *addr = slab_address(page);
 
print_tracking(s, p);
 
@@ -648,7 +669,7 @@ static int slab_pad_check(struct kmem_ca
if (!(s-flags  SLAB_POISON))
return 1;
 
-   start = page_address(page);
+   start = slab_address(page);
end = start + (PAGE_SIZE  s-order);
length = s-objects * s-size;
remainder = end - (start + length);
@@ -1040,11 +1061,7 @@ static struct page *allocate_slab(struct
struct page * page;
int pages = 1  s-order;
 
-   if (s-order)
-   flags |= __GFP_COMP;
-
-   if (s-flags  SLAB_CACHE_DMA)
-   flags |= SLUB_DMA;
+   flags |= s-gfpflags;
 
if (node == -1)
page = alloc_pages(flags, s-order);
@@ -1098,7 +1115,11 @@ static struct page *new_slab(struct kmem
SLAB_STORE_USER | SLAB_TRACE))
SetSlabDebug(page);
 
-   start = page_address(page);
+   if (!PageVcompound(page))
+   start = slab_address(page);
+   else
+   start = (void *)page-private;
+
end = start + s-objects * s-size;
 
if (unlikely(s-flags  SLAB_POISON))
@@ -1130,7 +1151,7 @@ static void __free_slab(struct kmem_cach

[09/17] VFALLBACK: Debugging aid

2007-09-18 Thread Christoph Lameter
Virtual fallbacks are rare and thus subtle bugs may creep in if we do not
test the fallbacks. CONFIG_VFALLBACK_ALWAYS makes all GFP_VFALLBACK
allocations fall back to virtual mapping.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 lib/Kconfig.debug |   11 +++
 mm/page_alloc.c   |9 +
 2 files changed, 20 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-09-18 19:19:34.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-09-18 20:16:26.0 -0700
@@ -1205,7 +1205,16 @@ zonelist_scan:
goto this_zone_full;
}
}
+#ifdef CONFIG_VFALLBACK_ALWAYS
+   if ((gfp_mask  __GFP_VFALLBACK) 
+   system_state == SYSTEM_RUNNING)  {
+   struct page *vcompound_alloc(gfp_t, int,
+   struct zonelist *, unsigned long);
 
+   page = vcompound_alloc(gfp_mask, order,
+   zonelist, alloc_flags);
+   } else
+#endif
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
break;
Index: linux-2.6/lib/Kconfig.debug
===
--- linux-2.6.orig/lib/Kconfig.debug2007-09-18 19:19:28.0 -0700
+++ linux-2.6/lib/Kconfig.debug 2007-09-18 19:19:34.0 -0700
@@ -105,6 +105,17 @@ config DETECT_SOFTLOCKUP
   can be detected via the NMI-watchdog, on platforms that
   support it.)
 
+config VFALLBACK_ALWAYS
+   bool Always fall back to Virtual Compound pages
+   default y
+   help
+ Virtual compound pages are only allocated if there is no linear
+ memory available. They are a fallback and errors created by the
+ use of virtual mappings instead of linear ones may not surface
+ because of their infrequent use. This option makes every
+ allocation that allows a fallback to a virtual mapping use
+ the virtual mapping. May have a significant performance impact.
+
 config SCHED_DEBUG
bool Collect scheduler debugging info
depends on DEBUG_KERNEL  PROC_FS

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[10/17] Use GFP_VFALLBACK for sparsemem.

2007-09-18 Thread Christoph Lameter
Sparsemem currently attempts first to do a physically contiguous mapping
and then falls back to vmalloc. The same thing can now be accomplished
using GFP_VFALLBACK.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/sparse.c |   23 +++
 1 file changed, 3 insertions(+), 20 deletions(-)

Index: linux-2.6/mm/sparse.c
===
--- linux-2.6.orig/mm/sparse.c  2007-09-18 13:21:44.0 -0700
+++ linux-2.6/mm/sparse.c   2007-09-18 13:28:43.0 -0700
@@ -269,32 +269,15 @@ void __init sparse_init(void)
 #ifdef CONFIG_MEMORY_HOTPLUG
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
 {
-   struct page *page, *ret;
unsigned long memmap_size = sizeof(struct page) * nr_pages;
 
-   page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
-   if (page)
-   goto got_map_page;
-
-   ret = vmalloc(memmap_size);
-   if (ret)
-   goto got_map_ptr;
-
-   return NULL;
-got_map_page:
-   ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
-got_map_ptr:
-   memset(ret, 0, memmap_size);
-
-   return ret;
+   return (struct page *)alloc_page(GFP_VFALLBACK|__GFP_ZERO,
+   get_order(memmap_size));
 }
 
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-   if (is_vmalloc_addr(memmap))
-   vfree(memmap);
-   else
-   free_pages((unsigned long)memmap,
+   free_pages((unsigned long)memmap,
   get_order(sizeof(struct page) * nr_pages));
 }
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[07/17] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings

2007-09-18 Thread Christoph Lameter
This adds a new gfp flag

__GFP_VFALLBACK

If specified during a higher order allocation then the system will fall
back to vmap and attempt to create a virtually contiguous area instead of
a physically contiguous area. In many cases the virtually contiguous area
can stand in for the physically contiguous area (with some loss of
performance).

The pages used for VFALLBACK are marked with a new flag
PageVcompound(page). The mark is necessary since we have to know upon
free if we have to destroy a virtual mapping. No additional flag is
consumed through the use of PG_swapcache together with PG_compound
(similar to PageHead() and PageTail()).

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/gfp.h|5 +
 include/linux/page-flags.h |   18 +++
 mm/page_alloc.c|  113 ++---
 3 files changed, 130 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-09-18 17:03:54.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-09-18 18:25:46.0 -0700
@@ -1230,6 +1230,86 @@ try_next_zone:
 }
 
 /*
+ * Virtual Compound Page support.
+ *
+ * Virtual Compound Pages are used to fall back to order 0 allocations if large
+ * linear mappings are not available and __GFP_VFALLBACK is set. They are
+ * formatted according to compound page conventions. I.e. following
+ * page-first_page if PageTail(page) is set can be used to determine the
+ * head page.
+ */
+struct page *vcompound_alloc(gfp_t gfp_mask, int order,
+   struct zonelist *zonelist, unsigned long alloc_flags)
+{
+   void *addr;
+   struct page *page;
+   int i;
+   int nr_pages = 1  order;
+   struct page **pages = kzalloc((nr_pages + 1) * sizeof(struct page *),
+   gfp_mask  GFP_LEVEL_MASK);
+
+   if (!pages)
+   return NULL;
+
+   for (i = 0; i  nr_pages; i++) {
+   page = get_page_from_freelist(gfp_mask  ~__GFP_VFALLBACK,
+   0, zonelist, alloc_flags);
+   if (!page)
+   goto abort;
+
+   /* Sets PageCompound which makes PageHead(page) true */
+   __SetPageVcompound(page);
+   if (i) {
+   page-first_page = pages[0];
+   __SetPageTail(page);
+   }
+   pages[i] = page;
+   }
+
+   addr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+   if (!addr)
+   goto abort;
+
+   return pages[0];
+
+abort:
+   for (i = 0; i  nr_pages; i++) {
+   page = pages[i];
+   if (!page)
+   continue;
+   __ClearPageTail(page);
+   __ClearPageHead(page);
+   __ClearPageVcompound(page);
+   __free_page(page);
+   }
+   kfree(pages);
+   return NULL;
+}
+
+static void vcompound_free(void *addr)
+{
+   struct page **pages = vunmap(addr);
+   int i;
+
+   /*
+* First page will have zero refcount since it maintains state
+* for the compound and was decremented before we got here.
+*/
+   __ClearPageHead(pages[0]);
+   __ClearPageVcompound(pages[0]);
+   free_hot_page(pages[0]);
+
+   for (i = 1; pages[i]; i++) {
+   struct page *page = pages[i];
+
+   __ClearPageTail(page);
+   __ClearPageVcompound(page);
+   __free_page(page);
+   }
+   kfree(pages);
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1324,12 +1404,12 @@ nofail_alloc:
goto nofail_alloc;
}
}
-   goto nopage;
+   goto try_vcompound;
}
 
/* Atomic allocations - we can't balance anything */
if (!wait)
-   goto nopage;
+   goto try_vcompound;
 
cond_resched();
 
@@ -1360,6 +1440,11 @@ nofail_alloc:
 */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+
+   if (!page  order  (gfp_mask  __GFP_VFALLBACK))
+   page = vcompound_alloc(gfp_mask, order,
+   zonelist, alloc_flags);
+
if (page)
goto got_pg;
 
@@ -1391,6 +1476,14 @@ nofail_alloc:
goto rebalance;
}
 
+try_vcompound:
+   /* Last chance before failing the allocation */
+   if (order  (gfp_mask  __GFP_VFALLBACK)) {
+   page = vcompound_alloc(gfp_mask, order,
+   zonelist, alloc_flags);
+   if (page)
+   goto got_pg;
+   }
 nopage:

[06/17] vmalloc_address(): Determine vmalloc address from page struct

2007-09-18 Thread Christoph Lameter
Sometimes we need to figure out which vmalloc address is in use
for a certain page struct. There is no easy way to figure out
the vmalloc address from the page struct. So simply search through
the kernel page table to find the address. This is a fairly expensive
process. Use sparingly (or provide a better implementation).

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/vmalloc.h |3 +
 mm/vmalloc.c|   77 
 2 files changed, 80 insertions(+)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-09-18 18:35:13.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-09-18 18:35:18.0 -0700
@@ -196,6 +196,83 @@ struct page *vmalloc_to_page(const void 
 EXPORT_SYMBOL(vmalloc_to_page);
 
 /*
+ * Determine vmalloc address from a page struct.
+ *
+ * Linear search through all ptes of the vmalloc area.
+ */
+static unsigned long vaddr_pte_range(pmd_t *pmd, unsigned long addr,
+   unsigned long end, unsigned long pfn)
+{
+   pte_t *pte;
+
+   pte = pte_offset_kernel(pmd, addr);
+   do {
+   pte_t ptent = *pte;
+   if (pte_present(ptent)  pte_pfn(ptent) == pfn)
+   return addr;
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+   return 0;
+}
+
+static inline unsigned long vaddr_pmd_range(pud_t *pud, unsigned long addr,
+   unsigned long end, unsigned long pfn)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   unsigned long n;
+
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (pmd_none_or_clear_bad(pmd))
+   continue;
+   n = vaddr_pte_range(pmd, addr, next, pfn);
+   if (n)
+   return n;
+   } while (pmd++, addr = next, addr != end);
+   return 0;
+}
+
+static inline unsigned long vaddr_pud_range(pgd_t *pgd, unsigned long addr,
+   unsigned long end, unsigned long pfn)
+{
+   pud_t *pud;
+   unsigned long next;
+   unsigned long n;
+
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (pud_none_or_clear_bad(pud))
+   continue;
+   n = vaddr_pmd_range(pud, addr, next, pfn);
+   if (n)
+   return n;
+   } while (pud++, addr = next, addr != end);
+   return 0;
+}
+
+void *vmalloc_address(struct page *page)
+{
+   pgd_t *pgd;
+   unsigned long next, n;
+   unsigned long addr = VMALLOC_START;
+   unsigned long pfn = page_to_pfn(page);
+
+   pgd = pgd_offset_k(VMALLOC_START);
+   do {
+   next = pgd_addr_end(addr, VMALLOC_END);
+   if (pgd_none_or_clear_bad(pgd))
+   continue;
+   n = vaddr_pud_range(pgd, addr, next, pfn);
+   if (n)
+   return (void *)n;
+   } while (pgd++, addr = next, addr  VMALLOC_END);
+   return NULL;
+}
+EXPORT_SYMBOL(vmalloc_address);
+
+/*
  * Map a vmalloc()-space virtual address to the physical page frame number.
  */
 unsigned long vmalloc_to_pfn(const void *vmalloc_addr)
Index: linux-2.6/include/linux/vmalloc.h
===
--- linux-2.6.orig/include/linux/vmalloc.h  2007-09-18 18:35:13.0 
-0700
+++ linux-2.6/include/linux/vmalloc.h   2007-09-18 18:35:48.0 -0700
@@ -85,6 +85,9 @@ extern void free_vm_area(struct vm_struc
 struct page *vmalloc_to_page(const void *addr);
 unsigned long vmalloc_to_pfn(const void *addr);
 
+/* Determine address from page struct pointer */
+void *vmalloc_address(struct page *);
+
 /*
  * Internals.  Dont't use..
  */

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[05/17] vunmap: return page array

2007-09-18 Thread Christoph Lameter
Make vunmap return the page array that was used at vmap. This is useful
if one has no structures to track the page array but simply stores the
virtual address somewhere. The disposition of the page array can be
decided upon after vunmap. vfree() may now also be used instead of
vunmap which will release the page array after vunmap'ping it.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/vmalloc.h |2 +-
 mm/vmalloc.c|   26 --
 2 files changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/vmalloc.h
===
--- linux-2.6.orig/include/linux/vmalloc.h  2007-09-18 13:22:56.0 
-0700
+++ linux-2.6/include/linux/vmalloc.h   2007-09-18 13:22:57.0 -0700
@@ -49,7 +49,7 @@ extern void vfree(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
-extern void vunmap(const void *addr);
+extern struct page **vunmap(const void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-09-18 13:22:56.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-09-18 13:22:57.0 -0700
@@ -356,17 +356,18 @@ struct vm_struct *remove_vm_area(const v
return v;
 }
 
-static void __vunmap(const void *addr, int deallocate_pages)
+static struct page **__vunmap(const void *addr, int deallocate_pages)
 {
struct vm_struct *area;
+   struct page **pages;
 
if (!addr)
-   return;
+   return NULL;
 
if ((PAGE_SIZE-1)  (unsigned long)addr) {
printk(KERN_ERR Trying to vfree() bad address (%p)\n, addr);
WARN_ON(1);
-   return;
+   return NULL;
}
 
area = remove_vm_area(addr);
@@ -374,29 +375,30 @@ static void __vunmap(const void *addr, i
printk(KERN_ERR Trying to vfree() nonexistent vm area (%p)\n,
addr);
WARN_ON(1);
-   return;
+   return NULL;
}
 
+   pages = area-pages;
debug_check_no_locks_freed(addr, area-size);
 
if (deallocate_pages) {
int i;
 
for (i = 0; i  area-nr_pages; i++) {
-   struct page *page = area-pages[i];
+   struct page *page = pages[i];
 
BUG_ON(!page);
__free_page(page);
}
 
if (area-flags  VM_VPAGES)
-   vfree(area-pages);
+   vfree(pages);
else
-   kfree(area-pages);
+   kfree(pages);
}
 
kfree(area);
-   return;
+   return pages;
 }
 
 /**
@@ -424,11 +426,13 @@ EXPORT_SYMBOL(vfree);
  * which was created from the page array passed to vmap().
  *
  * Must not be called in interrupt context.
+ *
+ * Returns a pointer to the array of pointers to page structs
  */
-void vunmap(const void *addr)
+struct page **vunmap(const void *addr)
 {
BUG_ON(in_interrupt());
-   __vunmap(addr, 0);
+   return __vunmap(addr, 0);
 }
 EXPORT_SYMBOL(vunmap);
 
@@ -453,6 +457,8 @@ void *vmap(struct page **pages, unsigned
area = get_vm_area((count  PAGE_SHIFT), flags);
if (!area)
return NULL;
+   area-pages = pages;
+   area-nr_pages = count;
if (map_vm_area(area, prot, pages)) {
vunmap(area-addr);
return NULL;

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Wed, 19 Sep 2007, Rene Herman wrote:
 
 Well, not so sure about that. What if one of your expected uses for example is
 video data storage -- lots of data, especially for multiple streams, and needs
 still relatively fast machinery. Why would you care for the overhead af
 _small_ blocks?

.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a support legacy 
customers, I think the whole patch-series has been a total waste of time.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area

2007-09-18 Thread Gabriel C
Christoph Lameter wrote:

  
 + if (is_vmalloc_addr(word))
 + page = vmalloc_to_page(word)
^^
Missing ' ; '

 + else
 + page = virt_to_page(word);
 +
 + zone = page_zone(page);
   return zone-wait_table[hash_long(val, zone-wait_table_bits)];
  }
  EXPORT_SYMBOL(bit_waitqueue);
 

Regards,

Gabriel
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman

On 09/18/2007 09:44 PM, Linus Torvalds wrote:


Nobody sane would *ever* argue for 16kB+ blocksizes in general.


Well, not so sure about that. What if one of your expected uses for example 
is video data storage -- lots of data, especially for multiple streams, and 
needs still relatively fast machinery. Why would you care for the overhead 
af _small_ blocks?


Okay, maybe that's covered in the in general but its not extremely oddball 
either...


Rene.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman

On 09/19/2007 05:50 AM, Linus Torvalds wrote:


On Wed, 19 Sep 2007, Rene Herman wrote:



Well, not so sure about that. What if one of your expected uses for example is
video data storage -- lots of data, especially for multiple streams, and needs
still relatively fast machinery. Why would you care for the overhead af
_small_ blocks?


.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a support legacy 
customers, I think the whole patch-series has been a total waste of time.


Admittedly, extent-based might not be a particularly bad answer at least to 
the I/O side of the equation...


I do feel larger blocksizes continue to make sense in general though. Packet 
writing on CD/DVD is a problem already today since the hardware needs 32K or 
64K blocks and I'd expect to see more of these and similiar situations when 
flash gets (even) more popular which it sort of inevitably is going to be.


Rene.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Wed, 19 Sep 2007, Rene Herman wrote:
 
 I do feel larger blocksizes continue to make sense in general though. Packet
 writing on CD/DVD is a problem already today since the hardware needs 32K or
 64K blocks and I'd expect to see more of these and similiar situations when
 flash gets (even) more popular which it sort of inevitably is going to be.

.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman

On 09/19/2007 06:33 AM, Linus Torvalds wrote:


On Wed, 19 Sep 2007, Rene Herman wrote:



I do feel larger blocksizes continue to make sense in general though. Packet
writing on CD/DVD is a problem already today since the hardware needs 32K or
64K blocks and I'd expect to see more of these and similiar situations when
flash gets (even) more popular which it sort of inevitably is going to be.


.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 


So if I understood that right, you'd suggest to deal with devices with 
larger physical blocksizes at some level above the current blocklayer.


Not familiar enough with either block or fs to be able to argue that 
effectively...


Rene.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread David Chinner
On Tue, Sep 18, 2007 at 06:06:52PM -0700, Linus Torvalds wrote:
   especially as the Linux
  kernel limitations in this area are well known.  There's no 16K mess
  that SGI is trying to clean up here (and SGI have offered both IA64 and
  x86_64 systems for some time now, so not sure how you came up with that
  whacko theory).
 
 Well, if that is the case, then I vote that we drop the whole patch-series 
 entirely. It clearly has no reason for existing at all.
 
 There is *no* valid reason for 16kB blocksizes unless you have legacy 
 issues.

Ok, let's step back for a moment and look at a basic, fundamental
constraint of disks - seek capacity. A decade ago, a terabyte of
filesystem had 30 disks behind it - a seek capacity of about
6000 seeks/s. Nowdays, that's a single disk with a seek
capacity of about 200/s. We're going *rapidly* backwards in
terms of seek capacity per terabyte of storage.

Now fill that terabyte of storage and index it in the most efficient
way - let's say btrees are used because lots of filesystems use
them. Hence the depth of the tree is roughly O((log n)/m) where m is
a factor of the btree block size.  Effectively, btree depth = seek
count on lookup of any object.

When the filesystem had a capacity of 6,000 seeks/s, we didn't
really care if the indexes used 4k blocks or not - the storage
subsystem had an excess of seek capacity to deal with
less-than-optimal indexing. Now we have over an order of magnitude
less seeks to expend in index operations *for the same amount of
data* so we are really starting to care about minimising the
number of seeks in our indexing mechanisms and allocations.

We can play tricks in index compaction to reduce the number of
interior nodes of the tree (like hashed indexing in the XFS ext3
htree directories) but that still only gets us so far in reducing
seeks and doesn't help at all for tree traversals. That leaves us
with the btree block size as the only factor we can further vary to
reduce the depth of the tree. i.e. m.

So we want to increase the filesystem block size it improve the
efficiency of our indexing. That improvement in efficiency
translates directly into better performance on seek constrained
storage subsystems.

The problem is this: to alter the fundamental block size of the
filesystem we also need to alter the data block size and that is
exactly the piece that linux does not support right now.  So while
we have the capability to use large block sizes in certain
filesystems, we can't use that capability until the data path
supports it.

To summarise, large block size support in the filesystem is not
about legacy issues. It's about trying to cope with the rapid
expansion of storage capabilities of modern hardware where we have
to index much, much more data with a corresponding decrease in
the seek capability of the hardware.

 So get your stories straight, people.

Ok, so let's set the record straight. There were 3 justifications
for using *large pages* to *support* large filesystem block sizes
The justifications for the variable order page cache with large
pages were:

1. little code change needed in the filesystems
- still true

2. Increased I/O sizes on 4k page machines (the SCSI
   controller problem)
- redundant thanks to Jens Axboe's quick work

3. avoiding the need for vmap() as it has great
   overhead and does not scale
- Nick is starting to work on that and has
   already had good results.

Everyone seems to be focussing on #2 as the entire justification for
large block sizes in filesystems and that this is an SGI problem.
Nothing could be further from the truth - the truth is that large
pages solved multiple problems in one go. We now have a different,
better solution #2, so please, please stop using that as some
justification for claiming filesystems don't need large block sizes.

However, all this doesn't change the fact that we have a major storage
scalability crunch coming in the next few years. Disk capacity is
likely to continue to double every 12 months for the next 3 or 4
years. Large block size support is only one mechanism we need to
help cope with this trend.

The variable order page cache with large pages was a means to an end
- it's not the only solution to this problem and I'm extremely happy
to see that there is progress on multiple fronts.  That's the
strength of the Linux community showing through.  In the end, I
really don't care how we end up supporting large filesystem block
sizes in the page cache - all I care about is that we end up
supporting it as efficiently and generically as we possibly can.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html