subject:"Re\: \[00\/41\] Large Blocksize Support V7 \(adds memmap support\)"

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-28 Thread Nick Piggin

On Thursday 20 September 2007 11:38, David Chinner wrote:
> On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:

> > Plus of course you don't like fsblock because it requires work to
> > adapt a fs to it, I can't argue about that.
>
> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:

I don't think there is anything inherently wrong with a structure
per filesystem block construct. In the places where you want to
have a handle on that information.

In the data path of a lot of filesystems, it's not really useful, no
question.

But the block / buffer head concept is there and is useful for many
things obviously. fsblock, I believe, improves on it; maybe even to
the point where there won't be too much reason for many
filesystems to convert their data paths to something different (eg.
nobh mode, or an extent block mapping).

> http://marc.info/?l=linux-fsdevel=118284983925719=2
>
> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.

If you don't need to manipulate or manage the pagecache on a
per-block basis anyway, then you shouldn't need fsblock (or anything
else particularly special) to do higher order block sizes.

If you do sometimes need to, then fsblock *may* be a way you can
remove vmap code from your filesystem and share it with generic
code...

> But, I'm not going to argue endlessly for one solution or another;
> I'm happy to see different solutions being chased, so may the
> best VM win ;)

Agreed :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-28 Thread Nick Piggin

On Thursday 20 September 2007 11:38, David Chinner wrote:
 On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:

  Plus of course you don't like fsblock because it requires work to
  adapt a fs to it, I can't argue about that.

 No, I don't like fsblock because it is inherently a struture
 per filesystem block construct, just like buggerheads. You
 still need to allocate millions of them when you have millions
 dirty pages around. Rather than type it all out again, read
 the fsblocks thread from here:

I don't think there is anything inherently wrong with a structure
per filesystem block construct. In the places where you want to
have a handle on that information.

In the data path of a lot of filesystems, it's not really useful, no
question.

But the block / buffer head concept is there and is useful for many
things obviously. fsblock, I believe, improves on it; maybe even to
the point where there won't be too much reason for many
filesystems to convert their data paths to something different (eg.
nobh mode, or an extent block mapping).


 http://marc.info/?l=linux-fsdevelm=118284983925719w=2

 FWIW, with Chris mason's extent-based block mapping (which btrfs
 is using and Christoph Hellwig is porting XFS over to) we completely
 remove buggerheads from XFS and so fsblock would be a pretty
 major step backwards for us if Chris's work goes into mainline.

If you don't need to manipulate or manage the pagecache on a
per-block basis anyway, then you shouldn't need fsblock (or anything
else particularly special) to do higher order block sizes.

If you do sometimes need to, then fsblock *may* be a way you can
remove vmap code from your filesystem and share it with generic
code...


 But, I'm not going to argue endlessly for one solution or another;
 I'm happy to see different solutions being chased, so may the
 best VM win ;)

Agreed :)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Christoph Lameter

On Fri, 21 Sep 2007, Hugh Dickins wrote:

> I've found some fixes needed on top of your Large Blocksize Support
> patches: I'll send those to you in a moment.  Looks like you didn't
> try much swapping!

yup. Thanks for looking at it.

> 
> I only managed to get ext2 working with larger blocksizes:
> reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
> reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
> wants too many credits", even after adding JBD patches that you
> turned out to be depending on); and I didn't try ext4 or xfs
> (I'm guessing the latter has been quite well tested ;)

Yes, there were issues with the first releases of the JBD patches. The 
current crop in mm is fine but much of that may have bypassed this list.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Andrea Arcangeli

On Sun, Sep 23, 2007 at 08:56:39AM +0200, Goswin von Brederlow wrote:
> As a user I know it because I didn't put a kernel source into /tmp. A
> programm can't reasonably know that.

Various apps requires you (admin/user) to tune the size of their
caches. Seems like you never tried to setup a database, oh well.

> Xen has its own memory pool and can quite agressively reclaim memory
> from dom0 when needed. I just ment to say that the number in

The whole point is if there's not enough ram of course... this is why
you should check.

> /proc/meminfo can change in a second so it is not much use knowing
> what it said last minute.

The numbers will change depending on what's running on your
system. It's up to you to know plus I normally keep vmstat monitored
in the background to see how the cache/free levels change over
time. Those numbers are worthless if they could be fragmented...

> I would kill any programm that does that to find out how much free ram
> the system has.

The admin should do that if he's unsure, not a program of course!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Kyle Moffett


On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote:

[EMAIL PROTECTED] (Mel Gorman) writes:

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
But when you already have say 10% of the ram in mixed groups then  
it is a sign the external fragmentation happens and some time  
should be spend on moving movable objects.


I'll play around with it on the side and see what sort of results  
I get.  I won't be pushing anything any time soon in relation to  
this though.  For now, I don't intend to fiddle more with grouping  
pages by mobility for something that may or may not be of benefit  
to a feature that hasn't been widely tested with what exists today.


I watched the videos you posted. A nice and quite clear improvement  
with and without your logic. Cudos.


When you play around with it may I suggest a change to the display  
of the memory information. I think it would be valuable to use a  
Hilbert Curve to arange the pages into pixels. Like this:


# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A


Here's an excellent example of an 0-255 numbered hilbert curve used  
to enumerate the various top-level allocations of IPv4 space:

http://xkcd.com/195/

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Kyle Moffett


On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote:

[EMAIL PROTECTED] (Mel Gorman) writes:

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
But when you already have say 10% of the ram in mixed groups then  
it is a sign the external fragmentation happens and some time  
should be spend on moving movable objects.


I'll play around with it on the side and see what sort of results  
I get.  I won't be pushing anything any time soon in relation to  
this though.  For now, I don't intend to fiddle more with grouping  
pages by mobility for something that may or may not be of benefit  
to a feature that hasn't been widely tested with what exists today.


I watched the videos you posted. A nice and quite clear improvement  
with and without your logic. Cudos.


When you play around with it may I suggest a change to the display  
of the memory information. I think it would be valuable to use a  
Hilbert Curve to arange the pages into pixels. Like this:


# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A


Here's an excellent example of an 0-255 numbered hilbert curve used  
to enumerate the various top-level allocations of IPv4 space:

http://xkcd.com/195/

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Andrea Arcangeli

On Sun, Sep 23, 2007 at 08:56:39AM +0200, Goswin von Brederlow wrote:
 As a user I know it because I didn't put a kernel source into /tmp. A
 programm can't reasonably know that.

Various apps requires you (admin/user) to tune the size of their
caches. Seems like you never tried to setup a database, oh well.

 Xen has its own memory pool and can quite agressively reclaim memory
 from dom0 when needed. I just ment to say that the number in

The whole point is if there's not enough ram of course... this is why
you should check.

 /proc/meminfo can change in a second so it is not much use knowing
 what it said last minute.

The numbers will change depending on what's running on your
system. It's up to you to know plus I normally keep vmstat monitored
in the background to see how the cache/free levels change over
time. Those numbers are worthless if they could be fragmented...

 I would kill any programm that does that to find out how much free ram
 the system has.

The admin should do that if he's unsure, not a program of course!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Christoph Lameter

On Fri, 21 Sep 2007, Hugh Dickins wrote:

 I've found some fixes needed on top of your Large Blocksize Support
 patches: I'll send those to you in a moment.  Looks like you didn't
 try much swapping!

yup. Thanks for looking at it.

 
 I only managed to get ext2 working with larger blocksizes:
 reiserfs -b 8192 wouldn't mount (reiserfs_fill_super: can not find
 reiserfs on /dev/sdb1); ext3 gave me mysterious errors (JBD: tar
 wants too many credits, even after adding JBD patches that you
 turned out to be depending on); and I didn't try ext4 or xfs
 (I'm guessing the latter has been quite well tested ;)

Yes, there were issues with the first releases of the JBD patches. The 
current crop in mm is fine but much of that may have bypassed this list.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Jörn Engel

On Sun, 16 September 2007 11:44:09 -0700, Linus Torvalds wrote:
> On Sun, 16 Sep 2007, Jörn Engel wrote:
> > 
> > My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
> > which are pinned for their entire lifetime and another for regular
> > files/inodes.  One could take a three-way approach and have
> > always-pinned, often-pinned and rarely-pinned.
> > 
> > We won't get never-pinned that way.
> 
> That sounds pretty good. The problem, of course, is that most of the time, 
> the actual dentry allocation itself is done before you really know which 
> case the dentry will be in, and the natural place for actually giving the 
> dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" 
> it with d_add() or d_instantiate().
> 
> [...]
> 
> And yes, you'd end up with the reallocation overhead quite often, but at 
> least it would now happen only when filling in a dentry, not in the 
> (*much* more critical) cached lookup path.

There may be another approach.  We could create a never-pinned cache,
without trying hard to keep it full.  Instead of moving a hot dentry at
dput() time, we move a cold one from the end of lru.  And if the lru
list is short, we just chicken out.

Our definition of "short lru list" can either be based on a ratio of
pinned to unpinned dentries or on a metric of cache hits vs. cache
misses.  I tend to dislike the cache hit metric, because updatedb would
cause tons of misses and result in the same mess we have right now.

With this double cache, we have a source of slabs to cheaply reap under
memory pressure, but still have a performance advantage (memcpy beats
disk io by orders of magnitude).

Jörn

-- 
The story so far:
In the beginning the Universe was created.  This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow

Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
>> When has free ever given any usefull "free" number? I can perfectly
>> fine allocate another gigabyte of memory despide free saing 25MB. But
>> that is because I know that the buffer/cached are not locked in.
>
> Well, as you said you know that buffer/cached are not locked in. If
> /proc/meminfo would be rubbish like you seem to imply in the first
> line, why would we ever bother to export that information and even
> waste time writing a binary that parse it for admins?

As a user I know it because I didn't put a kernel source into /tmp. A
programm can't reasonably know that.

>> On the other hand 1GB can instantly vanish when I start a xen domain
>> and anything relying on the free value would loose.
>
> Actually you better check meminfo or free before starting a 1G of Xen!!

Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
what it said last minute.

>> The only sensible thing for an application concerned with swapping is
>> to whatch the swapping and then reduce itself. Not the amount
>> free. Although I wish there were some kernel interface to get a
>> preasure value of how valuable free pages would be right now. I would
>> like that for fuse so a userspace filesystem can do caching without
>> cripling the kernel.
>
> Repeated drop caches + free can help.

I would kill any programm that does that to find out how much free ram
the system has.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow

[EMAIL PROTECTED] (Mel Gorman) writes:

> On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
>> [EMAIL PROTECTED] (Mel Gorman) writes:
>> 
>> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> >> Mel Gorman <[EMAIL PROTECTED]> writes:
>> >> Looking at my
>> >> little test program evicting movable objects from a mixed group should
>> >> not be that expensive as it doesn't happen often.
>> >
>> > It happens regularly if the size of the block you need to keep clean is
>> > lower than min_free_kbytes. In the case of hugepages, that was always
>> > the case.
>> 
>> That assumes that the number of groups allocated for unmovable objects
>> will continiously grow and shrink.
>
> They do grow and shrink. The number of pagetables in use changes for
> example.

By numbers of groups worth? And full groups get free, unmixed and
filled by movable objects?

>> I'm assuming it will level off at
>> some size for long times (hours) under normal operations.
>
> It doesn't unless you assume the system remains in a steady state for it's
> lifetime. Things like updatedb tend to throw a spanner into the works.

Moved to cron weekly here. And even normally it is only once a day. So
what if it starts moving some pages while updatedb runs. If it isn't
too braindead it will reclaim some dentries updated has created and
left for good. It should just cause the dentry cache to be smaller at
no cost. I'm not calling that normal operations. That is a once a day
special. What I don't want is to spend 1% of cpu time copying
pages. That would be unacceptable. Copying 1000 pages per updatedb run
would be trivial on the other hand.

>> There should
>> be some buffering of a few groups to be held back in reserve when it
>> shrinks to prevent the scenario that the size is just at a group
>> boundary and always grows/shrinks by 1 group.
>> 
>
> And what size should this group be that all workloads function?

1 is enough to prevent jittering. If you don't hold a group back and
you are exactly at a group boundary then alternatingly allocating and
freeing one page would result in a group allocation and freeing every
time. With one group in reserve you only get an group allocation or
freeing when a groupd worth of change has happened.

This assumes that changing the type and/or state of a group is
expensive. Takes time or locks or some such. Otherwise just let it
jitter.

>> >> So if
>> >> you evict movable objects from mixed group when needed all the
>> >> pagetable pages would end up in the same mixed group slowly taking it
>> >> over completly. No fragmentation at all. See how essential that
>> >> feature is. :)
>> >> 
>> >
>> > To move pages, there must be enough blocks free. That is where
>> > min_free_kbytes had to come in. If you cared only about keeping 64KB
>> > chunks free, it makes sense but it didn't in the context of hugepages.
>> 
>> I'm more concerned with keeping the little unmovable things out of the
>> way. Those are the things that will fragment the memory and prevent
>> any huge pages to be available even with moving other stuff out of the
>> way.
>
> That's fair, just not cheap

That is the price you pay. To allocate 2MB of ram you have to have 2MB
of free ram or make them free. There is no way around that. Moving
pages means that you can actually get those 2MB even if the price
is high and that you have more choice deciding what to throw away or
swap out. I would rather have a 2MB malloc take some time than have it
fail because the kernel doesn't feel like it.

>> Can you tell me how? I would like to do the same.
>> 
>
> They were generated using trace_allocmap kernel module in
> http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
> in combination with frag-display in the same package.  However, in the
> current version against current -mm's, it'll identify some movable pages
> wrong. Specifically, it will appear to be mixing movable pages with slab
> pages and it doesn't identify SLUB pages properly at all (SLUB came after
> the last revision of this tool). I need to bring an annotation patch up to
> date before it can generate the images correctly.

Thanks. I will test that out and see what I get on a few lustre
servers and clients. That is probably quite a different workload from
what you test.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow

[EMAIL PROTECTED] (Mel Gorman) writes:

> On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
>> But when you already have say 10% of the ram in mixed groups then it
>> is a sign the external fragmentation happens and some time should be
>> spend on moving movable objects.
>> 
>
> I'll play around with it on the side and see what sort of results I get.
> I won't be pushing anything any time soon in relation to this though.
> For now, I don't intend to fiddle more with grouping pages by mobility
> for something that may or may not be of benefit to a feature that hasn't
> been widely tested with what exists today.

I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.

When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:

# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A
+---+---+
# # # # |00 03 04 05|3A 3B 3C 3F|
# #   # #   # # |   |   |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # |   |   |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# #   # #   # # |   |   |
# # # # |0F 0C 0B 0A|35 34 33 30|
# # +-+-+   |
### ### ### |10 11|1E 1F|20 21 2E 2F|
  # # # #   | | |   |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-+   |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | |   |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-+-+---+

I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
(even or odd order).

>> Maybe instead of reserving one could say that you can have up to 6
>> groups of space
>
> And if the groups are 1GB in size? I tried something like this already.
> It didn't work out well at the time although I could revisit.

You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.

But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
if that size gives a reasonable number of groups total.

>> not used by unmovable objects before aggressive moving
>> starts. I don't quite see why you NEED reserving as long as there is
>> enough space free alltogether in case something needs moving.
>
> hence, increase min_free_kbytes.

Which is different from reserving a full group as it does not count
fragmented space as lost.

>> 1 group
>> worth of space free might be plenty to move stuff too. Note that all
>> the virtual pages can be stuffed in every little free space there is
>> and reassembled by the MMU. There is no space lost there.
>> 
>
> What you suggest sounds similar to having a type MIGRATE_MIXED where you
> allocate from when the preferred lists are full. It became a sizing
> problem that never really worked out. As I said, I can try again.

Not realy. I'm saying we should actively defragment mixed groups
during allocation and always as little as possible when a certain
level of external fragmentation is reached. A MIGRATE_MIXED sounds
like giving up completly if things get bad enough. Compare it to a
cheap network switch going into hub mode when its arp table runs full.
If you ever had that then you know how bad that is.

>> But until one tries one can't say.
>> 
>> MfG
>> Goswin
>> 
>> PS: How do allocations pick groups?
>
> Using GFP flags to identify the type.

That is the type of group, not which one.

>> Could one use the oldest group
>> dedicated to each MIGRATE_TYPE?
>
> Age is difficult to determine so probably not.

Put the uptime as sort key into each group header on creation or type
change. Then sort the partialy used groups by that key. A heap will do
fine and be fast.

>> Or lowest address for unmovable and
>> highest address for movable? Something to better keep the two out of
>> each other way.
>
> We bias the location of unmovable and reclaimable allocations already. It's
> not done for movable because it wasn't necessary (as they are easily
> reclaimed or moved anyway).

Except that is never done so doesn't count.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow

[EMAIL PROTECTED] (Mel Gorman) writes:

 On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
 But when you already have say 10% of the ram in mixed groups then it
 is a sign the external fragmentation happens and some time should be
 spend on moving movable objects.
 

 I'll play around with it on the side and see what sort of results I get.
 I won't be pushing anything any time soon in relation to this though.
 For now, I don't intend to fiddle more with grouping pages by mobility
 for something that may or may not be of benefit to a feature that hasn't
 been widely tested with what exists today.

I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.

When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:

# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A
+---+---+
# # # # |00 03 04 05|3A 3B 3C 3F|
# #   # #   # # |   |   |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # |   |   |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# #   # #   # # |   |   |
# # # # |0F 0C 0B 0A|35 34 33 30|
# # +-+-+   |
### ### ### |10 11|1E 1F|20 21 2E 2F|
  # # # #   | | |   |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-+   |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | |   |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-+-+---+

I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
(even or odd order).

 Maybe instead of reserving one could say that you can have up to 6
 groups of space

 And if the groups are 1GB in size? I tried something like this already.
 It didn't work out well at the time although I could revisit.

You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.

But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
if that size gives a reasonable number of groups total.

 not used by unmovable objects before aggressive moving
 starts. I don't quite see why you NEED reserving as long as there is
 enough space free alltogether in case something needs moving.

 hence, increase min_free_kbytes.

Which is different from reserving a full group as it does not count
fragmented space as lost.

 1 group
 worth of space free might be plenty to move stuff too. Note that all
 the virtual pages can be stuffed in every little free space there is
 and reassembled by the MMU. There is no space lost there.
 

 What you suggest sounds similar to having a type MIGRATE_MIXED where you
 allocate from when the preferred lists are full. It became a sizing
 problem that never really worked out. As I said, I can try again.

Not realy. I'm saying we should actively defragment mixed groups
during allocation and always as little as possible when a certain
level of external fragmentation is reached. A MIGRATE_MIXED sounds
like giving up completly if things get bad enough. Compare it to a
cheap network switch going into hub mode when its arp table runs full.
If you ever had that then you know how bad that is.

 But until one tries one can't say.
 
 MfG
 Goswin
 
 PS: How do allocations pick groups?

 Using GFP flags to identify the type.

That is the type of group, not which one.

 Could one use the oldest group
 dedicated to each MIGRATE_TYPE?

 Age is difficult to determine so probably not.

Put the uptime as sort key into each group header on creation or type
change. Then sort the partialy used groups by that key. A heap will do
fine and be fast.

 Or lowest address for unmovable and
 highest address for movable? Something to better keep the two out of
 each other way.

 We bias the location of unmovable and reclaimable allocations already. It's
 not done for movable because it wasn't necessary (as they are easily
 reclaimed or moved anyway).

Except that is never done so doesn't count.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow

[EMAIL PROTECTED] (Mel Gorman) writes:

 On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
 [EMAIL PROTECTED] (Mel Gorman) writes:
 
  On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
  Mel Gorman [EMAIL PROTECTED] writes:
  Looking at my
  little test program evicting movable objects from a mixed group should
  not be that expensive as it doesn't happen often.
 
  It happens regularly if the size of the block you need to keep clean is
  lower than min_free_kbytes. In the case of hugepages, that was always
  the case.
 
 That assumes that the number of groups allocated for unmovable objects
 will continiously grow and shrink.

 They do grow and shrink. The number of pagetables in use changes for
 example.

By numbers of groups worth? And full groups get free, unmixed and
filled by movable objects?

 I'm assuming it will level off at
 some size for long times (hours) under normal operations.

 It doesn't unless you assume the system remains in a steady state for it's
 lifetime. Things like updatedb tend to throw a spanner into the works.

Moved to cron weekly here. And even normally it is only once a day. So
what if it starts moving some pages while updatedb runs. If it isn't
too braindead it will reclaim some dentries updated has created and
left for good. It should just cause the dentry cache to be smaller at
no cost. I'm not calling that normal operations. That is a once a day
special. What I don't want is to spend 1% of cpu time copying
pages. That would be unacceptable. Copying 1000 pages per updatedb run
would be trivial on the other hand.

 There should
 be some buffering of a few groups to be held back in reserve when it
 shrinks to prevent the scenario that the size is just at a group
 boundary and always grows/shrinks by 1 group.
 

 And what size should this group be that all workloads function?

1 is enough to prevent jittering. If you don't hold a group back and
you are exactly at a group boundary then alternatingly allocating and
freeing one page would result in a group allocation and freeing every
time. With one group in reserve you only get an group allocation or
freeing when a groupd worth of change has happened.

This assumes that changing the type and/or state of a group is
expensive. Takes time or locks or some such. Otherwise just let it
jitter.

  So if
  you evict movable objects from mixed group when needed all the
  pagetable pages would end up in the same mixed group slowly taking it
  over completly. No fragmentation at all. See how essential that
  feature is. :)
  
 
  To move pages, there must be enough blocks free. That is where
  min_free_kbytes had to come in. If you cared only about keeping 64KB
  chunks free, it makes sense but it didn't in the context of hugepages.
 
 I'm more concerned with keeping the little unmovable things out of the
 way. Those are the things that will fragment the memory and prevent
 any huge pages to be available even with moving other stuff out of the
 way.

 That's fair, just not cheap

That is the price you pay. To allocate 2MB of ram you have to have 2MB
of free ram or make them free. There is no way around that. Moving
pages means that you can actually get those 2MB even if the price
is high and that you have more choice deciding what to throw away or
swap out. I would rather have a 2MB malloc take some time than have it
fail because the kernel doesn't feel like it.

 Can you tell me how? I would like to do the same.
 

 They were generated using trace_allocmap kernel module in
 http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
 in combination with frag-display in the same package.  However, in the
 current version against current -mm's, it'll identify some movable pages
 wrong. Specifically, it will appear to be mixing movable pages with slab
 pages and it doesn't identify SLUB pages properly at all (SLUB came after
 the last revision of this tool). I need to bring an annotation patch up to
 date before it can generate the images correctly.

Thanks. I will test that out and see what I get on a few lustre
servers and clients. That is probably quite a different workload from
what you test.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow

Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
 When has free ever given any usefull free number? I can perfectly
 fine allocate another gigabyte of memory despide free saing 25MB. But
 that is because I know that the buffer/cached are not locked in.

 Well, as you said you know that buffer/cached are not locked in. If
 /proc/meminfo would be rubbish like you seem to imply in the first
 line, why would we ever bother to export that information and even
 waste time writing a binary that parse it for admins?

As a user I know it because I didn't put a kernel source into /tmp. A
programm can't reasonably know that.

 On the other hand 1GB can instantly vanish when I start a xen domain
 and anything relying on the free value would loose.

 Actually you better check meminfo or free before starting a 1G of Xen!!

Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
what it said last minute.

 The only sensible thing for an application concerned with swapping is
 to whatch the swapping and then reduce itself. Not the amount
 free. Although I wish there were some kernel interface to get a
 preasure value of how valuable free pages would be right now. I would
 like that for fuse so a userspace filesystem can do caching without
 cripling the kernel.

 Repeated drop caches + free can help.

I would kill any programm that does that to find out how much free ram
the system has.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Jörn Engel

On Sun, 16 September 2007 11:44:09 -0700, Linus Torvalds wrote:
 On Sun, 16 Sep 2007, Jörn Engel wrote:
  
  My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
  which are pinned for their entire lifetime and another for regular
  files/inodes.  One could take a three-way approach and have
  always-pinned, often-pinned and rarely-pinned.
  
  We won't get never-pinned that way.
 
 That sounds pretty good. The problem, of course, is that most of the time, 
 the actual dentry allocation itself is done before you really know which 
 case the dentry will be in, and the natural place for actually giving the 
 dentry lifetime hint is *not* at d_alloc(), but when we instantiate 
 it with d_add() or d_instantiate().
 
 [...]
 
 And yes, you'd end up with the reallocation overhead quite often, but at 
 least it would now happen only when filling in a dentry, not in the 
 (*much* more critical) cached lookup path.

There may be another approach.  We could create a never-pinned cache,
without trying hard to keep it full.  Instead of moving a hot dentry at
dput() time, we move a cold one from the end of lru.  And if the lru
list is short, we just chicken out.

Our definition of short lru list can either be based on a ratio of
pinned to unpinned dentries or on a metric of cache hits vs. cache
misses.  I tend to dislike the cache hit metric, because updatedb would
cause tons of misses and result in the same mess we have right now.

With this double cache, we have a source of slabs to cheaply reap under
memory pressure, but still have a performance advantage (memcpy beats
disk io by orders of magnitude).

Jörn

-- 
The story so far:
In the beginning the Universe was created.  This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-22 Thread Goswin von Brederlow

[EMAIL PROTECTED] (Mel Gorman) writes:

> On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
>> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
>> Allocating ptes from slab is fairly simple but I think it would be
>> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
>> nearby ptes in the per-task local pagetable tree, to reduce the number
>> of locks taken and not to enter the slab at all for that.
>
> It runs the risk of pinning up to 60K of data per task that is unusable for
> any other purpose. On average, it'll be more like 32K but worth keeping
> in mind.

Two things to both of you respectively.

Why should we try to stay out of the pte slab? Isn't the slab exactly
made for this thing? To efficiently handle a large number of equal
size objects for quick allocation and dealocation? If it is a locking
problem then there should be a per cpu cache of ptes. Say 0-32
ptes. If you run out you allocate 16 from slab. When you overflow you
free 16 (which would give you your 64k allocations but in multiple
objects).

As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB
on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?)
respectively. For the sbrk() and mmap() usage from glibc malloc() that
would be fine as they grow linear and the mmap() call in glibc could
be made to align to those chunks. But for a programm like rtorrent
using mmap to bring in chunks of a 4GB file this looks desasterous.

>> Infact we
>> could allocate the 4 levels (or anyway more than one level) in one
>> single alloc_pages(0) and track the leftovers in the mm (or similar).

Personally I would really go with a per cpu cache. When mapping a page
reserve 4 tables. Then you walk the tree and add entries as
needed. And last you release 0-4 unused entries to the cache.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-22 Thread Goswin von Brederlow

[EMAIL PROTECTED] (Mel Gorman) writes:

 On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
 On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
 Allocating ptes from slab is fairly simple but I think it would be
 better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
 nearby ptes in the per-task local pagetable tree, to reduce the number
 of locks taken and not to enter the slab at all for that.

 It runs the risk of pinning up to 60K of data per task that is unusable for
 any other purpose. On average, it'll be more like 32K but worth keeping
 in mind.

Two things to both of you respectively.

Why should we try to stay out of the pte slab? Isn't the slab exactly
made for this thing? To efficiently handle a large number of equal
size objects for quick allocation and dealocation? If it is a locking
problem then there should be a per cpu cache of ptes. Say 0-32
ptes. If you run out you allocate 16 from slab. When you overflow you
free 16 (which would give you your 64k allocations but in multiple
objects).

As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB
on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?)
respectively. For the sbrk() and mmap() usage from glibc malloc() that
would be fine as they grow linear and the mmap() call in glibc could
be made to align to those chunks. But for a programm like rtorrent
using mmap to bring in chunks of a 4GB file this looks desasterous.

 Infact we
 could allocate the 4 levels (or anyway more than one level) in one
 single alloc_pages(0) and track the leftovers in the mm (or similar).

Personally I would really go with a per cpu cache. When mapping a page
reserve 4 tables. Then you walk the tree and add entries as
needed. And last you release 0-4 unused entries to the cache.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-21 Thread Hugh Dickins

On Thu, 20 Sep 2007, Christoph Lameter wrote:
> On Thu, 20 Sep 2007, David Chinner wrote:
> > > Disagree, the mmap side is not a little change.
> > 
> > That's not in the filesystem, though. ;)
> 
> And its really only a minimal change for some function to loop over all 
> 4k pages and elsewhere index the right 4k subpage.

I agree with you on that: the changes you had to make to support mmap
were _much_ less bothersome than I'd been fearing, and I'm surprised
some people still see that side of it as a sticking point.

But I've kept very quiet because I remain quite ambivalent about the
patchset: I'm somewhere on the spectrum between you and Nick, shifting
my position from hour to hour.  Don't expect any decisiveness from me.

In some senses I'm even further off the scale away from you: I'm
dubious even of Nick and Andrew's belief in opportunistic contiguity.  
Just how hard should we trying for contiguity?  How far should we
go in sacrificing our earlier "LRU" principles?  It's easy to bump
up PAGE_ALLOC_COSTLY_ORDER, but what price do we pay when we do?

I agree with those who want to see how the competing approaches
work out in practice: which is frustrating for you, yes, because
you are so close to ready.  (I've not glanced at virtual compound,
but had been wondering in that direction before you suggested it.)

I do think your patchset is, for the time being at least, a nice
out-of-tree set, and it's grand to be able to bring a filesystem
from another arch with larger pagesize and get at the data from it.

I've found some fixes needed on top of your Large Blocksize Support
patches: I'll send those to you in a moment.  Looks like you didn't
try much swapping!

I only managed to get ext2 working with larger blocksizes:
reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
wants too many credits", even after adding JBD patches that you
turned out to be depending on); and I didn't try ext4 or xfs
(I'm guessing the latter has been quite well tested ;)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-21 Thread Hugh Dickins

On Thu, 20 Sep 2007, Christoph Lameter wrote:
 On Thu, 20 Sep 2007, David Chinner wrote:
   Disagree, the mmap side is not a little change.
  
  That's not in the filesystem, though. ;)
 
 And its really only a minimal change for some function to loop over all 
 4k pages and elsewhere index the right 4k subpage.

I agree with you on that: the changes you had to make to support mmap
were _much_ less bothersome than I'd been fearing, and I'm surprised
some people still see that side of it as a sticking point.

But I've kept very quiet because I remain quite ambivalent about the
patchset: I'm somewhere on the spectrum between you and Nick, shifting
my position from hour to hour.  Don't expect any decisiveness from me.

In some senses I'm even further off the scale away from you: I'm
dubious even of Nick and Andrew's belief in opportunistic contiguity.  
Just how hard should we trying for contiguity?  How far should we
go in sacrificing our earlier LRU principles?  It's easy to bump
up PAGE_ALLOC_COSTLY_ORDER, but what price do we pay when we do?

I agree with those who want to see how the competing approaches
work out in practice: which is frustrating for you, yes, because
you are so close to ready.  (I've not glanced at virtual compound,
but had been wondering in that direction before you suggested it.)

I do think your patchset is, for the time being at least, a nice
out-of-tree set, and it's grand to be able to bring a filesystem
from another arch with larger pagesize and get at the data from it.

I've found some fixes needed on top of your Large Blocksize Support
patches: I'll send those to you in a moment.  Looks like you didn't
try much swapping!

I only managed to get ext2 working with larger blocksizes:
reiserfs -b 8192 wouldn't mount (reiserfs_fill_super: can not find
reiserfs on /dev/sdb1); ext3 gave me mysterious errors (JBD: tar
wants too many credits, even after adding JBD patches that you
turned out to be depending on); and I didn't try ext4 or xfs
(I'm guessing the latter has been quite well tested ;)

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Christoph Lameter

On Thu, 20 Sep 2007, Andrea Arcangeli wrote:

> The only point of this largepage stuff is to go an extra mile to save
> a bit more of cpu vs a strict vmap based solution (fsblock of course
> will be smart enough that if it notices the PAGE_SIZE is >= blocksize
> it doesn't need to run any vmap at all and it can just use the direct
> mapping, so vmap translates in 1 branch only to check the blocksize
> variable, PAGE_SIZE is immediate in the .text at compile time). But if

Hmmm.. You are not keeping up with things? Heard of virtual compound 
pages? They only require a vmap when the page allocator fails a 
larger order allocation (which we have established is rare to the point 
of nonexistence). The next rev of large blocksize will use that for 
fallback. Plus vcompounds can be used to get rid of most uses of vmalloc 
reducing the need for virtual mapping in general.

Largeblock is a general solution for managing large data sets using a 
single page struct. See the original message that started this thread. It 
can be used for various other subsystems like vcompound can.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Christoph Lameter

On Thu, 20 Sep 2007, David Chinner wrote:

> > Disagree, the mmap side is not a little change.
> 
> That's not in the filesystem, though. ;)

And its really only a minimal change for some function to loop over all 
4k pages and elsewhere index the right 4k subpage.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Andrea Arcangeli

On Thu, Sep 20, 2007 at 11:38:21AM +1000, David Chinner wrote:
> Sure, and that's what I meant when I said VPC + large pages was
> a means to the end, not the only solution to the problem.

The whole point is that it's not an end, it's an end to your own fs
centric view only (which is sure fair enough), but I watch the whole
VM not just the pagecache...

The same way the fs-centric view will hope to get this little bit of
further optimization from largepages to reach "the end", my VM-wide
view wants the same little bit of opitmization for *everything*
including tmpfs and anonymous memory, slab etc..! This is clearly why
config-page-shift is better...

If you're ok not to be on the edge and you want a generic rpm image
that run quite optimally for any workload, then 4k+fslblock is just
fine of course. But if we go on the edge we should aim for the _very_
end for the whole VM, not just for "the end of the pagecache on
certain files". Especially when the complexity involved in the mmap
code is similar, and it will reject heavily if we merge this
not-very-end solution that only reaches "the end" for the pagecache.

> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:
> 
> http://marc.info/?l=linux-fsdevel=118284983925719=2

Thanks for the pointer!

> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.

I tend to agree if we change it fsblock should support extent if
that's what you need on xfs to support range-locking etc... Whatever
happens in vfs should please all existing fs without people needing to
go their own way again... Or replace fsblock with Chris's block
mapping. Frankly I didn't see Chris's code so I cannot comment
further. But your complains sounds sensible. We certainly want to
avoid lowlevel fs to get smarter again than the vfs. The brainer stuff
should be in vfs!

> That's not in the filesystem, though. ;)
> 
> However, I agree that if you don't have mmap then it's not
> worthwhile and the changes for VPC aren't trivial.

Yep.

> 
> > >   3. avoiding the need for vmap() as it has great
> > >  overhead and does not scale
> > >   -> Nick is starting to work on that and has
> > >  already had good results.
> > 
> > Frankly I don't follow this vmap thing. Can you elaborate?
> 
> We current support metadata blocks larger than page size for
> certain types of metadata in XFS. e.g. directory blocks.
> This however, requires vmap()ing a bunch of individual,
> non-contiguous pages out of a block device address space
> in exactly the fashion that was proposed by Nick with fsblock
> originally.
> 
> vmap() has severe scalability problems - read this subthread
> of this discussion between Nick and myself:
> 
> http://lkml.org/lkml/2007/9/11/508

So the idea of vmap is that it's much simpler to have a contiguous
virtual address space large blocksize, than to find the right
b_data[index] once you exceed PAGE_SIZE...

The global tlb flush with ipi would kill performance, you can forget
any global mapping here. The only chance to do this would be like we
do with kmap_atomic per-cpu on highmem, with preempt_disable (for the
enjoyment of the rt folks out there ;). what's the problem of having
it per-cpu? Is this what fsblock already does? You've just have to
allocate a new virtual range large numberofentriesinvmap*blocksize
every time you mount a new fs. Then instead of calling kmap you call
vmap and vunmap when you're finished. That should provide decent
performance, especially with physically indexed caches.

Anything more heavyweight than what I suggested is probably overkill,
even vmalloc_to_page.

> Hmm - so you'll need page cache tail packing as well in that case
> to prevent memory being wasted on small files. That means any way
> we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
> we've got some non-trivial VM  modifications to make. 

Hmm no, the point of config-page-shift is that if you really need to
reach "the very end", you probably don't care about wasting some
memory, because either your workload can't fit in cache, or it fits in
cache regardless, or you're not wasting memory because you work with
large files...

The only point of this largepage stuff is to go an extra mile to save
a bit more of cpu vs a strict vmap based solution (fsblock of course
will be smart enough that if it notices the PAGE_SIZE is >= blocksize
it doesn't need to run any vmap at all and it can just use the direct
mapping, so vmap translates in 1 branch only to check the blocksize
variable, PAGE_SIZE is immediate in the .text at compile time). But if

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Andrea Arcangeli

On Thu, Sep 20, 2007 at 11:38:21AM +1000, David Chinner wrote:
 Sure, and that's what I meant when I said VPC + large pages was
 a means to the end, not the only solution to the problem.

The whole point is that it's not an end, it's an end to your own fs
centric view only (which is sure fair enough), but I watch the whole
VM not just the pagecache...

The same way the fs-centric view will hope to get this little bit of
further optimization from largepages to reach the end, my VM-wide
view wants the same little bit of opitmization for *everything*
including tmpfs and anonymous memory, slab etc..! This is clearly why
config-page-shift is better...

If you're ok not to be on the edge and you want a generic rpm image
that run quite optimally for any workload, then 4k+fslblock is just
fine of course. But if we go on the edge we should aim for the _very_
end for the whole VM, not just for the end of the pagecache on
certain files. Especially when the complexity involved in the mmap
code is similar, and it will reject heavily if we merge this
not-very-end solution that only reaches the end for the pagecache.

 No, I don't like fsblock because it is inherently a struture
 per filesystem block construct, just like buggerheads. You
 still need to allocate millions of them when you have millions
 dirty pages around. Rather than type it all out again, read
 the fsblocks thread from here:
 
 http://marc.info/?l=linux-fsdevelm=118284983925719w=2

Thanks for the pointer!

 FWIW, with Chris mason's extent-based block mapping (which btrfs
 is using and Christoph Hellwig is porting XFS over to) we completely
 remove buggerheads from XFS and so fsblock would be a pretty
 major step backwards for us if Chris's work goes into mainline.

I tend to agree if we change it fsblock should support extent if
that's what you need on xfs to support range-locking etc... Whatever
happens in vfs should please all existing fs without people needing to
go their own way again... Or replace fsblock with Chris's block
mapping. Frankly I didn't see Chris's code so I cannot comment
further. But your complains sounds sensible. We certainly want to
avoid lowlevel fs to get smarter again than the vfs. The brainer stuff
should be in vfs!

 That's not in the filesystem, though. ;)
 
 However, I agree that if you don't have mmap then it's not
 worthwhile and the changes for VPC aren't trivial.

Yep.

 
 3. avoiding the need for vmap() as it has great
overhead and does not scale
 - Nick is starting to work on that and has
already had good results.
  
  Frankly I don't follow this vmap thing. Can you elaborate?
 
 We current support metadata blocks larger than page size for
 certain types of metadata in XFS. e.g. directory blocks.
 This however, requires vmap()ing a bunch of individual,
 non-contiguous pages out of a block device address space
 in exactly the fashion that was proposed by Nick with fsblock
 originally.
 
 vmap() has severe scalability problems - read this subthread
 of this discussion between Nick and myself:
 
 http://lkml.org/lkml/2007/9/11/508

So the idea of vmap is that it's much simpler to have a contiguous
virtual address space large blocksize, than to find the right
b_data[index] once you exceed PAGE_SIZE...

The global tlb flush with ipi would kill performance, you can forget
any global mapping here. The only chance to do this would be like we
do with kmap_atomic per-cpu on highmem, with preempt_disable (for the
enjoyment of the rt folks out there ;). what's the problem of having
it per-cpu? Is this what fsblock already does? You've just have to
allocate a new virtual range large numberofentriesinvmap*blocksize
every time you mount a new fs. Then instead of calling kmap you call
vmap and vunmap when you're finished. That should provide decent
performance, especially with physically indexed caches.

Anything more heavyweight than what I suggested is probably overkill,
even vmalloc_to_page.

 Hmm - so you'll need page cache tail packing as well in that case
 to prevent memory being wasted on small files. That means any way
 we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
 we've got some non-trivial VM  modifications to make. 

Hmm no, the point of config-page-shift is that if you really need to
reach the very end, you probably don't care about wasting some
memory, because either your workload can't fit in cache, or it fits in
cache regardless, or you're not wasting memory because you work with
large files...

The only point of this largepage stuff is to go an extra mile to save
a bit more of cpu vs a strict vmap based solution (fsblock of course
will be smart enough that if it notices the PAGE_SIZE is = blocksize
it doesn't need to run any vmap at all and it can just use the direct
mapping, so vmap translates in 1 branch only to check the blocksize
variable, PAGE_SIZE is immediate in the .text at compile time). But if
you care about that tiny bit of performance during I/O

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Christoph Lameter

On Thu, 20 Sep 2007, David Chinner wrote:

  Disagree, the mmap side is not a little change.
 
 That's not in the filesystem, though. ;)

And its really only a minimal change for some function to loop over all 
4k pages and elsewhere index the right 4k subpage.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Christoph Lameter

On Thu, 20 Sep 2007, Andrea Arcangeli wrote:

 The only point of this largepage stuff is to go an extra mile to save
 a bit more of cpu vs a strict vmap based solution (fsblock of course
 will be smart enough that if it notices the PAGE_SIZE is = blocksize
 it doesn't need to run any vmap at all and it can just use the direct
 mapping, so vmap translates in 1 branch only to check the blocksize
 variable, PAGE_SIZE is immediate in the .text at compile time). But if

Hmmm.. You are not keeping up with things? Heard of virtual compound 
pages? They only require a vmap when the page allocator fails a 
larger order allocation (which we have established is rare to the point 
of nonexistence). The next rev of large blocksize will use that for 
fallback. Plus vcompounds can be used to get rid of most uses of vmalloc 
reducing the need for virtual mapping in general.

Largeblock is a general solution for managing large data sets using a 
single page struct. See the original message that started this thread. It 
can be used for various other subsystems like vcompound can.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread David Chinner

On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> > Ok, let's step back for a moment and look at a basic, fundamental
> > constraint of disks - seek capacity. A decade ago, a terabyte of
> > filesystem had 30 disks behind it - a seek capacity of about
> > 6000 seeks/s. Nowdays, that's a single disk with a seek
> > capacity of about 200/s. We're going *rapidly* backwards in
> > terms of seek capacity per terabyte of storage.
> > 
> > Now fill that terabyte of storage and index it in the most efficient
> > way - let's say btrees are used because lots of filesystems use
> > them. Hence the depth of the tree is roughly O((log n)/m) where m is
> > a factor of the btree block size.  Effectively, btree depth = seek
> > count on lookup of any object.
> 
> I agree. btrees will clearly benefit if the nodes are larger. We've an
> excess of disk capacity and an huge gap between seeking and contiguous
> bandwidth.
> 
> You don't need largepages for this, fsblocks is enough.

Sure, and that's what I meant when I said VPC + large pages was
a means to the end, not the only solution to the problem.

> Plus of course you don't like fsblock because it requires work to
> adapt a fs to it, I can't argue about that.

No, I don't like fsblock because it is inherently a "struture
per filesystem block" construct, just like buggerheads. You
still need to allocate millions of them when you have millions
dirty pages around. Rather than type it all out again, read
the fsblocks thread from here:

http://marc.info/?l=linux-fsdevel=118284983925719=2

FWIW, with Chris mason's extent-based block mapping (which btrfs
is using and Christoph Hellwig is porting XFS over to) we completely
remove buggerheads from XFS and so fsblock would be a pretty
major step backwards for us if Chris's work goes into mainline.

> > Ok, so let's set the record straight. There were 3 justifications
> > for using *large pages* to *support* large filesystem block sizes
> > The justifications for the variable order page cache with large
> > pages were:
> > 
> > 1. little code change needed in the filesystems
> > -> still true
> 
> Disagree, the mmap side is not a little change.

That's not in the filesystem, though. ;)

However, I agree that if you don't have mmap then it's not
worthwhile and the changes for VPC aren't trivial.

> > 3. avoiding the need for vmap() as it has great
> >overhead and does not scale
> > -> Nick is starting to work on that and has
> >already had good results.
> 
> Frankly I don't follow this vmap thing. Can you elaborate?

We current support metadata blocks larger than page size for
certain types of metadata in XFS. e.g. directory blocks.
This however, requires vmap()ing a bunch of individual,
non-contiguous pages out of a block device address space
in exactly the fashion that was proposed by Nick with fsblock
originally.

vmap() has severe scalability problems - read this subthread
of this discussion between Nick and myself:

http://lkml.org/lkml/2007/9/11/508

> > Everyone seems to be focussing on #2 as the entire justification for
> > large block sizes in filesystems and that this is an "SGI" problem.
> 
> I agree it's not a SGI problem and this is why I want a design that
> has a _slight chance_ to improve performance on x86-64 too. If
> variable order page cache will provide any further improvement on top
> of fsblock will be only because your I/O device isn't fast with small
> sg entries.

There we go - back to the bloody I/O devices. Can ppl please stop
bringing this up because it *is not an issue any more*.

> config-page-shift + fsblock IMHO is the way to go for x86-64, with one
> additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
> top of fsblocks.

Hmm - so you'll need page cache tail packing as well in that case
to prevent memory being wasted on small files. That means any way
we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
we've got some non-trivial VM  modifications to make. 

If VPC can be separated from the large contiguous page requirement
(i.e. virtually mapped compound page support), I still think it
comes out on top because it doesn't require every filesystem to be
modified and you can use standard pages where they are optimal
(i.e. on filesystems were block size <= PAGE_SIZE).

But, I'm not going to argue endlessly for one solution or another;
I'm happy to see different solutions being chased, so may the
best VM win ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Andrea Arcangeli

On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> Ok, let's step back for a moment and look at a basic, fundamental
> constraint of disks - seek capacity. A decade ago, a terabyte of
> filesystem had 30 disks behind it - a seek capacity of about
> 6000 seeks/s. Nowdays, that's a single disk with a seek
> capacity of about 200/s. We're going *rapidly* backwards in
> terms of seek capacity per terabyte of storage.
> 
> Now fill that terabyte of storage and index it in the most efficient
> way - let's say btrees are used because lots of filesystems use
> them. Hence the depth of the tree is roughly O((log n)/m) where m is
> a factor of the btree block size.  Effectively, btree depth = seek
> count on lookup of any object.

I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.

You don't need largepages for this, fsblocks is enough.

Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).

Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.

> Ok, so let's set the record straight. There were 3 justifications
> for using *large pages* to *support* large filesystem block sizes
> The justifications for the variable order page cache with large
> pages were:
> 
>   1. little code change needed in the filesystems
>   -> still true

Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.

>   3. avoiding the need for vmap() as it has great
>  overhead and does not scale
>   -> Nick is starting to work on that and has
>  already had good results.

Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.

> Everyone seems to be focussing on #2 as the entire justification for
> large block sizes in filesystems and that this is an "SGI" problem.

I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.

For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.

config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.

fsblock will provide the guarantee of "mounting" all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the "free" numbers (not working by luck). That's all we
need as far as I can tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Nick Piggin

On Wednesday 19 September 2007 04:30, Linus Torvalds wrote:
> On Tue, 18 Sep 2007, Nick Piggin wrote:
> > ROFL! Yeah of course, how could I have forgotten about our trusty OOM
> > killer as the solution to the fragmentation problem? It would only have
> > been funnier if you had said to reboot every so often when memory gets
> > fragmented :)
>
> Can we please stop this *idiotic* thread.
>
> Nick, you and some others seem to be arguing based on a totally flawed
> base, namely:
>  - we can guarantee anything at all in the VM
>  - we even care about the 16kB blocksize
>  - second-class citizenry is "bad"
>
> The fact is, *none* of those things are true. The VM doesn't guarantee
> anything, and is already very much about statistics in many places. You
> seem to be arguing as if Christoph was introducing something new and
> unacceptable, when it's largely just more of the same.

I will stop this idiotic thread.

However, at the VM and/or vm/fs things we had, I was happy enough
for this thing of Christoph's to get merged. Actually I even didn't care
if it had mmap support, so long as it solved their problem.

But a solution to the general problem of VM and IO scalability, it is not.
IMO.

> And the fact is, nobody but SGI customers would ever want the 16kB
> blocksize. IOW - NONE OF THIS MATTERS!

Maybe. Maybe not.

> Can you guys stop this inane thread already, or at least take it private
> between you guys, instead of forcing everybody else to listen in on your
> flamefest.

Will do. Sorry.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Alex Tomas

On 9/19/07, David Chinner <[EMAIL PROTECTED]> wrote:
> The problem is this: to alter the fundamental block size of the
> filesystem we also need to alter the data block size and that is
> exactly the piece that linux does not support right now.  So while
> we have the capability to use large block sizes in certain
> filesystems, we can't use that capability until the data path
> supports it.

it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.

-- 
thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Alex Tomas

On 9/19/07, David Chinner [EMAIL PROTECTED] wrote:
 The problem is this: to alter the fundamental block size of the
 filesystem we also need to alter the data block size and that is
 exactly the piece that linux does not support right now.  So while
 we have the capability to use large block sizes in certain
 filesystems, we can't use that capability until the data path
 supports it.

it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.

-- 
thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Nick Piggin

On Wednesday 19 September 2007 04:30, Linus Torvalds wrote:
 On Tue, 18 Sep 2007, Nick Piggin wrote:
  ROFL! Yeah of course, how could I have forgotten about our trusty OOM
  killer as the solution to the fragmentation problem? It would only have
  been funnier if you had said to reboot every so often when memory gets
  fragmented :)

 Can we please stop this *idiotic* thread.

 Nick, you and some others seem to be arguing based on a totally flawed
 base, namely:
  - we can guarantee anything at all in the VM
  - we even care about the 16kB blocksize
  - second-class citizenry is bad

 The fact is, *none* of those things are true. The VM doesn't guarantee
 anything, and is already very much about statistics in many places. You
 seem to be arguing as if Christoph was introducing something new and
 unacceptable, when it's largely just more of the same.

I will stop this idiotic thread.

However, at the VM and/or vm/fs things we had, I was happy enough
for this thing of Christoph's to get merged. Actually I even didn't care
if it had mmap support, so long as it solved their problem.

But a solution to the general problem of VM and IO scalability, it is not.
IMO.


 And the fact is, nobody but SGI customers would ever want the 16kB
 blocksize. IOW - NONE OF THIS MATTERS!

Maybe. Maybe not.


 Can you guys stop this inane thread already, or at least take it private
 between you guys, instead of forcing everybody else to listen in on your
 flamefest.

Will do. Sorry.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Andrea Arcangeli

On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
 Ok, let's step back for a moment and look at a basic, fundamental
 constraint of disks - seek capacity. A decade ago, a terabyte of
 filesystem had 30 disks behind it - a seek capacity of about
 6000 seeks/s. Nowdays, that's a single disk with a seek
 capacity of about 200/s. We're going *rapidly* backwards in
 terms of seek capacity per terabyte of storage.
 
 Now fill that terabyte of storage and index it in the most efficient
 way - let's say btrees are used because lots of filesystems use
 them. Hence the depth of the tree is roughly O((log n)/m) where m is
 a factor of the btree block size.  Effectively, btree depth = seek
 count on lookup of any object.

I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.

You don't need largepages for this, fsblocks is enough.

Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).

Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.

 Ok, so let's set the record straight. There were 3 justifications
 for using *large pages* to *support* large filesystem block sizes
 The justifications for the variable order page cache with large
 pages were:
 
   1. little code change needed in the filesystems
   - still true

Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.

   3. avoiding the need for vmap() as it has great
  overhead and does not scale
   - Nick is starting to work on that and has
  already had good results.

Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.

 Everyone seems to be focussing on #2 as the entire justification for
 large block sizes in filesystems and that this is an SGI problem.

I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.

For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.

config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.

fsblock will provide the guarantee of mounting all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the free numbers (not working by luck). That's all we
need as far as I can tell.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread David Chinner

On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:
 On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
  Ok, let's step back for a moment and look at a basic, fundamental
  constraint of disks - seek capacity. A decade ago, a terabyte of
  filesystem had 30 disks behind it - a seek capacity of about
  6000 seeks/s. Nowdays, that's a single disk with a seek
  capacity of about 200/s. We're going *rapidly* backwards in
  terms of seek capacity per terabyte of storage.
  
  Now fill that terabyte of storage and index it in the most efficient
  way - let's say btrees are used because lots of filesystems use
  them. Hence the depth of the tree is roughly O((log n)/m) where m is
  a factor of the btree block size.  Effectively, btree depth = seek
  count on lookup of any object.
 
 I agree. btrees will clearly benefit if the nodes are larger. We've an
 excess of disk capacity and an huge gap between seeking and contiguous
 bandwidth.
 
 You don't need largepages for this, fsblocks is enough.

Sure, and that's what I meant when I said VPC + large pages was
a means to the end, not the only solution to the problem.

 Plus of course you don't like fsblock because it requires work to
 adapt a fs to it, I can't argue about that.

No, I don't like fsblock because it is inherently a struture
per filesystem block construct, just like buggerheads. You
still need to allocate millions of them when you have millions
dirty pages around. Rather than type it all out again, read
the fsblocks thread from here:

http://marc.info/?l=linux-fsdevelm=118284983925719w=2

FWIW, with Chris mason's extent-based block mapping (which btrfs
is using and Christoph Hellwig is porting XFS over to) we completely
remove buggerheads from XFS and so fsblock would be a pretty
major step backwards for us if Chris's work goes into mainline.

  Ok, so let's set the record straight. There were 3 justifications
  for using *large pages* to *support* large filesystem block sizes
  The justifications for the variable order page cache with large
  pages were:
  
  1. little code change needed in the filesystems
  - still true
 
 Disagree, the mmap side is not a little change.

That's not in the filesystem, though. ;)

However, I agree that if you don't have mmap then it's not
worthwhile and the changes for VPC aren't trivial.

  3. avoiding the need for vmap() as it has great
 overhead and does not scale
  - Nick is starting to work on that and has
 already had good results.
 
 Frankly I don't follow this vmap thing. Can you elaborate?

We current support metadata blocks larger than page size for
certain types of metadata in XFS. e.g. directory blocks.
This however, requires vmap()ing a bunch of individual,
non-contiguous pages out of a block device address space
in exactly the fashion that was proposed by Nick with fsblock
originally.

vmap() has severe scalability problems - read this subthread
of this discussion between Nick and myself:

http://lkml.org/lkml/2007/9/11/508

  Everyone seems to be focussing on #2 as the entire justification for
  large block sizes in filesystems and that this is an SGI problem.
 
 I agree it's not a SGI problem and this is why I want a design that
 has a _slight chance_ to improve performance on x86-64 too. If
 variable order page cache will provide any further improvement on top
 of fsblock will be only because your I/O device isn't fast with small
 sg entries.

sigh

There we go - back to the bloody I/O devices. Can ppl please stop
bringing this up because it *is not an issue any more*.

 config-page-shift + fsblock IMHO is the way to go for x86-64, with one
 additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
 top of fsblocks.

Hmm - so you'll need page cache tail packing as well in that case
to prevent memory being wasted on small files. That means any way
we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
we've got some non-trivial VM  modifications to make. 

If VPC can be separated from the large contiguous page requirement
(i.e. virtually mapped compound page support), I still think it
comes out on top because it doesn't require every filesystem to be
modified and you can use standard pages where they are optimal
(i.e. on filesystems were block size = PAGE_SIZE).

But, I'm not going to argue endlessly for one solution or another;
I'm happy to see different solutions being chased, so may the
best VM win ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread David Chinner

On Tue, Sep 18, 2007 at 06:06:52PM -0700, Linus Torvalds wrote:
> >  especially as the Linux
> > kernel limitations in this area are well known.  There's no "16K mess"
> > that SGI is trying to clean up here (and SGI have offered both IA64 and
> > x86_64 systems for some time now, so not sure how you came up with that
> > whacko theory).
> 
> Well, if that is the case, then I vote that we drop the whole patch-series 
> entirely. It clearly has no reason for existing at all.
> 
> There is *no* valid reason for 16kB blocksizes unless you have legacy 
> issues.

Ok, let's step back for a moment and look at a basic, fundamental
constraint of disks - seek capacity. A decade ago, a terabyte of
filesystem had 30 disks behind it - a seek capacity of about
6000 seeks/s. Nowdays, that's a single disk with a seek
capacity of about 200/s. We're going *rapidly* backwards in
terms of seek capacity per terabyte of storage.

Now fill that terabyte of storage and index it in the most efficient
way - let's say btrees are used because lots of filesystems use
them. Hence the depth of the tree is roughly O((log n)/m) where m is
a factor of the btree block size.  Effectively, btree depth = seek
count on lookup of any object.

When the filesystem had a capacity of 6,000 seeks/s, we didn't
really care if the indexes used 4k blocks or not - the storage
subsystem had an excess of seek capacity to deal with
less-than-optimal indexing. Now we have over an order of magnitude
less seeks to expend in index operations *for the same amount of
data* so we are really starting to care about minimising the
number of seeks in our indexing mechanisms and allocations.

We can play tricks in index compaction to reduce the number of
interior nodes of the tree (like hashed indexing in the XFS ext3
htree directories) but that still only gets us so far in reducing
seeks and doesn't help at all for tree traversals. That leaves us
with the btree block size as the only factor we can further vary to
reduce the depth of the tree. i.e. "m".

So we want to increase the filesystem block size it improve the
efficiency of our indexing. That improvement in efficiency
translates directly into better performance on seek constrained
storage subsystems.

The problem is this: to alter the fundamental block size of the
filesystem we also need to alter the data block size and that is
exactly the piece that linux does not support right now.  So while
we have the capability to use large block sizes in certain
filesystems, we can't use that capability until the data path
supports it.

To summarise, large block size support in the filesystem is not
about "legacy" issues. It's about trying to cope with the rapid
expansion of storage capabilities of modern hardware where we have
to index much, much more data with a corresponding decrease in
the seek capability of the hardware.

> So get your stories straight, people.

Ok, so let's set the record straight. There were 3 justifications
for using *large pages* to *support* large filesystem block sizes
The justifications for the variable order page cache with large
pages were:

1. little code change needed in the filesystems
-> still true

2. Increased I/O sizes on 4k page machines (the "SCSI
   controller problem")
-> redundant thanks to Jens Axboe's quick work

3. avoiding the need for vmap() as it has great
   overhead and does not scale
-> Nick is starting to work on that and has
   already had good results.

Everyone seems to be focussing on #2 as the entire justification for
large block sizes in filesystems and that this is an "SGI" problem.
Nothing could be further from the truth - the truth is that large
pages solved multiple problems in one go. We now have a different,
better solution #2, so please, please stop using that as some
justification for claiming filesystems don't need large block sizes.

However, all this doesn't change the fact that we have a major storage
scalability crunch coming in the next few years. Disk capacity is
likely to continue to double every 12 months for the next 3 or 4
years. Large block size support is only one mechanism we need to
help cope with this trend.

The variable order page cache with large pages was a means to an end
- it's not the only solution to this problem and I'm extremely happy
to see that there is progress on multiple fronts.  That's the
strength of the Linux community showing through.  In the end, I
really don't care how we end up supporting large filesystem block
sizes in the page cache - all I care about is that we end up
supporting it as efficiently and generically as we possibly can.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman


On 09/19/2007 06:33 AM, Linus Torvalds wrote:


On Wed, 19 Sep 2007, Rene Herman wrote:



I do feel larger blocksizes continue to make sense in general though. Packet
writing on CD/DVD is a problem already today since the hardware needs 32K or
64K blocks and I'd expect to see more of these and similiar situations when
flash gets (even) more popular which it sort of inevitably is going to be.


.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 


So if I understood that right, you'd suggest to deal with devices with 
larger physical blocksizes at some level above the current blocklayer.


Not familiar enough with either block or fs to be able to argue that 
effectively...


Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Wed, 19 Sep 2007, Rene Herman wrote:
> 
> I do feel larger blocksizes continue to make sense in general though. Packet
> writing on CD/DVD is a problem already today since the hardware needs 32K or
> 64K blocks and I'd expect to see more of these and similiar situations when
> flash gets (even) more popular which it sort of inevitably is going to be.

.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman


On 09/19/2007 05:50 AM, Linus Torvalds wrote:


On Wed, 19 Sep 2007, Rene Herman wrote:



Well, not so sure about that. What if one of your expected uses for example is
video data storage -- lots of data, especially for multiple streams, and needs
still relatively fast machinery. Why would you care for the overhead af
_small_ blocks?


.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a "support legacy 
customers", I think the whole patch-series has been a total waste of time.


Admittedly, extent-based might not be a particularly bad answer at least to 
the I/O side of the equation...


I do feel larger blocksizes continue to make sense in general though. Packet 
writing on CD/DVD is a problem already today since the hardware needs 32K or 
64K blocks and I'd expect to see more of these and similiar situations when 
flash gets (even) more popular which it sort of inevitably is going to be.


Rene.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Wed, 19 Sep 2007, Rene Herman wrote:
> 
> Well, not so sure about that. What if one of your expected uses for example is
> video data storage -- lots of data, especially for multiple streams, and needs
> still relatively fast machinery. Why would you care for the overhead af
> _small_ blocks?

.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a "support legacy 
customers", I think the whole patch-series has been a total waste of time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman


On 09/18/2007 09:44 PM, Linus Torvalds wrote:


Nobody sane would *ever* argue for 16kB+ blocksizes in general.


Well, not so sure about that. What if one of your expected uses for example 
is video data storage -- lots of data, especially for multiple streams, and 
needs still relatively fast machinery. Why would you care for the overhead 
af _small_ blocks?


Okay, maybe that's covered in the "in general" but its not extremely oddball 
either...


Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nathan Scott

On Tue, 2007-09-18 at 18:06 -0700, Linus Torvalds wrote:
> There is *no* valid reason for 16kB blocksizes unless you have legacy 
> issues.

That's not correct.

> The performance issues have nothing to do with the block-size, and 

We must be thinking of different performance issues.

> should be solvable by just making sure that your stupid "state of the
> art" 
> crap SCSI controller gets contiguous physical memory, which is best
> done 
> in the read-ahead code. 

SCSI controllers have nothing to do with improving ondisk layout, which
is the performance issue I've been referring to.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nathan Scott

On Tue, 2007-09-18 at 12:44 -0700, Linus Torvalds wrote:
> This is not about performance. Never has been. It's about SGI wanting a 
> way out of their current 16kB mess.

Pass the crack pipe, Linus?

> The way to fix performance is to move to x86-64, and use 4kB pages and be 
> happy. However, the SGI people want a 16kB (and possibly bigger) 
> crap-option for their people who are (often _already_) running some 
> special case situation that nobody else cares about.

FWIW (and I hate to let reality get in the way of a good conspiracy) -
all SGI systems have always defaulted to using 4K blocksize filesystems;
there's very few customers who would use larger, especially as the Linux
kernel limitations in this area are well known.  There's no "16K mess"
that SGI is trying to clean up here (and SGI have offered both IA64 and
x86_64 systems for some time now, so not sure how you came up with that
whacko theory).

> It's not about "performance". If it was, they would never have used ia64

For SGI it really is about optimising ondisk layouts for some workloads
and large filesystems, and has nothing to do with IA64.  Read the paper
Dave sent out earlier, it's quite interesting.

For other people, like AntonA, who has also been asking for this
functionality literally for years (and ended up trying to do his own
thing inside NTFS IIRC) it's to be able to access existing filesystems
from other operating systems.  Here's a more recent discussion, I know
Anton had discussed it several times on fsdevel before this 2005 post
too:   http://oss.sgi.com/archives/xfs/2005-01/msg00126.html

Although I'm sure others exist, I've never worked on any platform other
than Linux that doesn't support filesystem block sizes larger than the
pagesize.  Its one thing to stick your head in the sand about the need
for this feature, its another thing entirely to try pass it off as an
"SGI mess", sorry.

I do entirely support the sentiment to stop this pissing match and get
on with fixing the problem though.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds

On Wed, 19 Sep 2007, Nathan Scott wrote:
> 
> FWIW (and I hate to let reality get in the way of a good conspiracy) -
> all SGI systems have always defaulted to using 4K blocksize filesystems;

Yes. And I've been told that:

> there's very few customers who would use larger

.. who apparently would like to  move to x86-64. That was what people 
implied at the kernel summit.

>especially as the Linux
> kernel limitations in this area are well known.  There's no "16K mess"
> that SGI is trying to clean up here (and SGI have offered both IA64 and
> x86_64 systems for some time now, so not sure how you came up with that
> whacko theory).

Well, if that is the case, then I vote that we drop the whole patch-series 
entirely. It clearly has no reason for existing at all.

There is *no* valid reason for 16kB blocksizes unless you have legacy 
issues. The performance issues have nothing to do with the block-size, and 
should be solvable by just making sure that your stupid "state of the art" 
crap SCSI controller gets contiguous physical memory, which is best done 
in the read-ahead code.

So get your stories straight, people.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Christoph Lameter

On Tue, 18 Sep 2007, Nick Piggin wrote:

> > We can avoid all doubt in this patchset as well by adding support for
> > fallback to a vmalloced compound page.
> 
> How would you do a vmapped fallback in your patchset? How would
> you keep track of pages 2..N if they don't exist in the radix tree?

Through the vmalloc structures and through the conventions established for 
compound pages?

> What if they don't even exist in the kernel's linear mapping? It seems
> you would also require more special casing in the fault path and special
> casing in the block layer to do this.

Well yeah there is some sucky part about vmapping things (same as in yours,
possibly more in mine since its general and not specific to the page 
cache). On the other hand a generic vcompound fallback will allow us to 
use the page allocator in many places where we currently have to use 
vmalloc because the allocations are too big. It will allow us to get rid 
of most of the vmalloc uses and thereby reduce TLB pressure somewhat.

The vcompound patchset is almost ready. Maybe bits and pieces may 
even help fsblock.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Christoph Lameter

On Tue, 18 Sep 2007, Nick Piggin wrote:

> On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
> > On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > I don't know how it would prevent fragmentation from building up
> > > anyway. It's commonly the case that potentially unmovable objects
> > > are allowed to fill up all of ram (dentries, inodes, etc).
> >
> > Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
> > ZONE_MOVABLE and thus the memory that can be allocated for them is
> > limited.
> 
> Why would ZONE_MOVABLE require that "movable objects should be moved
> out of the way for unmovable ones"? It never _has_ any unmovable objects in
> it. Quite obviously we were not talking about reserve zones.

This was a response to your statement all of memory could be filled up by 
unmovable 
objects. Which cannot occur if the memory for unmovable objects is 
limited. Not sure what you mean by reserves? Mel's reserves? The reserves 
for unmovable objects established by ZONE_MOVABLE?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds

On Tue, 18 Sep 2007, Andrea Arcangeli wrote:
> 
> Many? I can't recall anything besides PF_MEMALLOC and the decision
> that the VM is oom.

*All* of the buddy bitmaps, *all* of the GPF_ATOMIC, *all* of the zone 
watermarks, everything that we depend on every single day, is in the end 
just about statistically workable.

We do 1- and 2-order allocations all the time, and we "know" they work. 
Yet Nick (and this whole *idiotic* thread) has all been about how they 
cannot work.

> In general every time reliability has a low priority than performance
> I've an hard time to enjoy it.

This is not about performance. Never has been. It's about SGI wanting a 
way out of their current 16kB mess.

The way to fix performance is to move to x86-64, and use 4kB pages and be 
happy. However, the SGI people want a 16kB (and possibly bigger) 
crap-option for their people who are (often _already_) running some 
special case situation that nobody else cares about.

It's not about "performance". If it was, they would never have used ia64 
in the first place.  It's about special-case users that do odd things.

Nobody sane would *ever* argue for 16kB+ blocksizes in general. 

Linus

PS. Yes, I realize that there's a lot of insane people out there. However, 
we generally don't do kernel design decisions based on them. But we can 
pat the insane users on the head and say "we won't guarantee it works, but 
if you eat your prozac, and don't bother us, go do your stupid things".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli

On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
> When has free ever given any usefull "free" number? I can perfectly
> fine allocate another gigabyte of memory despide free saing 25MB. But
> that is because I know that the buffer/cached are not locked in.

Well, as you said you know that buffer/cached are not locked in. If
/proc/meminfo would be rubbish like you seem to imply in the first
line, why would we ever bother to export that information and even
waste time writing a binary that parse it for admins?

> On the other hand 1GB can instantly vanish when I start a xen domain
> and anything relying on the free value would loose.

Actually you better check meminfo or free before starting a 1G of Xen!!

> The only sensible thing for an application concerned with swapping is
> to whatch the swapping and then reduce itself. Not the amount
> free. Although I wish there were some kernel interface to get a
> preasure value of how valuable free pages would be right now. I would
> like that for fuse so a userspace filesystem can do caching without
> cripling the kernel.

Repeated drop caches + free can help.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli

On Tue, Sep 18, 2007 at 11:30:17AM -0700, Linus Torvalds wrote:
> The fact is, *none* of those things are true. The VM doesn't guarantee 
> anything, and is already very much about statistics in many places. You 

Many? I can't recall anything besides PF_MEMALLOC and the decision
that the VM is oom. Those are the only two gray areas... the safety
margin is large enough that nobody notices the lack of black-and-white
solution.

So instead of working to provide guarantees for the above two gray
spots, we're making everything weaker, that's the wrong direction as
far as I can tell, especially if we're going to mess up big time the
commo code in a backwards way only for those few users of those few
I/O devices out there.

In general every time reliability has a low priority than performance
I've an hard time to enjoy it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds

On Tue, 18 Sep 2007, Nick Piggin wrote:
> 
> ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
> as the solution to the fragmentation problem? It would only have been funnier
> if you had said to reboot every so often when memory gets fragmented :)

Can we please stop this *idiotic* thread.

Nick, you and some others seem to be arguing based on a totally flawed 
base, namely:
 - we can guarantee anything at all in the VM
 - we even care about the 16kB blocksize
 - second-class citizenry is "bad"

The fact is, *none* of those things are true. The VM doesn't guarantee 
anything, and is already very much about statistics in many places. You 
seem to be arguing as if Christoph was introducing something new and 
unacceptable, when it's largely just more of the same.

And the fact is, nobody but SGI customers would ever want the 16kB 
blocksize. IOW - NONE OF THIS MATTERS!

Can you guys stop this inane thread already, or at least take it private 
between you guys, instead of forcing everybody else to listen in on your 
flamefest.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:21, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > > So if you argue that vmap is a downside, then please tell me how you
> > > > consider the -ENOMEM of your approach to be better?
> > >
> > > That is again pretty undifferentiated. Are we talking about low page
> >
> > In general.
>
> There is no -ENOMEM approach. Lower order page allocation (<
> PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer
> will be activated.

ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
as the solution to the fragmentation problem? It would only have been funnier
if you had said to reboot every so often when memory gets fragmented :)

> That is the nature of the failures that we saw early in 
> the year when this was first merged into mm.
>
> > > With the ZONE_MOVABLE you can remove the unmovable objects into a
> > > defined pool then higher order success rates become reasonable.
> >
> > OK, if you rely on reserve pools, then it is not 1st class support and
> > hence it is a non-solution to VM and IO scalability problems.
>
> ZONE_MOVABLE creates two memory pools in a machine. One of them is for
> movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first
> call support for order 0 pages?

What?

> > > > If, by special software layer, you mean the vmap/vunmap support in
> > > > fsblock, let's see... that's probably all of a hundred or two lines.
> > > > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > > > pagecache and its new special mmap layer... Hmm, seems like a no
> > > > brainer to me. You really still want to persue the "extra layer"
> > > > argument as a point against fsblock here?
> > >
> > > Yes sure. You code could not live without these approaches. Without the
> >
> > Actually: your code is the one that relies on higher order allocations.
> > Now you're trying to turn that into an argument against fsblock?
>
> fsblock also needs contiguous pages in order to have a beneficial
> effect that we seem to be looking for.

Keyword: relies.

> > > antifragmentation measures your fsblock code would not be very
> > > successful in getting the larger contiguous segments you need to
> > > improve performance.
> >
> > Complely wrong. *I* don't need to do any of that to improve performance.
> > Actually the VM is well tuned for order-0 pages, and so seeing as I have
> > sane hardware, 4K pagecache works beautifully for me.
>
> Sure the system works fine as is. Not sure why we would need fsblock then.

Large block filesystem.

> > > (There is no new mmap layer, the higher order pagecache is simply the
> > > old API with set_blocksize expanded).
> >
> > Yes you add another layer in the userspace mapping code to handle higher
> > order pagecache.
>
> That would imply a new API or something? I do not see it.

I was not implying a new API.

> > > Why: It is the same approach that you use.
> >
> > Again, rubbish.
>
> Ok the logical conclusion from the above is that you think your approach
> is rubbish 

The logical conclusion is that _they are not the same approach_!

> Is there some way you could cool down a bit? 

I'm not upset, but what you were saying was rubbish, plain and simple. The
amount of times we've gone in circles, I most likely have already explained
this, serveral times, in a more polite manner.

And I know you're more than capable to understand at least the concept
behind fsblock, even without time to work through the exact details. What
are you expecting me to say, after all this back and forth, when you come
up with things like "[fsblock] is not a generic change but special to the
block layer", and then claim that fsblock is the same as allocating "virtual
compound pages" with vmalloc as a fallback for higher order allocs.

What I will say is that fsblock has still a relatively longer way to go, so
maybe that's your reason for not looking at it. And yes, when fsblock is
in a better state to actually perform useful comparisons with it, will be a
much better time to have these debates. But in that case, just say so :)
then I can go away and do more constructive work on it instead of filling
people's inboxes.

I believe the fsblock approach is the best one, but it's not without problems
and complexities, so I'm quite ready for it to be proven incorrect, not
performant, or otherwise rejected.

I'm going on holiday for 2 weeks. I'll try to stay away from email, and
particularly this thread.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> > I don't know how it would prevent fragmentation from building up
> > anyway. It's commonly the case that potentially unmovable objects
> > are allowed to fill up all of ram (dentries, inodes, etc).
>
> Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
> ZONE_MOVABLE and thus the memory that can be allocated for them is
> limited.

Why would ZONE_MOVABLE require that "movable objects should be moved
out of the way for unmovable ones"? It never _has_ any unmovable objects in
it. Quite obviously we were not talking about reserve zones.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:05, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > > fsblock doesn't need any of those hacks, of course.
> > >
> > > Nor does mine for the low orders that we are considering. For order >
> > > MAX_ORDER this is unavoidable since the page allocator cannot manage
> > > such large pages. It can be used for lower order if there are issues
> > > (that I have not seen yet).
> >
> > Or we can just avoid all doubt (and doesn't have arbitrary limitations
> > according to what you think might be reasonable or how well the
> > system actually behaves).
>
> We can avoid all doubt in this patchset as well by adding support for
> fallback to a vmalloced compound page.

How would you do a vmapped fallback in your patchset? How would
you keep track of pages 2..N if they don't exist in the radix tree?
What if they don't even exist in the kernel's linear mapping? It seems
you would also require more special casing in the fault path and special
casing in the block layer to do this.

It's not a trivial problem you can just brush away by handwaving. Let's
see... you could add another field in struct page to store the vmap
virtual address, and set a new flag to indicate indicate that constituent
page N can be found via vmalloc_to_page(page->vaddr + N*PAGE_SIZE).
Then add more special casing to the block layer and fault path etc. to handle
these new non-contiguous compound pages. I guess you must have thought
about it much harder than the 2 minutes I just did then, so you must have a
much nicer solution...

But even so, you're still trying very hard to avoid touching the filesystems
or buffer layer while advocating instead to squeeze the complexity out into
the vm and block layer. I don't agree that is the right thing to do. Sure it
is _easier_, because we know the VM.

I don't argue that fsblock large block support is trivial. But you are first
asserting that it is too complicated and then trying to address one of the
issues it solves by introducing complexity elsewhere.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread David Chinner

On Tue, Sep 18, 2007 at 11:00:40AM +0100, Mel Gorman wrote:
> We still lack data on what sort of workloads really benefit from large
> blocks (assuming there are any that cannot also be solved by improving
> order-0).

No we don't. All workloads benefit from larger block sizes when
you've got a btree tracking 20 million inodes and a create has to
search that tree for a free inode.  The tree gets much wider and
hence we take fewer disk seeks to traverse the tree. Same for large
directories, btree's tracking free space, etc - everything goes
faster with a larger filesystem block size because we spent less
time doing metadata I/O.

And the other advantage is that sequential I/O speeds also tend to
increase with larger block sizes. e.g. XFS on an Altix (16k pages)
using 16k block size is about 20-25% faster on writes than 4k block
size. See the graphs at the top of page 12:

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

The benefits are really about scalability and with terabyte sized
disks on the market.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Jörn Engel

On Tue, 18 September 2007 11:00:40 +0100, Mel Gorman wrote:
> 
> We still lack data on what sort of workloads really benefit from large
> blocks

Compressing filesystems like jffs2 and logfs gain better compression
ratio with larger blocks.  Going from 4KiB to 64KiB gave somewhere
around 10% benefit iirc.  Testdata was a 128MiB qemu root filesystem.

Granted, the same could be achieved by adding some extra code and a few
bounce buffers to the filesystem.  How suck a hack would perform I'd
prefer not to find out, though. :)

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Mel Gorman

On (17/09/07 15:00), Christoph Lameter didst pronounce:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> 
> > I don't know how it would prevent fragmentation from building up
> > anyway. It's commonly the case that potentially unmovable objects
> > are allowed to fill up all of ram (dentries, inodes, etc).
> 
> Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
> ZONE_MOVABLE and thus the memory that can be allocated for them is 
> limited.
> 

As Nick points out, having to configure something makes it a #2
solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out
clause to be able to control fragmentation no matter what the workload is
as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some
mechanism in grouping pages by mobility to force a number of blocks to be
MIGRATE_MOVABLE_ONLY, the emergency option will exist,

We still lack data on what sort of workloads really benefit from large
blocks (assuming there are any that cannot also be solved by improving
order-0). With Christophs approach + grouping pages by mobility +
ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the
course of the next few months while we're waiting for fsblock or software
pagesize to mature.

Do we really need to keep discussing this as no new point has been made ina
while? Can we at least take out the non-contentious parts of Christoph's
patches such as the page cache macros and do something with them?

-- 
Mel "tired of typing" Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Mel Gorman

On (17/09/07 15:00), Christoph Lameter didst pronounce:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
 
  I don't know how it would prevent fragmentation from building up
  anyway. It's commonly the case that potentially unmovable objects
  are allowed to fill up all of ram (dentries, inodes, etc).
 
 Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
 ZONE_MOVABLE and thus the memory that can be allocated for them is 
 limited.
 

As Nick points out, having to configure something makes it a #2
solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out
clause to be able to control fragmentation no matter what the workload is
as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some
mechanism in grouping pages by mobility to force a number of blocks to be
MIGRATE_MOVABLE_ONLY, the emergency option will exist,

We still lack data on what sort of workloads really benefit from large
blocks (assuming there are any that cannot also be solved by improving
order-0). With Christophs approach + grouping pages by mobility +
ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the
course of the next few months while we're waiting for fsblock or software
pagesize to mature.

Do we really need to keep discussing this as no new point has been made ina
while? Can we at least take out the non-contentious parts of Christoph's
patches such as the page cache macros and do something with them?

-- 
Mel tired of typing Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Jörn Engel

On Tue, 18 September 2007 11:00:40 +0100, Mel Gorman wrote:
 
 We still lack data on what sort of workloads really benefit from large
 blocks

Compressing filesystems like jffs2 and logfs gain better compression
ratio with larger blocks.  Going from 4KiB to 64KiB gave somewhere
around 10% benefit iirc.  Testdata was a 128MiB qemu root filesystem.

Granted, the same could be achieved by adding some extra code and a few
bounce buffers to the filesystem.  How suck a hack would perform I'd
prefer not to find out, though. :)

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread David Chinner

On Tue, Sep 18, 2007 at 11:00:40AM +0100, Mel Gorman wrote:
 We still lack data on what sort of workloads really benefit from large
 blocks (assuming there are any that cannot also be solved by improving
 order-0).

No we don't. All workloads benefit from larger block sizes when
you've got a btree tracking 20 million inodes and a create has to
search that tree for a free inode.  The tree gets much wider and
hence we take fewer disk seeks to traverse the tree. Same for large
directories, btree's tracking free space, etc - everything goes
faster with a larger filesystem block size because we spent less
time doing metadata I/O.

And the other advantage is that sequential I/O speeds also tend to
increase with larger block sizes. e.g. XFS on an Altix (16k pages)
using 16k block size is about 20-25% faster on writes than 4k block
size. See the graphs at the top of page 12:

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

The benefits are really about scalability and with terabyte sized
disks on the market.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:05, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
fsblock doesn't need any of those hacks, of course.
  
   Nor does mine for the low orders that we are considering. For order 
   MAX_ORDER this is unavoidable since the page allocator cannot manage
   such large pages. It can be used for lower order if there are issues
   (that I have not seen yet).
 
  Or we can just avoid all doubt (and doesn't have arbitrary limitations
  according to what you think might be reasonable or how well the
  system actually behaves).

 We can avoid all doubt in this patchset as well by adding support for
 fallback to a vmalloced compound page.

How would you do a vmapped fallback in your patchset? How would
you keep track of pages 2..N if they don't exist in the radix tree?
What if they don't even exist in the kernel's linear mapping? It seems
you would also require more special casing in the fault path and special
casing in the block layer to do this.

It's not a trivial problem you can just brush away by handwaving. Let's
see... you could add another field in struct page to store the vmap
virtual address, and set a new flag to indicate indicate that constituent
page N can be found via vmalloc_to_page(page-vaddr + N*PAGE_SIZE).
Then add more special casing to the block layer and fault path etc. to handle
these new non-contiguous compound pages. I guess you must have thought
about it much harder than the 2 minutes I just did then, so you must have a
much nicer solution...

But even so, you're still trying very hard to avoid touching the filesystems
or buffer layer while advocating instead to squeeze the complexity out into
the vm and block layer. I don't agree that is the right thing to do. Sure it
is _easier_, because we know the VM.

I don't argue that fsblock large block support is trivial. But you are first
asserting that it is too complicated and then trying to address one of the
issues it solves by introducing complexity elsewhere.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
  I don't know how it would prevent fragmentation from building up
  anyway. It's commonly the case that potentially unmovable objects
  are allowed to fill up all of ram (dentries, inodes, etc).

 Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
 ZONE_MOVABLE and thus the memory that can be allocated for them is
 limited.

Why would ZONE_MOVABLE require that movable objects should be moved
out of the way for unmovable ones? It never _has_ any unmovable objects in
it. Quite obviously we were not talking about reserve zones.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:21, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?
  
   That is again pretty undifferentiated. Are we talking about low page
 
  In general.

 There is no -ENOMEM approach. Lower order page allocation (
 PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer
 will be activated.

ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
as the solution to the fragmentation problem? It would only have been funnier
if you had said to reboot every so often when memory gets fragmented :)


 That is the nature of the failures that we saw early in 
 the year when this was first merged into mm.

   With the ZONE_MOVABLE you can remove the unmovable objects into a
   defined pool then higher order success rates become reasonable.
 
  OK, if you rely on reserve pools, then it is not 1st class support and
  hence it is a non-solution to VM and IO scalability problems.

 ZONE_MOVABLE creates two memory pools in a machine. One of them is for
 movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first
 call support for order 0 pages?

What?


If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the extra layer
argument as a point against fsblock here?
  
   Yes sure. You code could not live without these approaches. Without the
 
  Actually: your code is the one that relies on higher order allocations.
  Now you're trying to turn that into an argument against fsblock?

 fsblock also needs contiguous pages in order to have a beneficial
 effect that we seem to be looking for.

Keyword: relies.


   antifragmentation measures your fsblock code would not be very
   successful in getting the larger contiguous segments you need to
   improve performance.
 
  Complely wrong. *I* don't need to do any of that to improve performance.
  Actually the VM is well tuned for order-0 pages, and so seeing as I have
  sane hardware, 4K pagecache works beautifully for me.

 Sure the system works fine as is. Not sure why we would need fsblock then.

Large block filesystem.


   (There is no new mmap layer, the higher order pagecache is simply the
   old API with set_blocksize expanded).
 
  Yes you add another layer in the userspace mapping code to handle higher
  order pagecache.

 That would imply a new API or something? I do not see it.

I was not implying a new API.


   Why: It is the same approach that you use.
 
  Again, rubbish.

 Ok the logical conclusion from the above is that you think your approach
 is rubbish 

The logical conclusion is that _they are not the same approach_!


 Is there some way you could cool down a bit? 

I'm not upset, but what you were saying was rubbish, plain and simple. The
amount of times we've gone in circles, I most likely have already explained
this, serveral times, in a more polite manner.

And I know you're more than capable to understand at least the concept
behind fsblock, even without time to work through the exact details. What
are you expecting me to say, after all this back and forth, when you come
up with things like [fsblock] is not a generic change but special to the
block layer, and then claim that fsblock is the same as allocating virtual
compound pages with vmalloc as a fallback for higher order allocs.

What I will say is that fsblock has still a relatively longer way to go, so
maybe that's your reason for not looking at it. And yes, when fsblock is
in a better state to actually perform useful comparisons with it, will be a
much better time to have these debates. But in that case, just say so :)
then I can go away and do more constructive work on it instead of filling
people's inboxes.

I believe the fsblock approach is the best one, but it's not without problems
and complexities, so I'm quite ready for it to be proven incorrect, not
performant, or otherwise rejected.

I'm going on holiday for 2 weeks. I'll try to stay away from email, and
particularly this thread.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Tue, 18 Sep 2007, Nick Piggin wrote:
 
 ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
 as the solution to the fragmentation problem? It would only have been funnier
 if you had said to reboot every so often when memory gets fragmented :)

Can we please stop this *idiotic* thread.

Nick, you and some others seem to be arguing based on a totally flawed 
base, namely:
 - we can guarantee anything at all in the VM
 - we even care about the 16kB blocksize
 - second-class citizenry is bad

The fact is, *none* of those things are true. The VM doesn't guarantee 
anything, and is already very much about statistics in many places. You 
seem to be arguing as if Christoph was introducing something new and 
unacceptable, when it's largely just more of the same.

And the fact is, nobody but SGI customers would ever want the 16kB 
blocksize. IOW - NONE OF THIS MATTERS!

Can you guys stop this inane thread already, or at least take it private 
between you guys, instead of forcing everybody else to listen in on your 
flamefest.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli

On Tue, Sep 18, 2007 at 11:30:17AM -0700, Linus Torvalds wrote:
 The fact is, *none* of those things are true. The VM doesn't guarantee 
 anything, and is already very much about statistics in many places. You 

Many? I can't recall anything besides PF_MEMALLOC and the decision
that the VM is oom. Those are the only two gray areas... the safety
margin is large enough that nobody notices the lack of black-and-white
solution.

So instead of working to provide guarantees for the above two gray
spots, we're making everything weaker, that's the wrong direction as
far as I can tell, especially if we're going to mess up big time the
commo code in a backwards way only for those few users of those few
I/O devices out there.

In general every time reliability has a low priority than performance
I've an hard time to enjoy it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli

On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
 When has free ever given any usefull free number? I can perfectly
 fine allocate another gigabyte of memory despide free saing 25MB. But
 that is because I know that the buffer/cached are not locked in.

Well, as you said you know that buffer/cached are not locked in. If
/proc/meminfo would be rubbish like you seem to imply in the first
line, why would we ever bother to export that information and even
waste time writing a binary that parse it for admins?

 On the other hand 1GB can instantly vanish when I start a xen domain
 and anything relying on the free value would loose.

Actually you better check meminfo or free before starting a 1G of Xen!!

 The only sensible thing for an application concerned with swapping is
 to whatch the swapping and then reduce itself. Not the amount
 free. Although I wish there were some kernel interface to get a
 preasure value of how valuable free pages would be right now. I would
 like that for fuse so a userspace filesystem can do caching without
 cripling the kernel.

Repeated drop caches + free can help.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Tue, 18 Sep 2007, Andrea Arcangeli wrote:
 
 Many? I can't recall anything besides PF_MEMALLOC and the decision
 that the VM is oom.

*All* of the buddy bitmaps, *all* of the GPF_ATOMIC, *all* of the zone 
watermarks, everything that we depend on every single day, is in the end 
just about statistically workable.

We do 1- and 2-order allocations all the time, and we know they work. 
Yet Nick (and this whole *idiotic* thread) has all been about how they 
cannot work.

 In general every time reliability has a low priority than performance
 I've an hard time to enjoy it.

This is not about performance. Never has been. It's about SGI wanting a 
way out of their current 16kB mess.

The way to fix performance is to move to x86-64, and use 4kB pages and be 
happy. However, the SGI people want a 16kB (and possibly bigger) 
crap-option for their people who are (often _already_) running some 
special case situation that nobody else cares about.

It's not about performance. If it was, they would never have used ia64 
in the first place.  It's about special-case users that do odd things.

Nobody sane would *ever* argue for 16kB+ blocksizes in general. 

Linus

PS. Yes, I realize that there's a lot of insane people out there. However, 
we generally don't do kernel design decisions based on them. But we can 
pat the insane users on the head and say we won't guarantee it works, but 
if you eat your prozac, and don't bother us, go do your stupid things.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Christoph Lameter

On Tue, 18 Sep 2007, Nick Piggin wrote:

 On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
  On Sun, 16 Sep 2007, Nick Piggin wrote:
   I don't know how it would prevent fragmentation from building up
   anyway. It's commonly the case that potentially unmovable objects
   are allowed to fill up all of ram (dentries, inodes, etc).
 
  Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
  ZONE_MOVABLE and thus the memory that can be allocated for them is
  limited.
 
 Why would ZONE_MOVABLE require that movable objects should be moved
 out of the way for unmovable ones? It never _has_ any unmovable objects in
 it. Quite obviously we were not talking about reserve zones.

This was a response to your statement all of memory could be filled up by 
unmovable 
objects. Which cannot occur if the memory for unmovable objects is 
limited. Not sure what you mean by reserves? Mel's reserves? The reserves 
for unmovable objects established by ZONE_MOVABLE?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Christoph Lameter

On Tue, 18 Sep 2007, Nick Piggin wrote:

  We can avoid all doubt in this patchset as well by adding support for
  fallback to a vmalloced compound page.
 
 How would you do a vmapped fallback in your patchset? How would
 you keep track of pages 2..N if they don't exist in the radix tree?

Through the vmalloc structures and through the conventions established for 
compound pages?

 What if they don't even exist in the kernel's linear mapping? It seems
 you would also require more special casing in the fault path and special
 casing in the block layer to do this.

Well yeah there is some sucky part about vmapping things (same as in yours,
possibly more in mine since its general and not specific to the page 
cache). On the other hand a generic vcompound fallback will allow us to 
use the page allocator in many places where we currently have to use 
vmalloc because the allocations are too big. It will allow us to get rid 
of most of the vmalloc uses and thereby reduce TLB pressure somewhat.

The vcompound patchset is almost ready. Maybe bits and pieces may 
even help fsblock.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Wed, 19 Sep 2007, Nathan Scott wrote:
 
 FWIW (and I hate to let reality get in the way of a good conspiracy) -
 all SGI systems have always defaulted to using 4K blocksize filesystems;

Yes. And I've been told that:

 there's very few customers who would use larger

.. who apparently would like to  move to x86-64. That was what people 
implied at the kernel summit.

especially as the Linux
 kernel limitations in this area are well known.  There's no 16K mess
 that SGI is trying to clean up here (and SGI have offered both IA64 and
 x86_64 systems for some time now, so not sure how you came up with that
 whacko theory).

Well, if that is the case, then I vote that we drop the whole patch-series 
entirely. It clearly has no reason for existing at all.

There is *no* valid reason for 16kB blocksizes unless you have legacy 
issues. The performance issues have nothing to do with the block-size, and 
should be solvable by just making sure that your stupid state of the art 
crap SCSI controller gets contiguous physical memory, which is best done 
in the read-ahead code.

So get your stories straight, people.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nathan Scott

On Tue, 2007-09-18 at 12:44 -0700, Linus Torvalds wrote:
 This is not about performance. Never has been. It's about SGI wanting a 
 way out of their current 16kB mess.

Pass the crack pipe, Linus?

 The way to fix performance is to move to x86-64, and use 4kB pages and be 
 happy. However, the SGI people want a 16kB (and possibly bigger) 
 crap-option for their people who are (often _already_) running some 
 special case situation that nobody else cares about.

FWIW (and I hate to let reality get in the way of a good conspiracy) -
all SGI systems have always defaulted to using 4K blocksize filesystems;
there's very few customers who would use larger, especially as the Linux
kernel limitations in this area are well known.  There's no 16K mess
that SGI is trying to clean up here (and SGI have offered both IA64 and
x86_64 systems for some time now, so not sure how you came up with that
whacko theory).

 It's not about performance. If it was, they would never have used ia64

For SGI it really is about optimising ondisk layouts for some workloads
and large filesystems, and has nothing to do with IA64.  Read the paper
Dave sent out earlier, it's quite interesting.

For other people, like AntonA, who has also been asking for this
functionality literally for years (and ended up trying to do his own
thing inside NTFS IIRC) it's to be able to access existing filesystems
from other operating systems.  Here's a more recent discussion, I know
Anton had discussed it several times on fsdevel before this 2005 post
too:   http://oss.sgi.com/archives/xfs/2005-01/msg00126.html

Although I'm sure others exist, I've never worked on any platform other
than Linux that doesn't support filesystem block sizes larger than the
pagesize.  Its one thing to stick your head in the sand about the need
for this feature, its another thing entirely to try pass it off as an
SGI mess, sorry.

I do entirely support the sentiment to stop this pissing match and get
on with fixing the problem though.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nathan Scott

On Tue, 2007-09-18 at 18:06 -0700, Linus Torvalds wrote:
 There is *no* valid reason for 16kB blocksizes unless you have legacy 
 issues.

That's not correct.

 The performance issues have nothing to do with the block-size, and 

We must be thinking of different performance issues.

 should be solvable by just making sure that your stupid state of the
 art 
 crap SCSI controller gets contiguous physical memory, which is best
 done 
 in the read-ahead code. 

SCSI controllers have nothing to do with improving ondisk layout, which
is the performance issue I've been referring to.

cheers.

--
Nathan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman


On 09/18/2007 09:44 PM, Linus Torvalds wrote:


Nobody sane would *ever* argue for 16kB+ blocksizes in general.


Well, not so sure about that. What if one of your expected uses for example 
is video data storage -- lots of data, especially for multiple streams, and 
needs still relatively fast machinery. Why would you care for the overhead 
af _small_ blocks?


Okay, maybe that's covered in the in general but its not extremely oddball 
either...


Rene.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Wed, 19 Sep 2007, Rene Herman wrote:
 
 Well, not so sure about that. What if one of your expected uses for example is
 video data storage -- lots of data, especially for multiple streams, and needs
 still relatively fast machinery. Why would you care for the overhead af
 _small_ blocks?

.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a support legacy 
customers, I think the whole patch-series has been a total waste of time.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman


On 09/19/2007 05:50 AM, Linus Torvalds wrote:


On Wed, 19 Sep 2007, Rene Herman wrote:



Well, not so sure about that. What if one of your expected uses for example is
video data storage -- lots of data, especially for multiple streams, and needs
still relatively fast machinery. Why would you care for the overhead af
_small_ blocks?


.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a support legacy 
customers, I think the whole patch-series has been a total waste of time.


Admittedly, extent-based might not be a particularly bad answer at least to 
the I/O side of the equation...


I do feel larger blocksizes continue to make sense in general though. Packet 
writing on CD/DVD is a problem already today since the hardware needs 32K or 
64K blocks and I'd expect to see more of these and similiar situations when 
flash gets (even) more popular which it sort of inevitably is going to be.


Rene.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds



On Wed, 19 Sep 2007, Rene Herman wrote:
 
 I do feel larger blocksizes continue to make sense in general though. Packet
 writing on CD/DVD is a problem already today since the hardware needs 32K or
 64K blocks and I'd expect to see more of these and similiar situations when
 flash gets (even) more popular which it sort of inevitably is going to be.

.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Rene Herman


On 09/19/2007 06:33 AM, Linus Torvalds wrote:


On Wed, 19 Sep 2007, Rene Herman wrote:



I do feel larger blocksizes continue to make sense in general though. Packet
writing on CD/DVD is a problem already today since the hardware needs 32K or
64K blocks and I'd expect to see more of these and similiar situations when
flash gets (even) more popular which it sort of inevitably is going to be.


.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 


So if I understood that right, you'd suggest to deal with devices with 
larger physical blocksizes at some level above the current blocklayer.


Not familiar enough with either block or fs to be able to argue that 
effectively...


Rene.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread David Chinner

On Tue, Sep 18, 2007 at 06:06:52PM -0700, Linus Torvalds wrote:
   especially as the Linux
  kernel limitations in this area are well known.  There's no 16K mess
  that SGI is trying to clean up here (and SGI have offered both IA64 and
  x86_64 systems for some time now, so not sure how you came up with that
  whacko theory).
 
 Well, if that is the case, then I vote that we drop the whole patch-series 
 entirely. It clearly has no reason for existing at all.
 
 There is *no* valid reason for 16kB blocksizes unless you have legacy 
 issues.

Ok, let's step back for a moment and look at a basic, fundamental
constraint of disks - seek capacity. A decade ago, a terabyte of
filesystem had 30 disks behind it - a seek capacity of about
6000 seeks/s. Nowdays, that's a single disk with a seek
capacity of about 200/s. We're going *rapidly* backwards in
terms of seek capacity per terabyte of storage.

Now fill that terabyte of storage and index it in the most efficient
way - let's say btrees are used because lots of filesystems use
them. Hence the depth of the tree is roughly O((log n)/m) where m is
a factor of the btree block size.  Effectively, btree depth = seek
count on lookup of any object.

When the filesystem had a capacity of 6,000 seeks/s, we didn't
really care if the indexes used 4k blocks or not - the storage
subsystem had an excess of seek capacity to deal with
less-than-optimal indexing. Now we have over an order of magnitude
less seeks to expend in index operations *for the same amount of
data* so we are really starting to care about minimising the
number of seeks in our indexing mechanisms and allocations.

We can play tricks in index compaction to reduce the number of
interior nodes of the tree (like hashed indexing in the XFS ext3
htree directories) but that still only gets us so far in reducing
seeks and doesn't help at all for tree traversals. That leaves us
with the btree block size as the only factor we can further vary to
reduce the depth of the tree. i.e. m.

So we want to increase the filesystem block size it improve the
efficiency of our indexing. That improvement in efficiency
translates directly into better performance on seek constrained
storage subsystems.

The problem is this: to alter the fundamental block size of the
filesystem we also need to alter the data block size and that is
exactly the piece that linux does not support right now.  So while
we have the capability to use large block sizes in certain
filesystems, we can't use that capability until the data path
supports it.

To summarise, large block size support in the filesystem is not
about legacy issues. It's about trying to cope with the rapid
expansion of storage capabilities of modern hardware where we have
to index much, much more data with a corresponding decrease in
the seek capability of the hardware.

 So get your stories straight, people.

Ok, so let's set the record straight. There were 3 justifications
for using *large pages* to *support* large filesystem block sizes
The justifications for the variable order page cache with large
pages were:

1. little code change needed in the filesystems
- still true

2. Increased I/O sizes on 4k page machines (the SCSI
   controller problem)
- redundant thanks to Jens Axboe's quick work

3. avoiding the need for vmap() as it has great
   overhead and does not scale
- Nick is starting to work on that and has
   already had good results.

Everyone seems to be focussing on #2 as the entire justification for
large block sizes in filesystems and that this is an SGI problem.
Nothing could be further from the truth - the truth is that large
pages solved multiple problems in one go. We now have a different,
better solution #2, so please, please stop using that as some
justification for claiming filesystems don't need large block sizes.

However, all this doesn't change the fact that we have a major storage
scalability crunch coming in the next few years. Disk capacity is
likely to continue to double every 12 months for the next 3 or 4
years. Large block size support is only one mechanism we need to
help cope with this trend.

The variable order page cache with large pages was a means to an end
- it's not the only solution to this problem and I'm extremely happy
to see that there is progress on multiple fronts.  That's the
strength of the Linux community showing through.  In the end, I
really don't care how we end up supporting large filesystem block
sizes in the page cache - all I care about is that we end up
supporting it as efficiently and generically as we possibly can.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Christoph Lameter

On Sun, 16 Sep 2007, Nick Piggin wrote:

> > > So if you argue that vmap is a downside, then please tell me how you
> > > consider the -ENOMEM of your approach to be better?
> >
> > That is again pretty undifferentiated. Are we talking about low page
> 
> In general.

There is no -ENOMEM approach. Lower order page allocation (< 
PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer 
will be activated. That is the nature of the failures that we saw early in 
the year when this was first merged into mm.

> > With the ZONE_MOVABLE you can remove the unmovable objects into a defined
> > pool then higher order success rates become reasonable.
> 
> OK, if you rely on reserve pools, then it is not 1st class support and hence
> it is a non-solution to VM and IO scalability problems.

ZONE_MOVABLE creates two memory pools in a machine. One of them is for 
movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first 
call support for order 0 pages?

> > > If, by special software layer, you mean the vmap/vunmap support in
> > > fsblock, let's see... that's probably all of a hundred or two lines.
> > > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > > pagecache and its new special mmap layer... Hmm, seems like a no
> > > brainer to me. You really still want to persue the "extra layer"
> > > argument as a point against fsblock here?
> >
> > Yes sure. You code could not live without these approaches. Without the
> 
> Actually: your code is the one that relies on higher order allocations. Now
> you're trying to turn that into an argument against fsblock?

fsblock also needs contiguous pages in order to have a beneficial 
effect that we seem to be looking for.

> > antifragmentation measures your fsblock code would not be very successful
> > in getting the larger contiguous segments you need to improve performance.
> 
> Complely wrong. *I* don't need to do any of that to improve performance.
> Actually the VM is well tuned for order-0 pages, and so seeing as I have
> sane hardware, 4K pagecache works beautifully for me.

Sure the system works fine as is. Not sure why we would need fsblock then.

> > (There is no new mmap layer, the higher order pagecache is simply the old
> > API with set_blocksize expanded).
> 
> Yes you add another layer in the userspace mapping code to handle higher
> order pagecache.

That would imply a new API or something? I do not see it.

> > Why: It is the same approach that you use.
> 
> Again, rubbish.

Ok the logical conclusion from the above is that you think your approach 
is rubbish Is there some way you could cool down a bit?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Christoph Lameter

On Mon, 17 Sep 2007, Bernd Schmidt wrote:

> Christoph Lameter wrote:
> > True. That is why we want to limit the number of unmovable allocations and
> > that is why ZONE_MOVABLE exists to limit those. However, unmovable
> > allocations are already rare today. The overwhelming majority of allocations
> > are movable and reclaimable. You can see that f.e. by looking at
> > /proc/meminfo and see how high SUnreclaim: is (does not catch everything but
> > its a good indicator).
> 
> Just to inject another factor into the discussion, please remember that Linux
> also runs on nommu systems, where things like user space allocations are
> neither movable nor reclaimable.

Hmmm However, sorting of the allocations would result in avoiding 
defragmentation to some degree?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Christoph Lameter

On Sun, 16 Sep 2007, Nick Piggin wrote:

> > > fsblock doesn't need any of those hacks, of course.
> >
> > Nor does mine for the low orders that we are considering. For order >
> > MAX_ORDER this is unavoidable since the page allocator cannot manage such
> > large pages. It can be used for lower order if there are issues (that I
> > have not seen yet).
> 
> Or we can just avoid all doubt (and doesn't have arbitrary limitations
> according to what you think might be reasonable or how well the
> system actually behaves).

We can avoid all doubt in this patchset as well by adding support for 
fallback to a vmalloced compound page.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Christoph Lameter

On Sun, 16 Sep 2007, Jörn Engel wrote:

> I bet!  My (false) assumption was the same as Goswin's.  If non-movable
> pages are clearly seperated from movable ones and will evict movable
> ones before polluting further mixed superpages, Nick's scenario would be
> nearly infinitely impossible.
> 
> Assumption doesn't reflect current code.  Enforcing this assumption
> would cost extra overhead.  The amount of effort to make Christoph's
> approach work reliably seems substantial and I have no idea whether it
> would be worth it.

My approach is based on Mel's code and is already working the way you 
describe. Page cache allocs are marked __GFP_MOVABLE by Mel's work.

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Christoph Lameter

On Sun, 16 Sep 2007, Nick Piggin wrote:

> I don't know how it would prevent fragmentation from building up
> anyway. It's commonly the case that potentially unmovable objects
> are allowed to fill up all of ram (dentries, inodes, etc).

Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
ZONE_MOVABLE and thus the memory that can be allocated for them is 
limited.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Monday 17 September 2007 14:07, David Chinner wrote:
> On Fri, Sep 14, 2007 at 06:48:55AM +1000, Nick Piggin wrote:

> > OK, the vunmap batching code wipes your TLB flushing and IPIs off
> > the table. Diffstat below, but the TLB portions are here (besides that
> > _everything_ is probably lower due to less TLB misses caused by the
> > TLB flushing):
> >
> >   -170   -99.4% sn2_send_IPI
> >   -343  -100.0% sn_send_IPI_phys
> > -17911   -99.9% smp_call_function
> >
> > Total performance went up by 30% on a 64-way system (248 seconds to
> > 172 seconds to run parallel finds over different huge directories).
>
> Good start, Nick ;)

I didn't have the chance to test against a 16K directory block size to find
the "optimal" performance, but it is something I will do (I'm sure it will be
still a _lot_ faster than 172 seconds :)).


> >  23012  54790.5% _read_lock
> >   9427   329.0% __get_vm_area_node
> >   5792 0.0% __find_vm_area
> >   1590  53000.0% __vunmap
>
> 
>
> _read_lock? I though vmap() and vunmap() only took the vmlist_lock in
> write mode.

Yeah, it is a slight change... the lazy vunmap only has to take it for read.
In practice, I'm not sure that it helps a great deal because everything else
still takes the lock for write. But that explains why it's popped up in the
profile.


> > Next I have some patches to scale the vmap locks and data
> > structures better, but they're not quite ready yet. This looks like it
> > should result in a further speedup of several times when combined
> > with the TLB flushing reductions here...
>
> Sounds promising - when you have patches that work, send them my
> way and I'll run some tests on them.

Still away from home (for the next 2 weeks), so I'll be going a bit slow :P
I'm thinking about various scalable locking schemes and I'll definitely
ping you when I've made a bit of progress.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 04:08, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Nick Piggin wrote:
> > However fsblock can do everything that higher order pagecache can
> > do in terms of avoiding vmap and giving contiguous memory to block
> > devices by opportunistically allocating higher orders of pages, and
> > falling back to vmap if they cannot be satisfied.
>
> fsblock is restricted to the page cache and cannot be used in other
> contexts where subsystems can benefit from larger linear memory.

Unless you believe higher order pagecache is not restricted to the
pagecache, can we please just stick on topic of fsblock vs higher
order pagecache?

> > So if you argue that vmap is a downside, then please tell me how you
> > consider the -ENOMEM of your approach to be better?
>
> That is again pretty undifferentiated. Are we talking about low page

In general.

> orders? There we will reclaim the all of reclaimable memory before getting
> an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
> has 256 milllion 4k pages--and the unmovable ratios we see today it
> would require a very strange setup to get an allocation failure while
> still be able to allocate order 0 pages.

So much handwaving. 1TB machines without "very strange setups"
(where very strange is something arbitrarily defined by you) I guess make
up 0.001% of Linux installations.

> With the ZONE_MOVABLE you can remove the unmovable objects into a defined
> pool then higher order success rates become reasonable.

OK, if you rely on reserve pools, then it is not 1st class support and hence
it is a non-solution to VM and IO scalability problems.

> > If, by special software layer, you mean the vmap/vunmap support in
> > fsblock, let's see... that's probably all of a hundred or two lines.
> > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > pagecache and its new special mmap layer... Hmm, seems like a no
> > brainer to me. You really still want to persue the "extra layer"
> > argument as a point against fsblock here?
>
> Yes sure. You code could not live without these approaches. Without the

Actually: your code is the one that relies on higher order allocations. Now
you're trying to turn that into an argument against fsblock?

> antifragmentation measures your fsblock code would not be very successful
> in getting the larger contiguous segments you need to improve performance.

Complely wrong. *I* don't need to do any of that to improve performance.
Actually the VM is well tuned for order-0 pages, and so seeing as I have
sane hardware, 4K pagecache works beautifully for me.

My point was this: fsblock does not preclude your using such measures to
fix the performance of your hardware that's broken with 4K pages. And it
would allow higher order allocations to fail gracefully.

> (There is no new mmap layer, the higher order pagecache is simply the old
> API with set_blocksize expanded).

Yes you add another layer in the userspace mapping code to handle higher
order pagecache.

> > Of course I wouldn't state that. On the contrary, I categorically state
> > that I have already solved it :)
>
> Well then I guess that you have not read the requirements...

I'm not talking about solving your problem of poor hardware. I'm talking
about supporting higher order block sizes in the kernel.

> > > Because it has already been rejected in another form and adds more
> >
> > You have rejected it. But they are bogus reasons, as I showed above.
>
> Thats not me. I am working on this because many of the filesystem people
> have repeatedly asked me to do this. I am no expert on filesystems.

Yes it is you. You brought up reasons in this thread and I said why I thought
they were bogus. If you think you can now forget about my shooting them
down by saying you aren't an expert in filesystems, then you shouldn't have
brought them up in the first place. Either stand by your arguments or don't.

> > You also describe some other real (if lesser) issues like number of page
> > structs to manage in the pagecache. But this is hardly enough to reject
> > my patch now... for every downside you can point out in my approach, I
> > can point out one in yours.
> >
> > - fsblock doesn't require changes to virtual memory layer
>
> Therefore it is not a generic change but special to the block layer. So
> other subsystems still have to deal with the single page issues on
> their own.

Rubbish. fsblock doesn't touch a single line in the block layer.

> > > Maybe we coud get to something like a hybrid that avoids some of these
> > > issues? Add support so something like a virtual compound page can be
> > > handled transparently in the filesystem layer with special casing if
> > > such a beast reaches the block layer?
> >
> > That's conceptually much worse, IMO.
>
> Why: It is the same approach that you use.

Again, rubbish.

> If it is barely ever used and 
> satisfies your concern then I am fine with it.

Right below this line

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Monday 17 September 2007 04:13, Mel Gorman wrote:
> On (15/09/07 14:14), Goswin von Brederlow didst pronounce:

> > I keep coming back to the fact that movable objects should be moved
> > out of the way for unmovable ones. Anything else just allows
> > fragmentation to build up.
>
> This is easily achieved, just really really expensive because of the
> amount of copying that would have to take place. It would also compel
> that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> a lot of free memory to keep around which is why fragmentation avoidance
> doesn't do it.

I don't know how it would prevent fragmentation from building up
anyway. It's commonly the case that potentially unmovable objects
are allowed to fill up all of ram (dentries, inodes, etc).

And of course,  if you craft your exploit nicely with help from higher
ordered unmovable memory (eg. mm structs or unix sockets), then
you don't even need to fill all memory with unmovables before you
can have them take over all groups.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 03:52, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Nick Piggin wrote:
> > > > [*] ok, this isn't quite true because if you can actually put a hard
> > > > limit on unmovable allocations then anti-frag will fundamentally help
> > > > -- get back to me on that when you get patches to move most of the
> > > > obvious ones.
> > >
> > > We have this hard limit using ZONE_MOVABLE in 2.6.23.
> >
> > So we're back to 2nd class support.
>
> 2nd class support for me means a feature that is not enabled by default
> but that can be enabled in order to increase performance. The 2nd class
> support is there because we are not yet sure about the maturity of the
> memory allocation methods.

I'd rather an approach that does not require all these hacks.


> > > Reserve pools as handled (by the not yet available) large page pool
> > > patches (which again has altogether another purpose) are not a limit.
> > > The reserve pools are used to provide a mininum of higher order pages
> > > that is not broken down in order to insure that a mininum number of the
> > > desired order of pages is even available in your worst case scenario.
> > > Mainly I think that is needed during the period when memory
> > > defragmentation is still under development.
> >
> > fsblock doesn't need any of those hacks, of course.
>
> Nor does mine for the low orders that we are considering. For order >
> MAX_ORDER this is unavoidable since the page allocator cannot manage such
> large pages. It can be used for lower order if there are issues (that I
> have not seen yet).

Or we can just avoid all doubt (and doesn't have arbitrary limitations
according to what you think might be reasonable or how well the
system actually behaves).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Bernd Schmidt


Christoph Lameter wrote:
True. That is why we want to limit the number of unmovable allocations and 
that is why ZONE_MOVABLE exists to limit those. However, unmovable 
allocations are already rare today. The overwhelming majority of 
allocations are movable and reclaimable. You can see that f.e. by looking 
at /proc/meminfo and see how high SUnreclaim: is (does not catch 
everything but its a good indicator).


Just to inject another factor into the discussion, please remember that 
Linux also runs on nommu systems, where things like user space 
allocations are neither movable nor reclaimable.



Bernd

--
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
> > The 16MB is the size of a hugepage, the size of interest as far as I am
> > concerned. Your idea makes sense for large block support, but much less
> > for huge pages because you are incurring a cost in the general case for
> > something that may not be used.
> 
> Sorry for the misunderstanding, I totally agree!
> 

Great. It's clear that we had different use cases in mind when we were
poking holes in each approach.

> > There is nothing to say that both can't be done. Raise the size of
> > order-0 for large block support and continue trying to group the block
> > to make hugepage allocations likely succeed during the lifetime of the
> > system.
> 
> Sure, I completely agree.
> 
> > At the risk of repeating, your approach will be adding a new and
> > significant dimension to the internal fragmentation problem where a
> > kernel allocation may fail because the larger order-0 pages are all
> > being pinned by userspace pages.
> 
> This is exactly correct, some memory will be wasted. It'll reach 0
> free memory more quickly depending on which kind of applications are
> being run.
> 

I look forward to seeing how you deal with it. When/if you get to trying
to move pages out of slabs, I suggest you take a look at the Memory
Compaction patches or the memory unplug patches for simple examples of
how to use  page migration.

> > It improves the probabilty of hugepage allocations working because the
> > blocks with slab pages can be targetted and cleared if necessary.
> 
> Agreed.
> 
> > That suggestion was aimed at the large block support more than
> > hugepages. It helps large blocks because we'll be allocating and freeing
> > as more or less the same size. It certainly is easier to set
> > slub_min_order to the same size as what is needed for large blocks in
> > the system than introducing the necessary mechanics to allocate
> > pagetable pages and userspace pages from slab.
> 
> Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
> really complex indeed. I'm not planning for that in the short term but
> it remains a possibility to make the kernel more generic. Perhaps it
> could worth it...
> 

Perhaps.

> Allocating ptes from slab is fairly simple but I think it would be
> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
> nearby ptes in the per-task local pagetable tree, to reduce the number
> of locks taken and not to enter the slab at all for that.

It runs the risk of pinning up to 60K of data per task that is unusable for
any other purpose. On average, it'll be more like 32K but worth keeping
in mind.

> Infact we
> could allocate the 4 levels (or anyway more than one level) in one
> single alloc_pages(0) and track the leftovers in the mm (or similar).
> 
> > I'm not sure what you are getting at here. I think it would make more
> > sense if you said "when you read /proc/buddyinfo, you know the order-0
> > pages are really free for use with large blocks" with your approach.
> 
> I'm unsure who reads /proc/buddyinfo (that can change a lot and that
> is not very significant information if the vm can defrag well inside
> the reclaim code),

I read it although as you say, it's difficult to know what will happen if
you try and reclaim memory. It's why there is also a /proc/pagetypeinfo so
one can see the number of movable blocks that exist. That leads to better
guessing. In -mm, you can also see the number of mixed blocks but that will
not be available in mainline.

> but it's not much different and it's more about
> knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
> anon ram, and free memory.
> 
> The idea is that to succeed an mmap over a large xfs file with
> mlockall being invoked, those largepages must become available or
> it'll be oom despite there are still 512M free... I'm quite sure
> admins will gets confused if they get oom killer invoked with lots of
> ram still free.
> 
> The overcommit feature will also break, just to make an example (so
> much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
> killage ;).
> 
> > All this aside, there is nothing mutually exclusive with what you are 
> > proposing
> > and what grouping pages by mobility does. Your stuff can exist even if 
> > grouping
> > pages by mobility is in place. In fact, it'll help us by giving an important
> > comparison point as grouping pages by mobility can be trivially disabled 
> > with
> > a one-liner for the purposes of testing. If your approach is brought to 
> > being
> > a general solution that also helps hugepage allocation, then we can revisit
> > grouping pages by mobility by comparing kernels with it enabled and without.
> 
> Yes, I totally agree. It sounds worthwhile to have a good defrag logic
> in the VM. Even allocating the kernel stack in today kernels should be
> able to benefit from your work. It's just comparing a fork() failure

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
> [EMAIL PROTECTED] (Mel Gorman) writes:
> 
> > On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> >> Andrew Morton <[EMAIL PROTECTED]> writes:
> >> 
> >> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
> >> >
> >> >> While I agree with your concern, those numbers are quite silly.  The
> >> >> chances of 99.8% of pages being free and the remaining 0.2% being
> >> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
> >> >> creating a collision.
> >> >
> >> > Actually it'd be pretty easy to craft an application which allocates 
> >> > seven
> >> > pages for pagecache, then one for , then seven for pagecache, 
> >> > then
> >> > one for , etc.
> >> >
> >> > I've had test apps which do that sort of thing accidentally.  The result
> >> > wasn't pretty.
> >> 
> >> Except that the applications 7 pages are movable and the 
> >> would have to be unmovable. And then they should not share the same
> >> memory region. At least they should never be allowed to interleave in
> >> such a pattern on a larger scale.
> >> 
> >
> > It is actually really easy to force regions to never share. At the
> > moment, there is a fallback list that determines a preference for what
> > block to mix.
> >
> > The reason why this isn't enforced is the cost of moving. On x86 and
> > x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
> > those pages to prevent any mixing would be bad enough. On PowerPC, it's
> > potentially 16MB. On IA64, it's 1GB.
> >
> > As this was fragmentation avoidance, not guarantees, the decision was
> > made to not strictly enforce the types of pages within a block as the
> > cost cannot be made back unless the system was making agressive use of
> > large pages. This is not the case with Linux.
> 
> I don't say the group should never be mixed. The movable objects could
> be moved out on demand. If 64k get allocated then up to 64k get
> moved.

This type of action makes sense in the context of Andrea's approach and
large blocks. I don't think it makes sense today to do it in the general
case, at least not yet.

> That would reduce the impact as the kernel does not hang while
> it moves 2MB or even 1GB. It also allows objects to be freed and the
> space reused in the unmovable and mixed groups. There could also be a
> certain number or percentage of mixed groupd be allowed to further
> increase the chance of movable objects freeing themself from mixed
> groups.
> 
> But when you already have say 10% of the ram in mixed groups then it
> is a sign the external fragmentation happens and some time should be
> spend on moving movable objects.
> 

I'll play around with it on the side and see what sort of results I get.
I won't be pushing anything any time soon in relation to this though.
For now, I don't intend to fiddle more with grouping pages by mobility
for something that may or may not be of benefit to a feature that hasn't
been widely tested with what exists today.

> >> The only way a fragmentation catastroph can be (proovable) avoided is
> >> by having so few unmovable objects that size + max waste << ram
> >> size. The smaller the better. Allowing movable and unmovable objects
> >> to mix means that max waste goes way up. In your example waste would
> >> be 7*size. With 2MB uper order limit it would be 511*size.
> >> 
> >> I keep coming back to the fact that movable objects should be moved
> >> out of the way for unmovable ones. Anything else just allows
> >> fragmentation to build up.
> >> 
> >
> > This is easily achieved, just really really expensive because of the
> > amount of copying that would have to take place. It would also compel
> > that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> > MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> > a lot of free memory to keep around which is why fragmentation avoidance
> > doesn't do it.
> 
> In your sample graphics you had 1152 groups. Reserving a few of those
> doesnt sound too bad.

No, which on those systems, I would suggest setting min_free_kbytes to a
higher value. Doesn't work as well on IA-64.

> And how many migrate types do we talk about.
> So
> far we only had movable and unmovable.

Movable, unmovable, reclaimable and reserve in the current incarnation
of grouping pages by mobility.

> I would split unmovable into
> short term (caches, I/O pages) and long term (task structures,
> dentries).

Mostly done as you suggest already. Dentry are considered reclaimable, not
long-lived though and are grouped with things like inode caches for example.

> Reserving 6 groups for schort term unmovable and long term
> unmovable would be 1% of ram in your situation.
> 

More groups = more cost although very easy to add them. A mixed type
used to exist but was removed again because it couldn't be proved to be
useful at the time.

> Maybe instead of reserving one could say that you can

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (17/09/07 00:48), Goswin von Brederlow didst pronounce:
> [EMAIL PROTECTED] (Mel Gorman) writes:
> 
> > On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
> >> zooming in I see red pixels all over the squares mized with green
> >> pixels in the same square. This is exactly what happens with the
> >> variable order page cache and that's why it provides zero guarantees
> >> in terms of how much ram is really "free" (free as in "available").
> >> 
> >
> > This picture is not grouping pages by mobility so that is hardly a
> > suprise. This picture is not running grouping pages by mobility. This is
> > what the normal kernel looks like. Look at the videos in
> > http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
> > compares to vanilla. These are from February when there was less control
> > over mixing blocks than there is today.
> >
> > In the current version mixing occurs in the lower blocks as much as possible
> > not the upper ones. So there are a number of mixed blocks but the number is
> > kept to a minimum.
> >
> > The number of mixed blocks could have been enforced as 0, but I felt it was
> > better in the general case to fragment rather than regress performance.
> > That may be different for large blocks where you will want to take the
> > enforcement steps.
> 
> I agree that 0 is a bad value. But so is infinity. There should be
> some mixing but not a lot. You say "kept to a minimum". Is that
> actively done or already happens by itself. Hopefully the later which
> would be just splendid.
> 

Happens by itself due to biasing mixing blocks at lower PFNs. The exact
number is unknown. We used to track it a long time ago but not any more.

> >> With config-page-shift mmap works on 4k chunks but it's always backed
> >> by 64k or any other largesize that you choosed at compile time. And if
> 
> But would mapping a random 4K page out of a file then consume 64k?
> That sounds like an awfull lot of internal fragmentation. I hope the
> unaligned bits and pices get put into a slab or something as you
> suggested previously.
> 

This is a possibility but Andrea seems confident he can handle it.

> >> the virtual alignment of mmap matches the physical alignment of the
> >> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
> >> could use the 62nd bit of the pte to use a 64k tlb (if future cpus
> >> will allow that). Nick also suggested to still set all ptes equal to
> >> make life easier for the tlb miss microcode.
> 
> It is too bad that existing amd64 CPUs only allow such large physical
> pages. But it kind of makes sense to cut away a full level or page
> tables for the next bigger size each.
> 

Yep on both counts.

> >> > big you can make it. I don't think my system with 1GB ram would work
> >> > so well with 2MB order 0 pages. But I wasn't refering to that but to
> >> > the picture.
> >> 
> >> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
> >> too, of course unless you're running a db or a multimedia streaming
> >> service, in which case it should be ideal.
> 
> rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
> ocasional mplayer.
> 
> I would mostly be concerned how rtorrents totaly random access of
> mmapped files negatively impacts such a 64k page system.
> 

For what it's worth, the last allocation failure that occured with
grouping pages by mobility was order-1 atomic failures for a wireless
network card when bittorrent was running. You're likely right in that
torrents will be an interesting workload in terms of fragmentation.

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
> [EMAIL PROTECTED] (Mel Gorman) writes:
> 
> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
> >> Mel Gorman <[EMAIL PROTECTED]> writes:
> >> 
> >> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
> >> >> Nick Piggin <[EMAIL PROTECTED]> writes:
> >> >> 
> >> >> > In my attack, I cause the kernel to allocate lots of unmovable 
> >> >> > allocations
> >> >> > and deplete movable groups. I theoretically then only need to keep a
> >> >> > small number (1/2^N) of these allocations around in order to DoS a
> >> >> > page allocation of order N.
> >> >> 
> >> >> I'm assuming that when an unmovable allocation hijacks a movable group
> >> >> any further unmovable alloc will evict movable objects out of that
> >> >> group before hijacking another one. right?
> >> >> 
> >> >
> >> > No eviction takes place. If an unmovable allocation gets placed in a
> >> > movable group, then steps are taken to ensure that future unmovable
> >> > allocations will take place in the same range (these decisions take
> >> > place in __rmqueue_fallback()). When choosing a movable block to
> >> > pollute, it will also choose the lowest possible block in PFN terms to
> >> > steal so that fragmentation pollution will be as confined as possible.
> >> > Evicting the unmovable pages would be one of those expensive steps that
> >> > have been avoided to date.
> >> 
> >> But then you can have all blocks filled with movable data, free 4K in
> >> one group, allocate 4K unmovable to take over the group, free 4k in
> >> the next group, take that group and so on. You can end with 4k
> >> unmovable in every 64k easily by accident.
> >> 
> >
> > As the mixing takes place at the lowest possible block, it's
> > exceptionally difficult to trigger this. Possible, but exceptionally
> > difficult.
> 
> Why is it difficult?
> 

Unless mlock() is being used, it is difficult to place the pages in the
way you suggest. Feel free to put together a test program that forces an
adverse fragmentation situation, it'll be useful in the future for comparing
reliability of any large block solution.

> When user space allocates memory wouldn't it get it contiously?

Not unless it's using libhugetlbfs or it's very very early in the
lifetime of the system. Even then, another process faulting at the same
time will break it up.

> I mean
> that is one of the goals, to use larger continious allocations and map
> them with a single page table entry where possible, right?

It's a goal ultimately but not what we do right now. There have been
suggestions of allocating the contiguous pages optimistically in the
fault path and later promoting with an arch-specific hook but it's
vapourware right now.

> And then
> you can roughly predict where an munmap() would free a page.
> 
> Say the application does map a few GB of file, uses madvice to tell
> the kernel it needs a 2MB block (to get a continious 2MB chunk
> mapped), waits for it and then munmaps 4K in there. A 4k hole for some
> unmovable object to fill.

With grouping pages by mobility, that 4K hole would be on the movable
free lists. To get an unmovable allocation in there, the system needs to
be under considerable stress. Even just raising min_free_kbytes a bit
would make it considerably harder.

With the standard kernel, it would be easier to place as you suggest.

> If you can then trigger the creation of an
> unmovable object as well (stat some file?) and loop you will fill the
> ram quickly. Maybe it only works in 10% but then you just do it 10
> times as often.
> 
> Over long times it could occur naturally. This is just to demonstrate
> it with malice.
> 

Try writing such a program. I'd be interested in reading it.

> > As I have stated repeatedly, the guarantees can be made but potential
> > hugepage allocation did not justify it. Large blocks might.
> >
> >> There should be a lot of preassure for movable objects to vacate a
> >> mixed group or you do get fragmentation catastrophs.
> >
> > We (Andy Whitcroft and I) did implement something like that. It hooked into
> > kswapd to clean mixed blocks. If the caller could do the cleaning, it
> > did the work instead of kswapd.
> 
> Do you have a graphic like
> http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
> for that case?
> 

Not at the moment. I don't have any of the patches to accurately reflect
it up to date. The eviction patches have rotted to the point of
unusability.

> >> Looking at my
> >> little test program evicting movable objects from a mixed group should
> >> not be that expensive as it doesn't happen often.
> >
> > It happens regularly if the size of the block you need to keep clean is
> > lower than min_free_kbytes. In the case of hugepages, that was always
> > the case.
> 
> That assumes that the number of groups allocated for unmovable objects
> will continiously grow and shrink.

They do grow and shrink. The number of pagetables in use changes

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
 [EMAIL PROTECTED] (Mel Gorman) writes:
 
  On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
  Mel Gorman [EMAIL PROTECTED] writes:
  
   On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
   Nick Piggin [EMAIL PROTECTED] writes:
   
In my attack, I cause the kernel to allocate lots of unmovable 
allocations
and deplete movable groups. I theoretically then only need to keep a
small number (1/2^N) of these allocations around in order to DoS a
page allocation of order N.
   
   I'm assuming that when an unmovable allocation hijacks a movable group
   any further unmovable alloc will evict movable objects out of that
   group before hijacking another one. right?
   
  
   No eviction takes place. If an unmovable allocation gets placed in a
   movable group, then steps are taken to ensure that future unmovable
   allocations will take place in the same range (these decisions take
   place in __rmqueue_fallback()). When choosing a movable block to
   pollute, it will also choose the lowest possible block in PFN terms to
   steal so that fragmentation pollution will be as confined as possible.
   Evicting the unmovable pages would be one of those expensive steps that
   have been avoided to date.
  
  But then you can have all blocks filled with movable data, free 4K in
  one group, allocate 4K unmovable to take over the group, free 4k in
  the next group, take that group and so on. You can end with 4k
  unmovable in every 64k easily by accident.
  
 
  As the mixing takes place at the lowest possible block, it's
  exceptionally difficult to trigger this. Possible, but exceptionally
  difficult.
 
 Why is it difficult?
 

Unless mlock() is being used, it is difficult to place the pages in the
way you suggest. Feel free to put together a test program that forces an
adverse fragmentation situation, it'll be useful in the future for comparing
reliability of any large block solution.

 When user space allocates memory wouldn't it get it contiously?

Not unless it's using libhugetlbfs or it's very very early in the
lifetime of the system. Even then, another process faulting at the same
time will break it up.

 I mean
 that is one of the goals, to use larger continious allocations and map
 them with a single page table entry where possible, right?

It's a goal ultimately but not what we do right now. There have been
suggestions of allocating the contiguous pages optimistically in the
fault path and later promoting with an arch-specific hook but it's
vapourware right now.

 And then
 you can roughly predict where an munmap() would free a page.
 
 Say the application does map a few GB of file, uses madvice to tell
 the kernel it needs a 2MB block (to get a continious 2MB chunk
 mapped), waits for it and then munmaps 4K in there. A 4k hole for some
 unmovable object to fill.

With grouping pages by mobility, that 4K hole would be on the movable
free lists. To get an unmovable allocation in there, the system needs to
be under considerable stress. Even just raising min_free_kbytes a bit
would make it considerably harder.

With the standard kernel, it would be easier to place as you suggest.

 If you can then trigger the creation of an
 unmovable object as well (stat some file?) and loop you will fill the
 ram quickly. Maybe it only works in 10% but then you just do it 10
 times as often.
 
 Over long times it could occur naturally. This is just to demonstrate
 it with malice.
 

Try writing such a program. I'd be interested in reading it.

  As I have stated repeatedly, the guarantees can be made but potential
  hugepage allocation did not justify it. Large blocks might.
 
  There should be a lot of preassure for movable objects to vacate a
  mixed group or you do get fragmentation catastrophs.
 
  We (Andy Whitcroft and I) did implement something like that. It hooked into
  kswapd to clean mixed blocks. If the caller could do the cleaning, it
  did the work instead of kswapd.
 
 Do you have a graphic like
 http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
 for that case?
 

Not at the moment. I don't have any of the patches to accurately reflect
it up to date. The eviction patches have rotted to the point of
unusability.

  Looking at my
  little test program evicting movable objects from a mixed group should
  not be that expensive as it doesn't happen often.
 
  It happens regularly if the size of the block you need to keep clean is
  lower than min_free_kbytes. In the case of hugepages, that was always
  the case.
 
 That assumes that the number of groups allocated for unmovable objects
 will continiously grow and shrink.

They do grow and shrink. The number of pagetables in use changes for
example.

 I'm assuming it will level off at
 some size for long times (hours) under normal operations.

It doesn't unless you assume the system remains in a steady state for it's
lifetime. Things like

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (17/09/07 00:48), Goswin von Brederlow didst pronounce:
 [EMAIL PROTECTED] (Mel Gorman) writes:
 
  On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
  zooming in I see red pixels all over the squares mized with green
  pixels in the same square. This is exactly what happens with the
  variable order page cache and that's why it provides zero guarantees
  in terms of how much ram is really free (free as in available).
  
 
  This picture is not grouping pages by mobility so that is hardly a
  suprise. This picture is not running grouping pages by mobility. This is
  what the normal kernel looks like. Look at the videos in
  http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
  compares to vanilla. These are from February when there was less control
  over mixing blocks than there is today.
 
  In the current version mixing occurs in the lower blocks as much as possible
  not the upper ones. So there are a number of mixed blocks but the number is
  kept to a minimum.
 
  The number of mixed blocks could have been enforced as 0, but I felt it was
  better in the general case to fragment rather than regress performance.
  That may be different for large blocks where you will want to take the
  enforcement steps.
 
 I agree that 0 is a bad value. But so is infinity. There should be
 some mixing but not a lot. You say kept to a minimum. Is that
 actively done or already happens by itself. Hopefully the later which
 would be just splendid.
 

Happens by itself due to biasing mixing blocks at lower PFNs. The exact
number is unknown. We used to track it a long time ago but not any more.

  With config-page-shift mmap works on 4k chunks but it's always backed
  by 64k or any other largesize that you choosed at compile time. And if
 
 But would mapping a random 4K page out of a file then consume 64k?
 That sounds like an awfull lot of internal fragmentation. I hope the
 unaligned bits and pices get put into a slab or something as you
 suggested previously.
 

This is a possibility but Andrea seems confident he can handle it.

  the virtual alignment of mmap matches the physical alignment of the
  physical largepage and is = PAGE_SIZE (software PAGE_SIZE I mean) we
  could use the 62nd bit of the pte to use a 64k tlb (if future cpus
  will allow that). Nick also suggested to still set all ptes equal to
  make life easier for the tlb miss microcode.
 
 It is too bad that existing amd64 CPUs only allow such large physical
 pages. But it kind of makes sense to cut away a full level or page
 tables for the next bigger size each.
 

Yep on both counts.

   big you can make it. I don't think my system with 1GB ram would work
   so well with 2MB order 0 pages. But I wasn't refering to that but to
   the picture.
  
  Sure! 2M is sure way excessive for a 1G system, 64k most certainly
  too, of course unless you're running a db or a multimedia streaming
  service, in which case it should be ideal.
 
 rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
 ocasional mplayer.
 
 I would mostly be concerned how rtorrents totaly random access of
 mmapped files negatively impacts such a 64k page system.
 

For what it's worth, the last allocation failure that occured with
grouping pages by mobility was order-1 atomic failures for a wireless
network card when bittorrent was running. You're likely right in that
torrents will be an interesting workload in terms of fragmentation.

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
 [EMAIL PROTECTED] (Mel Gorman) writes:
 
  On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
  Andrew Morton [EMAIL PROTECTED] writes:
  
   On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel [EMAIL PROTECTED] wrote:
  
   While I agree with your concern, those numbers are quite silly.  The
   chances of 99.8% of pages being free and the remaining 0.2% being
   perfectly spread across all 2MB large_pages are lower than those of SHA1
   creating a collision.
  
   Actually it'd be pretty easy to craft an application which allocates 
   seven
   pages for pagecache, then one for something, then seven for pagecache, 
   then
   one for something, etc.
  
   I've had test apps which do that sort of thing accidentally.  The result
   wasn't pretty.
  
  Except that the applications 7 pages are movable and the something
  would have to be unmovable. And then they should not share the same
  memory region. At least they should never be allowed to interleave in
  such a pattern on a larger scale.
  
 
  It is actually really easy to force regions to never share. At the
  moment, there is a fallback list that determines a preference for what
  block to mix.
 
  The reason why this isn't enforced is the cost of moving. On x86 and
  x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
  those pages to prevent any mixing would be bad enough. On PowerPC, it's
  potentially 16MB. On IA64, it's 1GB.
 
  As this was fragmentation avoidance, not guarantees, the decision was
  made to not strictly enforce the types of pages within a block as the
  cost cannot be made back unless the system was making agressive use of
  large pages. This is not the case with Linux.
 
 I don't say the group should never be mixed. The movable objects could
 be moved out on demand. If 64k get allocated then up to 64k get
 moved.

This type of action makes sense in the context of Andrea's approach and
large blocks. I don't think it makes sense today to do it in the general
case, at least not yet.

 That would reduce the impact as the kernel does not hang while
 it moves 2MB or even 1GB. It also allows objects to be freed and the
 space reused in the unmovable and mixed groups. There could also be a
 certain number or percentage of mixed groupd be allowed to further
 increase the chance of movable objects freeing themself from mixed
 groups.
 
 But when you already have say 10% of the ram in mixed groups then it
 is a sign the external fragmentation happens and some time should be
 spend on moving movable objects.
 

I'll play around with it on the side and see what sort of results I get.
I won't be pushing anything any time soon in relation to this though.
For now, I don't intend to fiddle more with grouping pages by mobility
for something that may or may not be of benefit to a feature that hasn't
been widely tested with what exists today.

  The only way a fragmentation catastroph can be (proovable) avoided is
  by having so few unmovable objects that size + max waste  ram
  size. The smaller the better. Allowing movable and unmovable objects
  to mix means that max waste goes way up. In your example waste would
  be 7*size. With 2MB uper order limit it would be 511*size.
  
  I keep coming back to the fact that movable objects should be moved
  out of the way for unmovable ones. Anything else just allows
  fragmentation to build up.
  
 
  This is easily achieved, just really really expensive because of the
  amount of copying that would have to take place. It would also compel
  that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
  MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
  a lot of free memory to keep around which is why fragmentation avoidance
  doesn't do it.
 
 In your sample graphics you had 1152 groups. Reserving a few of those
 doesnt sound too bad.

No, which on those systems, I would suggest setting min_free_kbytes to a
higher value. Doesn't work as well on IA-64.

 And how many migrate types do we talk about.
 So
 far we only had movable and unmovable.

Movable, unmovable, reclaimable and reserve in the current incarnation
of grouping pages by mobility.

 I would split unmovable into
 short term (caches, I/O pages) and long term (task structures,
 dentries).

Mostly done as you suggest already. Dentry are considered reclaimable, not
long-lived though and are grouped with things like inode caches for example.

 Reserving 6 groups for schort term unmovable and long term
 unmovable would be 1% of ram in your situation.
 

More groups = more cost although very easy to add them. A mixed type
used to exist but was removed again because it couldn't be proved to be
useful at the time.

 Maybe instead of reserving one could say that you can have up to 6
 groups of space

And if the groups are 1GB in size? I tried something like this already.
It didn't work out well at the time although I could revisit.

 not

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman

On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
 On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
  The 16MB is the size of a hugepage, the size of interest as far as I am
  concerned. Your idea makes sense for large block support, but much less
  for huge pages because you are incurring a cost in the general case for
  something that may not be used.
 
 Sorry for the misunderstanding, I totally agree!
 

Great. It's clear that we had different use cases in mind when we were
poking holes in each approach.

  There is nothing to say that both can't be done. Raise the size of
  order-0 for large block support and continue trying to group the block
  to make hugepage allocations likely succeed during the lifetime of the
  system.
 
 Sure, I completely agree.
 
  At the risk of repeating, your approach will be adding a new and
  significant dimension to the internal fragmentation problem where a
  kernel allocation may fail because the larger order-0 pages are all
  being pinned by userspace pages.
 
 This is exactly correct, some memory will be wasted. It'll reach 0
 free memory more quickly depending on which kind of applications are
 being run.
 

I look forward to seeing how you deal with it. When/if you get to trying
to move pages out of slabs, I suggest you take a look at the Memory
Compaction patches or the memory unplug patches for simple examples of
how to use  page migration.

  It improves the probabilty of hugepage allocations working because the
  blocks with slab pages can be targetted and cleared if necessary.
 
 Agreed.
 
  That suggestion was aimed at the large block support more than
  hugepages. It helps large blocks because we'll be allocating and freeing
  as more or less the same size. It certainly is easier to set
  slub_min_order to the same size as what is needed for large blocks in
  the system than introducing the necessary mechanics to allocate
  pagetable pages and userspace pages from slab.
 
 Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
 really complex indeed. I'm not planning for that in the short term but
 it remains a possibility to make the kernel more generic. Perhaps it
 could worth it...
 

Perhaps.

 Allocating ptes from slab is fairly simple but I think it would be
 better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
 nearby ptes in the per-task local pagetable tree, to reduce the number
 of locks taken and not to enter the slab at all for that.

It runs the risk of pinning up to 60K of data per task that is unusable for
any other purpose. On average, it'll be more like 32K but worth keeping
in mind.

 Infact we
 could allocate the 4 levels (or anyway more than one level) in one
 single alloc_pages(0) and track the leftovers in the mm (or similar).
 
  I'm not sure what you are getting at here. I think it would make more
  sense if you said when you read /proc/buddyinfo, you know the order-0
  pages are really free for use with large blocks with your approach.
 
 I'm unsure who reads /proc/buddyinfo (that can change a lot and that
 is not very significant information if the vm can defrag well inside
 the reclaim code),

I read it although as you say, it's difficult to know what will happen if
you try and reclaim memory. It's why there is also a /proc/pagetypeinfo so
one can see the number of movable blocks that exist. That leads to better
guessing. In -mm, you can also see the number of mixed blocks but that will
not be available in mainline.

 but it's not much different and it's more about
 knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
 anon ram, and free memory.
 
 The idea is that to succeed an mmap over a large xfs file with
 mlockall being invoked, those largepages must become available or
 it'll be oom despite there are still 512M free... I'm quite sure
 admins will gets confused if they get oom killer invoked with lots of
 ram still free.
 
 The overcommit feature will also break, just to make an example (so
 much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
 killage ;).
 
  All this aside, there is nothing mutually exclusive with what you are 
  proposing
  and what grouping pages by mobility does. Your stuff can exist even if 
  grouping
  pages by mobility is in place. In fact, it'll help us by giving an important
  comparison point as grouping pages by mobility can be trivially disabled 
  with
  a one-liner for the purposes of testing. If your approach is brought to 
  being
  a general solution that also helps hugepage allocation, then we can revisit
  grouping pages by mobility by comparing kernels with it enabled and without.
 
 Yes, I totally agree. It sounds worthwhile to have a good defrag logic
 in the VM. Even allocating the kernel stack in today kernels should be
 able to benefit from your work. It's just comparing a fork() failure
 (something that will happen with ulimit -n too and that apps must be
 able to deal with) with an I/O failure that

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Bernd Schmidt


Christoph Lameter wrote:
True. That is why we want to limit the number of unmovable allocations and 
that is why ZONE_MOVABLE exists to limit those. However, unmovable 
allocations are already rare today. The overwhelming majority of 
allocations are movable and reclaimable. You can see that f.e. by looking 
at /proc/meminfo and see how high SUnreclaim: is (does not catch 
everything but its a good indicator).


Just to inject another factor into the discussion, please remember that 
Linux also runs on nommu systems, where things like user space 
allocations are neither movable nor reclaimable.



Bernd

--
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 03:52, Christoph Lameter wrote:
 On Fri, 14 Sep 2007, Nick Piggin wrote:
[*] ok, this isn't quite true because if you can actually put a hard
limit on unmovable allocations then anti-frag will fundamentally help
-- get back to me on that when you get patches to move most of the
obvious ones.
  
   We have this hard limit using ZONE_MOVABLE in 2.6.23.
 
  So we're back to 2nd class support.

 2nd class support for me means a feature that is not enabled by default
 but that can be enabled in order to increase performance. The 2nd class
 support is there because we are not yet sure about the maturity of the
 memory allocation methods.

I'd rather an approach that does not require all these hacks.


   Reserve pools as handled (by the not yet available) large page pool
   patches (which again has altogether another purpose) are not a limit.
   The reserve pools are used to provide a mininum of higher order pages
   that is not broken down in order to insure that a mininum number of the
   desired order of pages is even available in your worst case scenario.
   Mainly I think that is needed during the period when memory
   defragmentation is still under development.
 
  fsblock doesn't need any of those hacks, of course.

 Nor does mine for the low orders that we are considering. For order 
 MAX_ORDER this is unavoidable since the page allocator cannot manage such
 large pages. It can be used for lower order if there are issues (that I
 have not seen yet).

Or we can just avoid all doubt (and doesn't have arbitrary limitations
according to what you think might be reasonable or how well the
system actually behaves).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 04:08, Christoph Lameter wrote:
 On Fri, 14 Sep 2007, Nick Piggin wrote:
  However fsblock can do everything that higher order pagecache can
  do in terms of avoiding vmap and giving contiguous memory to block
  devices by opportunistically allocating higher orders of pages, and
  falling back to vmap if they cannot be satisfied.

 fsblock is restricted to the page cache and cannot be used in other
 contexts where subsystems can benefit from larger linear memory.

Unless you believe higher order pagecache is not restricted to the
pagecache, can we please just stick on topic of fsblock vs higher
order pagecache?


  So if you argue that vmap is a downside, then please tell me how you
  consider the -ENOMEM of your approach to be better?

 That is again pretty undifferentiated. Are we talking about low page

In general.


 orders? There we will reclaim the all of reclaimable memory before getting
 an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
 has 256 milllion 4k pages--and the unmovable ratios we see today it
 would require a very strange setup to get an allocation failure while
 still be able to allocate order 0 pages.

So much handwaving. 1TB machines without very strange setups
(where very strange is something arbitrarily defined by you) I guess make
up 0.001% of Linux installations.


 With the ZONE_MOVABLE you can remove the unmovable objects into a defined
 pool then higher order success rates become reasonable.

OK, if you rely on reserve pools, then it is not 1st class support and hence
it is a non-solution to VM and IO scalability problems.


  If, by special software layer, you mean the vmap/vunmap support in
  fsblock, let's see... that's probably all of a hundred or two lines.
  Contrast that with anti-fragmentation, lumpy reclaim, higher order
  pagecache and its new special mmap layer... Hmm, seems like a no
  brainer to me. You really still want to persue the extra layer
  argument as a point against fsblock here?

 Yes sure. You code could not live without these approaches. Without the

Actually: your code is the one that relies on higher order allocations. Now
you're trying to turn that into an argument against fsblock?


 antifragmentation measures your fsblock code would not be very successful
 in getting the larger contiguous segments you need to improve performance.

Complely wrong. *I* don't need to do any of that to improve performance.
Actually the VM is well tuned for order-0 pages, and so seeing as I have
sane hardware, 4K pagecache works beautifully for me.

My point was this: fsblock does not preclude your using such measures to
fix the performance of your hardware that's broken with 4K pages. And it
would allow higher order allocations to fail gracefully.


 (There is no new mmap layer, the higher order pagecache is simply the old
 API with set_blocksize expanded).

Yes you add another layer in the userspace mapping code to handle higher
order pagecache.


  Of course I wouldn't state that. On the contrary, I categorically state
  that I have already solved it :)

 Well then I guess that you have not read the requirements...

I'm not talking about solving your problem of poor hardware. I'm talking
about supporting higher order block sizes in the kernel.


   Because it has already been rejected in another form and adds more
 
  You have rejected it. But they are bogus reasons, as I showed above.

 Thats not me. I am working on this because many of the filesystem people
 have repeatedly asked me to do this. I am no expert on filesystems.

Yes it is you. You brought up reasons in this thread and I said why I thought
they were bogus. If you think you can now forget about my shooting them
down by saying you aren't an expert in filesystems, then you shouldn't have
brought them up in the first place. Either stand by your arguments or don't.


  You also describe some other real (if lesser) issues like number of page
  structs to manage in the pagecache. But this is hardly enough to reject
  my patch now... for every downside you can point out in my approach, I
  can point out one in yours.
 
  - fsblock doesn't require changes to virtual memory layer

 Therefore it is not a generic change but special to the block layer. So
 other subsystems still have to deal with the single page issues on
 their own.

Rubbish. fsblock doesn't touch a single line in the block layer.


   Maybe we coud get to something like a hybrid that avoids some of these
   issues? Add support so something like a virtual compound page can be
   handled transparently in the filesystem layer with special casing if
   such a beast reaches the block layer?
 
  That's conceptually much worse, IMO.

 Why: It is the same approach that you use.

Again, rubbish.

 If it is barely ever used and 
 satisfies your concern then I am fine with it.

Right below this line is where I told you I am _not_ fine with it.


  And practically worse as well: vmap space is

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Monday 17 September 2007 04:13, Mel Gorman wrote:
 On (15/09/07 14:14), Goswin von Brederlow didst pronounce:

  I keep coming back to the fact that movable objects should be moved
  out of the way for unmovable ones. Anything else just allows
  fragmentation to build up.

 This is easily achieved, just really really expensive because of the
 amount of copying that would have to take place. It would also compel
 that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
 MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
 a lot of free memory to keep around which is why fragmentation avoidance
 doesn't do it.

I don't know how it would prevent fragmentation from building up
anyway. It's commonly the case that potentially unmovable objects
are allowed to fill up all of ram (dentries, inodes, etc).

And of course,  if you craft your exploit nicely with help from higher
ordered unmovable memory (eg. mm structs or unix sockets), then
you don't even need to fill all memory with unmovables before you
can have them take over all groups.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Monday 17 September 2007 14:07, David Chinner wrote:
 On Fri, Sep 14, 2007 at 06:48:55AM +1000, Nick Piggin wrote:

  OK, the vunmap batching code wipes your TLB flushing and IPIs off
  the table. Diffstat below, but the TLB portions are here (besides that
  _everything_ is probably lower due to less TLB misses caused by the
  TLB flushing):
 
-170   -99.4% sn2_send_IPI
-343  -100.0% sn_send_IPI_phys
  -17911   -99.9% smp_call_function
 
  Total performance went up by 30% on a 64-way system (248 seconds to
  172 seconds to run parallel finds over different huge directories).

 Good start, Nick ;)

I didn't have the chance to test against a 16K directory block size to find
the optimal performance, but it is something I will do (I'm sure it will be
still a _lot_ faster than 172 seconds :)).


   23012  54790.5% _read_lock
9427   329.0% __get_vm_area_node
5792 0.0% __find_vm_area
1590  53000.0% __vunmap

 

 _read_lock? I though vmap() and vunmap() only took the vmlist_lock in
 write mode.

Yeah, it is a slight change... the lazy vunmap only has to take it for read.
In practice, I'm not sure that it helps a great deal because everything else
still takes the lock for write. But that explains why it's popped up in the
profile.


  Next I have some patches to scale the vmap locks and data
  structures better, but they're not quite ready yet. This looks like it
  should result in a further speedup of several times when combined
  with the TLB flushing reductions here...

 Sounds promising - when you have patches that work, send them my
 way and I'll run some tests on them.

Still away from home (for the next 2 weeks), so I'll be going a bit slow :P
I'm thinking about various scalable locking schemes and I'll definitely
ping you when I've made a bit of progress.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Christoph Lameter

On Sun, 16 Sep 2007, Nick Piggin wrote:

 I don't know how it would prevent fragmentation from building up
 anyway. It's commonly the case that potentially unmovable objects
 are allowed to fill up all of ram (dentries, inodes, etc).

Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
ZONE_MOVABLE and thus the memory that can be allocated for them is 
limited.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 290 matches

Mail list logo