Re: [fuse-devel] [PATCH] fuse: make fuse daemon frozen along with kernel threads

2013-02-08 Thread Goswin von Brederlow
On Thu, Feb 07, 2013 at 10:59:19AM +0100, Miklos Szeredi wrote:
> [CC list restored]
> 
> On Thu, Feb 7, 2013 at 9:41 AM, Goswin von Brederlow  
> wrote:
> > On Wed, Feb 06, 2013 at 10:27:40AM +0100, Miklos Szeredi wrote:
> >> On Wed, Feb 6, 2013 at 2:11 AM, Li Fei  wrote:
> >> >
> >> > There is well known issue that freezing will fail in case that fuse
> >> > daemon is frozen firstly with some requests not handled, as the fuse
> >> > usage task is waiting for the response from fuse daemon and can't be
> >> > frozen.
> >> >
> >> > To solve the issue above, make fuse daemon frozen after all all user
> >> > space processes frozen and during the kernel threads frozen phase.
> >> > PF_FREEZE_DAEMON flag is added to indicate that the current thread is
> >> > the fuse daemon,
> >>
> >> Okay and how do you know which thread, process or processes belong to
> >> the "fuse daemon"?
> >
> > Maybe I'm talking about the wrong thing but isn't any process having
> > /dev/fuse open "the fuse daemon"? And that doesn't even cover cases
> > where one thread reads requests from the kernel and hands them to
> > worker threads (that do not have /dev/fuse themself). Or the fuse
> > request might need mysql to finish a request.
> >
> > I believe figuring out what processes handle fuse requests is a lost
> > proposition.
> 
> Pretty much.
> 
> >
> >
> > Secondly how does freezing the daemon second garanty that it has
> > replied to all pending requests? Or how is leaving it thawed the right
> > decision?
> >
> > Instead the kernel side of fuse should be half frozen and stop sending
> > out new requests. Then it should wait for all pending requests to
> > complete. Then and only then can userspace processes be frozen safely.
> 
> The problem with that is one fuse filesystem might be calling into
> another.  Say two fuse filesystems are mounted at /mnt/A and /mnt/B,
> Process X starts a read request on /mnt/A. This is handled by process
> A, which in turn starts a read request on /mnt/B, which is handled by
> B.  If we freeze the system at the moment when A starts handling the
> request but hasn't yet sent the request to B then things wil be stuck
> and the freezing will fail.
> 
> So the order should be:  Freeze the "topmost" fuse filesystems (A in
> the example) first and if all pending requests are completed then
> freeze the next ones in the hierarchy, etc...  This would work if this
> dependency between filesystems were known.  But it's not and again it
> looks like an impossible task.

What is topmost? The kernel can't know that for sure.
 
> The only way to *reliably* freeze fuse filesystems is to let it freeze
> even if there are outstanding requests.  But that's the hardest to
> implement, because then it needs to allow freezing of tasks waiting on
> i_mutex, for example, which is currently not possible.  But this is
> the only way I see that would not have unsolvable corner cases that
> prop up at the worst moment.
> 
> And yes, it would be prudent to  wait some time for pending requests
> to finish before freezing.  But it's not a requirement, things
> *should* work without that: suspending a machine is just like a very
> long pause by the CPU, as long as no hardware is involved.  And with
> fuse filesystems there is no hardware involved directly by definition.
> 
> But I'm open to ideas and at this stage I think even patches that
> improve the situation for the majority of the cases would be
> acceptable, since this is turning out to be a major problem for a lot
> of people.
> 
> Thanks,
> Miklos

For shutdown in userspace there is the sendsigs.omit.d/ to avoid the
problem of halting/killing processes of the fuse filesystems (or other
services) prematurely. I guess something similar needs to be done for
freeze. The fuse filesystem has to tell the kernel what is up.

MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] [PATCH] fuse: make fuse daemon frozen along with kernel threads

2013-02-08 Thread Goswin von Brederlow
On Thu, Feb 07, 2013 at 10:59:19AM +0100, Miklos Szeredi wrote:
 [CC list restored]
 
 On Thu, Feb 7, 2013 at 9:41 AM, Goswin von Brederlow goswin-...@web.de 
 wrote:
  On Wed, Feb 06, 2013 at 10:27:40AM +0100, Miklos Szeredi wrote:
  On Wed, Feb 6, 2013 at 2:11 AM, Li Fei fei...@intel.com wrote:
  
   There is well known issue that freezing will fail in case that fuse
   daemon is frozen firstly with some requests not handled, as the fuse
   usage task is waiting for the response from fuse daemon and can't be
   frozen.
  
   To solve the issue above, make fuse daemon frozen after all all user
   space processes frozen and during the kernel threads frozen phase.
   PF_FREEZE_DAEMON flag is added to indicate that the current thread is
   the fuse daemon,
 
  Okay and how do you know which thread, process or processes belong to
  the fuse daemon?
 
  Maybe I'm talking about the wrong thing but isn't any process having
  /dev/fuse open the fuse daemon? And that doesn't even cover cases
  where one thread reads requests from the kernel and hands them to
  worker threads (that do not have /dev/fuse themself). Or the fuse
  request might need mysql to finish a request.
 
  I believe figuring out what processes handle fuse requests is a lost
  proposition.
 
 Pretty much.
 
 
 
  Secondly how does freezing the daemon second garanty that it has
  replied to all pending requests? Or how is leaving it thawed the right
  decision?
 
  Instead the kernel side of fuse should be half frozen and stop sending
  out new requests. Then it should wait for all pending requests to
  complete. Then and only then can userspace processes be frozen safely.
 
 The problem with that is one fuse filesystem might be calling into
 another.  Say two fuse filesystems are mounted at /mnt/A and /mnt/B,
 Process X starts a read request on /mnt/A. This is handled by process
 A, which in turn starts a read request on /mnt/B, which is handled by
 B.  If we freeze the system at the moment when A starts handling the
 request but hasn't yet sent the request to B then things wil be stuck
 and the freezing will fail.
 
 So the order should be:  Freeze the topmost fuse filesystems (A in
 the example) first and if all pending requests are completed then
 freeze the next ones in the hierarchy, etc...  This would work if this
 dependency between filesystems were known.  But it's not and again it
 looks like an impossible task.

What is topmost? The kernel can't know that for sure.
 
 The only way to *reliably* freeze fuse filesystems is to let it freeze
 even if there are outstanding requests.  But that's the hardest to
 implement, because then it needs to allow freezing of tasks waiting on
 i_mutex, for example, which is currently not possible.  But this is
 the only way I see that would not have unsolvable corner cases that
 prop up at the worst moment.
 
 And yes, it would be prudent to  wait some time for pending requests
 to finish before freezing.  But it's not a requirement, things
 *should* work without that: suspending a machine is just like a very
 long pause by the CPU, as long as no hardware is involved.  And with
 fuse filesystems there is no hardware involved directly by definition.
 
 But I'm open to ideas and at this stage I think even patches that
 improve the situation for the majority of the cases would be
 acceptable, since this is turning out to be a major problem for a lot
 of people.
 
 Thanks,
 Miklos

For shutdown in userspace there is the sendsigs.omit.d/ to avoid the
problem of halting/killing processes of the fuse filesystems (or other
services) prematurely. I guess something similar needs to be done for
freeze. The fuse filesystem has to tell the kernel what is up.

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels

2007-10-18 Thread Goswin von Brederlow
Arnaud Fontaine <[EMAIL PROTECTED]> writes:

>> "Dave" == Dave Jones <[EMAIL PROTECTED]> writes:
>
> Dave> Many of these that I've  seen have turned out to be a hardware
> Dave> problem.  Try running memtest86+  on that machine for a while.
> Dave>  It doesn't  catch all  problems, but  it will  highlight more
> Dave> common memory faults.
>
> Hello,
>
> We ran memtest86+ before production, it  was about one month ago. Do you
> think it could come from that anyway?

I find that a lot of the time memtest does not reveal an error. Only
when you combine multiple sources or on random access do you get
errors. For example compiling a kernel while doing heavy I/O on the
disk. But that might just be me. Errors are rather random occurances.

Compiling a kernel repeadatly and multiple in parallel is usualy a
good test. If it sometimes fails to compile then it is near certain a
hardware error.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels

2007-10-18 Thread Goswin von Brederlow
Arnaud Fontaine [EMAIL PROTECTED] writes:

 Dave == Dave Jones [EMAIL PROTECTED] writes:

 Dave Many of these that I've  seen have turned out to be a hardware
 Dave problem.  Try running memtest86+  on that machine for a while.
 Dave  It doesn't  catch all  problems, but  it will  highlight more
 Dave common memory faults.

 Hello,

 We ran memtest86+ before production, it  was about one month ago. Do you
 think it could come from that anyway?

I find that a lot of the time memtest does not reveal an error. Only
when you combine multiple sources or on random access do you get
errors. For example compiling a kernel while doing heavy I/O on the
disk. But that might just be me. Errors are rather random occurances.

Compiling a kernel repeadatly and multiple in parallel is usualy a
good test. If it sometimes fails to compile then it is near certain a
hardware error.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow
Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
>> When has free ever given any usefull "free" number? I can perfectly
>> fine allocate another gigabyte of memory despide free saing 25MB. But
>> that is because I know that the buffer/cached are not locked in.
>
> Well, as you said you know that buffer/cached are not locked in. If
> /proc/meminfo would be rubbish like you seem to imply in the first
> line, why would we ever bother to export that information and even
> waste time writing a binary that parse it for admins?

As a user I know it because I didn't put a kernel source into /tmp. A
programm can't reasonably know that.

>> On the other hand 1GB can instantly vanish when I start a xen domain
>> and anything relying on the free value would loose.
>
> Actually you better check meminfo or free before starting a 1G of Xen!!

Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
what it said last minute.

>> The only sensible thing for an application concerned with swapping is
>> to whatch the swapping and then reduce itself. Not the amount
>> free. Although I wish there were some kernel interface to get a
>> preasure value of how valuable free pages would be right now. I would
>> like that for fuse so a userspace filesystem can do caching without
>> cripling the kernel.
>
> Repeated drop caches + free can help.

I would kill any programm that does that to find out how much free ram
the system has.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

> On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
>> [EMAIL PROTECTED] (Mel Gorman) writes:
>> 
>> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> >> Mel Gorman <[EMAIL PROTECTED]> writes:
>> >> Looking at my
>> >> little test program evicting movable objects from a mixed group should
>> >> not be that expensive as it doesn't happen often.
>> >
>> > It happens regularly if the size of the block you need to keep clean is
>> > lower than min_free_kbytes. In the case of hugepages, that was always
>> > the case.
>> 
>> That assumes that the number of groups allocated for unmovable objects
>> will continiously grow and shrink.
>
> They do grow and shrink. The number of pagetables in use changes for
> example.

By numbers of groups worth? And full groups get free, unmixed and
filled by movable objects?

>> I'm assuming it will level off at
>> some size for long times (hours) under normal operations.
>
> It doesn't unless you assume the system remains in a steady state for it's
> lifetime. Things like updatedb tend to throw a spanner into the works.

Moved to cron weekly here. And even normally it is only once a day. So
what if it starts moving some pages while updatedb runs. If it isn't
too braindead it will reclaim some dentries updated has created and
left for good. It should just cause the dentry cache to be smaller at
no cost. I'm not calling that normal operations. That is a once a day
special. What I don't want is to spend 1% of cpu time copying
pages. That would be unacceptable. Copying 1000 pages per updatedb run
would be trivial on the other hand.

>> There should
>> be some buffering of a few groups to be held back in reserve when it
>> shrinks to prevent the scenario that the size is just at a group
>> boundary and always grows/shrinks by 1 group.
>> 
>
> And what size should this group be that all workloads function?

1 is enough to prevent jittering. If you don't hold a group back and
you are exactly at a group boundary then alternatingly allocating and
freeing one page would result in a group allocation and freeing every
time. With one group in reserve you only get an group allocation or
freeing when a groupd worth of change has happened.

This assumes that changing the type and/or state of a group is
expensive. Takes time or locks or some such. Otherwise just let it
jitter.

>> >> So if
>> >> you evict movable objects from mixed group when needed all the
>> >> pagetable pages would end up in the same mixed group slowly taking it
>> >> over completly. No fragmentation at all. See how essential that
>> >> feature is. :)
>> >> 
>> >
>> > To move pages, there must be enough blocks free. That is where
>> > min_free_kbytes had to come in. If you cared only about keeping 64KB
>> > chunks free, it makes sense but it didn't in the context of hugepages.
>> 
>> I'm more concerned with keeping the little unmovable things out of the
>> way. Those are the things that will fragment the memory and prevent
>> any huge pages to be available even with moving other stuff out of the
>> way.
>
> That's fair, just not cheap

That is the price you pay. To allocate 2MB of ram you have to have 2MB
of free ram or make them free. There is no way around that. Moving
pages means that you can actually get those 2MB even if the price
is high and that you have more choice deciding what to throw away or
swap out. I would rather have a 2MB malloc take some time than have it
fail because the kernel doesn't feel like it.

>> Can you tell me how? I would like to do the same.
>> 
>
> They were generated using trace_allocmap kernel module in
> http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
> in combination with frag-display in the same package.  However, in the
> current version against current -mm's, it'll identify some movable pages
> wrong. Specifically, it will appear to be mixing movable pages with slab
> pages and it doesn't identify SLUB pages properly at all (SLUB came after
> the last revision of this tool). I need to bring an annotation patch up to
> date before it can generate the images correctly.

Thanks. I will test that out and see what I get on a few lustre
servers and clients. That is probably quite a different workload from
what you test.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

> On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
>> But when you already have say 10% of the ram in mixed groups then it
>> is a sign the external fragmentation happens and some time should be
>> spend on moving movable objects.
>> 
>
> I'll play around with it on the side and see what sort of results I get.
> I won't be pushing anything any time soon in relation to this though.
> For now, I don't intend to fiddle more with grouping pages by mobility
> for something that may or may not be of benefit to a feature that hasn't
> been widely tested with what exists today.

I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.

When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:

# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A
+---+---+
# # # # |00 03 04 05|3A 3B 3C 3F|
# #   # #   # # |   |   |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # |   |   |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# #   # #   # # |   |   |
# # # # |0F 0C 0B 0A|35 34 33 30|
# # +-+-+   |
### ### ### |10 11|1E 1F|20 21 2E 2F|
  # # # #   | | |   |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-+   |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | |   |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-+-+---+

I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
(even or odd order).

>> Maybe instead of reserving one could say that you can have up to 6
>> groups of space
>
> And if the groups are 1GB in size? I tried something like this already.
> It didn't work out well at the time although I could revisit.

You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.

But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
if that size gives a reasonable number of groups total.

>> not used by unmovable objects before aggressive moving
>> starts. I don't quite see why you NEED reserving as long as there is
>> enough space free alltogether in case something needs moving.
>
> hence, increase min_free_kbytes.

Which is different from reserving a full group as it does not count
fragmented space as lost.

>> 1 group
>> worth of space free might be plenty to move stuff too. Note that all
>> the virtual pages can be stuffed in every little free space there is
>> and reassembled by the MMU. There is no space lost there.
>> 
>
> What you suggest sounds similar to having a type MIGRATE_MIXED where you
> allocate from when the preferred lists are full. It became a sizing
> problem that never really worked out. As I said, I can try again.

Not realy. I'm saying we should actively defragment mixed groups
during allocation and always as little as possible when a certain
level of external fragmentation is reached. A MIGRATE_MIXED sounds
like giving up completly if things get bad enough. Compare it to a
cheap network switch going into hub mode when its arp table runs full.
If you ever had that then you know how bad that is.

>> But until one tries one can't say.
>> 
>> MfG
>> Goswin
>> 
>> PS: How do allocations pick groups?
>
> Using GFP flags to identify the type.

That is the type of group, not which one.

>> Could one use the oldest group
>> dedicated to each MIGRATE_TYPE?
>
> Age is difficult to determine so probably not.

Put the uptime as sort key into each group header on creation or type
change. Then sort the partialy used groups by that key. A heap will do
fine and be fast.

>> Or lowest address for unmovable and
>> highest address for movable? Something to better keep the two out of
>> each other way.
>
> We bias the location of unmovable and reclaimable allocations already. It's
> not done for movable because it wasn't necessary (as they are easily
> reclaimed or moved anyway).

Except that is never done so doesn't count.

MfG

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

 On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
 But when you already have say 10% of the ram in mixed groups then it
 is a sign the external fragmentation happens and some time should be
 spend on moving movable objects.
 

 I'll play around with it on the side and see what sort of results I get.
 I won't be pushing anything any time soon in relation to this though.
 For now, I don't intend to fiddle more with grouping pages by mobility
 for something that may or may not be of benefit to a feature that hasn't
 been widely tested with what exists today.

I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.

When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:

# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A
+---+---+
# # # # |00 03 04 05|3A 3B 3C 3F|
# #   # #   # # |   |   |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # |   |   |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# #   # #   # # |   |   |
# # # # |0F 0C 0B 0A|35 34 33 30|
# # +-+-+   |
### ### ### |10 11|1E 1F|20 21 2E 2F|
  # # # #   | | |   |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-+   |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | |   |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-+-+---+

I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
(even or odd order).

 Maybe instead of reserving one could say that you can have up to 6
 groups of space

 And if the groups are 1GB in size? I tried something like this already.
 It didn't work out well at the time although I could revisit.

You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.

But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
if that size gives a reasonable number of groups total.

 not used by unmovable objects before aggressive moving
 starts. I don't quite see why you NEED reserving as long as there is
 enough space free alltogether in case something needs moving.

 hence, increase min_free_kbytes.

Which is different from reserving a full group as it does not count
fragmented space as lost.

 1 group
 worth of space free might be plenty to move stuff too. Note that all
 the virtual pages can be stuffed in every little free space there is
 and reassembled by the MMU. There is no space lost there.
 

 What you suggest sounds similar to having a type MIGRATE_MIXED where you
 allocate from when the preferred lists are full. It became a sizing
 problem that never really worked out. As I said, I can try again.

Not realy. I'm saying we should actively defragment mixed groups
during allocation and always as little as possible when a certain
level of external fragmentation is reached. A MIGRATE_MIXED sounds
like giving up completly if things get bad enough. Compare it to a
cheap network switch going into hub mode when its arp table runs full.
If you ever had that then you know how bad that is.

 But until one tries one can't say.
 
 MfG
 Goswin
 
 PS: How do allocations pick groups?

 Using GFP flags to identify the type.

That is the type of group, not which one.

 Could one use the oldest group
 dedicated to each MIGRATE_TYPE?

 Age is difficult to determine so probably not.

Put the uptime as sort key into each group header on creation or type
change. Then sort the partialy used groups by that key. A heap will do
fine and be fast.

 Or lowest address for unmovable and
 highest address for movable? Something to better keep the two out of
 each other way.

 We bias the location of unmovable and reclaimable allocations already. It's
 not done for movable because it wasn't necessary (as they are easily
 reclaimed or moved anyway).

Except that is never done so doesn't count.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

 On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
 [EMAIL PROTECTED] (Mel Gorman) writes:
 
  On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
  Mel Gorman [EMAIL PROTECTED] writes:
  Looking at my
  little test program evicting movable objects from a mixed group should
  not be that expensive as it doesn't happen often.
 
  It happens regularly if the size of the block you need to keep clean is
  lower than min_free_kbytes. In the case of hugepages, that was always
  the case.
 
 That assumes that the number of groups allocated for unmovable objects
 will continiously grow and shrink.

 They do grow and shrink. The number of pagetables in use changes for
 example.

By numbers of groups worth? And full groups get free, unmixed and
filled by movable objects?

 I'm assuming it will level off at
 some size for long times (hours) under normal operations.

 It doesn't unless you assume the system remains in a steady state for it's
 lifetime. Things like updatedb tend to throw a spanner into the works.

Moved to cron weekly here. And even normally it is only once a day. So
what if it starts moving some pages while updatedb runs. If it isn't
too braindead it will reclaim some dentries updated has created and
left for good. It should just cause the dentry cache to be smaller at
no cost. I'm not calling that normal operations. That is a once a day
special. What I don't want is to spend 1% of cpu time copying
pages. That would be unacceptable. Copying 1000 pages per updatedb run
would be trivial on the other hand.

 There should
 be some buffering of a few groups to be held back in reserve when it
 shrinks to prevent the scenario that the size is just at a group
 boundary and always grows/shrinks by 1 group.
 

 And what size should this group be that all workloads function?

1 is enough to prevent jittering. If you don't hold a group back and
you are exactly at a group boundary then alternatingly allocating and
freeing one page would result in a group allocation and freeing every
time. With one group in reserve you only get an group allocation or
freeing when a groupd worth of change has happened.

This assumes that changing the type and/or state of a group is
expensive. Takes time or locks or some such. Otherwise just let it
jitter.

  So if
  you evict movable objects from mixed group when needed all the
  pagetable pages would end up in the same mixed group slowly taking it
  over completly. No fragmentation at all. See how essential that
  feature is. :)
  
 
  To move pages, there must be enough blocks free. That is where
  min_free_kbytes had to come in. If you cared only about keeping 64KB
  chunks free, it makes sense but it didn't in the context of hugepages.
 
 I'm more concerned with keeping the little unmovable things out of the
 way. Those are the things that will fragment the memory and prevent
 any huge pages to be available even with moving other stuff out of the
 way.

 That's fair, just not cheap

That is the price you pay. To allocate 2MB of ram you have to have 2MB
of free ram or make them free. There is no way around that. Moving
pages means that you can actually get those 2MB even if the price
is high and that you have more choice deciding what to throw away or
swap out. I would rather have a 2MB malloc take some time than have it
fail because the kernel doesn't feel like it.

 Can you tell me how? I would like to do the same.
 

 They were generated using trace_allocmap kernel module in
 http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
 in combination with frag-display in the same package.  However, in the
 current version against current -mm's, it'll identify some movable pages
 wrong. Specifically, it will appear to be mixing movable pages with slab
 pages and it doesn't identify SLUB pages properly at all (SLUB came after
 the last revision of this tool). I need to bring an annotation patch up to
 date before it can generate the images correctly.

Thanks. I will test that out and see what I get on a few lustre
servers and clients. That is probably quite a different workload from
what you test.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-23 Thread Goswin von Brederlow
Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
 When has free ever given any usefull free number? I can perfectly
 fine allocate another gigabyte of memory despide free saing 25MB. But
 that is because I know that the buffer/cached are not locked in.

 Well, as you said you know that buffer/cached are not locked in. If
 /proc/meminfo would be rubbish like you seem to imply in the first
 line, why would we ever bother to export that information and even
 waste time writing a binary that parse it for admins?

As a user I know it because I didn't put a kernel source into /tmp. A
programm can't reasonably know that.

 On the other hand 1GB can instantly vanish when I start a xen domain
 and anything relying on the free value would loose.

 Actually you better check meminfo or free before starting a 1G of Xen!!

Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
what it said last minute.

 The only sensible thing for an application concerned with swapping is
 to whatch the swapping and then reduce itself. Not the amount
 free. Although I wish there were some kernel interface to get a
 preasure value of how valuable free pages would be right now. I would
 like that for fuse so a userspace filesystem can do caching without
 cripling the kernel.

 Repeated drop caches + free can help.

I would kill any programm that does that to find out how much free ram
the system has.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-22 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

> On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
>> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
>> Allocating ptes from slab is fairly simple but I think it would be
>> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
>> nearby ptes in the per-task local pagetable tree, to reduce the number
>> of locks taken and not to enter the slab at all for that.
>
> It runs the risk of pinning up to 60K of data per task that is unusable for
> any other purpose. On average, it'll be more like 32K but worth keeping
> in mind.

Two things to both of you respectively.

Why should we try to stay out of the pte slab? Isn't the slab exactly
made for this thing? To efficiently handle a large number of equal
size objects for quick allocation and dealocation? If it is a locking
problem then there should be a per cpu cache of ptes. Say 0-32
ptes. If you run out you allocate 16 from slab. When you overflow you
free 16 (which would give you your 64k allocations but in multiple
objects).

As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB
on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?)
respectively. For the sbrk() and mmap() usage from glibc malloc() that
would be fine as they grow linear and the mmap() call in glibc could
be made to align to those chunks. But for a programm like rtorrent
using mmap to bring in chunks of a 4GB file this looks desasterous.

>> Infact we
>> could allocate the 4 levels (or anyway more than one level) in one
>> single alloc_pages(0) and track the leftovers in the mm (or similar).

Personally I would really go with a per cpu cache. When mapping a page
reserve 4 tables. Then you walk the tree and add entries as
needed. And last you release 0-4 unused entries to the cache.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-22 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

 On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
 On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
 Allocating ptes from slab is fairly simple but I think it would be
 better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
 nearby ptes in the per-task local pagetable tree, to reduce the number
 of locks taken and not to enter the slab at all for that.

 It runs the risk of pinning up to 60K of data per task that is unusable for
 any other purpose. On average, it'll be more like 32K but worth keeping
 in mind.

Two things to both of you respectively.

Why should we try to stay out of the pte slab? Isn't the slab exactly
made for this thing? To efficiently handle a large number of equal
size objects for quick allocation and dealocation? If it is a locking
problem then there should be a per cpu cache of ptes. Say 0-32
ptes. If you run out you allocate 16 from slab. When you overflow you
free 16 (which would give you your 64k allocations but in multiple
objects).

As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB
on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?)
respectively. For the sbrk() and mmap() usage from glibc malloc() that
would be fine as they grow linear and the mmap() call in glibc could
be made to align to those chunks. But for a programm like rtorrent
using mmap to bring in chunks of a 4GB file this looks desasterous.

 Infact we
 could allocate the 4 levels (or anyway more than one level) in one
 single alloc_pages(0) and track the leftovers in the mm (or similar).

Personally I would really go with a per cpu cache. When mapping a page
reserve 4 tables. Then you walk the tree and add entries as
needed. And last you release 0-4 unused entries to the cache.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> You ignore one other bit, when "/usr/bin/free" says 1G is free, with
> config-page-shift it's free no matter what, same goes for not mlocked
> cache. With variable order page cache, /usr/bin/free becomes mostly a
> lie as long as there's no 4k fallback (like fsblock).

% free
 total   used   free sharedbuffers cached
Mem:   13987841372956  25828  0 225224 321504
-/+ buffers/cache: 826228 572556
Swap:  1048568 201048548

When has free ever given any usefull "free" number? I can perfectly
fine allocate another gigabyte of memory despide free saing 25MB. But
that is because I know that the buffer/cached are not locked in.

On the other hand 1GB can instantly vanish when I start a xen domain
and anything relying on the free value would loose.


The only sensible thing for an application concerned with swapping is
to whatch the swapping and then reduce itself. Not the amount
free. Although I wish there were some kernel interface to get a
preasure value of how valuable free pages would be right now. I would
like that for fuse so a userspace filesystem can do caching without
cripling the kernel.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

> On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
>> zooming in I see red pixels all over the squares mized with green
>> pixels in the same square. This is exactly what happens with the
>> variable order page cache and that's why it provides zero guarantees
>> in terms of how much ram is really "free" (free as in "available").
>> 
>
> This picture is not grouping pages by mobility so that is hardly a
> suprise. This picture is not running grouping pages by mobility. This is
> what the normal kernel looks like. Look at the videos in
> http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
> compares to vanilla. These are from February when there was less control
> over mixing blocks than there is today.
>
> In the current version mixing occurs in the lower blocks as much as possible
> not the upper ones. So there are a number of mixed blocks but the number is
> kept to a minimum.
>
> The number of mixed blocks could have been enforced as 0, but I felt it was
> better in the general case to fragment rather than regress performance.
> That may be different for large blocks where you will want to take the
> enforcement steps.

I agree that 0 is a bad value. But so is infinity. There should be
some mixing but not a lot. You say "kept to a minimum". Is that
actively done or already happens by itself. Hopefully the later which
would be just splendid.

>> With config-page-shift mmap works on 4k chunks but it's always backed
>> by 64k or any other largesize that you choosed at compile time. And if

But would mapping a random 4K page out of a file then consume 64k?
That sounds like an awfull lot of internal fragmentation. I hope the
unaligned bits and pices get put into a slab or something as you
suggested previously.

>> the virtual alignment of mmap matches the physical alignment of the
>> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
>> could use the 62nd bit of the pte to use a 64k tlb (if future cpus
>> will allow that). Nick also suggested to still set all ptes equal to
>> make life easier for the tlb miss microcode.

It is too bad that existing amd64 CPUs only allow such large physical
pages. But it kind of makes sense to cut away a full level or page
tables for the next bigger size each.

>> > big you can make it. I don't think my system with 1GB ram would work
>> > so well with 2MB order 0 pages. But I wasn't refering to that but to
>> > the picture.
>> 
>> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
>> too, of course unless you're running a db or a multimedia streaming
>> service, in which case it should be ideal.

rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
ocasional mplayer.

I would mostly be concerned how rtorrents totaly random access of
mmapped files negatively impacts such a 64k page system.

MfG
 Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Linus Torvalds <[EMAIL PROTECTED]> writes:

> On Sun, 16 Sep 2007, Jörn Engel wrote:
>> 
>> My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
>> which are pinned for their entire lifetime and another for regular
>> files/inodes.  One could take a three-way approach and have
>> always-pinned, often-pinned and rarely-pinned.
>> 
>> We won't get never-pinned that way.
>
> That sounds pretty good. The problem, of course, is that most of the time, 
> the actual dentry allocation itself is done before you really know which 
> case the dentry will be in, and the natural place for actually giving the 
> dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" 
> it with d_add() or d_instantiate().
>
> But it turns out that most of the filesystems we care about already use a 
> special case of "d_add()" that *already* replaces the dentry with another 
> one in some cases: "d_splice_alias()".
>
> So I bet that if we just taught "d_splice_alias()" to look at the inode, 
> and based on the inode just re-allocate the dentry to some other slab 
> cache, we'd already handle a lot of the cases!
>
> And yes, you'd end up with the reallocation overhead quite often, but at 
> least it would now happen only when filling in a dentry, not in the 
> (*much* more critical) cached lookup path.
>
>   Linus

You would only get it for dentries that live long (or your prediction
is awfully wrong) and then the reallocation amortizes over time if you
will. :)

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

> On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> Mel Gorman <[EMAIL PROTECTED]> writes:
>> 
>> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
>> >> Nick Piggin <[EMAIL PROTECTED]> writes:
>> >> 
>> >> > In my attack, I cause the kernel to allocate lots of unmovable 
>> >> > allocations
>> >> > and deplete movable groups. I theoretically then only need to keep a
>> >> > small number (1/2^N) of these allocations around in order to DoS a
>> >> > page allocation of order N.
>> >> 
>> >> I'm assuming that when an unmovable allocation hijacks a movable group
>> >> any further unmovable alloc will evict movable objects out of that
>> >> group before hijacking another one. right?
>> >> 
>> >
>> > No eviction takes place. If an unmovable allocation gets placed in a
>> > movable group, then steps are taken to ensure that future unmovable
>> > allocations will take place in the same range (these decisions take
>> > place in __rmqueue_fallback()). When choosing a movable block to
>> > pollute, it will also choose the lowest possible block in PFN terms to
>> > steal so that fragmentation pollution will be as confined as possible.
>> > Evicting the unmovable pages would be one of those expensive steps that
>> > have been avoided to date.
>> 
>> But then you can have all blocks filled with movable data, free 4K in
>> one group, allocate 4K unmovable to take over the group, free 4k in
>> the next group, take that group and so on. You can end with 4k
>> unmovable in every 64k easily by accident.
>> 
>
> As the mixing takes place at the lowest possible block, it's
> exceptionally difficult to trigger this. Possible, but exceptionally
> difficult.

Why is it difficult?

When user space allocates memory wouldn't it get it contiously? I mean
that is one of the goals, to use larger continious allocations and map
them with a single page table entry where possible, right? And then
you can roughly predict where an munmap() would free a page.

Say the application does map a few GB of file, uses madvice to tell
the kernel it needs a 2MB block (to get a continious 2MB chunk
mapped), waits for it and then munmaps 4K in there. A 4k hole for some
unmovable object to fill. If you can then trigger the creation of an
unmovable object as well (stat some file?) and loop you will fill the
ram quickly. Maybe it only works in 10% but then you just do it 10
times as often.

Over long times it could occur naturally. This is just to demonstrate
it with malice.

> As I have stated repeatedly, the guarantees can be made but potential
> hugepage allocation did not justify it. Large blocks might.
>
>> There should be a lot of preassure for movable objects to vacate a
>> mixed group or you do get fragmentation catastrophs.
>
> We (Andy Whitcroft and I) did implement something like that. It hooked into
> kswapd to clean mixed blocks. If the caller could do the cleaning, it
> did the work instead of kswapd.

Do you have a graphic like
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
for that case?

>> Looking at my
>> little test program evicting movable objects from a mixed group should
>> not be that expensive as it doesn't happen often.
>
> It happens regularly if the size of the block you need to keep clean is
> lower than min_free_kbytes. In the case of hugepages, that was always
> the case.

That assumes that the number of groups allocated for unmovable objects
will continiously grow and shrink. I'm assuming it will level off at
some size for long times (hours) under normal operations. There should
be some buffering of a few groups to be held back in reserve when it
shrinks to prevent the scenario that the size is just at a group
boundary and always grows/shrinks by 1 group.

>> The cost of it
>> should be freeing some pages (or finding free ones in a movable group)
>> and then memcpy.
>
> Freeing pages is not cheap. Copying pages is cheaper but not cheap.

To copy you need a free page as destination. Thats all I
ment. Hopefully there will always be a free one and the actual freeing
is done asynchronously from the copying.

>> So if
>> you evict movable objects from mixed group when needed all the
>> pagetable pages would end up in the same mixed group slowly taking it
>> over completly. No fragmentation at all. See how essential that
>> feature is. :)
>> 
>
> To move pages, there must be enough blocks free. That is where
> min_free_kbytes had to come in. If you cared only about keeping 64KB
>

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Jörn Engel <[EMAIL PROTECTED]> writes:

> On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote:
>> 
>> Movable? I rather assume all slab allocations aren't movable. Then
>> slab defrag can try to tackle on users like dcache and inodes. Keep in
>> mind that with the exception of updatedb, those inodes/dentries will
>> be pinned and you won't move them, which is why I prefer to consider
>> them not movable too... since there's no guarantee they are.
>
> I have been toying with the idea of having seperate caches for pinned
> and movable dentries.  Downside of such a patch would be the number of
> memcpy() operations when moving dentries from one cache to the other.
> Upside is that a fair amount of slab cache can be made movable.
> memcpy() is still faster than reading an object from disk.

How probable is it that the dentry is needed again? If you copy it and
it is not needed then you wasted time. If you throw it out and it is
needed then you wasted time too. Depending on the probability one of
the two is cheaper overall. Idealy I would throw away dentries that
haven't been accessed recently and copy recently used ones.

How much of a systems ram is spend on dentires? How much on task
structures? Does anyone have some stats on that? If it is <10% of the
total ram combined then I don't see much point in moving them. Just
keep them out of the way of users memory so the buddy system can work
effectively.

> Most likely the current reaction to such a patch would be to shoot it
> down due to overhead, so I didn't pursue it.  All I have is an old patch
> to seperate never-cached from possibly-cached dentries.  It will
> increase the odds of freeing a slab, but provide no guarantee.
>
> But the point here is: dentries/inodes can be made movable if there are
> clear advantages to it.  Maybe they should?
>
> Jörn

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

> On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
>> Andrew Morton <[EMAIL PROTECTED]> writes:
>> 
>> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
>> >
>> >> While I agree with your concern, those numbers are quite silly.  The
>> >> chances of 99.8% of pages being free and the remaining 0.2% being
>> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> >> creating a collision.
>> >
>> > Actually it'd be pretty easy to craft an application which allocates seven
>> > pages for pagecache, then one for , then seven for pagecache, 
>> > then
>> > one for , etc.
>> >
>> > I've had test apps which do that sort of thing accidentally.  The result
>> > wasn't pretty.
>> 
>> Except that the applications 7 pages are movable and the 
>> would have to be unmovable. And then they should not share the same
>> memory region. At least they should never be allowed to interleave in
>> such a pattern on a larger scale.
>> 
>
> It is actually really easy to force regions to never share. At the
> moment, there is a fallback list that determines a preference for what
> block to mix.
>
> The reason why this isn't enforced is the cost of moving. On x86 and
> x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
> those pages to prevent any mixing would be bad enough. On PowerPC, it's
> potentially 16MB. On IA64, it's 1GB.
>
> As this was fragmentation avoidance, not guarantees, the decision was
> made to not strictly enforce the types of pages within a block as the
> cost cannot be made back unless the system was making agressive use of
> large pages. This is not the case with Linux.

I don't say the group should never be mixed. The movable objects could
be moved out on demand. If 64k get allocated then up to 64k get
moved. That would reduce the impact as the kernel does not hang while
it moves 2MB or even 1GB. It also allows objects to be freed and the
space reused in the unmovable and mixed groups. There could also be a
certain number or percentage of mixed groupd be allowed to further
increase the chance of movable objects freeing themself from mixed
groups.

But when you already have say 10% of the ram in mixed groups then it
is a sign the external fragmentation happens and some time should be
spend on moving movable objects.

>> The only way a fragmentation catastroph can be (proovable) avoided is
>> by having so few unmovable objects that size + max waste << ram
>> size. The smaller the better. Allowing movable and unmovable objects
>> to mix means that max waste goes way up. In your example waste would
>> be 7*size. With 2MB uper order limit it would be 511*size.
>> 
>> I keep coming back to the fact that movable objects should be moved
>> out of the way for unmovable ones. Anything else just allows
>> fragmentation to build up.
>> 
>
> This is easily achieved, just really really expensive because of the
> amount of copying that would have to take place. It would also compel
> that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> a lot of free memory to keep around which is why fragmentation avoidance
> doesn't do it.

In your sample graphics you had 1152 groups. Reserving a few of those
doesnt sound too bad. And how many migrate types do we talk about. So
far we only had movable and unmovable. I would split unmovable into
short term (caches, I/O pages) and long term (task structures,
dentries). Reserving 6 groups for schort term unmovable and long term
unmovable would be 1% of ram in your situation.

Maybe instead of reserving one could say that you can have up to 6
groups of space not used by unmovable objects before aggressive moving
starts. I don't quite see why you NEED reserving as long as there is
enough space free alltogether in case something needs moving. 1 group
worth of space free might be plenty to move stuff too. Note that all
the virtual pages can be stuffed in every little free space there is
and reassembled by the MMU. There is no space lost there.


But until one tries one can't say.

MfG
Goswin

PS: How do allocations pick groups? Could one use the oldest group
dedicated to each MIGRATE_TYPE? Or lowest address for unmovable and
highest address for movable? Something to better keep the two out of
each other way.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
>> - Userspace allocates a lot of memory in those slabs.
>
> If with slabs you mean slab/slub, I can't follow, there has never been
> a single byte of userland memory allocated there since ever the slab
> existed in linux.

This and other comments in your reply show me that you completly
misunderstood what I was talking about.

Look at
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg

The red dots (pinned) are dentries, page tables, kernel stacks,
whatever kernel stuff, right?

The green dots (movable) are mostly userspace pages being mapped
there, right?

What I was refering too is that because movable objects (green dots)
aren't moved out of a mixed group (the boxes) when some unmovable
object needs space all the groups become mixed over time. That means
the unmovable objects are spread out over all the ram and the buddy
system can't recombine regions when unmovable objects free them. There
will nearly always be some movable objects in the other buddy. The
system of having unmovable and movable groups breaks down and becomes
useless.


I'm assuming here that we want the possibility of larger order pages
for unmovable objects (large continiuos regions for DMA for example)
than the smallest order user space gets (or any movable object). If
mmap() still works on 4k page bounaries then those will fragment all
regions into 4k chunks in the worst case.

Obviously if userspace has a minimum order of 64k chunks then it will
never break any region smaller than 64k chunks and will never cause a
fragmentation catastroph. I know that is verry roughly your aproach
(make order 0 bigger), and I like it, but it has some limits as to how
big you can make it. I don't think my system with 1GB ram would work
so well with 2MB order 0 pages. But I wasn't refering to that but to
the picture.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
 - Userspace allocates a lot of memory in those slabs.

 If with slabs you mean slab/slub, I can't follow, there has never been
 a single byte of userland memory allocated there since ever the slab
 existed in linux.

This and other comments in your reply show me that you completly
misunderstood what I was talking about.

Look at
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg

The red dots (pinned) are dentries, page tables, kernel stacks,
whatever kernel stuff, right?

The green dots (movable) are mostly userspace pages being mapped
there, right?

What I was refering too is that because movable objects (green dots)
aren't moved out of a mixed group (the boxes) when some unmovable
object needs space all the groups become mixed over time. That means
the unmovable objects are spread out over all the ram and the buddy
system can't recombine regions when unmovable objects free them. There
will nearly always be some movable objects in the other buddy. The
system of having unmovable and movable groups breaks down and becomes
useless.


I'm assuming here that we want the possibility of larger order pages
for unmovable objects (large continiuos regions for DMA for example)
than the smallest order user space gets (or any movable object). If
mmap() still works on 4k page bounaries then those will fragment all
regions into 4k chunks in the worst case.

Obviously if userspace has a minimum order of 64k chunks then it will
never break any region smaller than 64k chunks and will never cause a
fragmentation catastroph. I know that is verry roughly your aproach
(make order 0 bigger), and I like it, but it has some limits as to how
big you can make it. I don't think my system with 1GB ram would work
so well with 2MB order 0 pages. But I wasn't refering to that but to
the picture.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

 On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel [EMAIL PROTECTED] wrote:
 
  While I agree with your concern, those numbers are quite silly.  The
  chances of 99.8% of pages being free and the remaining 0.2% being
  perfectly spread across all 2MB large_pages are lower than those of SHA1
  creating a collision.
 
  Actually it'd be pretty easy to craft an application which allocates seven
  pages for pagecache, then one for something, then seven for pagecache, 
  then
  one for something, etc.
 
  I've had test apps which do that sort of thing accidentally.  The result
  wasn't pretty.
 
 Except that the applications 7 pages are movable and the something
 would have to be unmovable. And then they should not share the same
 memory region. At least they should never be allowed to interleave in
 such a pattern on a larger scale.
 

 It is actually really easy to force regions to never share. At the
 moment, there is a fallback list that determines a preference for what
 block to mix.

 The reason why this isn't enforced is the cost of moving. On x86 and
 x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
 those pages to prevent any mixing would be bad enough. On PowerPC, it's
 potentially 16MB. On IA64, it's 1GB.

 As this was fragmentation avoidance, not guarantees, the decision was
 made to not strictly enforce the types of pages within a block as the
 cost cannot be made back unless the system was making agressive use of
 large pages. This is not the case with Linux.

I don't say the group should never be mixed. The movable objects could
be moved out on demand. If 64k get allocated then up to 64k get
moved. That would reduce the impact as the kernel does not hang while
it moves 2MB or even 1GB. It also allows objects to be freed and the
space reused in the unmovable and mixed groups. There could also be a
certain number or percentage of mixed groupd be allowed to further
increase the chance of movable objects freeing themself from mixed
groups.

But when you already have say 10% of the ram in mixed groups then it
is a sign the external fragmentation happens and some time should be
spend on moving movable objects.

 The only way a fragmentation catastroph can be (proovable) avoided is
 by having so few unmovable objects that size + max waste  ram
 size. The smaller the better. Allowing movable and unmovable objects
 to mix means that max waste goes way up. In your example waste would
 be 7*size. With 2MB uper order limit it would be 511*size.
 
 I keep coming back to the fact that movable objects should be moved
 out of the way for unmovable ones. Anything else just allows
 fragmentation to build up.
 

 This is easily achieved, just really really expensive because of the
 amount of copying that would have to take place. It would also compel
 that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
 MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
 a lot of free memory to keep around which is why fragmentation avoidance
 doesn't do it.

In your sample graphics you had 1152 groups. Reserving a few of those
doesnt sound too bad. And how many migrate types do we talk about. So
far we only had movable and unmovable. I would split unmovable into
short term (caches, I/O pages) and long term (task structures,
dentries). Reserving 6 groups for schort term unmovable and long term
unmovable would be 1% of ram in your situation.

Maybe instead of reserving one could say that you can have up to 6
groups of space not used by unmovable objects before aggressive moving
starts. I don't quite see why you NEED reserving as long as there is
enough space free alltogether in case something needs moving. 1 group
worth of space free might be plenty to move stuff too. Note that all
the virtual pages can be stuffed in every little free space there is
and reassembled by the MMU. There is no space lost there.


But until one tries one can't say.

MfG
Goswin

PS: How do allocations pick groups? Could one use the oldest group
dedicated to each MIGRATE_TYPE? Or lowest address for unmovable and
highest address for movable? Something to better keep the two out of
each other way.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Jörn Engel [EMAIL PROTECTED] writes:

 On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote:
 
 Movable? I rather assume all slab allocations aren't movable. Then
 slab defrag can try to tackle on users like dcache and inodes. Keep in
 mind that with the exception of updatedb, those inodes/dentries will
 be pinned and you won't move them, which is why I prefer to consider
 them not movable too... since there's no guarantee they are.

 I have been toying with the idea of having seperate caches for pinned
 and movable dentries.  Downside of such a patch would be the number of
 memcpy() operations when moving dentries from one cache to the other.
 Upside is that a fair amount of slab cache can be made movable.
 memcpy() is still faster than reading an object from disk.

How probable is it that the dentry is needed again? If you copy it and
it is not needed then you wasted time. If you throw it out and it is
needed then you wasted time too. Depending on the probability one of
the two is cheaper overall. Idealy I would throw away dentries that
haven't been accessed recently and copy recently used ones.

How much of a systems ram is spend on dentires? How much on task
structures? Does anyone have some stats on that? If it is 10% of the
total ram combined then I don't see much point in moving them. Just
keep them out of the way of users memory so the buddy system can work
effectively.

 Most likely the current reaction to such a patch would be to shoot it
 down due to overhead, so I didn't pursue it.  All I have is an old patch
 to seperate never-cached from possibly-cached dentries.  It will
 increase the odds of freeing a slab, but provide no guarantee.

 But the point here is: dentries/inodes can be made movable if there are
 clear advantages to it.  Maybe they should?

 Jörn

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

 On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
 Mel Gorman [EMAIL PROTECTED] writes:
 
  On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
  Nick Piggin [EMAIL PROTECTED] writes:
  
   In my attack, I cause the kernel to allocate lots of unmovable 
   allocations
   and deplete movable groups. I theoretically then only need to keep a
   small number (1/2^N) of these allocations around in order to DoS a
   page allocation of order N.
  
  I'm assuming that when an unmovable allocation hijacks a movable group
  any further unmovable alloc will evict movable objects out of that
  group before hijacking another one. right?
  
 
  No eviction takes place. If an unmovable allocation gets placed in a
  movable group, then steps are taken to ensure that future unmovable
  allocations will take place in the same range (these decisions take
  place in __rmqueue_fallback()). When choosing a movable block to
  pollute, it will also choose the lowest possible block in PFN terms to
  steal so that fragmentation pollution will be as confined as possible.
  Evicting the unmovable pages would be one of those expensive steps that
  have been avoided to date.
 
 But then you can have all blocks filled with movable data, free 4K in
 one group, allocate 4K unmovable to take over the group, free 4k in
 the next group, take that group and so on. You can end with 4k
 unmovable in every 64k easily by accident.
 

 As the mixing takes place at the lowest possible block, it's
 exceptionally difficult to trigger this. Possible, but exceptionally
 difficult.

Why is it difficult?

When user space allocates memory wouldn't it get it contiously? I mean
that is one of the goals, to use larger continious allocations and map
them with a single page table entry where possible, right? And then
you can roughly predict where an munmap() would free a page.

Say the application does map a few GB of file, uses madvice to tell
the kernel it needs a 2MB block (to get a continious 2MB chunk
mapped), waits for it and then munmaps 4K in there. A 4k hole for some
unmovable object to fill. If you can then trigger the creation of an
unmovable object as well (stat some file?) and loop you will fill the
ram quickly. Maybe it only works in 10% but then you just do it 10
times as often.

Over long times it could occur naturally. This is just to demonstrate
it with malice.

 As I have stated repeatedly, the guarantees can be made but potential
 hugepage allocation did not justify it. Large blocks might.

 There should be a lot of preassure for movable objects to vacate a
 mixed group or you do get fragmentation catastrophs.

 We (Andy Whitcroft and I) did implement something like that. It hooked into
 kswapd to clean mixed blocks. If the caller could do the cleaning, it
 did the work instead of kswapd.

Do you have a graphic like
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
for that case?

 Looking at my
 little test program evicting movable objects from a mixed group should
 not be that expensive as it doesn't happen often.

 It happens regularly if the size of the block you need to keep clean is
 lower than min_free_kbytes. In the case of hugepages, that was always
 the case.

That assumes that the number of groups allocated for unmovable objects
will continiously grow and shrink. I'm assuming it will level off at
some size for long times (hours) under normal operations. There should
be some buffering of a few groups to be held back in reserve when it
shrinks to prevent the scenario that the size is just at a group
boundary and always grows/shrinks by 1 group.

 The cost of it
 should be freeing some pages (or finding free ones in a movable group)
 and then memcpy.

 Freeing pages is not cheap. Copying pages is cheaper but not cheap.

To copy you need a free page as destination. Thats all I
ment. Hopefully there will always be a free one and the actual freeing
is done asynchronously from the copying.

 So if
 you evict movable objects from mixed group when needed all the
 pagetable pages would end up in the same mixed group slowly taking it
 over completly. No fragmentation at all. See how essential that
 feature is. :)
 

 To move pages, there must be enough blocks free. That is where
 min_free_kbytes had to come in. If you cared only about keeping 64KB
 chunks free, it makes sense but it didn't in the context of hugepages.

I'm more concerned with keeping the little unmovable things out of the
way. Those are the things that will fragment the memory and prevent
any huge pages to be available even with moving other stuff out of the
way.

It would also already be a big plus to have 64k continious chunks for
many operations. Guaranty that the filesystem and block layers can
always get such a page (by means of copying pages out of the way when
needed) and do even larger pages speculative.

But as you say that is where min_free_kbytes comes in. To have the
chance

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Linus Torvalds [EMAIL PROTECTED] writes:

 On Sun, 16 Sep 2007, Jörn Engel wrote:
 
 My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
 which are pinned for their entire lifetime and another for regular
 files/inodes.  One could take a three-way approach and have
 always-pinned, often-pinned and rarely-pinned.
 
 We won't get never-pinned that way.

 That sounds pretty good. The problem, of course, is that most of the time, 
 the actual dentry allocation itself is done before you really know which 
 case the dentry will be in, and the natural place for actually giving the 
 dentry lifetime hint is *not* at d_alloc(), but when we instantiate 
 it with d_add() or d_instantiate().

 But it turns out that most of the filesystems we care about already use a 
 special case of d_add() that *already* replaces the dentry with another 
 one in some cases: d_splice_alias().

 So I bet that if we just taught d_splice_alias() to look at the inode, 
 and based on the inode just re-allocate the dentry to some other slab 
 cache, we'd already handle a lot of the cases!

 And yes, you'd end up with the reallocation overhead quite often, but at 
 least it would now happen only when filling in a dentry, not in the 
 (*much* more critical) cached lookup path.

   Linus

You would only get it for dentries that live long (or your prediction
is awfully wrong) and then the reallocation amortizes over time if you
will. :)

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
[EMAIL PROTECTED] (Mel Gorman) writes:

 On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
 zooming in I see red pixels all over the squares mized with green
 pixels in the same square. This is exactly what happens with the
 variable order page cache and that's why it provides zero guarantees
 in terms of how much ram is really free (free as in available).
 

 This picture is not grouping pages by mobility so that is hardly a
 suprise. This picture is not running grouping pages by mobility. This is
 what the normal kernel looks like. Look at the videos in
 http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
 compares to vanilla. These are from February when there was less control
 over mixing blocks than there is today.

 In the current version mixing occurs in the lower blocks as much as possible
 not the upper ones. So there are a number of mixed blocks but the number is
 kept to a minimum.

 The number of mixed blocks could have been enforced as 0, but I felt it was
 better in the general case to fragment rather than regress performance.
 That may be different for large blocks where you will want to take the
 enforcement steps.

I agree that 0 is a bad value. But so is infinity. There should be
some mixing but not a lot. You say kept to a minimum. Is that
actively done or already happens by itself. Hopefully the later which
would be just splendid.

 With config-page-shift mmap works on 4k chunks but it's always backed
 by 64k or any other largesize that you choosed at compile time. And if

But would mapping a random 4K page out of a file then consume 64k?
That sounds like an awfull lot of internal fragmentation. I hope the
unaligned bits and pices get put into a slab or something as you
suggested previously.

 the virtual alignment of mmap matches the physical alignment of the
 physical largepage and is = PAGE_SIZE (software PAGE_SIZE I mean) we
 could use the 62nd bit of the pte to use a 64k tlb (if future cpus
 will allow that). Nick also suggested to still set all ptes equal to
 make life easier for the tlb miss microcode.

It is too bad that existing amd64 CPUs only allow such large physical
pages. But it kind of makes sense to cut away a full level or page
tables for the next bigger size each.

  big you can make it. I don't think my system with 1GB ram would work
  so well with 2MB order 0 pages. But I wasn't refering to that but to
  the picture.
 
 Sure! 2M is sure way excessive for a 1G system, 64k most certainly
 too, of course unless you're running a db or a multimedia streaming
 service, in which case it should be ideal.

rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
ocasional mplayer.

I would mostly be concerned how rtorrents totaly random access of
mmapped files negatively impacts such a 64k page system.

MfG
 Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Goswin von Brederlow
Andrea Arcangeli [EMAIL PROTECTED] writes:

 You ignore one other bit, when /usr/bin/free says 1G is free, with
 config-page-shift it's free no matter what, same goes for not mlocked
 cache. With variable order page cache, /usr/bin/free becomes mostly a
 lie as long as there's no 4k fallback (like fsblock).

% free
 total   used   free sharedbuffers cached
Mem:   13987841372956  25828  0 225224 321504
-/+ buffers/cache: 826228 572556
Swap:  1048568 201048548

When has free ever given any usefull free number? I can perfectly
fine allocate another gigabyte of memory despide free saing 25MB. But
that is because I know that the buffer/cached are not locked in.

On the other hand 1GB can instantly vanish when I start a xen domain
and anything relying on the free value would loose.


The only sensible thing for an application concerned with swapping is
to whatch the swapping and then reduce itself. Not the amount
free. Although I wish there were some kernel interface to get a
preasure value of how valuable free pages would be right now. I would
like that for fuse so a userspace filesystem can do caching without
cripling the kernel.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Goswin von Brederlow
Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
>> I keep coming back to the fact that movable objects should be moved
>> out of the way for unmovable ones. Anything else just allows
>
> That's incidentally exactly what the slab does, no need to reinvent
> the wheel for that, it's an old problem and there's room for
> optimization in the slab partial-reuse logic too. Just boost the order
> 0 page size and use the slab to get the 4k chunks. The sgi/defrag
> design is backwards.

How does that help? Will slabs move objects around to combine two
partially filled slabs into nearly full one? If not consider this:

- You create a slab for 4k objects based on 64k compound pages.
  (first of all that wastes you a page already for the meta infos)
- Something movable allocates a 14 4k page in there making the slab
  partially filled.
- Something unmovable alloactes a 4k page making the slab mixed and
  full.
- Repeat until out of memory.

OR

- Userspace allocates a lot of memory in those slabs.
- Userspace frees one in every 15 4k chunks.
- Userspace forks 1000 times causing an unmovable task structure to
  appear in 1000 slabs. 

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Goswin von Brederlow
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
>
>> While I agree with your concern, those numbers are quite silly.  The
>> chances of 99.8% of pages being free and the remaining 0.2% being
>> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> creating a collision.
>
> Actually it'd be pretty easy to craft an application which allocates seven
> pages for pagecache, then one for , then seven for pagecache, then
> one for , etc.
>
> I've had test apps which do that sort of thing accidentally.  The result
> wasn't pretty.

Except that the applications 7 pages are movable and the 
would have to be unmovable. And then they should not share the same
memory region. At least they should never be allowed to interleave in
such a pattern on a larger scale.

The only way a fragmentation catastroph can be (proovable) avoided is
by having so few unmovable objects that size + max waste << ram
size. The smaller the better. Allowing movable and unmovable objects
to mix means that max waste goes way up. In your example waste would
be 7*size. With 2MB uper order limit it would be 511*size.

I keep coming back to the fact that movable objects should be moved
out of the way for unmovable ones. Anything else just allows
fragmentation to build up.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Goswin von Brederlow
Andrew Morton [EMAIL PROTECTED] writes:

 On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel [EMAIL PROTECTED] wrote:

 While I agree with your concern, those numbers are quite silly.  The
 chances of 99.8% of pages being free and the remaining 0.2% being
 perfectly spread across all 2MB large_pages are lower than those of SHA1
 creating a collision.

 Actually it'd be pretty easy to craft an application which allocates seven
 pages for pagecache, then one for something, then seven for pagecache, then
 one for something, etc.

 I've had test apps which do that sort of thing accidentally.  The result
 wasn't pretty.

Except that the applications 7 pages are movable and the something
would have to be unmovable. And then they should not share the same
memory region. At least they should never be allowed to interleave in
such a pattern on a larger scale.

The only way a fragmentation catastroph can be (proovable) avoided is
by having so few unmovable objects that size + max waste  ram
size. The smaller the better. Allowing movable and unmovable objects
to mix means that max waste goes way up. In your example waste would
be 7*size. With 2MB uper order limit it would be 511*size.

I keep coming back to the fact that movable objects should be moved
out of the way for unmovable ones. Anything else just allows
fragmentation to build up.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Goswin von Brederlow
Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
 I keep coming back to the fact that movable objects should be moved
 out of the way for unmovable ones. Anything else just allows

 That's incidentally exactly what the slab does, no need to reinvent
 the wheel for that, it's an old problem and there's room for
 optimization in the slab partial-reuse logic too. Just boost the order
 0 page size and use the slab to get the 4k chunks. The sgi/defrag
 design is backwards.

How does that help? Will slabs move objects around to combine two
partially filled slabs into nearly full one? If not consider this:

- You create a slab for 4k objects based on 64k compound pages.
  (first of all that wastes you a page already for the meta infos)
- Something movable allocates a 14 4k page in there making the slab
  partially filled.
- Something unmovable alloactes a 4k page making the slab mixed and
  full.
- Repeat until out of memory.

OR

- Userspace allocates a lot of memory in those slabs.
- Userspace frees one in every 15 4k chunks.
- Userspace forks 1000 times causing an unmovable task structure to
  appear in 1000 slabs. 

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Christoph Lameter <[EMAIL PROTECTED]> writes:

> On Fri, 14 Sep 2007, Christoph Lameter wrote:
>
>> an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine 
>
> s/1G/1T/ Sigh.
>
>> has 256 milllion 4k pages--and the unmovable ratios we see today it 
>
> 256k for 1G.

256k == 64 pages for 1GB ram or 256k pages == 1Mb?

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Mel Gorman <[EMAIL PROTECTED]> writes:

> On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
>> Nick Piggin <[EMAIL PROTECTED]> writes:
>> 
>> > In my attack, I cause the kernel to allocate lots of unmovable allocations
>> > and deplete movable groups. I theoretically then only need to keep a
>> > small number (1/2^N) of these allocations around in order to DoS a
>> > page allocation of order N.
>> 
>> I'm assuming that when an unmovable allocation hijacks a movable group
>> any further unmovable alloc will evict movable objects out of that
>> group before hijacking another one. right?
>> 
>
> No eviction takes place. If an unmovable allocation gets placed in a
> movable group, then steps are taken to ensure that future unmovable
> allocations will take place in the same range (these decisions take
> place in __rmqueue_fallback()). When choosing a movable block to
> pollute, it will also choose the lowest possible block in PFN terms to
> steal so that fragmentation pollution will be as confined as possible.
> Evicting the unmovable pages would be one of those expensive steps that
> have been avoided to date.

But then you can have all blocks filled with movable data, free 4K in
one group, allocate 4K unmovable to take over the group, free 4k in
the next group, take that group and so on. You can end with 4k
unmovable in every 64k easily by accident.

There should be a lot of preassure for movable objects to vacate a
mixed group or you do get fragmentation catastrophs. Looking at my
little test program evicting movable objects from a mixed group should
not be that expensive as it doesn't happen often. The cost of it
should be freeing some pages (or finding free ones in a movable group)
and then memcpy. With my simplified simulation it never happens so I
expect it to only happen when the work set changes.

>> > And it doesn't even have to be a DoS. The natural fragmentation
>> > that occurs today in a kernel today has the possibility to slowly push out
>> > the movable groups and give you the same situation.
>> 
>> How would you cause that? Say you do want to purposefully place one
>> unmovable 4k page into every 64k compund page. So you allocate
>> 4K. First 64k page locked. But now, to get 4K into the second 64K page
>> you have to first use up all the rest of the first 64k page. Meaning
>> one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
>> will a new 64k chunk be broken and become locked.
>
> It would be easier early in the boot to mmap a large area and fault it
> in in virtual address order then mlock every a page every 64K. Early in
> the systems lifetime, there will be a rough correlation between physical
> and virtual memory.
>
> Without mlock(), the most successful attack will like mmap() a 60K
> region and fault it in as an attempt to get pagetable pages placed in
> every 64K region. This strategy would not work with grouping pages by
> mobility though as it would group the pagetable pages together.

But even with mlock the virtual pages should still be movable. So if
you evict movable objects from mixed group when needed all the
pagetable pages would end up in the same mixed group slowly taking it
over completly. No fragmentation at all. See how essential that
feature is. :)

> Targetted attacks on grouping pages by mobility are not very easy and
> not that interesting either. As Nick suggests, the natural fragmentation
> over long periods of time is what is interesting.
>
>> So to get the last 64k chunk used all previous 32k chunks need to be
>> blocked and you need to allocate 32k (or less if more is blocked). For
>> all previous 32k chunks to be blocked every second 16k needs to be
>> blocked. To block the last of those 16k chunks all previous 8k chunks
>> need to be blocked and you need to allocate 8k. For all previous 8k
>> chunks to be blocked every second 4k page needs to be used. To alloc
>> the last of those 4k pages all previous 4k pages need to be used.
>> 
>> So to construct a situation where no continious 64k chunk is free you
>> have to allocate  - 64k - 32k - 16k - 8k - 4k (or there
>> about) of memory first. Only then could you free memory again while
>> still keeping every 64k page blocked. Does that occur naturally given
>> enough ram to start with?
>> 
>
> I believe it's very difficult to craft an attack that will work in a
> short period of time. An attack that worked on 2.6.22 as well may have
> no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
> does it make it exceedingly hard to craft an attack unless the attacker
> can mlock large amounts of memory.
>
>> 
>> Too see how bad fr

Re: [RFC PATCH] Add a 'minimal tree install' target

2007-09-14 Thread Goswin von Brederlow
Chris Wedgwood <[EMAIL PROTECTED]> writes:

> This is a somewhat rough first-pass at making a 'minimal tree'
> installation target.  This installs a partial source-tree which you
> can use to build external modules against.  It feels pretty unclean
> but I'm not aware of a much better way to do some of this.
>
> This patch works for me, even when using O=.  It probably
> needs further cleanups.
>
> Comments?

Ever looked at the debian packages and how they do it? They even split
out common files and specific files from the kernel build. Saves some
space if you build multiple flavours of the same kernel version.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata & scsi suggestion for make menuconfig

2007-09-14 Thread Goswin von Brederlow
Helge Hafting <[EMAIL PROTECTED]> writes:

> Randy Dunlap wrote:
>> On Fri, 7 Sep 2007 14:48:00 +0200 Folkert van Heusden wrote:
>>
>>
>>> Hi,
>>>
>>> Maybe it is a nice enhancement for make menuconfig to more explicitly
>>> give a pop-up or so when someone selects for example a sata controller
>>> while no 'scsi-disk' support was selected?
>>>
>>
>> I know that it's difficult to get people to read docs & help text,
>> and maybe it is needed in more places, but CONFIG_ATA (SATA/PATA)
>> help text says:
>>
>>   NOTE: ATA enables basic SCSI support; *however*,
>>   'SCSI disk support', 'SCSI tape support', or
>>   'SCSI CDROM support' may also be needed,
>>   depending on your hardware configuration.

Could one duplicate the configure options for scsi disk/tape/cdrom at
that place? The text should then probably read SCSI/SATA disk support
in both places.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_NOLINK for open()

2007-09-14 Thread Goswin von Brederlow
Brent Casavant <[EMAIL PROTECTED]> writes:

> My (limited) understanding of ptrace is that a parent-child
> relationship is needed between the tracing process and the traced
> process (at least that's what I gather from the man page).  This
> does give cause for concern, and I might have to see what can be
> done to alleviate this concern.  I fully realize that making this
> design completely unassilable is a fools errand, but closing off
> as many attack vectors as possible seems prudent.

No relationship needed:

strace -p 

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Hi,


Nick Piggin <[EMAIL PROTECTED]> writes:

> In my attack, I cause the kernel to allocate lots of unmovable allocations
> and deplete movable groups. I theoretically then only need to keep a
> small number (1/2^N) of these allocations around in order to DoS a
> page allocation of order N.

I'm assuming that when an unmovable allocation hijacks a movable group
any further unmovable alloc will evict movable objects out of that
group before hijacking another one. right?

> And it doesn't even have to be a DoS. The natural fragmentation
> that occurs today in a kernel today has the possibility to slowly push out
> the movable groups and give you the same situation.

How would you cause that? Say you do want to purposefully place one
unmovable 4k page into every 64k compund page. So you allocate
4K. First 64k page locked. But now, to get 4K into the second 64K page
you have to first use up all the rest of the first 64k page. Meaning
one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
will a new 64k chunk be broken and become locked.

So to get the last 64k chunk used all previous 32k chunks need to be
blocked and you need to allocate 32k (or less if more is blocked). For
all previous 32k chunks to be blocked every second 16k needs to be
blocked. To block the last of those 16k chunks all previous 8k chunks
need to be blocked and you need to allocate 8k. For all previous 8k
chunks to be blocked every second 4k page needs to be used. To alloc
the last of those 4k pages all previous 4k pages need to be used.

So to construct a situation where no continious 64k chunk is free you
have to allocate  - 64k - 32k - 16k - 8k - 4k (or there
about) of memory first. Only then could you free memory again while
still keeping every 64k page blocked. Does that occur naturally given
enough ram to start with?



Too see how bad fragmentation could be I wrote a little progamm to
simulate allocations with the following simplified alogrithm:

Memory management:
- Free pages are kept in buckets, one per order, and sorted by address.
- alloc() the front page (smallest address) out of the bucket of the
  right order or recursively splits the next higher bucket.
- free() recursively tries to merge a page with its neighbour and puts
  the result back into the proper bucket (sorted by address).

Allocation and lifetime:
- Every tick a new page is allocated with random order.
- The order is a triangle distribution with max at 0 (throw 2 dice,
  add the eyes, subtract 7, abs() the number).
- The page is scheduled to be freed after X ticks. Where X is nearly
  a gaus curve centered at 0 and maximum at  * 1.5.
  (What I actualy do is throw 8 dice and sum them up and shift the
   result.)

Display:
I start with a white window. Every page allocation draws a black box
from the address of the page and as wide as the page is big (-1 pixel to
give a seperation to the next page). Every page free draws a yellow
box in place of the black one. Yellow to show where a page was in use
at one point while white means the page was never used.

As the time ticks the memory fills up. Quickly at first and then comes
to a stop around 80% filled. And then something interesting
happens. The yellow regions (previously used but now free) start
drifting up. Small pages tend to end up in the lower addresses and big
pages at the higher addresses. The memory defragments itself to some
degree.

http://mrvn.homeip.net/fragment/

Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
295296 16k, 176647 32k and 59064 64k allocations you get this:
http://mrvn.homeip.net/fragment/256mb.png

Simulating 1GB ram and after 5881185 ticks  and 2116671 4k, 1645957
8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
http://mrvn.homeip.net/fragment/1gb.png

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Hi,


Nick Piggin [EMAIL PROTECTED] writes:

 In my attack, I cause the kernel to allocate lots of unmovable allocations
 and deplete movable groups. I theoretically then only need to keep a
 small number (1/2^N) of these allocations around in order to DoS a
 page allocation of order N.

I'm assuming that when an unmovable allocation hijacks a movable group
any further unmovable alloc will evict movable objects out of that
group before hijacking another one. right?

 And it doesn't even have to be a DoS. The natural fragmentation
 that occurs today in a kernel today has the possibility to slowly push out
 the movable groups and give you the same situation.

How would you cause that? Say you do want to purposefully place one
unmovable 4k page into every 64k compund page. So you allocate
4K. First 64k page locked. But now, to get 4K into the second 64K page
you have to first use up all the rest of the first 64k page. Meaning
one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
will a new 64k chunk be broken and become locked.

So to get the last 64k chunk used all previous 32k chunks need to be
blocked and you need to allocate 32k (or less if more is blocked). For
all previous 32k chunks to be blocked every second 16k needs to be
blocked. To block the last of those 16k chunks all previous 8k chunks
need to be blocked and you need to allocate 8k. For all previous 8k
chunks to be blocked every second 4k page needs to be used. To alloc
the last of those 4k pages all previous 4k pages need to be used.

So to construct a situation where no continious 64k chunk is free you
have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there
about) of memory first. Only then could you free memory again while
still keeping every 64k page blocked. Does that occur naturally given
enough ram to start with?



Too see how bad fragmentation could be I wrote a little progamm to
simulate allocations with the following simplified alogrithm:

Memory management:
- Free pages are kept in buckets, one per order, and sorted by address.
- alloc() the front page (smallest address) out of the bucket of the
  right order or recursively splits the next higher bucket.
- free() recursively tries to merge a page with its neighbour and puts
  the result back into the proper bucket (sorted by address).

Allocation and lifetime:
- Every tick a new page is allocated with random order.
- The order is a triangle distribution with max at 0 (throw 2 dice,
  add the eyes, subtract 7, abs() the number).
- The page is scheduled to be freed after X ticks. Where X is nearly
  a gaus curve centered at 0 and maximum at total num pages * 1.5.
  (What I actualy do is throw 8 dice and sum them up and shift the
   result.)

Display:
I start with a white window. Every page allocation draws a black box
from the address of the page and as wide as the page is big (-1 pixel to
give a seperation to the next page). Every page free draws a yellow
box in place of the black one. Yellow to show where a page was in use
at one point while white means the page was never used.

As the time ticks the memory fills up. Quickly at first and then comes
to a stop around 80% filled. And then something interesting
happens. The yellow regions (previously used but now free) start
drifting up. Small pages tend to end up in the lower addresses and big
pages at the higher addresses. The memory defragments itself to some
degree.

http://mrvn.homeip.net/fragment/

Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
295296 16k, 176647 32k and 59064 64k allocations you get this:
http://mrvn.homeip.net/fragment/256mb.png

Simulating 1GB ram and after 5881185 ticks  and 2116671 4k, 1645957
8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
http://mrvn.homeip.net/fragment/1gb.png

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_NOLINK for open()

2007-09-14 Thread Goswin von Brederlow
Brent Casavant [EMAIL PROTECTED] writes:

 My (limited) understanding of ptrace is that a parent-child
 relationship is needed between the tracing process and the traced
 process (at least that's what I gather from the man page).  This
 does give cause for concern, and I might have to see what can be
 done to alleviate this concern.  I fully realize that making this
 design completely unassilable is a fools errand, but closing off
 as many attack vectors as possible seems prudent.

No relationship needed:

strace -p pid

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata scsi suggestion for make menuconfig

2007-09-14 Thread Goswin von Brederlow
Helge Hafting [EMAIL PROTECTED] writes:

 Randy Dunlap wrote:
 On Fri, 7 Sep 2007 14:48:00 +0200 Folkert van Heusden wrote:


 Hi,

 Maybe it is a nice enhancement for make menuconfig to more explicitly
 give a pop-up or so when someone selects for example a sata controller
 while no 'scsi-disk' support was selected?


 I know that it's difficult to get people to read docs  help text,
 and maybe it is needed in more places, but CONFIG_ATA (SATA/PATA)
 help text says:

   NOTE: ATA enables basic SCSI support; *however*,
   'SCSI disk support', 'SCSI tape support', or
   'SCSI CDROM support' may also be needed,
   depending on your hardware configuration.

Could one duplicate the configure options for scsi disk/tape/cdrom at
that place? The text should then probably read SCSI/SATA disk support
in both places.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Add a 'minimal tree install' target

2007-09-14 Thread Goswin von Brederlow
Chris Wedgwood [EMAIL PROTECTED] writes:

 This is a somewhat rough first-pass at making a 'minimal tree'
 installation target.  This installs a partial source-tree which you
 can use to build external modules against.  It feels pretty unclean
 but I'm not aware of a much better way to do some of this.

 This patch works for me, even when using O=buildtree.  It probably
 needs further cleanups.

 Comments?

Ever looked at the debian packages and how they do it? They even split
out common files and specific files from the kernel build. Saves some
space if you build multiple flavours of the same kernel version.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Mel Gorman [EMAIL PROTECTED] writes:

 On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
 
  In my attack, I cause the kernel to allocate lots of unmovable allocations
  and deplete movable groups. I theoretically then only need to keep a
  small number (1/2^N) of these allocations around in order to DoS a
  page allocation of order N.
 
 I'm assuming that when an unmovable allocation hijacks a movable group
 any further unmovable alloc will evict movable objects out of that
 group before hijacking another one. right?
 

 No eviction takes place. If an unmovable allocation gets placed in a
 movable group, then steps are taken to ensure that future unmovable
 allocations will take place in the same range (these decisions take
 place in __rmqueue_fallback()). When choosing a movable block to
 pollute, it will also choose the lowest possible block in PFN terms to
 steal so that fragmentation pollution will be as confined as possible.
 Evicting the unmovable pages would be one of those expensive steps that
 have been avoided to date.

But then you can have all blocks filled with movable data, free 4K in
one group, allocate 4K unmovable to take over the group, free 4k in
the next group, take that group and so on. You can end with 4k
unmovable in every 64k easily by accident.

There should be a lot of preassure for movable objects to vacate a
mixed group or you do get fragmentation catastrophs. Looking at my
little test program evicting movable objects from a mixed group should
not be that expensive as it doesn't happen often. The cost of it
should be freeing some pages (or finding free ones in a movable group)
and then memcpy. With my simplified simulation it never happens so I
expect it to only happen when the work set changes.

  And it doesn't even have to be a DoS. The natural fragmentation
  that occurs today in a kernel today has the possibility to slowly push out
  the movable groups and give you the same situation.
 
 How would you cause that? Say you do want to purposefully place one
 unmovable 4k page into every 64k compund page. So you allocate
 4K. First 64k page locked. But now, to get 4K into the second 64K page
 you have to first use up all the rest of the first 64k page. Meaning
 one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
 will a new 64k chunk be broken and become locked.

 It would be easier early in the boot to mmap a large area and fault it
 in in virtual address order then mlock every a page every 64K. Early in
 the systems lifetime, there will be a rough correlation between physical
 and virtual memory.

 Without mlock(), the most successful attack will like mmap() a 60K
 region and fault it in as an attempt to get pagetable pages placed in
 every 64K region. This strategy would not work with grouping pages by
 mobility though as it would group the pagetable pages together.

But even with mlock the virtual pages should still be movable. So if
you evict movable objects from mixed group when needed all the
pagetable pages would end up in the same mixed group slowly taking it
over completly. No fragmentation at all. See how essential that
feature is. :)

 Targetted attacks on grouping pages by mobility are not very easy and
 not that interesting either. As Nick suggests, the natural fragmentation
 over long periods of time is what is interesting.

 So to get the last 64k chunk used all previous 32k chunks need to be
 blocked and you need to allocate 32k (or less if more is blocked). For
 all previous 32k chunks to be blocked every second 16k needs to be
 blocked. To block the last of those 16k chunks all previous 8k chunks
 need to be blocked and you need to allocate 8k. For all previous 8k
 chunks to be blocked every second 4k page needs to be used. To alloc
 the last of those 4k pages all previous 4k pages need to be used.
 
 So to construct a situation where no continious 64k chunk is free you
 have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there
 about) of memory first. Only then could you free memory again while
 still keeping every 64k page blocked. Does that occur naturally given
 enough ram to start with?
 

 I believe it's very difficult to craft an attack that will work in a
 short period of time. An attack that worked on 2.6.22 as well may have
 no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
 does it make it exceedingly hard to craft an attack unless the attacker
 can mlock large amounts of memory.

 
 Too see how bad fragmentation could be I wrote a little progamm to
 simulate allocations with the following simplified alogrithm:
 
 Memory management:
 - Free pages are kept in buckets, one per order, and sorted by address.
 - alloc() the front page (smallest address) out of the bucket of the
   right order or recursively splits the next higher bucket.
 - free() recursively tries to merge a page with its neighbour and puts
   the result back into the proper bucket (sorted

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Christoph Lameter [EMAIL PROTECTED] writes:

 On Fri, 14 Sep 2007, Christoph Lameter wrote:

 an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine 

 s/1G/1T/ Sigh.

 has 256 milllion 4k pages--and the unmovable ratios we see today it 

 256k for 1G.

256k == 64 pages for 1GB ram or 256k pages == 1Mb?

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Goswin von Brederlow
Nick Piggin <[EMAIL PROTECTED]> writes:

> On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
>> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
>> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>> > > 5. VM scalability
>> > >Large block sizes mean less state keeping for the information being
>> > >transferred. For a 1TB file one needs to handle 256 million page
>> > >structs in the VM if one uses 4k page size. A 64k page size reduces
>> > >that amount to 16 million. If the limitation in existing filesystems
>> > >are removed then even higher reductions become possible. For very
>> > >large files like that a page size of 2 MB may be beneficial which
>> > >will reduce the number of page struct to handle to 512k. The
>> > > variable nature of the block size means that the size can be tuned at
>> > > file system creation time for the anticipated needs on a volume.
>> >
>> > The idea that there even _is_ a bug to fail when higher order pages
>> > cannot be allocated was also brushed aside by some people at the
>> > vm/fs summit. I don't know if those people had gone through the
>> > math about this, but it goes somewhat like this: if you use a 64K
>> > page size, you can "run out of memory" with 93% of your pages free.
>> > If you use a 2MB page size, you can fail with 99.8% of your pages
>> > still free. That's 64GB of memory used on a 32TB Altix.
>>
>> While I agree with your concern, those numbers are quite silly.  The
>
> They are the theoretical worst case. Obviously with a non trivially
> sized system and non-DoS workload, they will not be reached.

I would think it should be pretty hard to have only one page out of
each 2MB chunk allocated and non evictable (writeable, swappable or
movable). Wouldn't that require some kernel driver to allocate all
pages and then selectively free them in such a pattern as to keep one
page per 2MB chunk?

Assuming nothing tries to allocate a large chunk of ram while holding
to many locks for the kernel to free it.

>> chances of 99.8% of pages being free and the remaining 0.2% being
>> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> creating a collision.  I don't see anyone abandoning git or rsync, so
>> your extreme example clearly is the wrong one.
>>
>> Again, I agree with your concern, even though your example makes it look
>> silly.
>
> It is not simply a question of once-off chance for an all-at-once layout
> to fail in this way. Fragmentation slowly builds over time, and especially
> if you do actually use higher-order pages for a significant number of
> things (unlike we do today), then the problem will become worse. If you
> have any part of your workload that is affected by fragmentation, then
> it will cause unfragmented regions to eventually be used for fragmentation
> inducing allocations (by definition -- if it did not, eg. then there would be
> no fragmentation problem and no need for Mel's patches).

It might be naive (stop me as soon as I go into dream world) but I
would think there are two kinds of fragmentation:

Hard fragments - physical pages the kernel can't move around
Soft fragments - virtual pages/cache that happen to cause a fragment

I would further assume most ram is used on soft fragments and that the
kernel will free them up by flushing or swapping the data when there
is sufficient need. With defragmentation support the kernel could
prevent some flushings or swapping by moving the data from one
physical page to another. But that would just reduce unneccessary work
and not change the availability of larger pages.

Further I would assume that there are two kinds of hard fragments:
Fragments allocated once at start time and temporary fragments.

At boot time (or when a module is loaded or something) you get a tiny
amount of ram allocated that will remain busy for basically ever. You
get some fragmentation right there that you can never get rid of.

At runtime a lot of pages are allocated and quickly freed again. They
get preferably positions in regions where there already is
fragmentation. In regions where there are suitable sized holes
already. They would only break a free 2MB chunk into smaller chunks if
there is no small hole to be found.

Now a trick I would use is to put kernel allocated pages at one end of
the ram and virtual/cache pages at the other end. Small kernel allocs
would find holes at the start of the ram while big allocs would have
to move more to the middle or end of the ram to find a large enough
hole. And virtual/cache pages could always be cleared out to free
large continious chunks.

Splitting the two types would prevent fragmentation of freeable and
not freeable regions giving us always a large pool to pull compound
pages from.

One could also split the ram into regions of different page sizes,
meaning that some large compound pages may not be split below a
certain limit. E.g. some amount of ram would be reserved for chunk
>=64k only. 

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Goswin von Brederlow
Nick Piggin [EMAIL PROTECTED] writes:

 On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
 On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
  On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
   5. VM scalability
  Large block sizes mean less state keeping for the information being
  transferred. For a 1TB file one needs to handle 256 million page
  structs in the VM if one uses 4k page size. A 64k page size reduces
  that amount to 16 million. If the limitation in existing filesystems
  are removed then even higher reductions become possible. For very
  large files like that a page size of 2 MB may be beneficial which
  will reduce the number of page struct to handle to 512k. The
   variable nature of the block size means that the size can be tuned at
   file system creation time for the anticipated needs on a volume.
 
  The idea that there even _is_ a bug to fail when higher order pages
  cannot be allocated was also brushed aside by some people at the
  vm/fs summit. I don't know if those people had gone through the
  math about this, but it goes somewhat like this: if you use a 64K
  page size, you can run out of memory with 93% of your pages free.
  If you use a 2MB page size, you can fail with 99.8% of your pages
  still free. That's 64GB of memory used on a 32TB Altix.

 While I agree with your concern, those numbers are quite silly.  The

 They are the theoretical worst case. Obviously with a non trivially
 sized system and non-DoS workload, they will not be reached.

I would think it should be pretty hard to have only one page out of
each 2MB chunk allocated and non evictable (writeable, swappable or
movable). Wouldn't that require some kernel driver to allocate all
pages and then selectively free them in such a pattern as to keep one
page per 2MB chunk?

Assuming nothing tries to allocate a large chunk of ram while holding
to many locks for the kernel to free it.

 chances of 99.8% of pages being free and the remaining 0.2% being
 perfectly spread across all 2MB large_pages are lower than those of SHA1
 creating a collision.  I don't see anyone abandoning git or rsync, so
 your extreme example clearly is the wrong one.

 Again, I agree with your concern, even though your example makes it look
 silly.

 It is not simply a question of once-off chance for an all-at-once layout
 to fail in this way. Fragmentation slowly builds over time, and especially
 if you do actually use higher-order pages for a significant number of
 things (unlike we do today), then the problem will become worse. If you
 have any part of your workload that is affected by fragmentation, then
 it will cause unfragmented regions to eventually be used for fragmentation
 inducing allocations (by definition -- if it did not, eg. then there would be
 no fragmentation problem and no need for Mel's patches).

It might be naive (stop me as soon as I go into dream world) but I
would think there are two kinds of fragmentation:

Hard fragments - physical pages the kernel can't move around
Soft fragments - virtual pages/cache that happen to cause a fragment

I would further assume most ram is used on soft fragments and that the
kernel will free them up by flushing or swapping the data when there
is sufficient need. With defragmentation support the kernel could
prevent some flushings or swapping by moving the data from one
physical page to another. But that would just reduce unneccessary work
and not change the availability of larger pages.

Further I would assume that there are two kinds of hard fragments:
Fragments allocated once at start time and temporary fragments.

At boot time (or when a module is loaded or something) you get a tiny
amount of ram allocated that will remain busy for basically ever. You
get some fragmentation right there that you can never get rid of.

At runtime a lot of pages are allocated and quickly freed again. They
get preferably positions in regions where there already is
fragmentation. In regions where there are suitable sized holes
already. They would only break a free 2MB chunk into smaller chunks if
there is no small hole to be found.

Now a trick I would use is to put kernel allocated pages at one end of
the ram and virtual/cache pages at the other end. Small kernel allocs
would find holes at the start of the ram while big allocs would have
to move more to the middle or end of the ram to find a large enough
hole. And virtual/cache pages could always be cleared out to free
large continious chunks.

Splitting the two types would prevent fragmentation of freeable and
not freeable regions giving us always a large pool to pull compound
pages from.

One could also split the ram into regions of different page sizes,
meaning that some large compound pages may not be split below a
certain limit. E.g. some amount of ram would be reserved for chunk
=64k only. This should be configurable via sys.

 I don't know what happens as time tends towards infinity, but I 

Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-08 Thread Goswin von Brederlow
Nick Piggin <[EMAIL PROTECTED]> writes:

> Lustre should probably have to be ported over to write_begin/write_end in
> order to use it too. With the patches in -mm, if a filesystem is still using
> prepare_write/commit_write, the vm reverts to a safe path which avoids
> the deadlock (and allows multi-seg io copies), but copies the data twice.

Not quite relevant for the performance problem. The situation is like
this:

lustre servers  <-lustre network protocol-> lustre client <-NFS-> desktop

The NFSd problem is on the lustre client that only plays gateway. That
is not to say that the lustre servers or desktop loose performance due
to fragmenting writes too but it isn't that noticeable there.

> OTOH, this is very likely to go upstream, so your filesystem will need to be
> ported over sooner or later anyway.

Lustre copies the ext3 source from the kernel, patches in some extra
features and renames them during build. So one the one hand it always
breaks whenever someone meddles with the ext3 code. On the other hand
improvement for ext3 get picked up by lustre semi automatically.

In this case lustre would get the begin_write() function from ext3 and
use it.

MfG
Goswin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-08 Thread Goswin von Brederlow
Nick Piggin [EMAIL PROTECTED] writes:

 Lustre should probably have to be ported over to write_begin/write_end in
 order to use it too. With the patches in -mm, if a filesystem is still using
 prepare_write/commit_write, the vm reverts to a safe path which avoids
 the deadlock (and allows multi-seg io copies), but copies the data twice.

Not quite relevant for the performance problem. The situation is like
this:

lustre servers  -lustre network protocol- lustre client -NFS- desktop

The NFSd problem is on the lustre client that only plays gateway. That
is not to say that the lustre servers or desktop loose performance due
to fragmenting writes too but it isn't that noticeable there.

 OTOH, this is very likely to go upstream, so your filesystem will need to be
 ported over sooner or later anyway.

Lustre copies the ext3 source from the kernel, patches in some extra
features and renames them during build. So one the one hand it always
breaks whenever someone meddles with the ext3 code. On the other hand
improvement for ext3 get picked up by lustre semi automatically.

In this case lustre would get the begin_write() function from ext3 and
use it.

MfG
Goswin

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-07 Thread Goswin von Brederlow
Nick Piggin <[EMAIL PROTECTED]> writes:

> On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:
>> Nick Piggin <[EMAIL PROTECTED]> writes:
>> > So I believe the problem is that for a multi-segment iovec, we currently
>> > prepare_write/commit_write once for each segment, right? We do this
>>
>> It is more complex.
>>
>> Currently a __grab_cache_page, a_ops->prepare_write,
>> filemap_copy_from_user[_iovec] and a_ops->commit_write is done
>> whenever we hit
>>
>>   a) a page boundary
>
> This is required by the prepare_write/commit_write API. The write_begin
> / write_end API is also a page-based one, but in future, we are looking
> at having a more general API but we haven't completely decided on the
> form yet. "perform_write" is one proposal you can look for.
>
>>   b) a segment boundary
>
> This is done, as I said, because of the deadlock issue. While the issue is
> more completely fixed in -mm, a special case for kernel memory (eg. nfsd)
> is in the latest mainline kernels.

Can you tell me where to get the fix from -mm? If it is completly
fixed there then that could make our patch obsolete.

>> Those two cases don't have to, and from the stats basically never,
>> coincide. For NFSd this means we do this TWICE per segment and TWICE
>> per page.
>
> The page boundary doesn't matter so much (well it does for other reasons,
> but we've never been good at them...). The segment boundary means that
> we aren't able to do block sized writes very well and end up doing a lot of
> read-modify-write operations that could be avoided.

Those are extremly costly for lustre. We have tested exporting a
lustre filesystem to NFS. Without fixes we get 40MB/s and with the
fixes it rises to nearly 200MB/s. That is a factor of 5 in speed.

>> > because there is a nasty deadlock in the VM (copy_from_user being
>> > called with a page locked), and copying multiple segs dramatically
>> > increases the chances that one of these copies will cause a page fault
>> > and thus potentially deadlock.
>>
>> What actually locks the page? Is it __grab_cache_page or
>> a_ops->prepare_write?
>
> prepare_write must be given a locked page.

Then that means __grab_cache_page does return a locked page because
there is nothing between the two calls that would.

>> Note that the patch does not change the number of copy_from_user calls
>> being made nor does it change their arguments. If we need 2 (or more)
>> segments to fill a page we still do 2 seperate calls to
>> filemap_copy_from_user_iovec, both only spanning (part of) one
>> segment.
>>
>> What the patch changes is the number of copy_from_user calls between
>> __grab_cache_page and a_ops->commit_write.
>
> So you're doing all copy_from_user calls within a prepare_write? Then
> you're increasing the chances of deadlock. If not, then you're breaking
> the API contract.

Actually due to a bug, as you noticed, we do the copy first and then
prepare/write. But fixing that would indeed do multiple copies between
prepare and commit.

>> Copying a full PAGE_SIZE bytes from multiple segments in one go would
>> be a further improvement if that is possible.
>>
>> > The fix you have I don't think can work because a filesystem must be
>> > notified of the modification _before_ it has happened. (If I understand
>> > correctly, you are skipping the prepare_write potentially until after
>> > some data is copied?).
>>
>> Yes. We changed the order of copy_from_user calls and
>> a_ops->prepare_write by mistake. We will rectify that and do the
>> prepare_write for the full page (when possible) before copying the
>> data into the page.
>
> OK, that is what used to be done, but the API is broken due to this
> deadlock. write_begin/write_end fixes it properly.

I'm verry interested in that fix.

>> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
>> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did
>> > you try a later kernel to see if it is fixed there?
>>
>> Later than 2.6.23-rc5?
>
> No it would be included earlier. The "segment_eq" check should be
> allowing kernel writes (nfsd) to write multiple segments. If you have a
> patch which changes this significantly, then it would indicate the
> existing logic has a problem (or you've got a userspace application doing
> the writev, which should be fixed by the write_begin patches in -mm).

I've got userspace application doing the writev. To be exact 14% of
the commits were saved by combining multiple segments into a single
prepare/write pair. Since the k

Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-07 Thread Goswin von Brederlow
Nick Piggin <[EMAIL PROTECTED]> writes:

> Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> you try a later kernel to see if it is fixed there?

I had a chance to look up that commit (git clone took a while so sorry
for writing 2 mails). It is present in 2.6.23-rc5 so I already noticed
it when merging our patch in 2.6.23-rc5.

Upon closer reading of the patch though I see that it will indeed
prevent writes by the nfsd to be split smaller than PAGE_SIZE and it
will cause filemap_copy_from_user[_iovec] to be called with a source
spanning multiple pages.

So the commit 29dbb3fc should have a simmilar, slightly better even,
gain for the nfsd and other kernel space segments. But it will not
improve writes from user space, where ~14% of the commits were saved
during a days work for me.


Now I have a question about fault_in_pages_readable(). Can I call that
for multiple pages and then call __grab_cache_page() without risking
one of the pages from getting lost again and causing a deadlock?

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-07 Thread Goswin von Brederlow
Nick Piggin <[EMAIL PROTECTED]> writes:

> On Thursday 06 September 2007 03:41, Bernd Schubert wrote:
> Minor nit: when resubmitting a patch, you should include everything
> (ie. the full changelog of problem statement and fix description) in a
> single mail. It's just a bit easier...

Will do next time.

> So I believe the problem is that for a multi-segment iovec, we currently
> prepare_write/commit_write once for each segment, right? We do this

It is more complex.

Currently a __grab_cache_page, a_ops->prepare_write,
filemap_copy_from_user[_iovec] and a_ops->commit_write is done
whenever we hit

  a) a page boundary
  b) a segment boundary

Those two cases don't have to, and from the stats basically never,
coincide. For NFSd this means we do this TWICE per segment and TWICE
per page.

> because there is a nasty deadlock in the VM (copy_from_user being
> called with a page locked), and copying multiple segs dramatically
> increases the chances that one of these copies will cause a page fault
> and thus potentially deadlock.

What actually locks the page? Is it __grab_cache_page or
a_ops->prepare_write?

Note that the patch does not change the number of copy_from_user calls
being made nor does it change their arguments. If we need 2 (or more)
segments to fill a page we still do 2 seperate calls to
filemap_copy_from_user_iovec, both only spanning (part of) one
segment.

What the patch changes is the number of copy_from_user calls between
__grab_cache_page and a_ops->commit_write.

Copying a full PAGE_SIZE bytes from multiple segments in one go would
be a further improvement if that is possible.

> The fix you have I don't think can work because a filesystem must be
> notified of the modification _before_ it has happened. (If I understand
> correctly, you are skipping the prepare_write potentially until after
> some data is copied?).

Yes. We changed the order of copy_from_user calls and
a_ops->prepare_write by mistake. We will rectify that and do the
prepare_write for the full page (when possible) before copying the
data into the page.

> Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> you try a later kernel to see if it is fixed there?

Later than 2.6.23-rc5?

> Thanks,
> Nick

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-07 Thread Goswin von Brederlow
Nick Piggin [EMAIL PROTECTED] writes:

 On Thursday 06 September 2007 03:41, Bernd Schubert wrote:
 Minor nit: when resubmitting a patch, you should include everything
 (ie. the full changelog of problem statement and fix description) in a
 single mail. It's just a bit easier...

Will do next time.

 So I believe the problem is that for a multi-segment iovec, we currently
 prepare_write/commit_write once for each segment, right? We do this

It is more complex.

Currently a __grab_cache_page, a_ops-prepare_write,
filemap_copy_from_user[_iovec] and a_ops-commit_write is done
whenever we hit

  a) a page boundary
  b) a segment boundary

Those two cases don't have to, and from the stats basically never,
coincide. For NFSd this means we do this TWICE per segment and TWICE
per page.

 because there is a nasty deadlock in the VM (copy_from_user being
 called with a page locked), and copying multiple segs dramatically
 increases the chances that one of these copies will cause a page fault
 and thus potentially deadlock.

What actually locks the page? Is it __grab_cache_page or
a_ops-prepare_write?

Note that the patch does not change the number of copy_from_user calls
being made nor does it change their arguments. If we need 2 (or more)
segments to fill a page we still do 2 seperate calls to
filemap_copy_from_user_iovec, both only spanning (part of) one
segment.

What the patch changes is the number of copy_from_user calls between
__grab_cache_page and a_ops-commit_write.

Copying a full PAGE_SIZE bytes from multiple segments in one go would
be a further improvement if that is possible.

 The fix you have I don't think can work because a filesystem must be
 notified of the modification _before_ it has happened. (If I understand
 correctly, you are skipping the prepare_write potentially until after
 some data is copied?).

Yes. We changed the order of copy_from_user calls and
a_ops-prepare_write by mistake. We will rectify that and do the
prepare_write for the full page (when possible) before copying the
data into the page.

 Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
 also a workaround for the NFSD problem in git commit 29dbb3fc. Did
 you try a later kernel to see if it is fixed there?

Later than 2.6.23-rc5?

 Thanks,
 Nick

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-07 Thread Goswin von Brederlow
Nick Piggin [EMAIL PROTECTED] writes:

 Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
 also a workaround for the NFSD problem in git commit 29dbb3fc. Did
 you try a later kernel to see if it is fixed there?

I had a chance to look up that commit (git clone took a while so sorry
for writing 2 mails). It is present in 2.6.23-rc5 so I already noticed
it when merging our patch in 2.6.23-rc5.

Upon closer reading of the patch though I see that it will indeed
prevent writes by the nfsd to be split smaller than PAGE_SIZE and it
will cause filemap_copy_from_user[_iovec] to be called with a source
spanning multiple pages.

So the commit 29dbb3fc should have a simmilar, slightly better even,
gain for the nfsd and other kernel space segments. But it will not
improve writes from user space, where ~14% of the commits were saved
during a days work for me.


Now I have a question about fault_in_pages_readable(). Can I call that
for multiple pages and then call __grab_cache_page() without risking
one of the pages from getting lost again and causing a deadlock?

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

2007-09-07 Thread Goswin von Brederlow
Nick Piggin [EMAIL PROTECTED] writes:

 On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
  So I believe the problem is that for a multi-segment iovec, we currently
  prepare_write/commit_write once for each segment, right? We do this

 It is more complex.

 Currently a __grab_cache_page, a_ops-prepare_write,
 filemap_copy_from_user[_iovec] and a_ops-commit_write is done
 whenever we hit

   a) a page boundary

 This is required by the prepare_write/commit_write API. The write_begin
 / write_end API is also a page-based one, but in future, we are looking
 at having a more general API but we haven't completely decided on the
 form yet. perform_write is one proposal you can look for.

   b) a segment boundary

 This is done, as I said, because of the deadlock issue. While the issue is
 more completely fixed in -mm, a special case for kernel memory (eg. nfsd)
 is in the latest mainline kernels.

Can you tell me where to get the fix from -mm? If it is completly
fixed there then that could make our patch obsolete.

 Those two cases don't have to, and from the stats basically never,
 coincide. For NFSd this means we do this TWICE per segment and TWICE
 per page.

 The page boundary doesn't matter so much (well it does for other reasons,
 but we've never been good at them...). The segment boundary means that
 we aren't able to do block sized writes very well and end up doing a lot of
 read-modify-write operations that could be avoided.

Those are extremly costly for lustre. We have tested exporting a
lustre filesystem to NFS. Without fixes we get 40MB/s and with the
fixes it rises to nearly 200MB/s. That is a factor of 5 in speed.

  because there is a nasty deadlock in the VM (copy_from_user being
  called with a page locked), and copying multiple segs dramatically
  increases the chances that one of these copies will cause a page fault
  and thus potentially deadlock.

 What actually locks the page? Is it __grab_cache_page or
 a_ops-prepare_write?

 prepare_write must be given a locked page.

Then that means __grab_cache_page does return a locked page because
there is nothing between the two calls that would.

 Note that the patch does not change the number of copy_from_user calls
 being made nor does it change their arguments. If we need 2 (or more)
 segments to fill a page we still do 2 seperate calls to
 filemap_copy_from_user_iovec, both only spanning (part of) one
 segment.

 What the patch changes is the number of copy_from_user calls between
 __grab_cache_page and a_ops-commit_write.

 So you're doing all copy_from_user calls within a prepare_write? Then
 you're increasing the chances of deadlock. If not, then you're breaking
 the API contract.

Actually due to a bug, as you noticed, we do the copy first and then
prepare/write. But fixing that would indeed do multiple copies between
prepare and commit.

 Copying a full PAGE_SIZE bytes from multiple segments in one go would
 be a further improvement if that is possible.

  The fix you have I don't think can work because a filesystem must be
  notified of the modification _before_ it has happened. (If I understand
  correctly, you are skipping the prepare_write potentially until after
  some data is copied?).

 Yes. We changed the order of copy_from_user calls and
 a_ops-prepare_write by mistake. We will rectify that and do the
 prepare_write for the full page (when possible) before copying the
 data into the page.

 OK, that is what used to be done, but the API is broken due to this
 deadlock. write_begin/write_end fixes it properly.

I'm verry interested in that fix.

  Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
  also a workaround for the NFSD problem in git commit 29dbb3fc. Did
  you try a later kernel to see if it is fixed there?

 Later than 2.6.23-rc5?

 No it would be included earlier. The segment_eq check should be
 allowing kernel writes (nfsd) to write multiple segments. If you have a
 patch which changes this significantly, then it would indicate the
 existing logic has a problem (or you've got a userspace application doing
 the writev, which should be fixed by the write_begin patches in -mm).

I've got userspace application doing the writev. To be exact 14% of
the commits were saved by combining multiple segments into a single
prepare/write pair. Since the kernel segments don't fragment anymore
in 2.6.23-rc5 those savings must come from user space stuff.

From the stats posted earlier you can see that there is a substantial
amount of calls with 6 segments all (alot) smaller than a page. Lots
of calls our patch or the write_begin/end will save.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] FS block count, size and seek offset?

2007-06-19 Thread Goswin von Brederlow
"David Brown" <[EMAIL PROTECTED]> writes:

>> Why don't you use the existing fuse-unionfs?
>
> I thought about doing this but it would need to be modified somehow
> and even then my users would look to me to fix issues and I don't like
> trying to find hard bugs in other peoples code.
>
> Also, there's a lot of functionality that funionfs has but I don't
> need and the extra code would get in the way attempting to modify or
> debug issues.
>
> What I want is fairly specific and I've not seen anything out there to do it.
>
> Thanks,
> - David Brown

You can still read their code to see how they solved problems you
have.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] FS block count, size and seek offset?

2007-06-19 Thread Goswin von Brederlow
David Brown [EMAIL PROTECTED] writes:

 Why don't you use the existing fuse-unionfs?

 I thought about doing this but it would need to be modified somehow
 and even then my users would look to me to fix issues and I don't like
 trying to find hard bugs in other peoples code.

 Also, there's a lot of functionality that funionfs has but I don't
 need and the extra code would get in the way attempting to modify or
 debug issues.

 What I want is fairly specific and I've not seen anything out there to do it.

 Thanks,
 - David Brown

You can still read their code to see how they solved problems you
have.

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] FS block count, size and seek offset?

2007-06-18 Thread Goswin von Brederlow
"David Brown" <[EMAIL PROTECTED]> writes:

> I was looking at various file systems and how they return
> stat.st_blocks and stat.st_size for directories and had some questions
> on how a fuse filesystem is supposed to implement readdir with the
> seek offset when trying to union two directories together.
>
> Is the offset a byte offset into the DIR *dp? or is it the struct
> dirent size (which is relative based on the name of the file) into the
> dir pointer?

I think it is totaly your call what you store in it. Only requirement
is that you never pass 0 to the filler function.

off_t is 64bit so storing DIR* in it should be no problem. Or a
pointer to

struct UnionDIR {
  DIR *dp;
  struct UnionDIR *next;
}

> Also, if you want to be accurate when you stat a directory that's
> unioned in the fuse file system how many blocks should one return?
> Since each filesystem seems to return different values for size and
> number of blocks for directories. I know I could just say that its not
> supported with my filesystem built using fuse... but I'd like to at
> least try to be accurate.

You could add them up and round it to some common block size (if they
differ). But I don't think it maters and nothing uses that info.

What is more important is the link count. For example find uses the
link count to know how many subdirs a directory has. Once it found
that many it assumes there are no more dirs and saves on stat() calls.

> Is it accurate to assume that the size or number of blocks returned
> from a stat will be used to pass a seek offset?
>
> When does fuse use the seek offset?

Afaik never. The offset is only stored for the next readdir call but
never used inside fuse.

> These are the number of blocks and size on an empty dir.
> ext3
> size 4096 nblocks 8
> reiserfs
> size 48 nblocks 0
> jfs
> size 1 nblocks 0
> xfs
> size 6 nblocks 0
>
> Any help to figure out how to union two directories and return correct
> values would be helpful.

Why don't you use the existing fuse-unionfs?

> Thanks,
> - David Brown
>
> P.S. maybe a posix filesystem interface manual would be good?

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [fuse-devel] FS block count, size and seek offset?

2007-06-18 Thread Goswin von Brederlow
David Brown [EMAIL PROTECTED] writes:

 I was looking at various file systems and how they return
 stat.st_blocks and stat.st_size for directories and had some questions
 on how a fuse filesystem is supposed to implement readdir with the
 seek offset when trying to union two directories together.

 Is the offset a byte offset into the DIR *dp? or is it the struct
 dirent size (which is relative based on the name of the file) into the
 dir pointer?

I think it is totaly your call what you store in it. Only requirement
is that you never pass 0 to the filler function.

off_t is 64bit so storing DIR* in it should be no problem. Or a
pointer to

struct UnionDIR {
  DIR *dp;
  struct UnionDIR *next;
}

 Also, if you want to be accurate when you stat a directory that's
 unioned in the fuse file system how many blocks should one return?
 Since each filesystem seems to return different values for size and
 number of blocks for directories. I know I could just say that its not
 supported with my filesystem built using fuse... but I'd like to at
 least try to be accurate.

You could add them up and round it to some common block size (if they
differ). But I don't think it maters and nothing uses that info.

What is more important is the link count. For example find uses the
link count to know how many subdirs a directory has. Once it found
that many it assumes there are no more dirs and saves on stat() calls.

 Is it accurate to assume that the size or number of blocks returned
 from a stat will be used to pass a seek offset?

 When does fuse use the seek offset?

Afaik never. The offset is only stored for the next readdir call but
never used inside fuse.

 These are the number of blocks and size on an empty dir.
 ext3
 size 4096 nblocks 8
 reiserfs
 size 48 nblocks 0
 jfs
 size 1 nblocks 0
 xfs
 size 6 nblocks 0

 Any help to figure out how to union two directories and return correct
 values would be helpful.

Why don't you use the existing fuse-unionfs?

 Thanks,
 - David Brown

 P.S. maybe a posix filesystem interface manual would be good?

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/