Re: [PATCH] copy_from_high_bh

2001-07-08 Thread Mark Hemment

On Sun, 8 Jul 2001, Linus Torvalds wrote:
> On Sun, 8 Jul 2001 [EMAIL PROTECTED] wrote:
> >
> >   mm/highmem.c/copy_from_high_bh() blocks interrupts while copying "down"
> > to a bounce buffer, for writing.
> >   This function is only ever called from create_bounce() (which cannot
> > be called from an interrupt context - it may block), and the kmap
> > 'KM_BOUNCE_WRITE' is only used in copy_from_high_bh().
>
> If so it's not just the interrupt disable that is unnecessary, the
> "kmap_atomic()" should also be just a plain "kmap()", I think.

  That might eat through kmap()'s virtual window too quickly.

  I did think about adding a test to see if the page was already mapped,
and falling back to kmap_atomic() if it isn't.  That should give the best
of both worlds?

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] copy_from_high_bh

2001-07-08 Thread Mark Hemment

On Sun, 8 Jul 2001, Linus Torvalds wrote:
 On Sun, 8 Jul 2001 [EMAIL PROTECTED] wrote:
 
mm/highmem.c/copy_from_high_bh() blocks interrupts while copying down
  to a bounce buffer, for writing.
This function is only ever called from create_bounce() (which cannot
  be called from an interrupt context - it may block), and the kmap
  'KM_BOUNCE_WRITE' is only used in copy_from_high_bh().

 If so it's not just the interrupt disable that is unnecessary, the
 kmap_atomic() should also be just a plain kmap(), I think.

  That might eat through kmap()'s virtual window too quickly.

  I did think about adding a test to see if the page was already mapped,
and falling back to kmap_atomic() if it isn't.  That should give the best
of both worlds?

Mark

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 4GB I/O, cut three

2001-05-30 Thread Mark Hemment


On Wed, 30 May 2001, Jens Axboe wrote:
> On Wed, May 30 2001, Mark Hemment wrote:
> >   This can lead to attempt_merge() releasing the embedded request
> > structure (which, as an extract copy, has the ->q set, so to
> > blkdev_release_request() it looks like a request which originated from
> > the block layer).  This isn't too healthy.
> > 
> >   The fix here is to add a check in __scsi_merge_requests_fn() to check
> > for ->special being non-NULL.
> 
> How about just adding 
> 
>   if (req->cmd != next->cmd
>   || req->rq_dev != next->rq_dev
>   || req->nr_sectors + next->nr_sectors > q->max_sectors
>   || next->sem || req->special)
> return;
> 
> ie check for special too, that would make sense to me. Either way would
> work, but I'd rather make this explicit in the block layer that 'not
> normal' requests are left alone. That includes stuff with the sem set,
> or special.


  Yes, that is an equivalent fix.

  In the original patch I wanted to keep the change local (ie. in the SCSI
layer).  Pushing the check up the generic block layer makes sense.

  Are you going to push this change to Linus, or should I?
  I'm assuming the other scsi-layer changes in Alan's tree will eventually
be pushed.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 4GB I/O, cut three

2001-05-30 Thread Mark Hemment

Hi again, :)

On Tue, 29 May 2001, Jens Axboe wrote:
> Another day, another version.
> 
> Bugs fixed in this version: none
> Known bugs in this version: none
> 
> In other words, it's perfect of course.

  With the scsi-high patch, I'm not sure about the removal of the line
from __scsi_end_request();

req->buffer = bh->b_data;

  A requeued request is not always processed immediately, so new
buffer-heads arriving at the block-layer can be merged against it.  A
requeued request is placed at the head of a request list, so
nothing can merge with it - but what about if multiple requests are
requeued on the same queue?

  In Linus's tree, requests requeued via the SCSI layer can cause problems
(corruption).  I sent out a patch to cover this a few months back, which
got picked up by Alan (its in the -ac series - see the changes to
scsi_lib.c and scsi_merge.c) but no one posted any feedback.
  I've included some of the original message below.

Mark


--
>From [EMAIL PROTECTED] Sat Mar 31 16:07:14 2001 +0100
Date: Sat, 31 Mar 2001 16:07:13 +0100 (BST)
From: Mark Hemment <[EMAIL PROTECTED]>
Subject: [PATCH] Possible SCSI + block-layer bugs

Hi,

  I've never seen these trigger, but they look theoretically possible.

  When processing the completion of a SCSI request in a bottom-half,
__scsi_end_request() can find all the buffers associated with the request
haven't been completed (ie. leftovers).

  One question is; can this ever happen?
  If it can't then the code should be removed from __scsi_end_request(),
if it can happen then there appears to be a few problems;

  The request is re-queued to the block layer via 
scsi_queue_next_request(), which uses the "special" pointer in the request
structure to remember the Scsi_Cmnd associated with the request.  The SCSI
request function is then called, but doesn't guarantee to immediately
process the re-queued request even though it was added at the head (say,
the queue has become plugged).  This can trigger two possible bugs.

  The first is that __scsi_end_request() doesn't decrement the
hard_nr_sectors count in the request.  As the request is back on the
queue, it is possible for newly arriving buffer-heads to merge with the
heads already hanging off the request.  This merging uses the
hard_nr_sectors when calculating both the merged hard_nr_sectors and
nr_sectors counts.
  As the request is at the head, only back-merging can occur, but if
__scsi_end_request() triggers another uncompleted request to be re-queued,
it is possible to get front merging as well.

  The merging of a re-queued request looks safe, except for the
hard_nr_sectors.  This patch corrects the hard_nr_sectors accounting.


  The second bug is from request merging in attempt_merge().

  For a re-queued request, the request structure is the one embedded in
the Scsi_Cmnd (which is a copy of the request taken in the 
scsi_request_fn).
  In attempt_merge(), q->merge_requests_fn() is called to see the requests
are allowed to merge.  __scsi_merge_requests_fn() checks number of
segments, etc, but doesn't check if one of the requests is a re-queued one
(ie. no test against ->special).
  This can lead to attempt_merge() releasing the embedded request
structure (which, as an extract copy, has the ->q set, so to
blkdev_release_request() it looks like a request which originated from
the block layer).  This isn't too healthy.

  The fix here is to add a check in __scsi_merge_requests_fn() to check
for ->special being non-NULL.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ll_rw_blk.c and high memory

2001-05-30 Thread Mark Hemment

Hi Jens, all,

  In drivers/block/ll_rw_blk.c:blk_dev_init(), the high and low queued
sectors are calculated from the total number of free pages in all memory
zones.  Shouldn't this calculation be passed upon the number of pages upon
which I/O can be done directly (ie. without bounce pages)?

  On a box with gigabytes of memory, high_queued_sectors becomes larger
than the amount of memory upon which block I/O can be directly done - the
test in ll_rw_block() against high_queued_sectors is then never true.  The
throttling of submitting I/O happens much earlier in the stack - at
the allocation of a bounce page.  This doesn't seem right.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 4GB I/O, cut three

2001-05-30 Thread Mark Hemment

On Wed, 30 May 2001, Jens Axboe wrote:
> On Wed, May 30 2001, Mark Hemment wrote:
> > Hi Jens,
> > 
> >   I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
> > modified qlogic fibre channel driver with 32disks hanging off it, without
> > any problems.  The test used was SpecFS 2.0
> 
> Cool, could you send me the qlogic diff? It's the one-liner can_dma32
> chance I'm interested in, I'm just not sure what driver you used :-)

  The qlogic driver is the one from;
http://www.feral.com/isp.html
I find this much more stable than the one already in the kernel.
  It did just need the one-liner change, but as the driver isn't in the
kernel there isn't much point adding it change to your patch. :)


> >   I did change the patch so that bounce-pages always come from the NORMAL
> > zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
> > I'm not 100% sure the VM is capable of keeping the zones it already has
> > balanced - and adding another one might break the camels back.  But as the
> > test box has 4GB, it wasn't bouncing anyway.
> 
> You are right, this is definitely something that needs checking. I
> really want this to work though. Rik, Andrea? Will the balancing handle
> the extra zone?

  In theory it should do - ie. there isn't anything to stop it.

  With NFS loads, over a ported VxFS filesystem, I do see some problems
between the NORMAL and HIGH zones.  Thinking about it, ZONE_DMA32
shouldn't make this any worse.

  Rik, Andrea, quick description of a balancing problem;
Consider a VM which is under load (but not stressed), such that
all zone free-page pools are between their MIN and LOW marks, with
pages in the inactive_clean lists.

The NORMAL zone has non-zero page order allocations thrown at
it.  This causes __alloc_pages() to reap pages from the NORMAL
inactive_clean list until the required buddy is built.  The blind
reaping causes the NORMAL zone to have a large number of free pages
(greater than ->pages_low).

Now, when HIGHMEM allocations come in (for page cache pages), they
skip the HIGH zone and use the NORMAL zone (as it now has plenty
of free pages) - the code at the top of __alloc_pages(), which
checks against ->pages_low.

But the NORMAL zone is usually under more pressure than the HIGH
zone - as many more allocations needed ready-mapped memory.  This
causes the page-cache pages from the NORMAL zone to come under
more pressure, and are "re-cycled" quicker than page-cache pages
in the HIGHMEM zone.

  OK, we shouldn't be throwing too many non-zero page allocations at
__alloc_pages(), but it does happen.
  Also, the problem isn't as bad as it first looks - HIGHMEM page-cache
pages do get "recycled" (reclaimed), but there is a slight imbalance.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 4GB I/O, cut three

2001-05-30 Thread Mark Hemment

Hi Jens,

  I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
modified qlogic fibre channel driver with 32disks hanging off it, without
any problems.  The test used was SpecFS 2.0

  Peformance is definitely up - but I can't give an exact number, as the
run with this patch was compiled with no-omit-frame-pointer for debugging
any probs.

  I did change the patch so that bounce-pages always come from the NORMAL
zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
I'm not 100% sure the VM is capable of keeping the zones it already has
balanced - and adding another one might break the camels back.  But as the
test box has 4GB, it wasn't bouncing anyway.

Mark


On Tue, 29 May 2001, Jens Axboe wrote:
> Another day, another version.
> 
> Bugs fixed in this version: none
> Known bugs in this version: none
> 
> In other words, it's perfect of course.
> 
> Changes:
> 
> - Added ide-dma segment coalescing
> - Only print highmem I/O enable info when HIGHMEM is actually set
> 
> Please give it a test spin, especially if you have 1GB of RAM or more.
> You should see something like this when booting:
> 
> hda: enabling highmem I/O
> ...
> SCSI: channel 0, id 0: enabling highmem I/O
> 
> depending on drive configuration etc.
> 
> Plea to maintainers of the different architectures: could you please add
> the arch parts to support this? This includes:
> 
> - memory zoning at init time
> - page_to_bus
> - pci_map_page / pci_unmap_page
> - set_bh_sg
> - KM_BH_IRQ (for HIGHMEM archs)
> 
> I think that's it, feel free to send me questions and (even better)
> patches.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 4GB I/O, cut three

2001-05-30 Thread Mark Hemment

Hi Jens,

  I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
modified qlogic fibre channel driver with 32disks hanging off it, without
any problems.  The test used was SpecFS 2.0

  Peformance is definitely up - but I can't give an exact number, as the
run with this patch was compiled with no-omit-frame-pointer for debugging
any probs.

  I did change the patch so that bounce-pages always come from the NORMAL
zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
I'm not 100% sure the VM is capable of keeping the zones it already has
balanced - and adding another one might break the camels back.  But as the
test box has 4GB, it wasn't bouncing anyway.

Mark


On Tue, 29 May 2001, Jens Axboe wrote:
 Another day, another version.
 
 Bugs fixed in this version: none
 Known bugs in this version: none
 
 In other words, it's perfect of course.
 
 Changes:
 
 - Added ide-dma segment coalescing
 - Only print highmem I/O enable info when HIGHMEM is actually set
 
 Please give it a test spin, especially if you have 1GB of RAM or more.
 You should see something like this when booting:
 
 hda: enabling highmem I/O
 ...
 SCSI: channel 0, id 0: enabling highmem I/O
 
 depending on drive configuration etc.
 
 Plea to maintainers of the different architectures: could you please add
 the arch parts to support this? This includes:
 
 - memory zoning at init time
 - page_to_bus
 - pci_map_page / pci_unmap_page
 - set_bh_sg
 - KM_BH_IRQ (for HIGHMEM archs)
 
 I think that's it, feel free to send me questions and (even better)
 patches.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 4GB I/O, cut three

2001-05-30 Thread Mark Hemment

On Wed, 30 May 2001, Jens Axboe wrote:
 On Wed, May 30 2001, Mark Hemment wrote:
  Hi Jens,
  
I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
  modified qlogic fibre channel driver with 32disks hanging off it, without
  any problems.  The test used was SpecFS 2.0
 
 Cool, could you send me the qlogic diff? It's the one-liner can_dma32
 chance I'm interested in, I'm just not sure what driver you used :-)

  The qlogic driver is the one from;
http://www.feral.com/isp.html
I find this much more stable than the one already in the kernel.
  It did just need the one-liner change, but as the driver isn't in the
kernel there isn't much point adding it change to your patch. :)


I did change the patch so that bounce-pages always come from the NORMAL
  zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
  I'm not 100% sure the VM is capable of keeping the zones it already has
  balanced - and adding another one might break the camels back.  But as the
  test box has 4GB, it wasn't bouncing anyway.
 
 You are right, this is definitely something that needs checking. I
 really want this to work though. Rik, Andrea? Will the balancing handle
 the extra zone?

  In theory it should do - ie. there isn't anything to stop it.

  With NFS loads, over a ported VxFS filesystem, I do see some problems
between the NORMAL and HIGH zones.  Thinking about it, ZONE_DMA32
shouldn't make this any worse.

  Rik, Andrea, quick description of a balancing problem;
Consider a VM which is under load (but not stressed), such that
all zone free-page pools are between their MIN and LOW marks, with
pages in the inactive_clean lists.

The NORMAL zone has non-zero page order allocations thrown at
it.  This causes __alloc_pages() to reap pages from the NORMAL
inactive_clean list until the required buddy is built.  The blind
reaping causes the NORMAL zone to have a large number of free pages
(greater than -pages_low).

Now, when HIGHMEM allocations come in (for page cache pages), they
skip the HIGH zone and use the NORMAL zone (as it now has plenty
of free pages) - the code at the top of __alloc_pages(), which
checks against -pages_low.

But the NORMAL zone is usually under more pressure than the HIGH
zone - as many more allocations needed ready-mapped memory.  This
causes the page-cache pages from the NORMAL zone to come under
more pressure, and are re-cycled quicker than page-cache pages
in the HIGHMEM zone.

  OK, we shouldn't be throwing too many non-zero page allocations at
__alloc_pages(), but it does happen.
  Also, the problem isn't as bad as it first looks - HIGHMEM page-cache
pages do get recycled (reclaimed), but there is a slight imbalance.

Mark

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ll_rw_blk.c and high memory

2001-05-30 Thread Mark Hemment

Hi Jens, all,

  In drivers/block/ll_rw_blk.c:blk_dev_init(), the high and low queued
sectors are calculated from the total number of free pages in all memory
zones.  Shouldn't this calculation be passed upon the number of pages upon
which I/O can be done directly (ie. without bounce pages)?

  On a box with gigabytes of memory, high_queued_sectors becomes larger
than the amount of memory upon which block I/O can be directly done - the
test in ll_rw_block() against high_queued_sectors is then never true.  The
throttling of submitting I/O happens much earlier in the stack - at
the allocation of a bounce page.  This doesn't seem right.

Mark

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: nasty SCSI performance drop-outs (potential test case)

2001-05-11 Thread Mark Hemment


On Fri, 11 May 2001, null wrote: 
> Time to mkfs the same two 5GB LUNs in parallel is 54 seconds.  Hmmm.
> Bandwidth on two CPUs is totally consumed (99.9%) and a third CPU is
> usually consumed by the kupdated process.  Activity lights on the storage
> device are mostly idle during this time.

  I see you've got 1.2GBytes, so are using HIGHMEM support?

  I sumbitted a patch a few weeks back, against the buffer cache, which
makes it behave better with HIGHMEM.  The patch improved the time take to
create large filesystems.

  This got picked up by Alan in his -ac series.  Can't remember exactly
when Alan merged it, but it is definitely in 2.4.4-ac3.  I'd recommend
giving it a try.

Mark


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Mark Hemment


On Wed, 9 May 2001, Marcelo Tosatti wrote:
> On Wed, 9 May 2001, Mark Hemment wrote:
> >   Could introduce another allocation flag (__GFP_FAIL?) which is or'ed
> > with a __GFP_WAIT to limit the looping?
> 
> __GFP_FAIL is in the -ac tree already and it is being used by the bounce
> buffer allocation code. 

Thanks for the pointer.

  For non-zero order allocations, the test against __GFP_FAIL is a little
too soon; it would be better after we've tried to reclaim pages from the
inactive-clean list.  Any nasty side effects to this?

  Plus, the code still prevents PF_MEMALLOC processes from using the
inactive-clean list for non-zero order allocations.  As the trend seems to
be to make zero and non-zero allocations 'equivalent', shouldn't this
restriction to lifted?

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Mark Hemment


On Wed, 9 May 2001, Marcelo Tosatti wrote:
 On Wed, 9 May 2001, Mark Hemment wrote:
Could introduce another allocation flag (__GFP_FAIL?) which is or'ed
  with a __GFP_WAIT to limit the looping?
 
 __GFP_FAIL is in the -ac tree already and it is being used by the bounce
 buffer allocation code. 

Thanks for the pointer.

  For non-zero order allocations, the test against __GFP_FAIL is a little
too soon; it would be better after we've tried to reclaim pages from the
inactive-clean list.  Any nasty side effects to this?

  Plus, the code still prevents PF_MEMALLOC processes from using the
inactive-clean list for non-zero order allocations.  As the trend seems to
be to make zero and non-zero allocations 'equivalent', shouldn't this
restriction to lifted?

Mark

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-09 Thread Mark Hemment


On Tue, 8 May 2001, David S. Miller wrote: 
> Actually, the change was made because it is illogical to try only
> once on multi-order pages.  Especially because we depend upon order
> 1 pages so much (every task struct allocated).  We depend upon them
> even more so on sparc64 (certain kinds of page tables need to be
> allocated as 1 order pages).
> 
> The old code failed _far_ too easily, it was unacceptable.
> 
> Why put some strange limit in there?  Whatever number you pick
> is arbitrary, and I can probably piece together an allocation
> state where the choosen limit is too small.

  Agreed, but some allocations of non-zero orders can fall back to other
schemes (such as an emergency buffer, or using vmalloc for a temp
buffer) and don't want to be trapped in __alloc_pages() for too long.

  Could introduce another allocation flag (__GFP_FAIL?) which is or'ed
with a __GFP_WAIT to limit the looping?

> So instead, you could test for the condition that prevents any
> possible forward progress, no?

  Yes, it is possible to trap when kswapd might not make any useful
progress for a failing non-zero ordered allocation, and to set a global
"force" flag (kswapd_force) to ensure it does something useful.
  For order-1 allocations, that would work.

  For order-2 (and above) it becomes much more difficult as the page
'reap' routines release/process pages based upon age and do not factor in
whether a page may/will buddy (now or in the near future).  This 'blind'
processing of pages can wipe a significant percentage of the page cache
when trying to build a buddy at a high order.

  Of course, no one should be doing really large order allocations and
expecting them to succeed.  But, if they are doing this, the allocation
should at least fail.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-09 Thread Mark Hemment


On Tue, 8 May 2001, David S. Miller wrote: 
 Actually, the change was made because it is illogical to try only
 once on multi-order pages.  Especially because we depend upon order
 1 pages so much (every task struct allocated).  We depend upon them
 even more so on sparc64 (certain kinds of page tables need to be
 allocated as 1 order pages).
 
 The old code failed _far_ too easily, it was unacceptable.
 
 Why put some strange limit in there?  Whatever number you pick
 is arbitrary, and I can probably piece together an allocation
 state where the choosen limit is too small.

  Agreed, but some allocations of non-zero orders can fall back to other
schemes (such as an emergency buffer, or using vmalloc for a temp
buffer) and don't want to be trapped in __alloc_pages() for too long.

  Could introduce another allocation flag (__GFP_FAIL?) which is or'ed
with a __GFP_WAIT to limit the looping?

 So instead, you could test for the condition that prevents any
 possible forward progress, no?

  Yes, it is possible to trap when kswapd might not make any useful
progress for a failing non-zero ordered allocation, and to set a global
force flag (kswapd_force) to ensure it does something useful.
  For order-1 allocations, that would work.

  For order-2 (and above) it becomes much more difficult as the page
'reap' routines release/process pages based upon age and do not factor in
whether a page may/will buddy (now or in the near future).  This 'blind'
processing of pages can wipe a significant percentage of the page cache
when trying to build a buddy at a high order.

  Of course, no one should be doing really large order allocations and
expecting them to succeed.  But, if they are doing this, the allocation
should at least fail.

Mark

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] allocation looping + kswapd CPU cycles

2001-05-08 Thread Mark Hemment


  In 2.4.3pre6, code in page_alloc.c:__alloc_pages(), changed from;

try_to_free_pages(gfp_mask);
wakeup_bdflush();
if (!order)
goto try_again;
to
try_to_free_pages(gfp_mask);
wakeup_bdflush();
goto try_again;


  This introduced the effect of a non-zero order, __GFP_WAIT allocation
(without PF_MEMALLOC set), never returning failure.  The allocation keeps
looping in __alloc_pages(), kicking kswapd, until the allocation succeeds.

  If there is plenty of memory in the free-pools and inactive-lists
free_shortage() will return false, causing the state of these
free-pools/inactive-lists not to be 'improved' by kswapd.

  If there is nothing else changing/improving the free-pools or
inactive-lists, the allocation loops forever (kicking kswapd).

  Does anyone know why the 2.4.3pre6 change was made?

  The attached patch (against 2.4.5-pre1) fixes the looping symptom, by
adding a counter and looping only twice for non-zero order allocations.

  The real fix is to measure fragmentation and the progress of kswapd, but
that is too drastic for 2.4.x.

Mark


diff -ur linux-2.4.5-pre1/mm/page_alloc.c markhe-2.4.5-pre1/mm/page_alloc.c
--- linux-2.4.5-pre1/mm/page_alloc.cFri Apr 27 22:18:08 2001
+++ markhe-2.4.5-pre1/mm/page_alloc.c   Tue May  8 13:42:12 2001
@@ -275,6 +275,7 @@
 {
zone_t **zone;
int direct_reclaim = 0;
+   int loop;
unsigned int gfp_mask = zonelist->gfp_mask;
struct page * page;
 
@@ -313,6 +314,7 @@
&& nr_inactive_dirty_pages >= freepages.high)
wakeup_bdflush(0);
 
+   loop = 0;
 try_again:
/*
 * First, see if we have any zones with lots of free memory.
@@ -453,7 +455,8 @@
if (gfp_mask & __GFP_WAIT) {
memory_pressure++;
try_to_free_pages(gfp_mask);
-   goto try_again;
+   if (!order || loop++ < 2)
+   goto try_again;
}
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] allocation looping + kswapd CPU cycles

2001-05-08 Thread Mark Hemment


  In 2.4.3pre6, code in page_alloc.c:__alloc_pages(), changed from;

try_to_free_pages(gfp_mask);
wakeup_bdflush();
if (!order)
goto try_again;
to
try_to_free_pages(gfp_mask);
wakeup_bdflush();
goto try_again;


  This introduced the effect of a non-zero order, __GFP_WAIT allocation
(without PF_MEMALLOC set), never returning failure.  The allocation keeps
looping in __alloc_pages(), kicking kswapd, until the allocation succeeds.

  If there is plenty of memory in the free-pools and inactive-lists
free_shortage() will return false, causing the state of these
free-pools/inactive-lists not to be 'improved' by kswapd.

  If there is nothing else changing/improving the free-pools or
inactive-lists, the allocation loops forever (kicking kswapd).

  Does anyone know why the 2.4.3pre6 change was made?

  The attached patch (against 2.4.5-pre1) fixes the looping symptom, by
adding a counter and looping only twice for non-zero order allocations.

  The real fix is to measure fragmentation and the progress of kswapd, but
that is too drastic for 2.4.x.

Mark


diff -ur linux-2.4.5-pre1/mm/page_alloc.c markhe-2.4.5-pre1/mm/page_alloc.c
--- linux-2.4.5-pre1/mm/page_alloc.cFri Apr 27 22:18:08 2001
+++ markhe-2.4.5-pre1/mm/page_alloc.c   Tue May  8 13:42:12 2001
@@ -275,6 +275,7 @@
 {
zone_t **zone;
int direct_reclaim = 0;
+   int loop;
unsigned int gfp_mask = zonelist-gfp_mask;
struct page * page;
 
@@ -313,6 +314,7 @@
 nr_inactive_dirty_pages = freepages.high)
wakeup_bdflush(0);
 
+   loop = 0;
 try_again:
/*
 * First, see if we have any zones with lots of free memory.
@@ -453,7 +455,8 @@
if (gfp_mask  __GFP_WAIT) {
memory_pressure++;
try_to_free_pages(gfp_mask);
-   goto try_again;
+   if (!order || loop++  2)
+   goto try_again;
}
}
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



kernel lock in dcache.c

2001-04-28 Thread Mark Hemment

Hi,

  d_move() in fs/dcache.c is checking the kernel lock is held
(switch_names() does the same, but is only called from d_move()).

  My question is why?
  I can't see what it is using the kernel lock to sync/protect against.

  Anyone out there know?

Thanks,
Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



kernel lock in dcache.c

2001-04-28 Thread Mark Hemment

Hi,

  d_move() in fs/dcache.c is checking the kernel lock is held
(switch_names() does the same, but is only called from d_move()).

  My question is why?
  I can't see what it is using the kernel lock to sync/protect against.

  Anyone out there know?

Thanks,
Mark

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allow PF_MEMALLOC tasks to directly reclaim pages

2001-04-27 Thread Mark Hemment

Marcelo,

  Infact, the test can be made even weaker than that.

  We only what to avoid the inactive-clean list when allocating from
within an interrupt (or from a bottom-half handler) to avoid 
deadlock on taking the pagecache_lock and pagemap_lru_lock.
  Note: no allocations are done while holding either of these locks.

  Even GFP_ATOMIC doesn't matter; that simply says if we should block or
not (could be an allocation while holding a spinlock, but not inside an
interrupt).

  I've been doing v heavy load on a 4-way server with;

if (order == 0 && !in_interrupt())
direct_reclaim = 1;

without any problems.

  Of course, the in_interrupt() is heavier than (gfp_mask & __GFP_WAIT),
but the benefits outweight the slight fattening.

Mark



On Fri, 27 Apr 2001, Marcelo Tosatti wrote:

> 
> Linus,
> 
> Currently __alloc_pages() does not allow PF_MEMALLOC tasks to free clean
> inactive pages.
> 
> This is senseless --- if the allocation has __GFP_WAIT set, its ok to grab
> the pagemap_lru_lock/pagecache_lock/etc.
> 
> I checked all possible codepaths after reclaim_page() and they are ok.
> 
> The following patch fixes that.
> 
> 
> --- linux/mm/page_alloc.c.origFri Apr 27 05:59:35 2001
> +++ linux/mm/page_alloc.c Fri Apr 27 05:59:48 2001
> @@ -295,8 +295,7 @@
>* Can we take pages directly from the inactive_clean
>* list?
>*/
> - if (order == 0 && (gfp_mask & __GFP_WAIT) &&
> - !(current->flags & PF_MEMALLOC))
> + if (order == 0 && (gfp_mask & __GFP_WAIT))
>   direct_reclaim = 1;
>  
>   /*
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: nfs performance at high loads

2001-04-04 Thread Mark Hemment


  I believe David Miller's latest zero-copy patches might help here.
  In his patch, the pull-up buffer is now allocated near the top of stack
(in the sunrpc code), so it can be a blocking allocation.
  This doesn't fix the core VM problems, but does relieve the pressure
_slightly_ on the VM (I assume, haven't tried David's patch yet).

  One of the core problems is that the VM keeps no measure of
page fragmentation in the free page pool.  The system reaches the state of
having plenty of free single pages (so kswapd and friends aren't kicked
- or if they are, they do no or little word), and very few buddied pages
(which you need for some of the NFS requests).

  Unfortunately, even with keeping a mesaure of fragmentation, and
insuring work is done when it is reached, doesn't solve the next problem.

  When a large order request comes in, the inactive_clean page list is
reaped.  As reclaim_page() simply selects the "oldest" page it can, with
no regard as to whether it will buddy (now, or 'possibily in the near
future), this list is quickly shrunk by a large order request - far too
quickly for a well behaved system.

  An NFS write request, with an 8K block size, needs an order-2 (16K) pull
up buffer (we shouldn't really be pulling the header into the same buffer
as the data - perhaps we aren't any more?).  On a well used system, an
order-2 _blocking_ allocation ends up populating the order-0 and order-1
with quite a few pages from the inactive_clean.

  This then triggers another problem. :(

  As large (non-zero) order requests are always from the NORMAL or DMA
zones, these zones tend to have a lot of free-pages (put there by the
blind reclaim_page() - well, once you can do a blocking allocation they
are, or when the fragmentation kicking is working).
  New allocations for pages for the page-cache often ignore the HIGHMEM
zone (it reaches a steady state), and so is passed over by the loop at the
head of __alloc_pages()).
  However, NORMAL and DMA zones tend to be above pages_low (due to the
reason above), and so new page-cache pages came from these zones.  On a
HIGHMEM system this leads to thrashing of the NORMAL zone, while the
HIGHMEM zone stays (relatively) quiet.
  Note: To make matters even worse under this condition, pulling pages out
of the NORMAL zone is exactly what you don't want to happen!  It would be
much better if they could be left alone for a (short) while to give them
chance to buddy - Linux (at present) doesn't care about the budding of
pages in the HIGHMEM zone (no non-zero allocations come from there).

  I was working on these problems (and some others) a few months back, and
will to return to them shortly.  Unfortunately, the changes started to
look too large for 2.4
  Also, for NFS, the best solution now might be to give the nfsd threads a
receive buffer.  With David's patches, the pull-up occurs in the context
of a thread, making this possible.
  This doesn't solve the problem for other subsystems which do non-zero
order page allocations, but (perhaps) they have a low enough frequency not
to be of real issue.


Kapish,

  Note: Ensure you put a "sync" in your /etc/exports - the default
behaviour was "async" (not legal for a valid SpecFS run).

Mark


On Wed, 4 Apr 2001, Alan Cox wrote:

> > We have been seeing some problems with running nfs benchmarks
> > at very high loads and were wondering if somebody could show
> > some pointers to where the problem lies.
> > The system is a 2.4.0 kernel on a 6.2 Red at distribution ( so
> 
> Use 2.2.19. The 2.4 VM is currently too broken to survive high I/O benchmark
> tests without going silly
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: nfs performance at high loads

2001-04-04 Thread Mark Hemment


  I believe David Miller's latest zero-copy patches might help here.
  In his patch, the pull-up buffer is now allocated near the top of stack
(in the sunrpc code), so it can be a blocking allocation.
  This doesn't fix the core VM problems, but does relieve the pressure
_slightly_ on the VM (I assume, haven't tried David's patch yet).

  One of the core problems is that the VM keeps no measure of
page fragmentation in the free page pool.  The system reaches the state of
having plenty of free single pages (so kswapd and friends aren't kicked
- or if they are, they do no or little word), and very few buddied pages
(which you need for some of the NFS requests).

  Unfortunately, even with keeping a mesaure of fragmentation, and
insuring work is done when it is reached, doesn't solve the next problem.

  When a large order request comes in, the inactive_clean page list is
reaped.  As reclaim_page() simply selects the "oldest" page it can, with
no regard as to whether it will buddy (now, or 'possibily in the near
future), this list is quickly shrunk by a large order request - far too
quickly for a well behaved system.

  An NFS write request, with an 8K block size, needs an order-2 (16K) pull
up buffer (we shouldn't really be pulling the header into the same buffer
as the data - perhaps we aren't any more?).  On a well used system, an
order-2 _blocking_ allocation ends up populating the order-0 and order-1
with quite a few pages from the inactive_clean.

  This then triggers another problem. :(

  As large (non-zero) order requests are always from the NORMAL or DMA
zones, these zones tend to have a lot of free-pages (put there by the
blind reclaim_page() - well, once you can do a blocking allocation they
are, or when the fragmentation kicking is working).
  New allocations for pages for the page-cache often ignore the HIGHMEM
zone (it reaches a steady state), and so is passed over by the loop at the
head of __alloc_pages()).
  However, NORMAL and DMA zones tend to be above pages_low (due to the
reason above), and so new page-cache pages came from these zones.  On a
HIGHMEM system this leads to thrashing of the NORMAL zone, while the
HIGHMEM zone stays (relatively) quiet.
  Note: To make matters even worse under this condition, pulling pages out
of the NORMAL zone is exactly what you don't want to happen!  It would be
much better if they could be left alone for a (short) while to give them
chance to buddy - Linux (at present) doesn't care about the budding of
pages in the HIGHMEM zone (no non-zero allocations come from there).

  I was working on these problems (and some others) a few months back, and
will to return to them shortly.  Unfortunately, the changes started to
look too large for 2.4
  Also, for NFS, the best solution now might be to give the nfsd threads a
receive buffer.  With David's patches, the pull-up occurs in the context
of a thread, making this possible.
  This doesn't solve the problem for other subsystems which do non-zero
order page allocations, but (perhaps) they have a low enough frequency not
to be of real issue.


Kapish,

  Note: Ensure you put a "sync" in your /etc/exports - the default
behaviour was "async" (not legal for a valid SpecFS run).

Mark


On Wed, 4 Apr 2001, Alan Cox wrote:

  We have been seeing some problems with running nfs benchmarks
  at very high loads and were wondering if somebody could show
  some pointers to where the problem lies.
  The system is a 2.4.0 kernel on a 6.2 Red at distribution ( so
 
 Use 2.2.19. The 2.4 VM is currently too broken to survive high I/O benchmark
 tests without going silly
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Possible SCSI + block-layer bugs

2001-03-31 Thread Mark Hemment

Hi,

  I've never seen these trigger, but they look theoretically possible.

  When processing the completion of a SCSI request in a bottom-half,
__scsi_end_request() can find all the buffers associated with the request
haven't been completed (ie. leftovers).

  One question is; can this ever happen?
  If it can't then the code should be removed from __scsi_end_request(),
if it can happen then there appears to be a few problems;

  The request is re-queued to the block layer via 
scsi_queue_next_request(), which uses the "special" pointer in the request
structure to remember the Scsi_Cmnd associated with the request.  The SCSI
request function is then called, but doesn't guarantee to immediately
process the re-queued request even though it was added at the head (say,
the queue has become plugged).  This can trigger two possible bugs.

  The first is that __scsi_end_request() doesn't decrement the
hard_nr_sectors count in the request.  As the request is back on the
queue, it is possible for newly arriving buffer-heads to merge with the
heads already hanging off the request.  This merging uses the
hard_nr_sectors when calculating both the merged hard_nr_sectors and
nr_sectors counts.
  As the request is at the head, only back-merging can occur, but if
__scsi_end_request() triggers another uncompleted request to be re-queued,
it is possible to get front merging as well.

  The merging of a re-queued request looks safe, except for the
hard_nr_sectors.  This patch corrects the hard_nr_sectors accounting.


  The second bug is from request merging in attempt_merge().

  For a re-queued request, the request structure is the one embedded in
the Scsi_Cmnd (which is a copy of the request taken in the 
scsi_request_fn).
  In attempt_merge(), q->merge_requests_fn() is called to see the requests
are allowed to merge.  __scsi_merge_requests_fn() checks number of
segments, etc, but doesn't check if one of the requests is a re-queued one
(ie. no test against ->special).
  This can lead to attempt_merge() releasing the embedded request
structure (which, as an extract copy, has the ->q set, so to
blkdev_release_request() it looks like a request which originated from
the block layer).  This isn't too healthy.

  The fix here is to add a check in __scsi_merge_requests_fn() to check
for ->special being non-NULL.

  The attached patch is against 2.4.3.

  Jens, Eric, anyone, comments?

Mark


diff -urN -X dontdiff linux-2.4.3/drivers/scsi/scsi_lib.c 
markhe-2.4.3/drivers/scsi/scsi_lib.c
--- linux-2.4.3/drivers/scsi/scsi_lib.c Sat Mar  3 02:38:39 2001
+++ markhe-2.4.3/drivers/scsi/scsi_lib.cSat Mar 31 16:56:31 2001
@@ -377,12 +377,15 @@
nsect = bh->b_size >> 9;
blk_finished_io(nsect);
req->bh = bh->b_reqnext;
-   req->nr_sectors -= nsect;
-   req->sector += nsect;
bh->b_reqnext = NULL;
sectors -= nsect;
bh->b_end_io(bh, uptodate);
if ((bh = req->bh) != NULL) {
+   req->hard_sector += nsect;
+   req->hard_nr_sectors -= nsect;
+   req->sector += nsect;
+   req->nr_sectors -= nsect;
+
req->current_nr_sectors = bh->b_size >> 9;
if (req->nr_sectors < req->current_nr_sectors) {
req->nr_sectors = req->current_nr_sectors;
diff -urN -X dontdiff linux-2.4.3/drivers/scsi/scsi_merge.c 
markhe-2.4.3/drivers/scsi/scsi_merge.c
--- linux-2.4.3/drivers/scsi/scsi_merge.c   Fri Feb  9 19:30:23 2001
+++ markhe-2.4.3/drivers/scsi/scsi_merge.c  Sat Mar 31 16:38:27 2001
@@ -597,6 +597,13 @@
Scsi_Device *SDpnt;
struct Scsi_Host *SHpnt;
 
+   /*
+* First check if the either of the requests are re-queued
+* requests.  Can't merge them if they are.
+*/
+   if (req->special || next->special)
+   return 0;
+
SDpnt = (Scsi_Device *) q->queuedata;
SHpnt = SDpnt->host;



[PATCH] kmap performance

2001-03-31 Thread Mark Hemment

Hi,

  Two performance changes against 2.4.3.

  flush_all_zero_pkmaps() is guarding against a race which cannot happen,
and thus hurting performance.

  It uses the atomic fetch-and-clear "ptep_get_and_clear()" operation,
which is much stronger than needed.  No-one has the page mapped, and
cannot get at the page's virtual address without taking the kmap_lock,
which flush_all_zero_pkmaps() holds.  Even with speculative execution,
there are no races which need to be closed via an atomic op.

  On a two-way, Pentium-III system, flush_all_zero_pkmaps() was taking
over 200K CPU cycles (with the flush_tlb_all() only accounting for ~9K of
those cycles).

  This patch replaces ptep_get_and_clear() with a pte_page(), and
pte_clear().  This reduces flush_all_zero_pkmaps() to around 75K cycles.


  The second part of this patch adds a conditional guard to the 
wake_up() call in kunmap_high().

  With most usage patterns, a page will not be simultaneously mapped more
than once, hence the most common case (by far) is for pkmap_count[] to
decrement to 1.  This was causing an unconditional call to wake_up(), when
(again) the common case is to have no tasks in the wait-queue.

  This patches adds a guard to the wake_up() using an inlined
waitqueue_active(), and so avoids unnecessary function calls.
  It also drops the actual wake_up() to be outside of the spinlock.  This
is safe, as any waiters will have placed themselves onto the queue under
the kmap_lock, and kunmap_high() tests the queue under this lock.

Mark



diff -urN -X dontdiff linux-2.4.3/mm/highmem.c markhe-2.4.3/mm/highmem.c
--- linux-2.4.3/mm/highmem.cTue Nov 28 20:31:02 2000
+++ markhe-2.4.3/mm/highmem.c   Sat Mar 31 15:03:43 2001
@@ -46,7 +46,7 @@
 
for (i = 0; i < LAST_PKMAP; i++) {
struct page *page;
-   pte_t pte;
+
/*
 * zero means we don't have anything to do,
 * >1 means that it is still in use. Only
@@ -56,10 +56,21 @@
if (pkmap_count[i] != 1)
continue;
pkmap_count[i] = 0;
-   pte = ptep_get_and_clear(pkmap_page_table+i);
-   if (pte_none(pte))
+
+   /* sanity check */
+   if (pte_none(pkmap_page_table[i]))
BUG();
-   page = pte_page(pte);
+
+   /*
+* Don't need an atomic fetch-and-clear op here;
+* no-one has the page mapped, and cannot get at
+* its virtual address (and hence PTE) without first
+* getting the kmap_lock (which is held here).
+* So no dangers, even with speculative execution.
+*/
+   page = pte_page(pkmap_page_table[i]);
+   pte_clear(_page_table[i]);
+
page->virtual = NULL;
}
flush_tlb_all();
@@ -139,6 +150,7 @@
 {
unsigned long vaddr;
unsigned long nr;
+   int need_wakeup;
 
spin_lock(_lock);
vaddr = (unsigned long) page->virtual;
@@ -150,13 +162,31 @@
 * A count must never go down to zero
 * without a TLB flush!
 */
+   need_wakeup = 0;
switch (--pkmap_count[nr]) {
case 0:
BUG();
case 1:
-   wake_up(_map_wait);
+   /*
+* Avoid an unnecessary wake_up() function call.
+* The common case is pkmap_count[] == 1, but
+* no waiters.
+* The tasks queued in the wait-queue are guarded
+* by both the lock in the wait-queue-head and by
+* the kmap_lock.  As the kmap_lock is held here,
+* no need for the wait-queue-head's lock.  Simply
+* test if the queue is empty.
+*/
+   need_wakeup = waitqueue_active(_map_wait);
}
spin_unlock(_lock);
+
+   /*
+* Can do wake-up, if needed, race-free outside of
+* the spinlock.
+*/
+   if (need_wakeup)
+   wake_up(_map_wait);
 }
 
 /*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Possible SCSI + block-layer bugs

2001-03-31 Thread Mark Hemment

Hi,

  I've never seen these trigger, but they look theoretically possible.

  When processing the completion of a SCSI request in a bottom-half,
__scsi_end_request() can find all the buffers associated with the request
haven't been completed (ie. leftovers).

  One question is; can this ever happen?
  If it can't then the code should be removed from __scsi_end_request(),
if it can happen then there appears to be a few problems;

  The request is re-queued to the block layer via 
scsi_queue_next_request(), which uses the "special" pointer in the request
structure to remember the Scsi_Cmnd associated with the request.  The SCSI
request function is then called, but doesn't guarantee to immediately
process the re-queued request even though it was added at the head (say,
the queue has become plugged).  This can trigger two possible bugs.

  The first is that __scsi_end_request() doesn't decrement the
hard_nr_sectors count in the request.  As the request is back on the
queue, it is possible for newly arriving buffer-heads to merge with the
heads already hanging off the request.  This merging uses the
hard_nr_sectors when calculating both the merged hard_nr_sectors and
nr_sectors counts.
  As the request is at the head, only back-merging can occur, but if
__scsi_end_request() triggers another uncompleted request to be re-queued,
it is possible to get front merging as well.

  The merging of a re-queued request looks safe, except for the
hard_nr_sectors.  This patch corrects the hard_nr_sectors accounting.


  The second bug is from request merging in attempt_merge().

  For a re-queued request, the request structure is the one embedded in
the Scsi_Cmnd (which is a copy of the request taken in the 
scsi_request_fn).
  In attempt_merge(), q-merge_requests_fn() is called to see the requests
are allowed to merge.  __scsi_merge_requests_fn() checks number of
segments, etc, but doesn't check if one of the requests is a re-queued one
(ie. no test against -special).
  This can lead to attempt_merge() releasing the embedded request
structure (which, as an extract copy, has the -q set, so to
blkdev_release_request() it looks like a request which originated from
the block layer).  This isn't too healthy.

  The fix here is to add a check in __scsi_merge_requests_fn() to check
for -special being non-NULL.

  The attached patch is against 2.4.3.

  Jens, Eric, anyone, comments?

Mark


diff -urN -X dontdiff linux-2.4.3/drivers/scsi/scsi_lib.c 
markhe-2.4.3/drivers/scsi/scsi_lib.c
--- linux-2.4.3/drivers/scsi/scsi_lib.c Sat Mar  3 02:38:39 2001
+++ markhe-2.4.3/drivers/scsi/scsi_lib.cSat Mar 31 16:56:31 2001
@@ -377,12 +377,15 @@
nsect = bh-b_size  9;
blk_finished_io(nsect);
req-bh = bh-b_reqnext;
-   req-nr_sectors -= nsect;
-   req-sector += nsect;
bh-b_reqnext = NULL;
sectors -= nsect;
bh-b_end_io(bh, uptodate);
if ((bh = req-bh) != NULL) {
+   req-hard_sector += nsect;
+   req-hard_nr_sectors -= nsect;
+   req-sector += nsect;
+   req-nr_sectors -= nsect;
+
req-current_nr_sectors = bh-b_size  9;
if (req-nr_sectors  req-current_nr_sectors) {
req-nr_sectors = req-current_nr_sectors;
diff -urN -X dontdiff linux-2.4.3/drivers/scsi/scsi_merge.c 
markhe-2.4.3/drivers/scsi/scsi_merge.c
--- linux-2.4.3/drivers/scsi/scsi_merge.c   Fri Feb  9 19:30:23 2001
+++ markhe-2.4.3/drivers/scsi/scsi_merge.c  Sat Mar 31 16:38:27 2001
@@ -597,6 +597,13 @@
Scsi_Device *SDpnt;
struct Scsi_Host *SHpnt;
 
+   /*
+* First check if the either of the requests are re-queued
+* requests.  Can't merge them if they are.
+*/
+   if (req-special || next-special)
+   return 0;
+
SDpnt = (Scsi_Device *) q-queuedata;
SHpnt = SDpnt-host;



Re: Q: explicit alignment control for the slab allocator

2001-03-02 Thread Mark Hemment

On Fri, 2 Mar 2001, Manfred Spraul wrote:
> Zitiere Mark Hemment <[EMAIL PROTECTED]>:
> >   Could be a win on archs with small L1 cache line sizes (16bytes on a
> > 486) - but most modern processors have larger lines.
> 
> IIRC cache colouring was introduced for some sun hardware with 2 memory busses:
> one handes (addr%256<128), the other one (addr%256>=128)
> 
> If everything is perfectly aligned, the load one one bus was far higher than the
> load on the other bus.

  Yes.
  High-end Intel PCs have also had interleaved buses for a few years
now.  So it is not just for Sun h/w.

 
> >   Hmm, no that note, seen the L1 line size defined for a Pentium ?
> > 128 bytes!! (CONFIG_X86_L1_CACHE_SHIFT of 7).  That is probably going to
> > waste a lot of space for small objects.
> >
> No, it doesn't:
> HWCACHE_ALIGN means "do not cross a cache line boundary".

  Ah, I broke my code! :(

  In my original slab, the code to do "packing" of objects into a single
cache line was #if-def'ed out for SMP to avoid the possibility of
false-sharing between objects.  Not a large possibility, but it exists.
 
> The question is who should decide about the cache colour offset?
> 
> a) the slab allocator always chooses the smallest sensible offset (i.e. the
> alignment)
> b) the caller can specify the offset, e.g. if the caller knows that the hot zone
> is large he would use a larger colour offset.

  Only the caller knows about the attributes and usage of an object, so
they should be able to request (not demand) the offset/alignment of the
allocator. (OK, they can demand the alignment.)

> Even if the hot zone is larger than the default offset, is there any advantage
> of increasing the colour offset beyond the alignment?
> 
> I don't see an advantage.

  I do, but like you, I don't have any data to prove my point.
  Time to get profiling?

Mark


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-02 Thread Mark Hemment


On Thu, 1 Mar 2001, Manfred Spraul wrote:
> Yes, I see the difference, but I'm not sure that it will work as
> intended.
> offset must be a multiple of the alignment, everything else won't work.

  The code does force the offset to be a multiple of the alignment -
rounding the offset up.  The idea was to a caller could something like;
kmem_cache_create("foo", sizeof(foo_s),
  offsetoff(foo_s, member), );

where structure members in foo_s are "hot" up until the 'member'
structure.

> In which cases an offset > alignment is really a win?

  You've got me. :)  I don't know.
  In the Bonwick paper, such a facility was described, so I thought "hey,
sounds like that might be useful".
  Could be a win on archs with small L1 cache line sizes (16bytes on a
486) - but most modern processors have larger lines.
  Hmm, no that note, seen the L1 line size defined for a Pentium ?
128 bytes!! (CONFIG_X86_L1_CACHE_SHIFT of 7).  That is probably going to
waste a lot of space for small objects.


> Obviously using offset 32 bytes for a structure with a 64 byte hot zone
> means that 2 slabs with a different "color" compete for the same cache
> lines [just assuming 32 byte cache lines for simplicity] in 50% of the
> cases, but otoh offset==64 halfs the number of possible colors.

  Yes.
  It is possibly to improve on the current slab allocator, to get an
extra colour or two out of it for some object sizes (eg. when the slab
management is on slab, it is only ever at the front of a slab - it could
also wrap round to the rear).

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-02 Thread Mark Hemment


On Thu, 1 Mar 2001, Manfred Spraul wrote:
 Yes, I see the difference, but I'm not sure that it will work as
 intended.
 offset must be a multiple of the alignment, everything else won't work.

  The code does force the offset to be a multiple of the alignment -
rounding the offset up.  The idea was to a caller could something like;
kmem_cache_create("foo", sizeof(foo_s),
  offsetoff(foo_s, member), );

where structure members in foo_s are "hot" up until the 'member'
structure.

 In which cases an offset  alignment is really a win?

  You've got me. :)  I don't know.
  In the Bonwick paper, such a facility was described, so I thought "hey,
sounds like that might be useful".
  Could be a win on archs with small L1 cache line sizes (16bytes on a
486) - but most modern processors have larger lines.
  Hmm, no that note, seen the L1 line size defined for a Pentium ?
128 bytes!! (CONFIG_X86_L1_CACHE_SHIFT of 7).  That is probably going to
waste a lot of space for small objects.


 Obviously using offset 32 bytes for a structure with a 64 byte hot zone
 means that 2 slabs with a different "color" compete for the same cache
 lines [just assuming 32 byte cache lines for simplicity] in 50% of the
 cases, but otoh offset==64 halfs the number of possible colors.

  Yes.
  It is possibly to improve on the current slab allocator, to get an
extra colour or two out of it for some object sizes (eg. when the slab
management is on slab, it is only ever at the front of a slab - it could
also wrap round to the rear).

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-02 Thread Mark Hemment

On Fri, 2 Mar 2001, Manfred Spraul wrote:
 Zitiere Mark Hemment [EMAIL PROTECTED]:
Could be a win on archs with small L1 cache line sizes (16bytes on a
  486) - but most modern processors have larger lines.
 
 IIRC cache colouring was introduced for some sun hardware with 2 memory busses:
 one handes (addr%256128), the other one (addr%256=128)
 
 If everything is perfectly aligned, the load one one bus was far higher than the
 load on the other bus.

  Yes.
  High-end Intel PCs have also had interleaved buses for a few years
now.  So it is not just for Sun h/w.

 
Hmm, no that note, seen the L1 line size defined for a Pentium ?
  128 bytes!! (CONFIG_X86_L1_CACHE_SHIFT of 7).  That is probably going to
  waste a lot of space for small objects.
 
 No, it doesn't:
 HWCACHE_ALIGN means "do not cross a cache line boundary".

  Ah, I broke my code! :(

  In my original slab, the code to do "packing" of objects into a single
cache line was #if-def'ed out for SMP to avoid the possibility of
false-sharing between objects.  Not a large possibility, but it exists.
 
 The question is who should decide about the cache colour offset?
 
 a) the slab allocator always chooses the smallest sensible offset (i.e. the
 alignment)
 b) the caller can specify the offset, e.g. if the caller knows that the hot zone
 is large he would use a larger colour offset.

  Only the caller knows about the attributes and usage of an object, so
they should be able to request (not demand) the offset/alignment of the
allocator. (OK, they can demand the alignment.)

 Even if the hot zone is larger than the default offset, is there any advantage
 of increasing the colour offset beyond the alignment?
 
 I don't see an advantage.

  I do, but like you, I don't have any data to prove my point.
  Time to get profiling?

Mark


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-01 Thread Mark Hemment

On Thu, 1 Mar 2001, Manfred Spraul wrote:

> Mark Hemment wrote:
> > 
> >   The original idea behind offset was for objects with a "hot" area
> > greater than a single L1 cache line.  By using offset correctly (and to my
> > knowledge it has never been used anywhere in the Linux kernel), a SLAB
> > cache creator (caller of kmem_cache_create()) could ask the SLAB for more
> > than one colour (space/L1 cache lines) offset between objects.
> >
> 
> What's the difference between this definition of 'offset' and alignment?

  The positioning of the first object within a slab (at least that is how
it is suppose to work).

  The distance between all objects within a slab is constant, so the
colouring of objects depends upon the cache line (offset) the first object
is placed on.
  The alignment is the boundary objects fall upon within a slab.  This may
require 'padding' between the objects so they fall on the correct
boundaries (ie. they aren't a 'natural' size).
  For kmem_cache_create(), a zero offset means the offset is the same as
the alignment.

  Take the case of offset being 64, and alignment being 32.
  Here, the allocator attempts to place the first object on a 64byte
boundary (say, at offset 0), and all subsequent objects (within the same
cache) on a 32byte boundary.
  Now, when it comes to construct the next slab, it tries to place the
first object 64bytes offset from the first object in the previous
slab (say, at offset 64).  The distance between the objects is still the
same - ie. they fall on 32byte boundaries.

  See the difference?

> alignment means that (addr%alignment==0)
> offset means that (addr1-addr2 == n*offset)
> 
> Isn't the only difference the alignment of the first object in a slab?

  Yes (as explained above).  It is important.
 
> Some hardware drivers use HW_CACHEALIGN and assume certain byte
> alignments, and arm needs 1024 byte aligned blocks.

  I should have put a big comment in the allocator, saying aligment/offset
are only hints to the allocator and not guarantees.
  Unfortunately, the allocator was always returning L1 aligned objects
with HW_CACHEALIGN, so folks started to depend on it.  Too late to break
that now.
  It sounds as if HW_CACHEALIGN has been broken by a config option, and
this needs to be fixed.
  But leave 'offset' alone?!  If it isn't working as described above, then
it needs fixing, but don't change its definition.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-01 Thread Mark Hemment

On Thu, 1 Mar 2001, Manfred Spraul wrote:

> Alan added a CONFIG options for FORCED_DEBUG slab debugging, but there
> is one minor problem with FORCED_DEBUG: FORCED_DEBUG disables
> HW_CACHEALIGN, and several drivers assume that HW_CACHEALIGN implies a
> certain alignment (iirc usb/uhci.c assumes 16-byte alignment)
> 
> I've attached a patch that fixes the explicit alignment control in
> kmem_cache_create().
> 
> The parameter 'offset' [the minimum offset to be used for cache
> coloring] actually is the guaranteed alignment, except that the
> implementation was broken. I've fixed the implementation and renamed
> 'offset' to 'align'.

  As the original author of the slab allocator, I can assure you there is
nothing guaranteed about "offset".  Neither is it to do with any minimum.

  The original idea behind offset was for objects with a "hot" area
greater than a single L1 cache line.  By using offset correctly (and to my
knowledge it has never been used anywhere in the Linux kernel), a SLAB
cache creator (caller of kmem_cache_create()) could ask the SLAB for more
than one colour (space/L1 cache lines) offset between objects.

  As no one uses the feature it could well be broken, but is that a reason
to change its meaning?

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-01 Thread Mark Hemment

On Thu, 1 Mar 2001, Manfred Spraul wrote:

 Alan added a CONFIG options for FORCED_DEBUG slab debugging, but there
 is one minor problem with FORCED_DEBUG: FORCED_DEBUG disables
 HW_CACHEALIGN, and several drivers assume that HW_CACHEALIGN implies a
 certain alignment (iirc usb/uhci.c assumes 16-byte alignment)
 
 I've attached a patch that fixes the explicit alignment control in
 kmem_cache_create().
 
 The parameter 'offset' [the minimum offset to be used for cache
 coloring] actually is the guaranteed alignment, except that the
 implementation was broken. I've fixed the implementation and renamed
 'offset' to 'align'.

  As the original author of the slab allocator, I can assure you there is
nothing guaranteed about "offset".  Neither is it to do with any minimum.

  The original idea behind offset was for objects with a "hot" area
greater than a single L1 cache line.  By using offset correctly (and to my
knowledge it has never been used anywhere in the Linux kernel), a SLAB
cache creator (caller of kmem_cache_create()) could ask the SLAB for more
than one colour (space/L1 cache lines) offset between objects.

  As no one uses the feature it could well be broken, but is that a reason
to change its meaning?

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Q: explicit alignment control for the slab allocator

2001-03-01 Thread Mark Hemment

On Thu, 1 Mar 2001, Manfred Spraul wrote:

 Mark Hemment wrote:
  
The original idea behind offset was for objects with a "hot" area
  greater than a single L1 cache line.  By using offset correctly (and to my
  knowledge it has never been used anywhere in the Linux kernel), a SLAB
  cache creator (caller of kmem_cache_create()) could ask the SLAB for more
  than one colour (space/L1 cache lines) offset between objects.
 
 
 What's the difference between this definition of 'offset' and alignment?

  The positioning of the first object within a slab (at least that is how
it is suppose to work).

  The distance between all objects within a slab is constant, so the
colouring of objects depends upon the cache line (offset) the first object
is placed on.
  The alignment is the boundary objects fall upon within a slab.  This may
require 'padding' between the objects so they fall on the correct
boundaries (ie. they aren't a 'natural' size).
  For kmem_cache_create(), a zero offset means the offset is the same as
the alignment.

  Take the case of offset being 64, and alignment being 32.
  Here, the allocator attempts to place the first object on a 64byte
boundary (say, at offset 0), and all subsequent objects (within the same
cache) on a 32byte boundary.
  Now, when it comes to construct the next slab, it tries to place the
first object 64bytes offset from the first object in the previous
slab (say, at offset 64).  The distance between the objects is still the
same - ie. they fall on 32byte boundaries.

  See the difference?

 alignment means that (addr%alignment==0)
 offset means that (addr1-addr2 == n*offset)
 
 Isn't the only difference the alignment of the first object in a slab?

  Yes (as explained above).  It is important.
 
 Some hardware drivers use HW_CACHEALIGN and assume certain byte
 alignments, and arm needs 1024 byte aligned blocks.

  I should have put a big comment in the allocator, saying aligment/offset
are only hints to the allocator and not guarantees.
  Unfortunately, the allocator was always returning L1 aligned objects
with HW_CACHEALIGN, so folks started to depend on it.  Too late to break
that now.
  It sounds as if HW_CACHEALIGN has been broken by a config option, and
this needs to be fixed.
  But leave 'offset' alone?!  If it isn't working as described above, then
it needs fixing, but don't change its definition.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] nfsd + scalability

2001-02-22 Thread Mark Hemment

On Thu, 22 Feb 2001, Neil Brown wrote:

> On Sunday February 18, [EMAIL PROTECTED] wrote:
> > Hi Neil, all,
> > 
> >   The nfs daemons run holding the global kernel lock.  They still hold
> > this lock over calls to file_op's read and write.
> > [snip]
> >   Dropping the kernel lock around read and write in fs/nfsd/vfs.c is a
> > _big_ SMP scalability win!
> 
> Certainly I would like to not hold the BKL so much, but I'm curious
> how much effect it will really have.  Do you have any data on the
> effect of this change?

  Depends very much on the hardware configuration, and underlying
filesystem.
  On an I/O bound system, obviously this has little effect.

  Using a filesystem which has a quite deep stack (CPU cycle heavy),
this is a big win.  I've been running with this for so long that I can't
find my original data files at the moment, but it was around +8%
improvment in throughput for a 4-way box under SpecFS with vxfs as the
underlying filesystem.  Less benefit for ext2 (all filesystems NFS
exported "sync" and "no_subtree_check").  Some of the benefit came from
the fact that there is also a volume manager sitting under the filesystem
(more CPU cycles with the kernel lock held!).

  Holding the kernel lock for less cycles has an important side benefit!
  If it is held for less cycles, then the probability of it being held
when an interrupt is processed is reduced.  This benefit is further
enhanced if there are bottom-halfs running off the back of the interrupt
(such as networking code).
  I need to get actual figures for how many cycles are we spinning at the
reacquire_kernel_lock() (in schedule()), but my gut feeling is that we
aren't doing too well when running as a file server.

> Also, I would much rather drop the lock in nfsd_dispatch() and then go
> around reclaiming it whereever it was needed.
> Subsequently we would drop the lock in nfsd() and make sure sunrpc is
> safe.

  sunrpc definitely tries to be SMP safe, I haven't convienced myself that
it actually is.
  In sunrpc, dropping the kernel lock when checksuming the UDP packet, and
when sending, is another win-win case (again, need to find my data files,
but this was approx +3% improvement in throughput).

> This would be more work (any volunteers?:-) but I feel that dropping
> it in little places like this is a bit unsatisfying.

  True, not 100% satisfying, but even little places give an overall
improvement in  scalability.  If they can be proved to be correct, then
why not do it?  It moves the code in the right direction.

  I am planning on slow expanding the dropping of the kernel lock within
NFS, rather than do it in one single bang.  It looks do able, but there
_might_ be an issue with the interaction with the dcache.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] nfsd + scalability

2001-02-22 Thread Mark Hemment

On Thu, 22 Feb 2001, Neil Brown wrote:

 On Sunday February 18, [EMAIL PROTECTED] wrote:
  Hi Neil, all,
  
The nfs daemons run holding the global kernel lock.  They still hold
  this lock over calls to file_op's read and write.
  [snip]
Dropping the kernel lock around read and write in fs/nfsd/vfs.c is a
  _big_ SMP scalability win!
 
 Certainly I would like to not hold the BKL so much, but I'm curious
 how much effect it will really have.  Do you have any data on the
 effect of this change?

  Depends very much on the hardware configuration, and underlying
filesystem.
  On an I/O bound system, obviously this has little effect.

  Using a filesystem which has a quite deep stack (CPU cycle heavy),
this is a big win.  I've been running with this for so long that I can't
find my original data files at the moment, but it was around +8%
improvment in throughput for a 4-way box under SpecFS with vxfs as the
underlying filesystem.  Less benefit for ext2 (all filesystems NFS
exported "sync" and "no_subtree_check").  Some of the benefit came from
the fact that there is also a volume manager sitting under the filesystem
(more CPU cycles with the kernel lock held!).

  Holding the kernel lock for less cycles has an important side benefit!
  If it is held for less cycles, then the probability of it being held
when an interrupt is processed is reduced.  This benefit is further
enhanced if there are bottom-halfs running off the back of the interrupt
(such as networking code).
  I need to get actual figures for how many cycles are we spinning at the
reacquire_kernel_lock() (in schedule()), but my gut feeling is that we
aren't doing too well when running as a file server.

 Also, I would much rather drop the lock in nfsd_dispatch() and then go
 around reclaiming it whereever it was needed.
 Subsequently we would drop the lock in nfsd() and make sure sunrpc is
 safe.

  sunrpc definitely tries to be SMP safe, I haven't convienced myself that
it actually is.
  In sunrpc, dropping the kernel lock when checksuming the UDP packet, and
when sending, is another win-win case (again, need to find my data files,
but this was approx +3% improvement in throughput).

 This would be more work (any volunteers?:-) but I feel that dropping
 it in little places like this is a bit unsatisfying.

  True, not 100% satisfying, but even little places give an overall
improvement in  scalability.  If they can be proved to be correct, then
why not do it?  It moves the code in the right direction.

  I am planning on slow expanding the dropping of the kernel lock within
NFS, rather than do it in one single bang.  It looks do able, but there
_might_ be an issue with the interaction with the dcache.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] nfsd + scalability

2001-02-18 Thread Mark Hemment

Hi Neil, all,

  The nfs daemons run holding the global kernel lock.  They still hold
this lock over calls to file_op's read and write.

  The file system kernel interface (FSKI) doesn't require the kernel lock
to be held over these read/write calls.  The nfs daemons do not require 
that the reads or writes do not block (would be v silly if they did), so
they have no guarantee the lock isn't dropped and retaken during
blocking.  ie. they aren't using it as a guard across the calls.

  Dropping the kernel lock around read and write in fs/nfsd/vfs.c is a
_big_ SMP scalability win!

  Attached patch is against 2.4.1-ac18, but should apply to most recent
kernel versions.

Mark


--- vanilla-2.4.1-ac18/fs/nfsd/vfs.cSun Feb 18 15:06:27 2001
+++ markhe-2.4.1-ac18/fs/nfsd/vfs.c Sun Feb 18 19:32:18 2001
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #define __NO_VERSION__
 #include 
@@ -602,12 +603,28 @@
file.f_ralen = ra->p_ralen;
file.f_rawin = ra->p_rawin;
}
+
+   /*
+* The nfs daemons run holding the global kernel lock, but
+* f_op->read() doesn't need the lock to be held.
+* Drop it here to help scalability.
+*
+* The "kernel_locked()" test isn't perfect (someone else could be
+* holding the lock when we're not), but it will eventually catch
+* any cases of entering here without the lock held.
+*/
+   if (!kernel_locked())
+   BUG();
+   unlock_kernel();
+
file.f_pos = offset;
 
oldfs = get_fs(); set_fs(KERNEL_DS);
err = file.f_op->read(, buf, *count, _pos);
set_fs(oldfs);
 
+   lock_kernel();
+
/* Write back readahead params */
if (ra != NULL) {
dprintk("nfsd: raparms %ld %ld %ld %ld %ld\n",
@@ -664,6 +681,22 @@
goto out_close;
 #endif
 
+   /*
+* The nfs daemons run holding the global kernel lock, but
+* f_op->write() doesn't need the lock to be held.
+* Also, as the struct file is private, the export is read-locked,
+* and the inode attached to the dentry cannot change under us, the
+* lock can be dropped ahead of the call to write() for even better
+* scalability.
+*
+* The "kernel_locked()" test isn't perfect (someone else could be
+* holding the lock when we're not), but it will eventually catch
+* any cases of entering here without the lock held.
+*/
+   if (!kernel_locked())
+   BUG();
+   unlock_kernel();
+
dentry = file.f_dentry;
inode = dentry->d_inode;
exp   = fhp->fh_export;
@@ -692,9 +725,12 @@
/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
err = file.f_op->write(, buf, cnt, _pos);
+   set_fs(oldfs);
+
+   lock_kernel();
+
if (err >= 0)
nfsdstats.io_write += cnt;
-   set_fs(oldfs);
 
/* clear setuid/setgid flag after write */
if (err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) {



[PATCH] HIGHMEM+buffers

2001-02-18 Thread Mark Hemment

Hi,

  On a 4GB SMP box, configured with HIGHMEM support, making a 670G
(obviously using a volume manager) ext2 file system takes 12minutes (over
10minutes of sys time).

  One problem is buffer allocations do not use HIGHMEM, but the
nr_free_buffer_pages() doesn't take this into account causing
balance_dirty_state() to return the wrong state.

  The attached patch fixes the worse part of nr_free_buffer_pages() - with
HIGHMEM it now only counts free+inactive pages in the NORMAL and DMA
zone.  It doesn't fix the inactive_dirty calculation.
  Also, in buffer.c:flush_dirty_buffers(), it is worthwhile to kick the
disk queues if it decides to reschedule.

  With the patch, that 670G filesystem can be created in 5m36secs (under
half the time).

  Patch is against 2.4.1-ac18.

Mark


diff -urN -X dontdiff vanilla-2.4.1-ac18/fs/buffer.c markhe-2.4.1-ac18/fs/buffer.c
--- vanilla-2.4.1-ac18/fs/buffer.c  Sun Feb 18 15:06:26 2001
+++ markhe-2.4.1-ac18/fs/buffer.c   Sun Feb 18 19:03:19 2001
@@ -2638,8 +2638,11 @@
ll_rw_block(WRITE, 1, );
atomic_dec(>b_count);
 
-   if (current->need_resched)
+   if (current->need_resched) {
+   /* kick what we've already pushed down */
+   run_task_queue(_disk);
schedule();
+   }
goto restart;
}
  out_unlock:
diff -urN -X dontdiff vanilla-2.4.1-ac18/include/linux/swap.h 
markhe-2.4.1-ac18/include/linux/swap.h
--- vanilla-2.4.1-ac18/include/linux/swap.h Sun Feb 18 15:06:29 2001
+++ markhe-2.4.1-ac18/include/linux/swap.h  Sun Feb 18 18:11:03 2001
@@ -65,7 +65,9 @@
 
 extern int nr_swap_pages;
 FASTCALL(unsigned int nr_free_pages(void));
+FASTCALL(unsigned int nr_free_pages_zone(int));
 FASTCALL(unsigned int nr_inactive_clean_pages(void));
+FASTCALL(unsigned int nr_inactive_clean_pages_zone(int));
 FASTCALL(unsigned int nr_free_buffer_pages(void));
 extern int nr_active_pages;
 extern int nr_inactive_dirty_pages;
diff -urN -X dontdiff vanilla-2.4.1-ac18/mm/page_alloc.c 
markhe-2.4.1-ac18/mm/page_alloc.c
--- vanilla-2.4.1-ac18/mm/page_alloc.c  Sun Feb 18 15:06:29 2001
+++ markhe-2.4.1-ac18/mm/page_alloc.c   Sun Feb 18 19:04:36 2001
@@ -547,6 +547,23 @@
 }
 
 /*
+ * Total amount of free (allocatable) RAM in a given zone.
+ */
+unsigned int nr_free_pages_zone (int zone_type)
+{
+   pg_data_t   *pgdat;
+   unsigned int sum;
+
+   sum = 0;
+   pgdat = pgdat_list;
+   while (pgdat) {
+   sum += (pgdat->node_zones+zone_type)->free_pages;
+   pgdat = pgdat->node_next;
+   }
+   return sum;
+}
+
+/*
  * Total amount of inactive_clean (allocatable) RAM:
  */
 unsigned int nr_inactive_clean_pages (void)
@@ -565,14 +582,43 @@
 }
 
 /*
+ * Total amount of inactive_clean (allocatable) RAM in a given zone.
+ */
+unsigned int nr_inactive_clean_pages_zone (int zone_type)
+{
+   pg_data_t   *pgdat;
+   unsigned int sum;
+
+   sum = 0;
+   pgdat = pgdat_list;
+   while (pgdat) {
+   sum += (pgdat->node_zones+zone_type)->inactive_clean_pages;
+   pgdat = pgdat->node_next;
+   }
+   return sum;
+}
+
+
+/*
  * Amount of free RAM allocatable as buffer memory:
+ *
+ * For HIGHMEM systems don't count HIGHMEM pages.
+ * This is function is still far from perfect for HIGHMEM systems, but
+ * it is close enough for the time being.
  */
 unsigned int nr_free_buffer_pages (void)
 {
unsigned int sum;
 
-   sum = nr_free_pages();
-   sum += nr_inactive_clean_pages();
+#ifCONFIG_HIGHMEM
+   sum = nr_free_pages_zone(ZONE_NORMAL) +
+ nr_free_pages_zone(ZONE_DMA) +
+ nr_inactive_clean_pages_zone(ZONE_NORMAL) +
+ nr_inactive_clean_pages_zone(ZONE_DMA);
+#else
+   sum = nr_free_pages() +
+ nr_inactive_clean_pages();
+#endif
sum += nr_inactive_dirty_pages;
 
/*



[PATCH] nfsd + scalability

2001-02-18 Thread Mark Hemment

Hi Neil, all,

  The nfs daemons run holding the global kernel lock.  They still hold
this lock over calls to file_op's read and write.

  The file system kernel interface (FSKI) doesn't require the kernel lock
to be held over these read/write calls.  The nfs daemons do not require 
that the reads or writes do not block (would be v silly if they did), so
they have no guarantee the lock isn't dropped and retaken during
blocking.  ie. they aren't using it as a guard across the calls.

  Dropping the kernel lock around read and write in fs/nfsd/vfs.c is a
_big_ SMP scalability win!

  Attached patch is against 2.4.1-ac18, but should apply to most recent
kernel versions.

Mark


--- vanilla-2.4.1-ac18/fs/nfsd/vfs.cSun Feb 18 15:06:27 2001
+++ markhe-2.4.1-ac18/fs/nfsd/vfs.c Sun Feb 18 19:32:18 2001
@@ -30,6 +30,7 @@
 #include linux/net.h
 #include linux/unistd.h
 #include linux/slab.h
+#include linux/smp_lock.h
 #include linux/in.h
 #define __NO_VERSION__
 #include linux/module.h
@@ -602,12 +603,28 @@
file.f_ralen = ra-p_ralen;
file.f_rawin = ra-p_rawin;
}
+
+   /*
+* The nfs daemons run holding the global kernel lock, but
+* f_op-read() doesn't need the lock to be held.
+* Drop it here to help scalability.
+*
+* The "kernel_locked()" test isn't perfect (someone else could be
+* holding the lock when we're not), but it will eventually catch
+* any cases of entering here without the lock held.
+*/
+   if (!kernel_locked())
+   BUG();
+   unlock_kernel();
+
file.f_pos = offset;
 
oldfs = get_fs(); set_fs(KERNEL_DS);
err = file.f_op-read(file, buf, *count, file.f_pos);
set_fs(oldfs);
 
+   lock_kernel();
+
/* Write back readahead params */
if (ra != NULL) {
dprintk("nfsd: raparms %ld %ld %ld %ld %ld\n",
@@ -664,6 +681,22 @@
goto out_close;
 #endif
 
+   /*
+* The nfs daemons run holding the global kernel lock, but
+* f_op-write() doesn't need the lock to be held.
+* Also, as the struct file is private, the export is read-locked,
+* and the inode attached to the dentry cannot change under us, the
+* lock can be dropped ahead of the call to write() for even better
+* scalability.
+*
+* The "kernel_locked()" test isn't perfect (someone else could be
+* holding the lock when we're not), but it will eventually catch
+* any cases of entering here without the lock held.
+*/
+   if (!kernel_locked())
+   BUG();
+   unlock_kernel();
+
dentry = file.f_dentry;
inode = dentry-d_inode;
exp   = fhp-fh_export;
@@ -692,9 +725,12 @@
/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
err = file.f_op-write(file, buf, cnt, file.f_pos);
+   set_fs(oldfs);
+
+   lock_kernel();
+
if (err = 0)
nfsdstats.io_write += cnt;
-   set_fs(oldfs);
 
/* clear setuid/setgid flag after write */
if (err = 0  (inode-i_mode  (S_ISUID | S_ISGID))) {



[PATCH] vmalloc fault race

2001-02-14 Thread Mark Hemment

Hi,

  If two processes, sharing the same page tables, hit an unloaded vmalloc
address in the kernel at the same time, one of the processes is killed
(with the message "Unable to handle kernel paging request").

  This occurs because the test on a vmalloc fault is too tight.  On x86,
it contains;
if (pmd_present(*pmd) || !pmd_present(*pmd_k))
goto bad_area_nosemaphore;

  If two processes are racing, chances are that "pmd_present(*pmd)" will
be true for one of them.

  It appears that; alpha, x86, arm, cris and sparc all have the
same/similar test.
  I've included a patch for all the above archs, but have only tested on
the x86 (so no guarantee it is correct for other archs).

  Note: Even though this race can only occur on SMP (assuming cannot
context switch on entering a page-fault), I've unconditionally remove the
test against pmd.  It could be argued that it should be left in for UP...

  Patch is againt 2.4.2pre3.

Mark


diff -urN -X dontdiff vxfs-2.4.2pre3/arch/alpha/mm/fault.c 
markhe-2.4.2pre3/arch/alpha/mm/fault.c
--- vxfs-2.4.2pre3/arch/alpha/mm/fault.cFri Dec 29 22:07:19 2000
+++ markhe-2.4.2pre3/arch/alpha/mm/fault.c  Wed Feb 14 12:10:21 2001
@@ -223,7 +223,7 @@
 
pgd = current->active_mm->pgd + offset;
pgd_k = swapper_pg_dir + offset;
-   if (!pgd_present(*pgd) && pgd_present(*pgd_k)) {
+   if (pgd_present(*pgd_k)) {
pgd_val(*pgd) = pgd_val(*pgd_k);
return;
}
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/arm/mm/fault-common.c 
markhe-2.4.2pre3/arch/arm/mm/fault-common.c
--- vxfs-2.4.2pre3/arch/arm/mm/fault-common.c   Mon Feb 12 10:10:27 2001
+++ markhe-2.4.2pre3/arch/arm/mm/fault-common.c Wed Feb 14 12:12:13 2001
@@ -185,8 +185,6 @@
goto bad_area;
 
pmd = pmd_offset(pgd, addr);
-   if (!pmd_none(*pmd))
-   goto bad_area;
set_pmd(pmd, *pmd_k);
return 1;
 
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/cris/mm/fault.c 
markhe-2.4.2pre3/arch/cris/mm/fault.c
--- vxfs-2.4.2pre3/arch/cris/mm/fault.c Mon Feb 12 10:10:27 2001
+++ markhe-2.4.2pre3/arch/cris/mm/fault.c   Wed Feb 14 12:13:10 2001
@@ -381,7 +381,7 @@
 pmd = pmd_offset(pgd, address);
 pmd_k = pmd_offset(pgd_k, address);
 
-if (pmd_present(*pmd) || !pmd_present(*pmd_k))
+if (!pmd_present(*pmd_k))
 goto bad_area_nosemaphore;
 set_pmd(pmd, *pmd_k);
 return;
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/i386/mm/fault.c 
markhe-2.4.2pre3/arch/i386/mm/fault.c
--- vxfs-2.4.2pre3/arch/i386/mm/fault.c Sun Nov 12 03:01:11 2000
+++ markhe-2.4.2pre3/arch/i386/mm/fault.c   Wed Feb 14 12:14:13 2001
@@ -340,7 +340,7 @@
pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);
 
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
+   if (!pmd_present(*pmd_k))
goto bad_area_nosemaphore;
set_pmd(pmd, *pmd_k);
return;
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/sparc/mm/fault.c 
markhe-2.4.2pre3/arch/sparc/mm/fault.c
--- vxfs-2.4.2pre3/arch/sparc/mm/fault.cMon Jan  1 18:37:41 2001
+++ markhe-2.4.2pre3/arch/sparc/mm/fault.c  Wed Feb 14 12:17:36 2001
@@ -378,7 +378,7 @@
pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);
 
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
+   if (!pmd_present(*pmd_k))
goto bad_area_nosemaphore;
pmd_val(*pmd) = pmd_val(*pmd_k);
return;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] vmalloc fault race

2001-02-14 Thread Mark Hemment

Hi,

  If two processes, sharing the same page tables, hit an unloaded vmalloc
address in the kernel at the same time, one of the processes is killed
(with the message "Unable to handle kernel paging request").

  This occurs because the test on a vmalloc fault is too tight.  On x86,
it contains;
if (pmd_present(*pmd) || !pmd_present(*pmd_k))
goto bad_area_nosemaphore;

  If two processes are racing, chances are that "pmd_present(*pmd)" will
be true for one of them.

  It appears that; alpha, x86, arm, cris and sparc all have the
same/similar test.
  I've included a patch for all the above archs, but have only tested on
the x86 (so no guarantee it is correct for other archs).

  Note: Even though this race can only occur on SMP (assuming cannot
context switch on entering a page-fault), I've unconditionally remove the
test against pmd.  It could be argued that it should be left in for UP...

  Patch is againt 2.4.2pre3.

Mark


diff -urN -X dontdiff vxfs-2.4.2pre3/arch/alpha/mm/fault.c 
markhe-2.4.2pre3/arch/alpha/mm/fault.c
--- vxfs-2.4.2pre3/arch/alpha/mm/fault.cFri Dec 29 22:07:19 2000
+++ markhe-2.4.2pre3/arch/alpha/mm/fault.c  Wed Feb 14 12:10:21 2001
@@ -223,7 +223,7 @@
 
pgd = current-active_mm-pgd + offset;
pgd_k = swapper_pg_dir + offset;
-   if (!pgd_present(*pgd)  pgd_present(*pgd_k)) {
+   if (pgd_present(*pgd_k)) {
pgd_val(*pgd) = pgd_val(*pgd_k);
return;
}
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/arm/mm/fault-common.c 
markhe-2.4.2pre3/arch/arm/mm/fault-common.c
--- vxfs-2.4.2pre3/arch/arm/mm/fault-common.c   Mon Feb 12 10:10:27 2001
+++ markhe-2.4.2pre3/arch/arm/mm/fault-common.c Wed Feb 14 12:12:13 2001
@@ -185,8 +185,6 @@
goto bad_area;
 
pmd = pmd_offset(pgd, addr);
-   if (!pmd_none(*pmd))
-   goto bad_area;
set_pmd(pmd, *pmd_k);
return 1;
 
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/cris/mm/fault.c 
markhe-2.4.2pre3/arch/cris/mm/fault.c
--- vxfs-2.4.2pre3/arch/cris/mm/fault.c Mon Feb 12 10:10:27 2001
+++ markhe-2.4.2pre3/arch/cris/mm/fault.c   Wed Feb 14 12:13:10 2001
@@ -381,7 +381,7 @@
 pmd = pmd_offset(pgd, address);
 pmd_k = pmd_offset(pgd_k, address);
 
-if (pmd_present(*pmd) || !pmd_present(*pmd_k))
+if (!pmd_present(*pmd_k))
 goto bad_area_nosemaphore;
 set_pmd(pmd, *pmd_k);
 return;
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/i386/mm/fault.c 
markhe-2.4.2pre3/arch/i386/mm/fault.c
--- vxfs-2.4.2pre3/arch/i386/mm/fault.c Sun Nov 12 03:01:11 2000
+++ markhe-2.4.2pre3/arch/i386/mm/fault.c   Wed Feb 14 12:14:13 2001
@@ -340,7 +340,7 @@
pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);
 
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
+   if (!pmd_present(*pmd_k))
goto bad_area_nosemaphore;
set_pmd(pmd, *pmd_k);
return;
diff -urN -X dontdiff vxfs-2.4.2pre3/arch/sparc/mm/fault.c 
markhe-2.4.2pre3/arch/sparc/mm/fault.c
--- vxfs-2.4.2pre3/arch/sparc/mm/fault.cMon Jan  1 18:37:41 2001
+++ markhe-2.4.2pre3/arch/sparc/mm/fault.c  Wed Feb 14 12:17:36 2001
@@ -378,7 +378,7 @@
pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);
 
-   if (pmd_present(*pmd) || !pmd_present(*pmd_k))
+   if (!pmd_present(*pmd_k))
goto bad_area_nosemaphore;
pmd_val(*pmd) = pmd_val(*pmd_k);
return;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[Patch] Pushing the global kernel lock (aka BKL)

2001-01-25 Thread Mark Hemment

Hi,

  Several places in the kernel run holding the global kernel lock when it
isn't needed.  This usually occurs when where is a function which can be
called via many different paths; some holding the lock, others not.

  Now, if a function can block (and hence drop the kernel lock) the caller
cannot rely on the kernel lock for its own integerity.  Such a functon
_may_ be a candidate for dropping the lock (if it is held) on entry, and
retaken on exit.  This improves SMP scalability.

  A good example of this is generic_make_request().  This function can be
arrived at by several different paths, some of which will be holding the
lock, and some not.  It (or, rather the functions it can call) may block
and a caller has no control over this.  generic_make_request() does not
need the kernel lock itself, and any functions it calls which do require
the lock should be taking it (as they cannot guarantee it is already
held).  This makes it a good candidate for dropping the lock early (rather
than only dropping when blocking).

  Dropping the kernel lock, even for a short period, helps
scalability.  Note, if the lock is held when an interrupt arrives, the
interrupt handler runs holding the lock and so do any bottom-half handlers
run on the back of the interrupt.  So, less time it is held, the smaller
the chance of being interrupted while holding it, the better the
scalability.
  As the current nfsd code isn't threaded, it runs holding the kernel
lock.  Any reduction in holding the lock helps nfs server scalability.

  The current macros used to drop and retake the kernel lock in
schedule() cannot be used in many cases as they do not nest.

  The attached patch (against 2.4.1-pre10) defines two new macros for x86
(save_and_drop_kernel_lock(x) and restore_kernel_lock(x)) and a new
declaration macro (DECLARE_KERNEL_LOCK_COUNTER(x)).  These allow "pushing"
and "poping" of the kernel lock.
  The patch also contains some examples of usage (although the one in
nfsd/vfs.c could be done with an unlock_kernel()/lock_kernel() pair).

  If the idea is acceptable, I'll tidy up the patch and add the necessary
macros for other archs.

  My questions are;
a) Does this make sense?
b) Is it acceptable in the kernel?
c) Any better suggestions for the macro names?

  I'd be interested in any results from those applying this patch
and running benchmarks on SMP boxes - mainly filesystem benchmarks.

  I admit this is not the cleanest of ideas.

Mark




diff -urN -X dontdiff vxfs-2.4.1-pre10/drivers/block/ll_rw_blk.c 
markhe-2.4.1-pre10/drivers/block/ll_rw_blk.c
--- vxfs-2.4.1-pre10/drivers/block/ll_rw_blk.c  Wed Jan 24 10:56:23 2001
+++ markhe-2.4.1-pre10/drivers/block/ll_rw_blk.cThu Jan 25 09:47:09 2001
@@ -907,10 +907,24 @@
 {
int major = MAJOR(bh->b_rdev);
request_queue_t *q;
+   DECLARE_KERNEL_LOCK_COUNTER(lock_depth)
 
if (!bh->b_end_io)
BUG();
 
+   /*
+* The caller cannot make any assumptions as to whether this
+* function (or, rather, the funcs called from here) will block
+* or not.
+* Also, we can be called with or without the global kernel lock.
+* As the kernel lock isn't need here, and should be taken by
+* any functions called from here which need it, it is safe to
+* drop the lock.  This helps scalability (and _really_ helps
+* scalability when there is a threaded volume manager sitting
+* below).
+*/
+   save_and_drop_kernel_lock(lock_depth);
+
if (blk_size[major]) {
unsigned long maxsector = (blk_size[major][MINOR(bh->b_rdev)] << 1) + 
1;
unsigned long sector = bh->b_rsector;
@@ -931,6 +945,7 @@
   blk_size[major][MINOR(bh->b_rdev)]);
}
bh->b_end_io(bh, 0);
+   restore_kernel_lock(lock_depth);
return;
}
}
@@ -953,6 +968,8 @@
break;
}
} while (q->make_request_fn(q, rw, bh));
+
+   restore_kernel_lock(lock_depth);
 }
 
 
diff -urN -X dontdiff vxfs-2.4.1-pre10/fs/nfsd/vfs.c markhe-2.4.1-pre10/fs/nfsd/vfs.c
--- vxfs-2.4.1-pre10/fs/nfsd/vfs.c  Fri Dec 29 22:07:23 2000
+++ markhe-2.4.1-pre10/fs/nfsd/vfs.cThu Jan 25 10:01:51 2001
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #define __NO_VERSION__
 #include 
@@ -596,6 +597,7 @@
mm_segment_toldfs;
int err;
struct file file;
+   DECLARE_KERNEL_LOCK_COUNTER(lock_depth)
 
err = nfsd_open(rqstp, fhp, S_IFREG, MAY_READ, );
if (err)
@@ -618,12 +620,25 @@
file.f_ralen = ra->p_ralen;
file.f_rawin = ra->p_rawin;
}
+
+   /*
+* The nfs daemons run holding the global 

[Patch] Pushing the global kernel lock (aka BKL)

2001-01-25 Thread Mark Hemment

Hi,

  Several places in the kernel run holding the global kernel lock when it
isn't needed.  This usually occurs when where is a function which can be
called via many different paths; some holding the lock, others not.

  Now, if a function can block (and hence drop the kernel lock) the caller
cannot rely on the kernel lock for its own integerity.  Such a functon
_may_ be a candidate for dropping the lock (if it is held) on entry, and
retaken on exit.  This improves SMP scalability.

  A good example of this is generic_make_request().  This function can be
arrived at by several different paths, some of which will be holding the
lock, and some not.  It (or, rather the functions it can call) may block
and a caller has no control over this.  generic_make_request() does not
need the kernel lock itself, and any functions it calls which do require
the lock should be taking it (as they cannot guarantee it is already
held).  This makes it a good candidate for dropping the lock early (rather
than only dropping when blocking).

  Dropping the kernel lock, even for a short period, helps
scalability.  Note, if the lock is held when an interrupt arrives, the
interrupt handler runs holding the lock and so do any bottom-half handlers
run on the back of the interrupt.  So, less time it is held, the smaller
the chance of being interrupted while holding it, the better the
scalability.
  As the current nfsd code isn't threaded, it runs holding the kernel
lock.  Any reduction in holding the lock helps nfs server scalability.

  The current macros used to drop and retake the kernel lock in
schedule() cannot be used in many cases as they do not nest.

  The attached patch (against 2.4.1-pre10) defines two new macros for x86
(save_and_drop_kernel_lock(x) and restore_kernel_lock(x)) and a new
declaration macro (DECLARE_KERNEL_LOCK_COUNTER(x)).  These allow "pushing"
and "poping" of the kernel lock.
  The patch also contains some examples of usage (although the one in
nfsd/vfs.c could be done with an unlock_kernel()/lock_kernel() pair).

  If the idea is acceptable, I'll tidy up the patch and add the necessary
macros for other archs.

  My questions are;
a) Does this make sense?
b) Is it acceptable in the kernel?
c) Any better suggestions for the macro names?

  I'd be interested in any results from those applying this patch
and running benchmarks on SMP boxes - mainly filesystem benchmarks.

  I admit this is not the cleanest of ideas.

Mark




diff -urN -X dontdiff vxfs-2.4.1-pre10/drivers/block/ll_rw_blk.c 
markhe-2.4.1-pre10/drivers/block/ll_rw_blk.c
--- vxfs-2.4.1-pre10/drivers/block/ll_rw_blk.c  Wed Jan 24 10:56:23 2001
+++ markhe-2.4.1-pre10/drivers/block/ll_rw_blk.cThu Jan 25 09:47:09 2001
@@ -907,10 +907,24 @@
 {
int major = MAJOR(bh-b_rdev);
request_queue_t *q;
+   DECLARE_KERNEL_LOCK_COUNTER(lock_depth)
 
if (!bh-b_end_io)
BUG();
 
+   /*
+* The caller cannot make any assumptions as to whether this
+* function (or, rather, the funcs called from here) will block
+* or not.
+* Also, we can be called with or without the global kernel lock.
+* As the kernel lock isn't need here, and should be taken by
+* any functions called from here which need it, it is safe to
+* drop the lock.  This helps scalability (and _really_ helps
+* scalability when there is a threaded volume manager sitting
+* below).
+*/
+   save_and_drop_kernel_lock(lock_depth);
+
if (blk_size[major]) {
unsigned long maxsector = (blk_size[major][MINOR(bh-b_rdev)]  1) + 
1;
unsigned long sector = bh-b_rsector;
@@ -931,6 +945,7 @@
   blk_size[major][MINOR(bh-b_rdev)]);
}
bh-b_end_io(bh, 0);
+   restore_kernel_lock(lock_depth);
return;
}
}
@@ -953,6 +968,8 @@
break;
}
} while (q-make_request_fn(q, rw, bh));
+
+   restore_kernel_lock(lock_depth);
 }
 
 
diff -urN -X dontdiff vxfs-2.4.1-pre10/fs/nfsd/vfs.c markhe-2.4.1-pre10/fs/nfsd/vfs.c
--- vxfs-2.4.1-pre10/fs/nfsd/vfs.c  Fri Dec 29 22:07:23 2000
+++ markhe-2.4.1-pre10/fs/nfsd/vfs.cThu Jan 25 10:01:51 2001
@@ -30,6 +30,7 @@
 #include linux/net.h
 #include linux/unistd.h
 #include linux/malloc.h
+#include linux/smp_lock.h
 #include linux/in.h
 #define __NO_VERSION__
 #include linux/module.h
@@ -596,6 +597,7 @@
mm_segment_toldfs;
int err;
struct file file;
+   DECLARE_KERNEL_LOCK_COUNTER(lock_depth)
 
err = nfsd_open(rqstp, fhp, S_IFREG, MAY_READ, file);
if (err)
@@ -618,12 +620,25 @@
file.f_ralen = ra-p_ralen;
file.f_rawin = ra-p_rawin;
  

Re: 4G SGI quad Xeon - memory-related slowdowns

2001-01-16 Thread Mark Hemment

Hi Paul,
 
> 2) Other block I/O output (eg dd if=/dev/zero of=/dev/sdi bs=4M) also
> run very slowly

What do you notice when running "top" and doing the above?
Does the "buff" value grow high (+700MB), with high CPU usage?

  If so, I think this might be down to nr_free_buffer_pages().

  This function includes the pages in all zones (including the HIGHMEM
zone) in its calculations, while only DMA and NORMAL zone pages are used
for buffers.  This upsets the result from 
balance_dirty_state() (fs/buffer.c), and as a result the required flushing
of buffers is only done as a result of running v low of pages in the DMA
and NORMAL zones.

  I've attached a "quick hack" I did for 2.4.0.  It doesn't completely
solve the problem, but moves it in the right direction.

  Please let me know if this helps.

Mark


diff -urN -X dontdiff linux-2.4.0/mm/page_alloc.c markhe-2.4.0/mm/page_alloc.c
--- linux-2.4.0/mm/page_alloc.c Wed Jan  3 17:59:06 2001
+++ markhe-2.4.0/mm/page_alloc.cMon Jan 15 15:35:14 2001
@@ -583,6 +583,27 @@
 }
 
 /*
+ * Free pages in zone "type", and the zones below it.
+ */
+unsigned int nr_free_pages_zone (int type)
+{
+   unsigned int sum;
+   zone_t *zone;
+   pg_data_t *pgdat = pgdat_list;
+
+   if (type >= MAX_NR_ZONES)
+   BUG();
+
+   sum = 0;
+   while (pgdat) {
+   for (zone = pgdat->node_zones; zone < pgdat->node_zones + type; 
+zone++)
+   sum += zone->free_pages;
+   pgdat = pgdat->node_next;
+   }
+   return sum;
+}
+
+/*
  * Total amount of inactive_clean (allocatable) RAM:
  */
 unsigned int nr_inactive_clean_pages (void)
@@ -600,6 +621,25 @@
return sum;
 }
 
+unsigned int nr_inactive_clean_pages_zone(int type)
+{
+   unsigned int sum;
+   zone_t *zone;
+   pg_data_t *pgdat = pgdat_list;
+
+   if (type >= MAX_NR_ZONES)
+   BUG();
+   type++;
+
+   sum = 0;
+   while (pgdat) {
+   for (zone = pgdat->node_zones; zone < pgdat->node_zones + type; 
+zone++)
+   sum += zone->inactive_clean_pages;
+   pgdat = pgdat->node_next;
+   }
+   return sum;
+}
+
 /*
  * Amount of free RAM allocatable as buffer memory:
  */
@@ -607,9 +647,9 @@
 {
unsigned int sum;
 
-   sum = nr_free_pages();
-   sum += nr_inactive_clean_pages();
-   sum += nr_inactive_dirty_pages;
+   sum = nr_free_pages_zone(ZONE_NORMAL);
+   sum += nr_inactive_clean_pages_zone(ZONE_NORMAL);
+   sum += nr_inactive_dirty_pages; /* XXX */
 
/*
 * Keep our write behind queue filled, even if



Re: 4G SGI quad Xeon - memory-related slowdowns

2001-01-16 Thread Mark Hemment

Hi Paul,
 
 2) Other block I/O output (eg dd if=/dev/zero of=/dev/sdi bs=4M) also
 run very slowly

What do you notice when running "top" and doing the above?
Does the "buff" value grow high (+700MB), with high CPU usage?

  If so, I think this might be down to nr_free_buffer_pages().

  This function includes the pages in all zones (including the HIGHMEM
zone) in its calculations, while only DMA and NORMAL zone pages are used
for buffers.  This upsets the result from 
balance_dirty_state() (fs/buffer.c), and as a result the required flushing
of buffers is only done as a result of running v low of pages in the DMA
and NORMAL zones.

  I've attached a "quick hack" I did for 2.4.0.  It doesn't completely
solve the problem, but moves it in the right direction.

  Please let me know if this helps.

Mark


diff -urN -X dontdiff linux-2.4.0/mm/page_alloc.c markhe-2.4.0/mm/page_alloc.c
--- linux-2.4.0/mm/page_alloc.c Wed Jan  3 17:59:06 2001
+++ markhe-2.4.0/mm/page_alloc.cMon Jan 15 15:35:14 2001
@@ -583,6 +583,27 @@
 }
 
 /*
+ * Free pages in zone "type", and the zones below it.
+ */
+unsigned int nr_free_pages_zone (int type)
+{
+   unsigned int sum;
+   zone_t *zone;
+   pg_data_t *pgdat = pgdat_list;
+
+   if (type = MAX_NR_ZONES)
+   BUG();
+
+   sum = 0;
+   while (pgdat) {
+   for (zone = pgdat-node_zones; zone  pgdat-node_zones + type; 
+zone++)
+   sum += zone-free_pages;
+   pgdat = pgdat-node_next;
+   }
+   return sum;
+}
+
+/*
  * Total amount of inactive_clean (allocatable) RAM:
  */
 unsigned int nr_inactive_clean_pages (void)
@@ -600,6 +621,25 @@
return sum;
 }
 
+unsigned int nr_inactive_clean_pages_zone(int type)
+{
+   unsigned int sum;
+   zone_t *zone;
+   pg_data_t *pgdat = pgdat_list;
+
+   if (type = MAX_NR_ZONES)
+   BUG();
+   type++;
+
+   sum = 0;
+   while (pgdat) {
+   for (zone = pgdat-node_zones; zone  pgdat-node_zones + type; 
+zone++)
+   sum += zone-inactive_clean_pages;
+   pgdat = pgdat-node_next;
+   }
+   return sum;
+}
+
 /*
  * Amount of free RAM allocatable as buffer memory:
  */
@@ -607,9 +647,9 @@
 {
unsigned int sum;
 
-   sum = nr_free_pages();
-   sum += nr_inactive_clean_pages();
-   sum += nr_inactive_dirty_pages;
+   sum = nr_free_pages_zone(ZONE_NORMAL);
+   sum += nr_inactive_clean_pages_zone(ZONE_NORMAL);
+   sum += nr_inactive_dirty_pages; /* XXX */
 
/*
 * Keep our write behind queue filled, even if



Re: test13-pre5

2000-12-29 Thread Mark Hemment

On Fri, 29 Dec 2000, Tim Wright wrote:
> Yes, this is a very important point if we ever want to make serious use
> of large memory machines on ia32. We ran into this with DYNIX/ptx when the
> P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address
> space), became a very high priority. Eventually, we had to add code to play
> silly segment games and "magically" materialize and dematerialize a 4GB
> kernel virtual address space instead of the 1GB. This only comes into play
> with really large amounts of memory, and is almost certainly not worth the
> agony of implementation on Linux, but we'll need to be careful elsewhere to
> conserve it as much as possible.

  Indeed.  I'm compiling my kernels with 2GB virtual.  Not as I want more
NORMAL pages in the page cache (HIGH memory is fine), but as I need
NORMAL pages for kernel data/structures (memory allocated from  
slab-caches) which need to be constantly mapped in.

Mark


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test13-pre5

2000-12-29 Thread Mark Hemment

Hi,

On Thu, 28 Dec 2000, David S. Miller wrote:
>Date:  Thu, 28 Dec 2000 23:17:22 +0100
>From: Andi Kleen <[EMAIL PROTECTED]>
> 
>Would you consider patches for any of these points? 
> 
> To me it seems just as important to make sure struct page is
> a power of 2 in size, with the waitq debugging turned off this
> is true for both 32-bit and 64-bit hosts last time I checked.

  Checking test11 (which I'm running here), even with waitq debugging
turned off, on 32-bit (IA32) the struct page is 68bytes (since
the "age" member was re-introduced a while back).

  For my development testing, I'm running a _heavily_ hacked kernel.  One
of these hacks is to pull the wait_queue_head out of struct page; the
waitq-heads are in a separate allocated area of memory, with a waitq-head
pointer embedded in the page structure (allocated/initialised in
free_area_init_core()).  This gives a page structure of 60bytes, giving me
one free double-word to play with (which I'm using as a pointer to a
release function).

  Infact, there doesn't need to be a waitq-head allocated for each page
structure - they can share; with a performance overhead on a false
wakeup in __wait_on_page().
  Note, for those of us running on 32bit with lots of physical memory, the
available virtual address-space is of major consideration.  Reducing the
size of the page structure is more than just reducing cache misses - it
gives us more virtual to play with...

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test13-pre5

2000-12-29 Thread Mark Hemment

Hi,

On Thu, 28 Dec 2000, David S. Miller wrote:
Date:  Thu, 28 Dec 2000 23:17:22 +0100
From: Andi Kleen [EMAIL PROTECTED]
 
Would you consider patches for any of these points? 
 
 To me it seems just as important to make sure struct page is
 a power of 2 in size, with the waitq debugging turned off this
 is true for both 32-bit and 64-bit hosts last time I checked.

  Checking test11 (which I'm running here), even with waitq debugging
turned off, on 32-bit (IA32) the struct page is 68bytes (since
the "age" member was re-introduced a while back).

  For my development testing, I'm running a _heavily_ hacked kernel.  One
of these hacks is to pull the wait_queue_head out of struct page; the
waitq-heads are in a separate allocated area of memory, with a waitq-head
pointer embedded in the page structure (allocated/initialised in
free_area_init_core()).  This gives a page structure of 60bytes, giving me
one free double-word to play with (which I'm using as a pointer to a
release function).

  Infact, there doesn't need to be a waitq-head allocated for each page
structure - they can share; with a performance overhead on a false
wakeup in __wait_on_page().
  Note, for those of us running on 32bit with lots of physical memory, the
available virtual address-space is of major consideration.  Reducing the
size of the page structure is more than just reducing cache misses - it
gives us more virtual to play with...

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test13-pre5

2000-12-29 Thread Mark Hemment

On Fri, 29 Dec 2000, Tim Wright wrote:
 Yes, this is a very important point if we ever want to make serious use
 of large memory machines on ia32. We ran into this with DYNIX/ptx when the
 P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address
 space), became a very high priority. Eventually, we had to add code to play
 silly segment games and "magically" materialize and dematerialize a 4GB
 kernel virtual address space instead of the 1GB. This only comes into play
 with really large amounts of memory, and is almost certainly not worth the
 agony of implementation on Linux, but we'll need to be careful elsewhere to
 conserve it as much as possible.

  Indeed.  I'm compiling my kernels with 2GB virtual.  Not as I want more
NORMAL pages in the page cache (HIGH memory is fine), but as I need
NORMAL pages for kernel data/structures (memory allocated from  
slab-caches) which need to be constantly mapped in.

Mark


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] Linus elevator

2000-12-18 Thread Mark Hemment

Hi,

  Looking at the second loop in elevator_linus_merge(), it is possible for
requests to have their elevator_sequence go negative.  This can cause a
v long latency before the request is finally serviced.

  Say, for example, a request (in the queue) is jumped in the first loop
in elevator_linus_merge() as "cmd != rw", even though its 
elevator_sequence is zero.  If it is found that the new request will
merge, the walking back over requests which were jumped makes no test for
an already zeroed elevator_sequence.  Hence it zero values can occur.

  With high default values for read/wite_latency, this hardly ever occurs.

  A simple fix for this is to test for zero before decrementing (patch
below) in the second loop.
  Alternatively, should testing in the first loop be modified?

Mark


diff -u --recursive --new-file -X dontdiff linux-2.4.0-test12/drivers/block/elevator.c 
markhe-2.4.0-test12/drivers/block/elevator.c
--- linux-2.4.0-test12/drivers/block/elevator.c Tue Dec  5 23:05:26 2000
+++ markhe-2.4.0-test12/drivers/block/elevator.cMon Dec 18 17:50:19 2000
@@ -90,6 +90,9 @@
if (ret != ELEVATOR_NO_MERGE && *req) {
while ((entry = entry->next) != >queue_head) {
struct request *tmp = blkdev_entry_to_request(entry);
+
+   if (!tmp->elevator_sequence)
+   continue;
tmp->elevator_sequence--;
}
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] Linus elevator

2000-12-18 Thread Mark Hemment

Hi,

  Looking at the second loop in elevator_linus_merge(), it is possible for
requests to have their elevator_sequence go negative.  This can cause a
v long latency before the request is finally serviced.

  Say, for example, a request (in the queue) is jumped in the first loop
in elevator_linus_merge() as "cmd != rw", even though its 
elevator_sequence is zero.  If it is found that the new request will
merge, the walking back over requests which were jumped makes no test for
an already zeroed elevator_sequence.  Hence it zero values can occur.

  With high default values for read/wite_latency, this hardly ever occurs.

  A simple fix for this is to test for zero before decrementing (patch
below) in the second loop.
  Alternatively, should testing in the first loop be modified?

Mark


diff -u --recursive --new-file -X dontdiff linux-2.4.0-test12/drivers/block/elevator.c 
markhe-2.4.0-test12/drivers/block/elevator.c
--- linux-2.4.0-test12/drivers/block/elevator.c Tue Dec  5 23:05:26 2000
+++ markhe-2.4.0-test12/drivers/block/elevator.cMon Dec 18 17:50:19 2000
@@ -90,6 +90,9 @@
if (ret != ELEVATOR_NO_MERGE  *req) {
while ((entry = entry-next) != q-queue_head) {
struct request *tmp = blkdev_entry_to_request(entry);
+
+   if (!tmp-elevator_sequence)
+   continue;
tmp-elevator_sequence--;
}
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Mark Hemment

Hi Tigran,

On Wed, 11 Oct 2000, Tigran Aivazian wrote:
 
> a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> malfunctioning, interrupts are generated but no traffic gets through (YES,
> I did plug it in correctly, this time, and I repeat 2.2.16 works!)

  I saw this the other week on our two-way Dell under a reasonibly heavy
load - but with 3c59x.c driver, the eepro100s survived!
  Either NIC (had two Tornados) could go this away after anything from 1
to 36 hours of load.  They would end up running in "poll" mode off the
transmit watchdog timer.
  Swapped them for a dual-port eepro100 and no more problems.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Mark Hemment

Hi Tigran,

On Wed, 11 Oct 2000, Tigran Aivazian wrote:
 
 a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
 malfunctioning, interrupts are generated but no traffic gets through (YES,
 I did plug it in correctly, this time, and I repeat 2.2.16 works!)

  I saw this the other week on our two-way Dell under a reasonibly heavy
load - but with 3c59x.c driver, the eepro100s survived!
  Either NIC (had two Tornados) could go this away after anything from 1
to 36 hours of load.  They would end up running in "poll" mode off the
transmit watchdog timer.
  Swapped them for a dual-port eepro100 and no more problems.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Mark Hemment

Hi,

On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
> So you have run out of physical memory --- what do you do about it?

  Why let the system get into the state where it is neccessary to kill a
process?
  Per-user/task resource counters should prevent unprivileged users from
soaking up too many resources.  That is the DoS protection.

  So an OOM is possibly;
1) A privileged, legally resource hungry, app(s) has taken all
   the memory.  Could be too important to simply kill (it
   should exit gracefully).
2) Simply too many tasks*(memory-requirements-of-each-task).

  Ignoring allocations done by the kernel, the suitation comes down to the
fact that the system has over committed its memory resources.  ie. it has
sold too many tickets for the number of seats in the plane, and all the
passengers have turned up.
 (note, I use the term "memory" and not "physical memory", I'm including
swap space).

  Why not protect the system from over committing its memory resources?
  
  It is possible to do true, system wide, resource counting of physical
memory and swap space, and to deny a fork() or mmap() which would cause
over committing of memoy resources if everyone cashed in their
requirements.

  Named pages (those which came from a file) are the simplest to
handle.  If dirty, they already have allocated backing store, so we know
there is somewhere to put them when memory is low.
  How many named pages need to be held in physical memory at any one
instance for the system to function?  Only a few, although if you reach
that state, the system will be thrashing itself to death.

  Anonymous and copied (those faulted from a write to  an
MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical
memory or on swap.  To avoid getting into the OOM suitation, when these
mappings are created the system needs to check that it has (and will have,
in the future) space for every page that _could_ be allocated for the
mapping - ie. work out the worst case (including page-tables).
  This space could be on swap or in physical memory.  It is the accounting
which needs to be done, not the actual allocation (and not even the
decision of where to store the page when allocated - that is made much
later, when it needs to be).  If a machine has 2GB of RAM, a 1MB
swap, and 1GB of dirty anon or copied pages, that is fine.
  I'm stressing this point, as the scheme of reserving space for an (as
yet) unallocated page is sometimes refered to as "eager swap
allocation" (or some such similar term).  This is confusing.  People then
start to believe they need backing store for each anon/copied pages.  You
don't.  You simply need somewhere to store it, and that could be a
physical page.  It is all in the accounting. :)

  Allocations made by the kernel, for the kernel, are (obviously) pinned
memory.  To ensure kernel allocations do not completely exhaust physical
memory (or cause phyiscal memory to be over committed if the worst case
occurs), they need to be limited.
  How to limit?
  As I first guess (and this is only a guess);
1) don't let kernel allocations exceed 25% of physical memory
   (tunable)
2) don't let kernel allocations succeed if they would cause
   over commitment.
  Both conditions would need to pass before an allocation could succeed.
  This does need much more thought.  Should some tuning be per subsystem?
I don't know

  Perhaps 1) isn't needed.  I'm not sure.

  Because of 2), the total physical memory accounted for anon/copied
pages needs to have a high watermark.  Otherwise, in the accounting, the
system could allow too much physical memory to be reserved for these
types of pages (there doesn't need to be space on swap for each
anon/copied page, just space somewhere - a watermark would prevent too
much of this being physical memory).  Note, this doesn't mean start
swapping earlier - remember, this is accounting of anon/copied pages to
avoid over commitment.
  For named pages, the page cache needs to have a reserved number of
physical pages (ie. how small is it allowed to get, before pruning
stops).  Again, these reserved pages are in the accounting.

 mlock()ed pages need to have accouting also to prevent over commitment of
physical memory.  All fun.

  The disadvantages;

1) Extra code to do the accouting.
This shouldn't be too heavy.

2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily.

Programs which expect to memory map areas (which would created
anon/copied pages when written to) will see an increased failure
rate in mmap().  This can be very annoying, espically when you
know the mapping will be used sparsely.

One solution is to add a new mmap() flag, which tells the kernel
to let this mmap() exceed the actually resources.
With such a flag, the mmap() will be allowed, but the task should
expected to be killed if memory is exhausted.  (It could be

Re: the new VMt

2000-09-26 Thread Mark Hemment

Hi,

On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
 So you have run out of physical memory --- what do you do about it?

  Why let the system get into the state where it is neccessary to kill a
process?
  Per-user/task resource counters should prevent unprivileged users from
soaking up too many resources.  That is the DoS protection.

  So an OOM is possibly;
1) A privileged, legally resource hungry, app(s) has taken all
   the memory.  Could be too important to simply kill (it
   should exit gracefully).
2) Simply too many tasks*(memory-requirements-of-each-task).

  Ignoring allocations done by the kernel, the suitation comes down to the
fact that the system has over committed its memory resources.  ie. it has
sold too many tickets for the number of seats in the plane, and all the
passengers have turned up.
 (note, I use the term "memory" and not "physical memory", I'm including
swap space).

  Why not protect the system from over committing its memory resources?
  
  It is possible to do true, system wide, resource counting of physical
memory and swap space, and to deny a fork() or mmap() which would cause
over committing of memoy resources if everyone cashed in their
requirements.

  Named pages (those which came from a file) are the simplest to
handle.  If dirty, they already have allocated backing store, so we know
there is somewhere to put them when memory is low.
  How many named pages need to be held in physical memory at any one
instance for the system to function?  Only a few, although if you reach
that state, the system will be thrashing itself to death.

  Anonymous and copied (those faulted from a write to  an
MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical
memory or on swap.  To avoid getting into the OOM suitation, when these
mappings are created the system needs to check that it has (and will have,
in the future) space for every page that _could_ be allocated for the
mapping - ie. work out the worst case (including page-tables).
  This space could be on swap or in physical memory.  It is the accounting
which needs to be done, not the actual allocation (and not even the
decision of where to store the page when allocated - that is made much
later, when it needs to be).  If a machine has 2GB of RAM, a 1MB
swap, and 1GB of dirty anon or copied pages, that is fine.
  I'm stressing this point, as the scheme of reserving space for an (as
yet) unallocated page is sometimes refered to as "eager swap
allocation" (or some such similar term).  This is confusing.  People then
start to believe they need backing store for each anon/copied pages.  You
don't.  You simply need somewhere to store it, and that could be a
physical page.  It is all in the accounting. :)

  Allocations made by the kernel, for the kernel, are (obviously) pinned
memory.  To ensure kernel allocations do not completely exhaust physical
memory (or cause phyiscal memory to be over committed if the worst case
occurs), they need to be limited.
  How to limit?
  As I first guess (and this is only a guess);
1) don't let kernel allocations exceed 25% of physical memory
   (tunable)
2) don't let kernel allocations succeed if they would cause
   over commitment.
  Both conditions would need to pass before an allocation could succeed.
  This does need much more thought.  Should some tuning be per subsystem?
I don't know

  Perhaps 1) isn't needed.  I'm not sure.

  Because of 2), the total physical memory accounted for anon/copied
pages needs to have a high watermark.  Otherwise, in the accounting, the
system could allow too much physical memory to be reserved for these
types of pages (there doesn't need to be space on swap for each
anon/copied page, just space somewhere - a watermark would prevent too
much of this being physical memory).  Note, this doesn't mean start
swapping earlier - remember, this is accounting of anon/copied pages to
avoid over commitment.
  For named pages, the page cache needs to have a reserved number of
physical pages (ie. how small is it allowed to get, before pruning
stops).  Again, these reserved pages are in the accounting.

 mlock()ed pages need to have accouting also to prevent over commitment of
physical memory.  All fun.

  The disadvantages;

1) Extra code to do the accouting.
This shouldn't be too heavy.

2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily.

Programs which expect to memory map areas (which would created
anon/copied pages when written to) will see an increased failure
rate in mmap().  This can be very annoying, espically when you
know the mapping will be used sparsely.

One solution is to add a new mmap() flag, which tells the kernel
to let this mmap() exceed the actually resources.
With such a flag, the mmap() will be allowed, but the task should
expected to be killed if memory is exhausted.  (It could be