Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Andrea Arcangeli wrote: > On Wed, May 30, 2001 at 08:57:50PM +0200, Yoann Vandoorselaere wrote: > > I remember the 2.3.51 kernel as the most usable kernel I ever used > > talking about VM. > > I also don't remeber anything strange in that kernel about the VM (I > instead remeber well the VM breakage introduced in 2.3.99-pre). > > Regardless of what 2.3.51 was doing, the falling back into the lower > zones before starting the balancing is fine. The problem with 2.3.51 was that it started balancing the HIGHMEM zone before falling back. On a 1GB system this lead not only to the system starting to swap as soon as the 128MB highmem zone was filled up, it also resulted in the other 900MB being essentially unused. Having your 1GB system running as if it had 128MB definately can be classified as Not Fun. Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30, 2001 at 08:57:50PM +0200, Yoann Vandoorselaere wrote: > I remember the 2.3.51 kernel as the most usable kernel I ever used > talking about VM. I also don't remeber anything strange in that kernel about the VM (I instead remeber well the VM breakage introduced in 2.3.99-pre). Regardless of what 2.3.51 was doing, the falling back into the lower zones before starting the balancing is fine. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30, 2001 at 03:42:51PM -0300, Rik van Riel wrote: > On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: > > > btw, I think such heuristic is horribly broken ;), the highmem zone > > simply needs to be balanced if it is under the pages_low mark, just > > skipping it and falling back into the normal zone that happens to be > > above the low mark is the wrong thing to do. > > 2.3.51 did this, we all know the result. I've no idea about what 2.3.51 does, but I was obviously wrong about that. Forget such what I said above. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
Rik van Riel <[EMAIL PROTECTED]> writes: > On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: > > > btw, I think such heuristic is horribly broken ;), the highmem zone > > simply needs to be balanced if it is under the pages_low mark, just > > skipping it and falling back into the normal zone that happens to be > > above the low mark is the wrong thing to do. > > 2.3.51 did this, we all know the result. Just a note, I remember the 2.3.51 kernel as the most usable kernel I ever used talking about VM. -- Yoann Vandoorselaere | C makes it easy to shoot yourself in the foot. C++ makes MandrakeSoft | it harder, but when you do, it blows away your whole | leg. - Bjarne Stroustrup - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: > You are right, this is definitely something that needs checking. I > really want this to work though. Rik, Andrea? Will the balancing > handle the extra zone? In as far as it handles balancing the current zones, it'll also work with one more. In places where it's currently broken it will probably also break with one extra zone, though the fact that the DMA32 zone takes the pressure off the NORMAL zone might actually help. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: > btw, I think such heuristic is horribly broken ;), the highmem zone > simply needs to be balanced if it is under the pages_low mark, just > skipping it and falling back into the normal zone that happens to be > above the low mark is the wrong thing to do. 2.3.51 did this, we all know the result. Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30, 2001 at 11:59:50AM +0100, Mark Hemment wrote: > Now, when HIGHMEM allocations come in (for page cache pages), they > skip the HIGH zone and use the NORMAL zone (as it now has plenty > of free pages) - the code at the top of __alloc_pages(), which > checks against ->pages_low. btw, I think such heuristic is horribly broken ;), the highmem zone simply needs to be balanced if it is under the pages_low mark, just skipping it and falling back into the normal zone that happens to be above the low mark is the wrong thing to do. > Also, the problem isn't as bad as it first looks - HIGHMEM page-cache > pages do get "recycled" (reclaimed), but there is a slight imbalance. there will always be some imbalance unless all allocations would be capable of highmem (which will never happen). The only thing we can do is to optimize the zone usage so we won't run out of normal pages unless there was a good reason. Once we run out of normal pages we'll simply return NULL and the reserved pool of highmem bounces will be used instead (other callers will behave differently). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30 2001, Andrea Arcangeli wrote: > > > I did change the patch so that bounce-pages always come from the NORMAL > > > zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as > > > I'm not 100% sure the VM is capable of keeping the zones it already has > > > balanced - and adding another one might break the camels back. But as the > > > test box has 4GB, it wasn't bouncing anyway. > > > > You are right, this is definitely something that needs checking. I > > really want this to work though. Rik, Andrea? Will the balancing handle > > the extra zone? > > The bounces can came from the ZONE_NORMAL without problems, however the Of course > ZONE_DMA32 way is fine too, but yes probably it isn't needed in real > life unless you do an huge amount of I/O at the same time. If you want It's not strictly needed, but it does buy us 3 extra gig to do I/O from an a pae enabled x86. > to reduce the amount of changes you can defer the zone_dma32 patch and > possibly plug it in later. Yes, I did modular patches for this reason. Thanks! -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
[ my usual email is offline at the moment, please CC to [EMAIL PROTECTED] for anything urgent until the problem is fixed ] On Wed, May 30, 2001 at 11:55:38AM +0200, Jens Axboe wrote: > On Wed, May 30 2001, Mark Hemment wrote: > > Hi Jens, > > > > I ran this (well, cut-two) on a 4-way box with 4GB of memory and a > > modified qlogic fibre channel driver with 32disks hanging off it, without > > any problems. The test used was SpecFS 2.0 > > Cool, could you send me the qlogic diff? It's the one-liner can_dma32 > chance I'm interested in, I'm just not sure what driver you used :-) > I'll add that to the patch then. Basically all the PCI cards should > work, I'm just being cautious and only enabling highmem I/O to the ones > that have been tested. > > > Peformance is definitely up - but I can't give an exact number, as the > > run with this patch was compiled with no-omit-frame-pointer for debugging > > any probs. > > Good > > > I did change the patch so that bounce-pages always come from the NORMAL > > zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as > > I'm not 100% sure the VM is capable of keeping the zones it already has > > balanced - and adding another one might break the camels back. But as the > > test box has 4GB, it wasn't bouncing anyway. > > You are right, this is definitely something that needs checking. I > really want this to work though. Rik, Andrea? Will the balancing handle > the extra zone? The bounces can came from the ZONE_NORMAL without problems, however the ZONE_DMA32 way is fine too, but yes probably it isn't needed in real life unless you do an huge amount of I/O at the same time. If you want to reduce the amount of changes you can defer the zone_dma32 patch and possibly plug it in later. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30 2001, Mark Hemment wrote: > On Wed, 30 May 2001, Jens Axboe wrote: > > On Wed, May 30 2001, Mark Hemment wrote: > > > This can lead to attempt_merge() releasing the embedded request > > > structure (which, as an extract copy, has the ->q set, so to > > > blkdev_release_request() it looks like a request which originated from > > > the block layer). This isn't too healthy. > > > > > > The fix here is to add a check in __scsi_merge_requests_fn() to check > > > for ->special being non-NULL. > > > > How about just adding > > > > if (req->cmd != next->cmd > > || req->rq_dev != next->rq_dev > > || req->nr_sectors + next->nr_sectors > q->max_sectors > > || next->sem || req->special) > > return; > > > > ie check for special too, that would make sense to me. Either way would > > work, but I'd rather make this explicit in the block layer that 'not > > normal' requests are left alone. That includes stuff with the sem set, > > or special. > > > Yes, that is an equivalent fix. > > In the original patch I wanted to keep the change local (ie. in the SCSI > layer). Pushing the check up the generic block layer makes sense. Ok, so we agree. > Are you going to push this change to Linus, or should I? > I'm assuming the other scsi-layer changes in Alan's tree will eventually > be pushed. I'll push it, I'll do the end_that_request_first thing too. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: > On Wed, May 30 2001, Mark Hemment wrote: > > This can lead to attempt_merge() releasing the embedded request > > structure (which, as an extract copy, has the ->q set, so to > > blkdev_release_request() it looks like a request which originated from > > the block layer). This isn't too healthy. > > > > The fix here is to add a check in __scsi_merge_requests_fn() to check > > for ->special being non-NULL. > > How about just adding > > if (req->cmd != next->cmd > || req->rq_dev != next->rq_dev > || req->nr_sectors + next->nr_sectors > q->max_sectors > || next->sem || req->special) > return; > > ie check for special too, that would make sense to me. Either way would > work, but I'd rather make this explicit in the block layer that 'not > normal' requests are left alone. That includes stuff with the sem set, > or special. Yes, that is an equivalent fix. In the original patch I wanted to keep the change local (ie. in the SCSI layer). Pushing the check up the generic block layer makes sense. Are you going to push this change to Linus, or should I? I'm assuming the other scsi-layer changes in Alan's tree will eventually be pushed. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30 2001, Mark Hemment wrote: > Hi again, :) > > On Tue, 29 May 2001, Jens Axboe wrote: > > Another day, another version. > > > > Bugs fixed in this version: none > > Known bugs in this version: none > > > > In other words, it's perfect of course. > > With the scsi-high patch, I'm not sure about the removal of the line > from __scsi_end_request(); > > req->buffer = bh->b_data; Why? > A requeued request is not always processed immediately, so new > buffer-heads arriving at the block-layer can be merged against it. A > requeued request is placed at the head of a request list, so > nothing can merge with it - but what about if multiple requests are > requeued on the same queue? You forget that SCSI is not head-active, so there can indeed be merges against a request that was re-added to the queue list. > When processing the completion of a SCSI request in a bottom-half, > __scsi_end_request() can find all the buffers associated with the request > haven't been completed (ie. leftovers). > > One question is; can this ever happen? Yes it can happen. > The request is re-queued to the block layer via > scsi_queue_next_request(), which uses the "special" pointer in the request > structure to remember the Scsi_Cmnd associated with the request. The SCSI > request function is then called, but doesn't guarantee to immediately > process the re-queued request even though it was added at the head (say, > the queue has become plugged). This can trigger two possible bugs. > > The first is that __scsi_end_request() doesn't decrement the > hard_nr_sectors count in the request. As the request is back on the > queue, it is possible for newly arriving buffer-heads to merge with the > heads already hanging off the request. This merging uses the > hard_nr_sectors when calculating both the merged hard_nr_sectors and > nr_sectors counts. Right, that looks like a bug. I would prefer SCSI using end_that_request_first here actually. > As the request is at the head, only back-merging can occur, but if > __scsi_end_request() triggers another uncompleted request to be re-queued, > it is possible to get front merging as well. There can be front merges too. If a head is active, then no merging can occcur. But for SCSI, the front request must always be in a sane state. Or bad things can happen, like you describe. > The merging of a re-queued request looks safe, except for the > hard_nr_sectors. This patch corrects the hard_nr_sectors accounting. Right > The second bug is from request merging in attempt_merge(). > > For a re-queued request, the request structure is the one embedded in > the Scsi_Cmnd (which is a copy of the request taken in the > scsi_request_fn). > In attempt_merge(), q->merge_requests_fn() is called to see the requests > are allowed to merge. __scsi_merge_requests_fn() checks number of > segments, etc, but doesn't check if one of the requests is a re-queued one > (ie. no test against ->special). > This can lead to attempt_merge() releasing the embedded request > structure (which, as an extract copy, has the ->q set, so to > blkdev_release_request() it looks like a request which originated from > the block layer). This isn't too healthy. > > The fix here is to add a check in __scsi_merge_requests_fn() to check > for ->special being non-NULL. How about just adding if (req->cmd != next->cmd || req->rq_dev != next->rq_dev || req->nr_sectors + next->nr_sectors > q->max_sectors || next->sem || req->special) return; ie check for special too, that would make sense to me. Either way would work, but I'd rather make this explicit in the block layer that 'not normal' requests are left alone. That includes stuff with the sem set, or special. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
Hi again, :) On Tue, 29 May 2001, Jens Axboe wrote: > Another day, another version. > > Bugs fixed in this version: none > Known bugs in this version: none > > In other words, it's perfect of course. With the scsi-high patch, I'm not sure about the removal of the line from __scsi_end_request(); req->buffer = bh->b_data; A requeued request is not always processed immediately, so new buffer-heads arriving at the block-layer can be merged against it. A requeued request is placed at the head of a request list, so nothing can merge with it - but what about if multiple requests are requeued on the same queue? In Linus's tree, requests requeued via the SCSI layer can cause problems (corruption). I sent out a patch to cover this a few months back, which got picked up by Alan (its in the -ac series - see the changes to scsi_lib.c and scsi_merge.c) but no one posted any feedback. I've included some of the original message below. Mark -- >From [EMAIL PROTECTED] Sat Mar 31 16:07:14 2001 +0100 Date: Sat, 31 Mar 2001 16:07:13 +0100 (BST) From: Mark Hemment <[EMAIL PROTECTED]> Subject: [PATCH] Possible SCSI + block-layer bugs Hi, I've never seen these trigger, but they look theoretically possible. When processing the completion of a SCSI request in a bottom-half, __scsi_end_request() can find all the buffers associated with the request haven't been completed (ie. leftovers). One question is; can this ever happen? If it can't then the code should be removed from __scsi_end_request(), if it can happen then there appears to be a few problems; The request is re-queued to the block layer via scsi_queue_next_request(), which uses the "special" pointer in the request structure to remember the Scsi_Cmnd associated with the request. The SCSI request function is then called, but doesn't guarantee to immediately process the re-queued request even though it was added at the head (say, the queue has become plugged). This can trigger two possible bugs. The first is that __scsi_end_request() doesn't decrement the hard_nr_sectors count in the request. As the request is back on the queue, it is possible for newly arriving buffer-heads to merge with the heads already hanging off the request. This merging uses the hard_nr_sectors when calculating both the merged hard_nr_sectors and nr_sectors counts. As the request is at the head, only back-merging can occur, but if __scsi_end_request() triggers another uncompleted request to be re-queued, it is possible to get front merging as well. The merging of a re-queued request looks safe, except for the hard_nr_sectors. This patch corrects the hard_nr_sectors accounting. The second bug is from request merging in attempt_merge(). For a re-queued request, the request structure is the one embedded in the Scsi_Cmnd (which is a copy of the request taken in the scsi_request_fn). In attempt_merge(), q->merge_requests_fn() is called to see the requests are allowed to merge. __scsi_merge_requests_fn() checks number of segments, etc, but doesn't check if one of the requests is a re-queued one (ie. no test against ->special). This can lead to attempt_merge() releasing the embedded request structure (which, as an extract copy, has the ->q set, so to blkdev_release_request() it looks like a request which originated from the block layer). This isn't too healthy. The fix here is to add a check in __scsi_merge_requests_fn() to check for ->special being non-NULL. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: > On Wed, May 30 2001, Mark Hemment wrote: > > Hi Jens, > > > > I ran this (well, cut-two) on a 4-way box with 4GB of memory and a > > modified qlogic fibre channel driver with 32disks hanging off it, without > > any problems. The test used was SpecFS 2.0 > > Cool, could you send me the qlogic diff? It's the one-liner can_dma32 > chance I'm interested in, I'm just not sure what driver you used :-) The qlogic driver is the one from; http://www.feral.com/isp.html I find this much more stable than the one already in the kernel. It did just need the one-liner change, but as the driver isn't in the kernel there isn't much point adding it change to your patch. :) > > I did change the patch so that bounce-pages always come from the NORMAL > > zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as > > I'm not 100% sure the VM is capable of keeping the zones it already has > > balanced - and adding another one might break the camels back. But as the > > test box has 4GB, it wasn't bouncing anyway. > > You are right, this is definitely something that needs checking. I > really want this to work though. Rik, Andrea? Will the balancing handle > the extra zone? In theory it should do - ie. there isn't anything to stop it. With NFS loads, over a ported VxFS filesystem, I do see some problems between the NORMAL and HIGH zones. Thinking about it, ZONE_DMA32 shouldn't make this any worse. Rik, Andrea, quick description of a balancing problem; Consider a VM which is under load (but not stressed), such that all zone free-page pools are between their MIN and LOW marks, with pages in the inactive_clean lists. The NORMAL zone has non-zero page order allocations thrown at it. This causes __alloc_pages() to reap pages from the NORMAL inactive_clean list until the required buddy is built. The blind reaping causes the NORMAL zone to have a large number of free pages (greater than ->pages_low). Now, when HIGHMEM allocations come in (for page cache pages), they skip the HIGH zone and use the NORMAL zone (as it now has plenty of free pages) - the code at the top of __alloc_pages(), which checks against ->pages_low. But the NORMAL zone is usually under more pressure than the HIGH zone - as many more allocations needed ready-mapped memory. This causes the page-cache pages from the NORMAL zone to come under more pressure, and are "re-cycled" quicker than page-cache pages in the HIGHMEM zone. OK, we shouldn't be throwing too many non-zero page allocations at __alloc_pages(), but it does happen. Also, the problem isn't as bad as it first looks - HIGHMEM page-cache pages do get "recycled" (reclaimed), but there is a slight imbalance. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30 2001, Mark Hemment wrote: > Hi Jens, > > I ran this (well, cut-two) on a 4-way box with 4GB of memory and a > modified qlogic fibre channel driver with 32disks hanging off it, without > any problems. The test used was SpecFS 2.0 Cool, could you send me the qlogic diff? It's the one-liner can_dma32 chance I'm interested in, I'm just not sure what driver you used :-) I'll add that to the patch then. Basically all the PCI cards should work, I'm just being cautious and only enabling highmem I/O to the ones that have been tested. > Peformance is definitely up - but I can't give an exact number, as the > run with this patch was compiled with no-omit-frame-pointer for debugging > any probs. Good > I did change the patch so that bounce-pages always come from the NORMAL > zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as > I'm not 100% sure the VM is capable of keeping the zones it already has > balanced - and adding another one might break the camels back. But as the > test box has 4GB, it wasn't bouncing anyway. You are right, this is definitely something that needs checking. I really want this to work though. Rik, Andrea? Will the balancing handle the extra zone? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
Hi Jens, I ran this (well, cut-two) on a 4-way box with 4GB of memory and a modified qlogic fibre channel driver with 32disks hanging off it, without any problems. The test used was SpecFS 2.0 Peformance is definitely up - but I can't give an exact number, as the run with this patch was compiled with no-omit-frame-pointer for debugging any probs. I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. Mark On Tue, 29 May 2001, Jens Axboe wrote: > Another day, another version. > > Bugs fixed in this version: none > Known bugs in this version: none > > In other words, it's perfect of course. > > Changes: > > - Added ide-dma segment coalescing > - Only print highmem I/O enable info when HIGHMEM is actually set > > Please give it a test spin, especially if you have 1GB of RAM or more. > You should see something like this when booting: > > hda: enabling highmem I/O > ... > SCSI: channel 0, id 0: enabling highmem I/O > > depending on drive configuration etc. > > Plea to maintainers of the different architectures: could you please add > the arch parts to support this? This includes: > > - memory zoning at init time > - page_to_bus > - pci_map_page / pci_unmap_page > - set_bh_sg > - KM_BH_IRQ (for HIGHMEM archs) > > I think that's it, feel free to send me questions and (even better) > patches. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
Hi Jens, I ran this (well, cut-two) on a 4-way box with 4GB of memory and a modified qlogic fibre channel driver with 32disks hanging off it, without any problems. The test used was SpecFS 2.0 Peformance is definitely up - but I can't give an exact number, as the run with this patch was compiled with no-omit-frame-pointer for debugging any probs. I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. Mark On Tue, 29 May 2001, Jens Axboe wrote: Another day, another version. Bugs fixed in this version: none Known bugs in this version: none In other words, it's perfect of course. Changes: - Added ide-dma segment coalescing - Only print highmem I/O enable info when HIGHMEM is actually set Please give it a test spin, especially if you have 1GB of RAM or more. You should see something like this when booting: hda: enabling highmem I/O ... SCSI: channel 0, id 0: enabling highmem I/O depending on drive configuration etc. Plea to maintainers of the different architectures: could you please add the arch parts to support this? This includes: - memory zoning at init time - page_to_bus - pci_map_page / pci_unmap_page - set_bh_sg - KM_BH_IRQ (for HIGHMEM archs) I think that's it, feel free to send me questions and (even better) patches. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30 2001, Mark Hemment wrote: Hi Jens, I ran this (well, cut-two) on a 4-way box with 4GB of memory and a modified qlogic fibre channel driver with 32disks hanging off it, without any problems. The test used was SpecFS 2.0 Cool, could you send me the qlogic diff? It's the one-liner can_dma32 chance I'm interested in, I'm just not sure what driver you used :-) I'll add that to the patch then. Basically all the PCI cards should work, I'm just being cautious and only enabling highmem I/O to the ones that have been tested. Peformance is definitely up - but I can't give an exact number, as the run with this patch was compiled with no-omit-frame-pointer for debugging any probs. Good I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. You are right, this is definitely something that needs checking. I really want this to work though. Rik, Andrea? Will the balancing handle the extra zone? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: On Wed, May 30 2001, Mark Hemment wrote: Hi Jens, I ran this (well, cut-two) on a 4-way box with 4GB of memory and a modified qlogic fibre channel driver with 32disks hanging off it, without any problems. The test used was SpecFS 2.0 Cool, could you send me the qlogic diff? It's the one-liner can_dma32 chance I'm interested in, I'm just not sure what driver you used :-) The qlogic driver is the one from; http://www.feral.com/isp.html I find this much more stable than the one already in the kernel. It did just need the one-liner change, but as the driver isn't in the kernel there isn't much point adding it change to your patch. :) I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. You are right, this is definitely something that needs checking. I really want this to work though. Rik, Andrea? Will the balancing handle the extra zone? In theory it should do - ie. there isn't anything to stop it. With NFS loads, over a ported VxFS filesystem, I do see some problems between the NORMAL and HIGH zones. Thinking about it, ZONE_DMA32 shouldn't make this any worse. Rik, Andrea, quick description of a balancing problem; Consider a VM which is under load (but not stressed), such that all zone free-page pools are between their MIN and LOW marks, with pages in the inactive_clean lists. The NORMAL zone has non-zero page order allocations thrown at it. This causes __alloc_pages() to reap pages from the NORMAL inactive_clean list until the required buddy is built. The blind reaping causes the NORMAL zone to have a large number of free pages (greater than -pages_low). Now, when HIGHMEM allocations come in (for page cache pages), they skip the HIGH zone and use the NORMAL zone (as it now has plenty of free pages) - the code at the top of __alloc_pages(), which checks against -pages_low. But the NORMAL zone is usually under more pressure than the HIGH zone - as many more allocations needed ready-mapped memory. This causes the page-cache pages from the NORMAL zone to come under more pressure, and are re-cycled quicker than page-cache pages in the HIGHMEM zone. OK, we shouldn't be throwing too many non-zero page allocations at __alloc_pages(), but it does happen. Also, the problem isn't as bad as it first looks - HIGHMEM page-cache pages do get recycled (reclaimed), but there is a slight imbalance. Mark - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
[ my usual email is offline at the moment, please CC to [EMAIL PROTECTED] for anything urgent until the problem is fixed ] On Wed, May 30, 2001 at 11:55:38AM +0200, Jens Axboe wrote: On Wed, May 30 2001, Mark Hemment wrote: Hi Jens, I ran this (well, cut-two) on a 4-way box with 4GB of memory and a modified qlogic fibre channel driver with 32disks hanging off it, without any problems. The test used was SpecFS 2.0 Cool, could you send me the qlogic diff? It's the one-liner can_dma32 chance I'm interested in, I'm just not sure what driver you used :-) I'll add that to the patch then. Basically all the PCI cards should work, I'm just being cautious and only enabling highmem I/O to the ones that have been tested. Peformance is definitely up - but I can't give an exact number, as the run with this patch was compiled with no-omit-frame-pointer for debugging any probs. Good I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. You are right, this is definitely something that needs checking. I really want this to work though. Rik, Andrea? Will the balancing handle the extra zone? The bounces can came from the ZONE_NORMAL without problems, however the ZONE_DMA32 way is fine too, but yes probably it isn't needed in real life unless you do an huge amount of I/O at the same time. If you want to reduce the amount of changes you can defer the zone_dma32 patch and possibly plug it in later. Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30 2001, Andrea Arcangeli wrote: I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. You are right, this is definitely something that needs checking. I really want this to work though. Rik, Andrea? Will the balancing handle the extra zone? The bounces can came from the ZONE_NORMAL without problems, however the Of course ZONE_DMA32 way is fine too, but yes probably it isn't needed in real life unless you do an huge amount of I/O at the same time. If you want It's not strictly needed, but it does buy us 3 extra gig to do I/O from an a pae enabled x86. to reduce the amount of changes you can defer the zone_dma32 patch and possibly plug it in later. Yes, I did modular patches for this reason. Thanks! -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30, 2001 at 11:59:50AM +0100, Mark Hemment wrote: Now, when HIGHMEM allocations come in (for page cache pages), they skip the HIGH zone and use the NORMAL zone (as it now has plenty of free pages) - the code at the top of __alloc_pages(), which checks against -pages_low. btw, I think such heuristic is horribly broken ;), the highmem zone simply needs to be balanced if it is under the pages_low mark, just skipping it and falling back into the normal zone that happens to be above the low mark is the wrong thing to do. Also, the problem isn't as bad as it first looks - HIGHMEM page-cache pages do get recycled (reclaimed), but there is a slight imbalance. there will always be some imbalance unless all allocations would be capable of highmem (which will never happen). The only thing we can do is to optimize the zone usage so we won't run out of normal pages unless there was a good reason. Once we run out of normal pages we'll simply return NULL and the reserved pool of highmem bounces will be used instead (other callers will behave differently). Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: btw, I think such heuristic is horribly broken ;), the highmem zone simply needs to be balanced if it is under the pages_low mark, just skipping it and falling back into the normal zone that happens to be above the low mark is the wrong thing to do. 2.3.51 did this, we all know the result. Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: You are right, this is definitely something that needs checking. I really want this to work though. Rik, Andrea? Will the balancing handle the extra zone? In as far as it handles balancing the current zones, it'll also work with one more. In places where it's currently broken it will probably also break with one extra zone, though the fact that the DMA32 zone takes the pressure off the NORMAL zone might actually help. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30, 2001 at 03:42:51PM -0300, Rik van Riel wrote: On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: btw, I think such heuristic is horribly broken ;), the highmem zone simply needs to be balanced if it is under the pages_low mark, just skipping it and falling back into the normal zone that happens to be above the low mark is the wrong thing to do. 2.3.51 did this, we all know the result. I've no idea about what 2.3.51 does, but I was obviously wrong about that. Forget such what I said above. Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
Rik van Riel [EMAIL PROTECTED] writes: On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: btw, I think such heuristic is horribly broken ;), the highmem zone simply needs to be balanced if it is under the pages_low mark, just skipping it and falling back into the normal zone that happens to be above the low mark is the wrong thing to do. 2.3.51 did this, we all know the result. Just a note, I remember the 2.3.51 kernel as the most usable kernel I ever used talking about VM. -- Yoann Vandoorselaere | C makes it easy to shoot yourself in the foot. C++ makes MandrakeSoft | it harder, but when you do, it blows away your whole | leg. - Bjarne Stroustrup - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, May 30, 2001 at 08:57:50PM +0200, Yoann Vandoorselaere wrote: I remember the 2.3.51 kernel as the most usable kernel I ever used talking about VM. I also don't remeber anything strange in that kernel about the VM (I instead remeber well the VM breakage introduced in 2.3.99-pre). Regardless of what 2.3.51 was doing, the falling back into the lower zones before starting the balancing is fine. Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Andrea Arcangeli wrote: On Wed, May 30, 2001 at 08:57:50PM +0200, Yoann Vandoorselaere wrote: I remember the 2.3.51 kernel as the most usable kernel I ever used talking about VM. I also don't remeber anything strange in that kernel about the VM (I instead remeber well the VM breakage introduced in 2.3.99-pre). Regardless of what 2.3.51 was doing, the falling back into the lower zones before starting the balancing is fine. The problem with 2.3.51 was that it started balancing the HIGHMEM zone before falling back. On a 1GB system this lead not only to the system starting to swap as soon as the 128MB highmem zone was filled up, it also resulted in the other 900MB being essentially unused. Having your 1GB system running as if it had 128MB definately can be classified as Not Fun. Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] 4GB I/O, cut three
Hi, Another day, another version. Bugs fixed in this version: none Known bugs in this version: none In other words, it's perfect of course. Changes: - Added ide-dma segment coalescing - Only print highmem I/O enable info when HIGHMEM is actually set Please give it a test spin, especially if you have 1GB of RAM or more. You should see something like this when booting: hda: enabling highmem I/O ... SCSI: channel 0, id 0: enabling highmem I/O depending on drive configuration etc. Plea to maintainers of the different architectures: could you please add the arch parts to support this? This includes: - memory zoning at init time - page_to_bus - pci_map_page / pci_unmap_page - set_bh_sg - KM_BH_IRQ (for HIGHMEM archs) I think that's it, feel free to send me questions and (even better) patches. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] 4GB I/O, cut three
Hi, Another day, another version. Bugs fixed in this version: none Known bugs in this version: none In other words, it's perfect of course. Changes: - Added ide-dma segment coalescing - Only print highmem I/O enable info when HIGHMEM is actually set Please give it a test spin, especially if you have 1GB of RAM or more. You should see something like this when booting: hda: enabling highmem I/O ... SCSI: channel 0, id 0: enabling highmem I/O depending on drive configuration etc. Plea to maintainers of the different architectures: could you please add the arch parts to support this? This includes: - memory zoning at init time - page_to_bus - pci_map_page / pci_unmap_page - set_bh_sg - KM_BH_IRQ (for HIGHMEM archs) I think that's it, feel free to send me questions and (even better) patches. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/