Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote: > The megaraid driver is open source, do you see anything that driver can do > to improve performance. We would greatly appreciate any feedback in this > regard and definitely incorporate in the driver. The FW under Linux and > windows is same, so I do not see how the megaraid stack should perform > differently under Linux and windows? Just to second what Andy already stated: it's more likely the Megaraid firmware could be better at fetching the SG lists. This is a difficult problem since the firmware needs to work well on so many different platforms/chipsets. If LSI has time to turn more stones, get a PCI bus analyzer and filter it to only capture CPU MMIO traffic and DMA traffic to/from some "well known" SG lists (ie instrument the driver to print those to the console). Then run AIM7 or similar multithreaded workload. A perfect PCI trace will show the device pulling the SG list in cacheline at time after the CPU MMIO reads/writes from the card to indicate a new transaction is ready to go. Another stone LSI could turn is to verify the megaraid controller is NOT contending with the CPU for cachelines used to build SG lists. This something the driver controls but I only know how to measure this on ia64 machines (with pfmon or caliper or similar tool). If you want examples, see http://iou.parisc-linux.org/ols2004/pfmon_for_iodorks.pdf In case it's not clear from above, optimal IO flow means the device is moving control data and streaming data in cacheline or bigger units. If Megaraid is already doing that, then the PCI trace timing info should point at where the latencies are. hth, grant - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, 25 Jan 2005, Andi Kleen wrote: > On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote: > > > > > e.g. performance on megaraid controllers (very popular > > > because a big PC vendor ships them) was always quite bad on > > > Linux. Up to the point that specific IO workloads run half as > > > fast on a megaraid compared to other controllers. I heard > > > they do work better on Windows. > > > > > > > > Ideally the Linux IO patterns would look similar to the > > > Windows IO patterns, then we could reuse all the > > > optimizations the controller vendors did for Windows :) > > > > LSI would leave no stone unturned to make the performance better for > > megaraid controllers under Linux. If you have some hard data in relation to > > comparison of performance for adapters from other vendors, please share with > > us. We would definitely strive to better it. > > Sorry for being vague on this. I don't have much hard data on this, > just telling an annecdote. The issue we saw was over a year ago > and on a machine running an IO intensive multi process stress test > (I believe it was an AIM7 variant with some tweaked workfile). When the test > was moved to a machine with megaraid controller it ran significantly > lower, compared to the old setup with a non RAID SCSI controller from > a different vendor. I unfortunately don't know anymore the exact > type/firmware revision etc. of the megaraid that showed the problem. > Ok, for me here, the bottom line is that decent hardware will not benefit from help from the allocator. Worse, if the work required to provide adjacent pages is high, it will even adversly affect throughput. I know as well that to have physically contiguous pages in userspace would involve a fair amount of overhead so even if we devise a system for providing them, it would need to be a configurable option. I will keep an eye out for a means of granting physically contiguous pages for userspace in a lightweight manner but I'm going to focus on general availability of large pages for TLBs, extend the system for a pool of zero'd pages and how it can be adapted to help out the hotplug folks. The system I have in mind for contiguous pages for userspace right now is to extend the allocator API so that prefaulting and readahead will request blocks of pages for userspace rather than a series of order-0 pages. So, if we prefault 32 pages ahead, the allocator would have a new API that would return 32 pages that are physically contiguous. That, in combination with forced IOMMU may show if Contiguous Pages For IO is worth it or not. This will take a while as I'll have to develop some mechanism for measuring it while I'm at it and I only do this 2 days a week so it'll take a while. -- Mel Gorman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote: > > > e.g. performance on megaraid controllers (very popular > > because a big PC vendor ships them) was always quite bad on > > Linux. Up to the point that specific IO workloads run half as > > fast on a megaraid compared to other controllers. I heard > > they do work better on Windows. > > > > > Ideally the Linux IO patterns would look similar to the > > Windows IO patterns, then we could reuse all the > > optimizations the controller vendors did for Windows :) > > LSI would leave no stone unturned to make the performance better for > megaraid controllers under Linux. If you have some hard data in relation to > comparison of performance for adapters from other vendors, please share with > us. We would definitely strive to better it. Sorry for being vague on this. I don't have much hard data on this, just telling an annecdote. The issue we saw was over a year ago and on a machine running an IO intensive multi process stress test (I believe it was an AIM7 variant with some tweaked workfile). When the test was moved to a machine with megaraid controller it ran significantly lower, compared to the old setup with a non RAID SCSI controller from a different vendor. I unfortunately don't know anymore the exact type/firmware revision etc. of the megaraid that showed the problem. If you have already fixed the issues then please accept my apologies. > The megaraid driver is open source, do you see anything that driver can do > to improve performance. We would greatly appreciate any feedback in this > regard and definitely incorporate in the driver. The FW under Linux and > windows is same, so I do not see how the megaraid stack should perform > differently under Linux and windows? My understanding (may be incomplete) of the issue is basically what Steve said: something in the stack doesn't like the Linux IO patterns with often relatively long SG lists, which are longer than in some other popular OS. This is unlikely to be the Linux driver (drivers tend to just pass the SG lists through without too much processing), more likely it was the firmware or something below. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, Jan 25, 2005 at 02:27:57PM +, Christoph Hellwig wrote: > > It is not the driver per se, but the way the memory which is the I/O > > source/target is presented to the driver. In linux there is a good > > chance it will have to use more scatter gather elements to represent > > the same amount of data. > > Note that a change made a few month ago after seeing issues with > aacraid means it's much more likely to see contingous memory, > there were some numbers on linux-scsi and/or linux-kernel. But only at the beginning. iirc after a few days of uptime and memory fragmentation it degenerates back to the old numbers. Perhaps the recent anti defragmentation work will help more. -Andi P.S.: on a AMD x86-64 box the theory can be relatively easily tested: just run with iommu=force,biomerge that will use the IOMMU to merge SG elements. I just don't recommend it for production because some errors are not well handled. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
> It is not the driver per se, but the way the memory which is the I/O > source/target is presented to the driver. In linux there is a good > chance it will have to use more scatter gather elements to represent > the same amount of data. Note that a change made a few month ago after seeing issues with aacraid means it's much more likely to see contingous memory, there were some numbers on linux-scsi and/or linux-kernel. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
Mukker, Atul wrote: LSI would leave no stone unturned to make the performance better for megaraid controllers under Linux. If you have some hard data in relation to comparison of performance for adapters from other vendors, please share with us. We would definitely strive to better it. The megaraid driver is open source, do you see anything that driver can do to improve performance. We would greatly appreciate any feedback in this regard and definitely incorporate in the driver. The FW under Linux and windows is same, so I do not see how the megaraid stack should perform differently under Linux and windows? It is not the driver per se, but the way the memory which is the I/O source/target is presented to the driver. In linux there is a good chance it will have to use more scatter gather elements to represent the same amount of data. Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Avoiding fragmentation through different allocator
> e.g. performance on megaraid controllers (very popular > because a big PC vendor ships them) was always quite bad on > Linux. Up to the point that specific IO workloads run half as > fast on a megaraid compared to other controllers. I heard > they do work better on Windows. > > Ideally the Linux IO patterns would look similar to the > Windows IO patterns, then we could reuse all the > optimizations the controller vendors did for Windows :) LSI would leave no stone unturned to make the performance better for megaraid controllers under Linux. If you have some hard data in relation to comparison of performance for adapters from other vendors, please share with us. We would definitely strive to better it. The megaraid driver is open source, do you see anything that driver can do to improve performance. We would greatly appreciate any feedback in this regard and definitely incorporate in the driver. The FW under Linux and windows is same, so I do not see how the megaraid stack should perform differently under Linux and windows? Thanks Atul Mukker Architect, Drivers and BIOS LSI Logic Corporation - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Avoiding fragmentation through different allocator
e.g. performance on megaraid controllers (very popular because a big PC vendor ships them) was always quite bad on Linux. Up to the point that specific IO workloads run half as fast on a megaraid compared to other controllers. I heard they do work better on Windows. snip Ideally the Linux IO patterns would look similar to the Windows IO patterns, then we could reuse all the optimizations the controller vendors did for Windows :) LSI would leave no stone unturned to make the performance better for megaraid controllers under Linux. If you have some hard data in relation to comparison of performance for adapters from other vendors, please share with us. We would definitely strive to better it. The megaraid driver is open source, do you see anything that driver can do to improve performance. We would greatly appreciate any feedback in this regard and definitely incorporate in the driver. The FW under Linux and windows is same, so I do not see how the megaraid stack should perform differently under Linux and windows? Thanks Atul Mukker Architect, Drivers and BIOS LSI Logic Corporation - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
Mukker, Atul wrote: LSI would leave no stone unturned to make the performance better for megaraid controllers under Linux. If you have some hard data in relation to comparison of performance for adapters from other vendors, please share with us. We would definitely strive to better it. The megaraid driver is open source, do you see anything that driver can do to improve performance. We would greatly appreciate any feedback in this regard and definitely incorporate in the driver. The FW under Linux and windows is same, so I do not see how the megaraid stack should perform differently under Linux and windows? It is not the driver per se, but the way the memory which is the I/O source/target is presented to the driver. In linux there is a good chance it will have to use more scatter gather elements to represent the same amount of data. Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
It is not the driver per se, but the way the memory which is the I/O source/target is presented to the driver. In linux there is a good chance it will have to use more scatter gather elements to represent the same amount of data. Note that a change made a few month ago after seeing issues with aacraid means it's much more likely to see contingous memory, there were some numbers on linux-scsi and/or linux-kernel. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, Jan 25, 2005 at 02:27:57PM +, Christoph Hellwig wrote: It is not the driver per se, but the way the memory which is the I/O source/target is presented to the driver. In linux there is a good chance it will have to use more scatter gather elements to represent the same amount of data. Note that a change made a few month ago after seeing issues with aacraid means it's much more likely to see contingous memory, there were some numbers on linux-scsi and/or linux-kernel. But only at the beginning. iirc after a few days of uptime and memory fragmentation it degenerates back to the old numbers. Perhaps the recent anti defragmentation work will help more. -Andi P.S.: on a AMD x86-64 box the theory can be relatively easily tested: just run with iommu=force,biomerge that will use the IOMMU to merge SG elements. I just don't recommend it for production because some errors are not well handled. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote: e.g. performance on megaraid controllers (very popular because a big PC vendor ships them) was always quite bad on Linux. Up to the point that specific IO workloads run half as fast on a megaraid compared to other controllers. I heard they do work better on Windows. snip Ideally the Linux IO patterns would look similar to the Windows IO patterns, then we could reuse all the optimizations the controller vendors did for Windows :) LSI would leave no stone unturned to make the performance better for megaraid controllers under Linux. If you have some hard data in relation to comparison of performance for adapters from other vendors, please share with us. We would definitely strive to better it. Sorry for being vague on this. I don't have much hard data on this, just telling an annecdote. The issue we saw was over a year ago and on a machine running an IO intensive multi process stress test (I believe it was an AIM7 variant with some tweaked workfile). When the test was moved to a machine with megaraid controller it ran significantly lower, compared to the old setup with a non RAID SCSI controller from a different vendor. I unfortunately don't know anymore the exact type/firmware revision etc. of the megaraid that showed the problem. If you have already fixed the issues then please accept my apologies. The megaraid driver is open source, do you see anything that driver can do to improve performance. We would greatly appreciate any feedback in this regard and definitely incorporate in the driver. The FW under Linux and windows is same, so I do not see how the megaraid stack should perform differently under Linux and windows? My understanding (may be incomplete) of the issue is basically what Steve said: something in the stack doesn't like the Linux IO patterns with often relatively long SG lists, which are longer than in some other popular OS. This is unlikely to be the Linux driver (drivers tend to just pass the SG lists through without too much processing), more likely it was the firmware or something below. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, 25 Jan 2005, Andi Kleen wrote: On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote: e.g. performance on megaraid controllers (very popular because a big PC vendor ships them) was always quite bad on Linux. Up to the point that specific IO workloads run half as fast on a megaraid compared to other controllers. I heard they do work better on Windows. snip Ideally the Linux IO patterns would look similar to the Windows IO patterns, then we could reuse all the optimizations the controller vendors did for Windows :) LSI would leave no stone unturned to make the performance better for megaraid controllers under Linux. If you have some hard data in relation to comparison of performance for adapters from other vendors, please share with us. We would definitely strive to better it. Sorry for being vague on this. I don't have much hard data on this, just telling an annecdote. The issue we saw was over a year ago and on a machine running an IO intensive multi process stress test (I believe it was an AIM7 variant with some tweaked workfile). When the test was moved to a machine with megaraid controller it ran significantly lower, compared to the old setup with a non RAID SCSI controller from a different vendor. I unfortunately don't know anymore the exact type/firmware revision etc. of the megaraid that showed the problem. Ok, for me here, the bottom line is that decent hardware will not benefit from help from the allocator. Worse, if the work required to provide adjacent pages is high, it will even adversly affect throughput. I know as well that to have physically contiguous pages in userspace would involve a fair amount of overhead so even if we devise a system for providing them, it would need to be a configurable option. I will keep an eye out for a means of granting physically contiguous pages for userspace in a lightweight manner but I'm going to focus on general availability of large pages for TLBs, extend the system for a pool of zero'd pages and how it can be adapted to help out the hotplug folks. The system I have in mind for contiguous pages for userspace right now is to extend the allocator API so that prefaulting and readahead will request blocks of pages for userspace rather than a series of order-0 pages. So, if we prefault 32 pages ahead, the allocator would have a new API that would return 32 pages that are physically contiguous. That, in combination with forced IOMMU may show if Contiguous Pages For IO is worth it or not. This will take a while as I'll have to develop some mechanism for measuring it while I'm at it and I only do this 2 days a week so it'll take a while. -- Mel Gorman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote: The megaraid driver is open source, do you see anything that driver can do to improve performance. We would greatly appreciate any feedback in this regard and definitely incorporate in the driver. The FW under Linux and windows is same, so I do not see how the megaraid stack should perform differently under Linux and windows? Just to second what Andy already stated: it's more likely the Megaraid firmware could be better at fetching the SG lists. This is a difficult problem since the firmware needs to work well on so many different platforms/chipsets. If LSI has time to turn more stones, get a PCI bus analyzer and filter it to only capture CPU MMIO traffic and DMA traffic to/from some well known SG lists (ie instrument the driver to print those to the console). Then run AIM7 or similar multithreaded workload. A perfect PCI trace will show the device pulling the SG list in cacheline at time after the CPU MMIO reads/writes from the card to indicate a new transaction is ready to go. Another stone LSI could turn is to verify the megaraid controller is NOT contending with the CPU for cachelines used to build SG lists. This something the driver controls but I only know how to measure this on ia64 machines (with pfmon or caliper or similar tool). If you want examples, see http://iou.parisc-linux.org/ols2004/pfmon_for_iodorks.pdf In case it's not clear from above, optimal IO flow means the device is moving control data and streaming data in cacheline or bigger units. If Megaraid is already doing that, then the PCI trace timing info should point at where the latencies are. hth, grant - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
Steve Lord <[EMAIL PROTECTED]> writes: > > I realize this is one data point on one end of the scale, but I > just wanted to make the point that there are cases where it > does matter. Hopefully William's little change from last > year has helped out a lot. There are more datapoints: e.g. performance on megaraid controllers (very popular because a big PC vendor ships them) was always quite bad on Linux. Up to the point that specific IO workloads run half as fast on a megaraid compared to other controllers. I heard they do work better on Windows. Also I did some experiments with coalescing SG lists in the Opteron IOMMU some time ago. With a MPT fusion controller and forcing all SG lists through the IOMMU so that the SCSI controller always only contiguous mappings I saw ~5% improvement on some IO tests. Unfortunately there are some problems that doesn't allow to enable this unconditionally. But it gives strong evidence that MPT Fusion prefers shorter SG lists too. So it seems to be worthwhile to optimize for shorter SG lists. Ideally the Linux IO patterns would look similar to the Windows IO patterns, then we could reuse all the optimizations the controller vendors did for Windows :) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
James Bottomley wrote: Well, the basic advice would be not to worry too much about fragmentation from the point of view of I/O devices. They mostly all do scatter gather (SG) onboard as an intelligent processing operation and they're very good at it. No one has ever really measured an effect we can say "This is due to the card's SG engine". So, the rule we tend to follow is that if SG element reduction comes for free, we take it. The issue that actually causes problems isn't the reduction in processing overhead, it's that the device's SG list is usually finite in size and so it's worth conserving if we can; however it's mostly not worth conserving at the expense of processor cycles. Depends on the device at the other end of the scsi/fiber channel. We have seen the processor in raid devices get maxed out by linux when it is not maxed out by windows. Windows tends to be more device friendly (I hate to say it), by sending larger and fewer scatter gather elements than linux does. Running an LSI raid over fiberchannel with 4 ports, windows was able to sustain ~830 Mbytes/sec, basically channel speed using only 1500 commands a second. Linux peaked at 550 Mbytes/sec using over 4000 scsi commands to do it - the sustained rate was more like 350 Mbytes/sec, I think at the end of the day linux was sending 128K per scsi request. These numbers predate the current linux scsi and io code, and I do not have the hardware to rerun them right now. I realize this is one data point on one end of the scale, but I just wanted to make the point that there are cases where it does matter. Hopefully William's little change from last year has helped out a lot. Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, 2005-01-24 at 13:49 -0200, Marcelo Tosatti wrote: > So is it valid to affirm that on average an operation with one SG element > pointing to a 1MB > region is similar in speed to an operation with 16 SG elements each pointing > to a 64K > region due to the efficient onboard SG processing? it's within a few percent, yes. And the figures depend on how good the I/O card is at it. I can imagine there are some wildly varying I/O cards out there. However, also remember that 1MB of I/O is getting beyond what's sensible for a disc device anyway. The cable speed is much faster than the platter speed, so the device takes the I/O into its cache as it services it. If you overrun the cache it will burp (disconnect) and force a reconnection to get the rest (effectively splitting the I/O up anyway). This doesn't apply to arrays with huge caches, but it does to pretty much everything else. The average disc cache size is only a megabyte or so. James - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, Jan 24, 2005 at 10:29:52AM -0200, Marcelo Tosatti wrote: > Grant Grundler and James Bottomley have been working on this area, > they might want to add some comments to this discussion. > > It seems HP (Grant et all) has pursued using big pages on IA64 (64K) > for this purpose. Marcello, That might have been Alex Williamson...but the reasons for 64K pages is to reduce TLB thrashing, not faster IO. On HP ZX1 boxes, SG performance is slightly better (max +5%) when going through the IOMMU than when bypassing it. The IOMMU can perfectly coalesce DMA pages but has a small CPU and DMA cost to do so as well. Otherwise, I totally agree with James. IO devices do scatter-gather pretty well and IO subsystems are tuned for page-size chunk or smaller anyway. ... > > I could keep digging, but I think the bottom line is that having large > > pages generally available rather than a fixed setting is desirable. > > Definately, yes. Thanks for the pointers. Big pages are good for CPU TLB and that's where most of the research has been done. I think IO devices have learned to cope with the fact the alot less has been (or can be for many workloads) done to coalesce IO pages. grant - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, Jan 24, 2005 at 10:44:12AM -0600, James Bottomley wrote: > On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote: > > Since the pages which compose IO operations are most likely sparse (not > > physically contiguous), > > the driver+device has to perform scatter-gather IO on the pages. > > > > The idea is that if we can have larger memory blocks scatter-gather IO can > > use less SG list > > elements (decreased CPU overhead, decreased device overhead, faster). > > > > Best scenario is where only one sg element is required (ie one huge > > physically contiguous block). > > > > Old devices/unprepared drivers which are not able to perform SG/IO > > suffer with sequential small sized operations. > > > > I'm far away from being a SCSI/ATA knowledgeable person, the storage people > > can > > help with expertise here. > > > > Grant Grundler and James Bottomley have been working on this area, they > > might want to > > add some comments to this discussion. > > > > It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for > > this purpose. > > Well, the basic advice would be not to worry too much about > fragmentation from the point of view of I/O devices. They mostly all do > scatter gather (SG) onboard as an intelligent processing operation and > they're very good at it. So is it valid to affirm that on average an operation with one SG element pointing to a 1MB region is similar in speed to an operation with 16 SG elements each pointing to a 64K region due to the efficient onboard SG processing? > No one has ever really measured an effect we can say "This is due to the > card's SG engine". So, the rule we tend to follow is that if SG element > reduction comes for free, we take it. The issue that actually causes > problems isn't the reduction in processing overhead, it's that the > device's SG list is usually finite in size and so it's worth conserving > if we can; however it's mostly not worth conserving at the expense of > processor cycles. > > The bottom line is that the I/O (block) subsystem is very efficient at > coalescing (both in block space and in physical memory space) and we've > got it to the point where it's about as efficient as it can be. If > you're going to give us better physical contiguity properties, we'll > take them, but if you spend extra cycles doing it, the chances are > you'll slow down the I/O throughput path. OK! thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote: > Since the pages which compose IO operations are most likely sparse (not > physically contiguous), > the driver+device has to perform scatter-gather IO on the pages. > > The idea is that if we can have larger memory blocks scatter-gather IO can > use less SG list > elements (decreased CPU overhead, decreased device overhead, faster). > > Best scenario is where only one sg element is required (ie one huge > physically contiguous block). > > Old devices/unprepared drivers which are not able to perform SG/IO > suffer with sequential small sized operations. > > I'm far away from being a SCSI/ATA knowledgeable person, the storage people > can > help with expertise here. > > Grant Grundler and James Bottomley have been working on this area, they might > want to > add some comments to this discussion. > > It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this > purpose. Well, the basic advice would be not to worry too much about fragmentation from the point of view of I/O devices. They mostly all do scatter gather (SG) onboard as an intelligent processing operation and they're very good at it. No one has ever really measured an effect we can say "This is due to the card's SG engine". So, the rule we tend to follow is that if SG element reduction comes for free, we take it. The issue that actually causes problems isn't the reduction in processing overhead, it's that the device's SG list is usually finite in size and so it's worth conserving if we can; however it's mostly not worth conserving at the expense of processor cycles. The bottom line is that the I/O (block) subsystem is very efficient at coalescing (both in block space and in physical memory space) and we've got it to the point where it's about as efficient as it can be. If you're going to give us better physical contiguity properties, we'll take them, but if you spend extra cycles doing it, the chances are you'll slow down the I/O throughput path. James - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
James and Grant added to CC. On Mon, Jan 24, 2005 at 01:28:47PM +, Mel Gorman wrote: > On Sat, 22 Jan 2005, Marcelo Tosatti wrote: > > > > > I was thinking that it would be nice to have a set of high-order > > > > intensive workloads, and I wonder what are the most common high-order > > > > allocation paths which fail. > > > > > > > > > > Agreed. As I am not fully sure what workloads require high-order > > > allocations, I updated VMRegress to keep track of the count of > > > allocations and released 0.11 > > > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To > > > use it to track allocations, do the following > > > > > > < VMRegress instructions snipped> > > > > Great, excellent! Thanks. > > > > I plan to spend some time testing and trying to understand the vmregress > > package > > this week. > > > > The documentation is not in sync with the code as the package is fairly > large to maintain as a side-project. For the recent data I posted, The > interesting parts of the tools are; > > 1. bin/extfrag_stat.pl will display external fragmentation as a percentage > of each order. I can go more into the calculation of this if anyone is > interested. It does not require any special patches or modules > > 2. bin/intfrag_stat.pl will display internal fragmentation in the system. > Use the --man switch to get a list of all options. Linux occasionally > suffers badly from internal fragmentation but it's a problem for another > time > > 3. mapfrag_stat.pl is what I used to map where allocations are in the > address space. It requires the kernel patch in > kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a > non-mbuddy version in there) before the vmregress kernel modules can be > loaded > > 4. extfrag_stat_overtime.pl tracks external fragmentation over time > although the figures are not very useful. It can also graph what > fragmentation for some orders are over time. The figures are not useful > because the fragmentation figures are based on free pages and does not > take into account the layout of the currently allocated pages. > > 5. The module in src/test/highalloc.ko is what I used to test high-order > allocations. It creates a proc entry /proc/vmregress/test_highalloc that > can be read or written. "echo Order Pages > > /proc/vmregress/test_highalloc" will attempt to allocate 2^Order pages > "Pages" times. > > The perl scripts are smart enough to load the modules they need at runtime > if the modules have been installed with "make install". OK, thanks very much for the information - you might want to write this down into a text file and add it to the tarball :) > > > > It mostly depends on hardware because most high-order allocations happen > > > > inside device drivers? What are the kernel codepaths which try to do > > > > high-order allocations and fallback if failed? > > > > > > > > > > I'm not sure. I think that the paths we exercise right now will be largely > > > artifical. For example, you can force order-2 allocations by scping a > > > large file through localhost (because of the large MTU in that interface). > > > I have not come up with another meaningful workload that guarentees > > > high-order allocations yet. > > > > Thoughts and criticism of the following ideas are very much appreciated: > > > > In private conversation with wli (who helped me providing this information) > > we can > > conjecture the following: > > > > Modern IO devices are capable of doing scatter/gather IO. > > > > There is overhead associated with setting up and managing the > > scatter/gather tables. > > > > The benefit of large physically contiguous blocks is the ability to > > avoid the SG management overhead. > > > > Do we get this benefit right now? Since the pages which compose IO operations are most likely sparse (not physically contiguous), the driver+device has to perform scatter-gather IO on the pages. The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list elements (decreased CPU overhead, decreased device overhead, faster). Best scenario is where only one sg element is required (ie one huge physically contiguous block). Old devices/unprepared drivers which are not able to perform SG/IO suffer with sequential small sized operations. I'm far away from being a SCSI/ATA knowledgeable person, the storage people can help with expertise here. Grant Grundler and James Bottomley have been working on this area, they might want to add some comments to this discussion. It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose. > I read through the path of > generic_file_readv(). If I am reading this correctly (first reading, so > may not be right), scatter/gather IO will always be using order-0 pages. > Is this really true? Yes, it is. I was referring to scatter/gather IO at the device driver level, not SG IO at application level (readv/writev). Thing is that virtually contiguous
Re: [PATCH] Avoiding fragmentation through different allocator
On Sat, 22 Jan 2005, Marcelo Tosatti wrote: > > > I was thinking that it would be nice to have a set of high-order > > > intensive workloads, and I wonder what are the most common high-order > > > allocation paths which fail. > > > > > > > Agreed. As I am not fully sure what workloads require high-order > > allocations, I updated VMRegress to keep track of the count of > > allocations and released 0.11 > > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To > > use it to track allocations, do the following > > > > < VMRegress instructions snipped> > > Great, excellent! Thanks. > > I plan to spend some time testing and trying to understand the vmregress > package > this week. > The documentation is not in sync with the code as the package is fairly large to maintain as a side-project. For the recent data I posted, The interesting parts of the tools are; 1. bin/extfrag_stat.pl will display external fragmentation as a percentage of each order. I can go more into the calculation of this if anyone is interested. It does not require any special patches or modules 2. bin/intfrag_stat.pl will display internal fragmentation in the system. Use the --man switch to get a list of all options. Linux occasionally suffers badly from internal fragmentation but it's a problem for another time 3. mapfrag_stat.pl is what I used to map where allocations are in the address space. It requires the kernel patch in kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a non-mbuddy version in there) before the vmregress kernel modules can be loaded 4. extfrag_stat_overtime.pl tracks external fragmentation over time although the figures are not very useful. It can also graph what fragmentation for some orders are over time. The figures are not useful because the fragmentation figures are based on free pages and does not take into account the layout of the currently allocated pages. 5. The module in src/test/highalloc.ko is what I used to test high-order allocations. It creates a proc entry /proc/vmregress/test_highalloc that can be read or written. "echo Order Pages > /proc/vmregress/test_highalloc" will attempt to allocate 2^Order pages "Pages" times. The perl scripts are smart enough to load the modules they need at runtime if the modules have been installed with "make install". > > > It mostly depends on hardware because most high-order allocations happen > > > inside device drivers? What are the kernel codepaths which try to do > > > high-order allocations and fallback if failed? > > > > > > > I'm not sure. I think that the paths we exercise right now will be largely > > artifical. For example, you can force order-2 allocations by scping a > > large file through localhost (because of the large MTU in that interface). > > I have not come up with another meaningful workload that guarentees > > high-order allocations yet. > > Thoughts and criticism of the following ideas are very much appreciated: > > In private conversation with wli (who helped me providing this information) > we can > conjecture the following: > > Modern IO devices are capable of doing scatter/gather IO. > > There is overhead associated with setting up and managing the > scatter/gather tables. > > The benefit of large physically contiguous blocks is the ability to > avoid the SG management overhead. > Do we get this benefit right now? I read through the path of generic_file_readv(). If I am reading this correctly (first reading, so may not be right), scatter/gather IO will always be using order-0 pages. Is this really true? >From what I can see, the buffers being written to for readv() are all in userspace so are going to be order-0 (unless hugetlb is in use, is that the really interesting case?). For reading from the disk, the blocksize is what will be important and we can't create a filesystem with blocksizes greater than pagesize right now. So, for scatter/gather to take advantage of contiguous blocks, is more work required? If not, what am I missing? > Also filesystems benefit from big physically contiguous blocks. Quoting > wli "they want bigger blocks and contiguous memory to match bigger > blocks..." > This I don't get... What filesystems support really large blocks? ext2/3 only support pagesize and reiser will create a filesystem with a blocksize of 8192, but not mount it. > I completly agree that your simplified allocator decreases fragmentation > which in turn benefits the system overall. > > This is an area which can be further improved - ie efficiency in > reducing fragmentation is excellent. I sincerely appreciate the work > you are doing! > Thanks. > > > > > > Right now, I believe that the pool of huge pages is of a fixed size > > because of fragmentation difficulties. If we knew we could allocate huge > > pages, this pool would not have to be fixed. Some applications will > > heavily benefit from this. While databases are the obvious one, > > applications with large heaps will also benefit like
Re: [PATCH] Avoiding fragmentation through different allocator
James and Grant added to CC. On Mon, Jan 24, 2005 at 01:28:47PM +, Mel Gorman wrote: On Sat, 22 Jan 2005, Marcelo Tosatti wrote: I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. Agreed. As I am not fully sure what workloads require high-order allocations, I updated VMRegress to keep track of the count of allocations and released 0.11 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To use it to track allocations, do the following VMRegress instructions snipped Great, excellent! Thanks. I plan to spend some time testing and trying to understand the vmregress package this week. The documentation is not in sync with the code as the package is fairly large to maintain as a side-project. For the recent data I posted, The interesting parts of the tools are; 1. bin/extfrag_stat.pl will display external fragmentation as a percentage of each order. I can go more into the calculation of this if anyone is interested. It does not require any special patches or modules 2. bin/intfrag_stat.pl will display internal fragmentation in the system. Use the --man switch to get a list of all options. Linux occasionally suffers badly from internal fragmentation but it's a problem for another time 3. mapfrag_stat.pl is what I used to map where allocations are in the address space. It requires the kernel patch in kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a non-mbuddy version in there) before the vmregress kernel modules can be loaded 4. extfrag_stat_overtime.pl tracks external fragmentation over time although the figures are not very useful. It can also graph what fragmentation for some orders are over time. The figures are not useful because the fragmentation figures are based on free pages and does not take into account the layout of the currently allocated pages. 5. The module in src/test/highalloc.ko is what I used to test high-order allocations. It creates a proc entry /proc/vmregress/test_highalloc that can be read or written. echo Order Pages /proc/vmregress/test_highalloc will attempt to allocate 2^Order pages Pages times. The perl scripts are smart enough to load the modules they need at runtime if the modules have been installed with make install. OK, thanks very much for the information - you might want to write this down into a text file and add it to the tarball :) It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? I'm not sure. I think that the paths we exercise right now will be largely artifical. For example, you can force order-2 allocations by scping a large file through localhost (because of the large MTU in that interface). I have not come up with another meaningful workload that guarentees high-order allocations yet. Thoughts and criticism of the following ideas are very much appreciated: In private conversation with wli (who helped me providing this information) we can conjecture the following: Modern IO devices are capable of doing scatter/gather IO. There is overhead associated with setting up and managing the scatter/gather tables. The benefit of large physically contiguous blocks is the ability to avoid the SG management overhead. Do we get this benefit right now? Since the pages which compose IO operations are most likely sparse (not physically contiguous), the driver+device has to perform scatter-gather IO on the pages. The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list elements (decreased CPU overhead, decreased device overhead, faster). Best scenario is where only one sg element is required (ie one huge physically contiguous block). Old devices/unprepared drivers which are not able to perform SG/IO suffer with sequential small sized operations. I'm far away from being a SCSI/ATA knowledgeable person, the storage people can help with expertise here. Grant Grundler and James Bottomley have been working on this area, they might want to add some comments to this discussion. It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose. I read through the path of generic_file_readv(). If I am reading this correctly (first reading, so may not be right), scatter/gather IO will always be using order-0 pages. Is this really true? Yes, it is. I was referring to scatter/gather IO at the device driver level, not SG IO at application level (readv/writev). Thing is that virtually contiguous data buffers which are operated on with read/write, aio_read/aio_write, etc. become in fact scatter-gather operations at the device level if they are not physically
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote: Since the pages which compose IO operations are most likely sparse (not physically contiguous), the driver+device has to perform scatter-gather IO on the pages. The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list elements (decreased CPU overhead, decreased device overhead, faster). Best scenario is where only one sg element is required (ie one huge physically contiguous block). Old devices/unprepared drivers which are not able to perform SG/IO suffer with sequential small sized operations. I'm far away from being a SCSI/ATA knowledgeable person, the storage people can help with expertise here. Grant Grundler and James Bottomley have been working on this area, they might want to add some comments to this discussion. It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose. Well, the basic advice would be not to worry too much about fragmentation from the point of view of I/O devices. They mostly all do scatter gather (SG) onboard as an intelligent processing operation and they're very good at it. No one has ever really measured an effect we can say This is due to the card's SG engine. So, the rule we tend to follow is that if SG element reduction comes for free, we take it. The issue that actually causes problems isn't the reduction in processing overhead, it's that the device's SG list is usually finite in size and so it's worth conserving if we can; however it's mostly not worth conserving at the expense of processor cycles. The bottom line is that the I/O (block) subsystem is very efficient at coalescing (both in block space and in physical memory space) and we've got it to the point where it's about as efficient as it can be. If you're going to give us better physical contiguity properties, we'll take them, but if you spend extra cycles doing it, the chances are you'll slow down the I/O throughput path. James - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, Jan 24, 2005 at 10:44:12AM -0600, James Bottomley wrote: On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote: Since the pages which compose IO operations are most likely sparse (not physically contiguous), the driver+device has to perform scatter-gather IO on the pages. The idea is that if we can have larger memory blocks scatter-gather IO can use less SG list elements (decreased CPU overhead, decreased device overhead, faster). Best scenario is where only one sg element is required (ie one huge physically contiguous block). Old devices/unprepared drivers which are not able to perform SG/IO suffer with sequential small sized operations. I'm far away from being a SCSI/ATA knowledgeable person, the storage people can help with expertise here. Grant Grundler and James Bottomley have been working on this area, they might want to add some comments to this discussion. It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose. Well, the basic advice would be not to worry too much about fragmentation from the point of view of I/O devices. They mostly all do scatter gather (SG) onboard as an intelligent processing operation and they're very good at it. So is it valid to affirm that on average an operation with one SG element pointing to a 1MB region is similar in speed to an operation with 16 SG elements each pointing to a 64K region due to the efficient onboard SG processing? No one has ever really measured an effect we can say This is due to the card's SG engine. So, the rule we tend to follow is that if SG element reduction comes for free, we take it. The issue that actually causes problems isn't the reduction in processing overhead, it's that the device's SG list is usually finite in size and so it's worth conserving if we can; however it's mostly not worth conserving at the expense of processor cycles. The bottom line is that the I/O (block) subsystem is very efficient at coalescing (both in block space and in physical memory space) and we've got it to the point where it's about as efficient as it can be. If you're going to give us better physical contiguity properties, we'll take them, but if you spend extra cycles doing it, the chances are you'll slow down the I/O throughput path. OK! thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, Jan 24, 2005 at 10:29:52AM -0200, Marcelo Tosatti wrote: Grant Grundler and James Bottomley have been working on this area, they might want to add some comments to this discussion. It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this purpose. Marcello, That might have been Alex Williamson...but the reasons for 64K pages is to reduce TLB thrashing, not faster IO. On HP ZX1 boxes, SG performance is slightly better (max +5%) when going through the IOMMU than when bypassing it. The IOMMU can perfectly coalesce DMA pages but has a small CPU and DMA cost to do so as well. Otherwise, I totally agree with James. IO devices do scatter-gather pretty well and IO subsystems are tuned for page-size chunk or smaller anyway. ... I could keep digging, but I think the bottom line is that having large pages generally available rather than a fixed setting is desirable. Definately, yes. Thanks for the pointers. Big pages are good for CPU TLB and that's where most of the research has been done. I think IO devices have learned to cope with the fact the alot less has been (or can be for many workloads) done to coalesce IO pages. grant - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Mon, 2005-01-24 at 13:49 -0200, Marcelo Tosatti wrote: So is it valid to affirm that on average an operation with one SG element pointing to a 1MB region is similar in speed to an operation with 16 SG elements each pointing to a 64K region due to the efficient onboard SG processing? it's within a few percent, yes. And the figures depend on how good the I/O card is at it. I can imagine there are some wildly varying I/O cards out there. However, also remember that 1MB of I/O is getting beyond what's sensible for a disc device anyway. The cable speed is much faster than the platter speed, so the device takes the I/O into its cache as it services it. If you overrun the cache it will burp (disconnect) and force a reconnection to get the rest (effectively splitting the I/O up anyway). This doesn't apply to arrays with huge caches, but it does to pretty much everything else. The average disc cache size is only a megabyte or so. James - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
James Bottomley wrote: Well, the basic advice would be not to worry too much about fragmentation from the point of view of I/O devices. They mostly all do scatter gather (SG) onboard as an intelligent processing operation and they're very good at it. No one has ever really measured an effect we can say This is due to the card's SG engine. So, the rule we tend to follow is that if SG element reduction comes for free, we take it. The issue that actually causes problems isn't the reduction in processing overhead, it's that the device's SG list is usually finite in size and so it's worth conserving if we can; however it's mostly not worth conserving at the expense of processor cycles. Depends on the device at the other end of the scsi/fiber channel. We have seen the processor in raid devices get maxed out by linux when it is not maxed out by windows. Windows tends to be more device friendly (I hate to say it), by sending larger and fewer scatter gather elements than linux does. Running an LSI raid over fiberchannel with 4 ports, windows was able to sustain ~830 Mbytes/sec, basically channel speed using only 1500 commands a second. Linux peaked at 550 Mbytes/sec using over 4000 scsi commands to do it - the sustained rate was more like 350 Mbytes/sec, I think at the end of the day linux was sending 128K per scsi request. These numbers predate the current linux scsi and io code, and I do not have the hardware to rerun them right now. I realize this is one data point on one end of the scale, but I just wanted to make the point that there are cases where it does matter. Hopefully William's little change from last year has helped out a lot. Steve - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
Steve Lord [EMAIL PROTECTED] writes: I realize this is one data point on one end of the scale, but I just wanted to make the point that there are cases where it does matter. Hopefully William's little change from last year has helped out a lot. There are more datapoints: e.g. performance on megaraid controllers (very popular because a big PC vendor ships them) was always quite bad on Linux. Up to the point that specific IO workloads run half as fast on a megaraid compared to other controllers. I heard they do work better on Windows. Also I did some experiments with coalescing SG lists in the Opteron IOMMU some time ago. With a MPT fusion controller and forcing all SG lists through the IOMMU so that the SCSI controller always only contiguous mappings I saw ~5% improvement on some IO tests. Unfortunately there are some problems that doesn't allow to enable this unconditionally. But it gives strong evidence that MPT Fusion prefers shorter SG lists too. So it seems to be worthwhile to optimize for shorter SG lists. Ideally the Linux IO patterns would look similar to the Windows IO patterns, then we could reuse all the optimizations the controller vendors did for Windows :) -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Sat, 22 Jan 2005, Marcelo Tosatti wrote: I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. Agreed. As I am not fully sure what workloads require high-order allocations, I updated VMRegress to keep track of the count of allocations and released 0.11 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To use it to track allocations, do the following VMRegress instructions snipped Great, excellent! Thanks. I plan to spend some time testing and trying to understand the vmregress package this week. The documentation is not in sync with the code as the package is fairly large to maintain as a side-project. For the recent data I posted, The interesting parts of the tools are; 1. bin/extfrag_stat.pl will display external fragmentation as a percentage of each order. I can go more into the calculation of this if anyone is interested. It does not require any special patches or modules 2. bin/intfrag_stat.pl will display internal fragmentation in the system. Use the --man switch to get a list of all options. Linux occasionally suffers badly from internal fragmentation but it's a problem for another time 3. mapfrag_stat.pl is what I used to map where allocations are in the address space. It requires the kernel patch in kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a non-mbuddy version in there) before the vmregress kernel modules can be loaded 4. extfrag_stat_overtime.pl tracks external fragmentation over time although the figures are not very useful. It can also graph what fragmentation for some orders are over time. The figures are not useful because the fragmentation figures are based on free pages and does not take into account the layout of the currently allocated pages. 5. The module in src/test/highalloc.ko is what I used to test high-order allocations. It creates a proc entry /proc/vmregress/test_highalloc that can be read or written. echo Order Pages /proc/vmregress/test_highalloc will attempt to allocate 2^Order pages Pages times. The perl scripts are smart enough to load the modules they need at runtime if the modules have been installed with make install. It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? I'm not sure. I think that the paths we exercise right now will be largely artifical. For example, you can force order-2 allocations by scping a large file through localhost (because of the large MTU in that interface). I have not come up with another meaningful workload that guarentees high-order allocations yet. Thoughts and criticism of the following ideas are very much appreciated: In private conversation with wli (who helped me providing this information) we can conjecture the following: Modern IO devices are capable of doing scatter/gather IO. There is overhead associated with setting up and managing the scatter/gather tables. The benefit of large physically contiguous blocks is the ability to avoid the SG management overhead. Do we get this benefit right now? I read through the path of generic_file_readv(). If I am reading this correctly (first reading, so may not be right), scatter/gather IO will always be using order-0 pages. Is this really true? From what I can see, the buffers being written to for readv() are all in userspace so are going to be order-0 (unless hugetlb is in use, is that the really interesting case?). For reading from the disk, the blocksize is what will be important and we can't create a filesystem with blocksizes greater than pagesize right now. So, for scatter/gather to take advantage of contiguous blocks, is more work required? If not, what am I missing? Also filesystems benefit from big physically contiguous blocks. Quoting wli they want bigger blocks and contiguous memory to match bigger blocks... This I don't get... What filesystems support really large blocks? ext2/3 only support pagesize and reiser will create a filesystem with a blocksize of 8192, but not mount it. I completly agree that your simplified allocator decreases fragmentation which in turn benefits the system overall. This is an area which can be further improved - ie efficiency in reducing fragmentation is excellent. I sincerely appreciate the work you are doing! Thanks. Snip Right now, I believe that the pool of huge pages is of a fixed size because of fragmentation difficulties. If we knew we could allocate huge pages, this pool would not have to be fixed. Some applications will heavily benefit from this. While databases are the obvious one, applications with large heaps will also benefit like Java Virtual Machines. I can dig up papers that measured this on Solaris although I don't have them at
Re: [PATCH] Avoiding fragmentation through different allocator
On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote: > On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote: > > On Fri, 21 Jan 2005, Marcelo Tosatti wrote: > > > > > On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: > > > > > > > > > > Hi Mel, > > > > > > I was thinking that it would be nice to have a set of high-order > > > intensive workloads, and I wonder what are the most common high-order > > > allocation paths which fail. > > > > > > > Agreed. As I am not fully sure what workloads require high-order > > allocations, I updated VMRegress to keep track of the count of > > allocations and released 0.11 > > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To > > use it to track allocations, do the following > > > > 1. Download and unpack vmregress > > 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff . > > The patch currently requires the modified allocator but I can fix that up > > if people want it. Build and deploy the kernel > > 3. Build vmregress by > > ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy > > (or whatever path is appropriate) > > make > > 4. Load the modules with; > > insmod src/code/vmregress_core.ko > > insmod src/sense/trace_alloccount.ko > > > > This will create a proc entry /proc/vmregress/trace_alloccount that looks > > something like; > > > > Allocations (V1) > > --- > > KernNoRclm 997453 370 500000 > > 0000 > > KernRclm 35279000000 > > 0000 > > UserRclm9870808000000 > > 0000 > > Total 10903540 370 500000 > > 0000 > > > > Frees > > - > > KernNoRclm 590965 244 280000 > > 0000 > > KernRclm 227100 6050000 > > 0000 > > UserRclm7974200 73 170000 > > 0000 > > Total 19695805 747 1000000 > > 0000 > > > > To blank the counters, use > > > > echo 0 > /proc/vmregress/trace_alloccount > > > > Whatever workload we come up with, this proc entry will tell us if it is > > exercising high-order allocations right now. > > Great, excellent! Thanks. > > I plan to spend some time testing and trying to understand the vmregress > package > this week. > > > > It mostly depends on hardware because most high-order allocations happen > > > inside device drivers? What are the kernel codepaths which try to do > > > high-order allocations and fallback if failed? > > > > > > > I'm not sure. I think that the paths we exercise right now will be largely > > artifical. For example, you can force order-2 allocations by scping a > > large file through localhost (because of the large MTU in that interface). > > I have not come up with another meaningful workload that guarentees > > high-order allocations yet. > > Thoughts and criticism of the following ideas are very much appreciated: > > In private conversation with wli (who helped me providing this information) > we can > conjecture the following: > > Modern IO devices are capable of doing scatter/gather IO. > > There is overhead associated with setting up and managing the scatter/gather > tables. > > The benefit of large physically contiguous blocks is the ability to avoid the > SG > management overhead. > > Now the question is: The added overhead of allocating high order blocks > through migration > offsets the overhead of SG IO ? Quantifying that is interesting. What is the overhead of the SG IO management and how is the improvement without them? Are block IO drivers trying to allocate big physical segments? I bet they are not, because the "pool of huge pages" (as you say) is limited. > > This depends on the driver implementation (how efficiently its able to manage > the SG IO tables) and > device/IO subsystem characteristics. > > Also filesystems benefit from big physically contiguous blocks. Quoting wli > "they want bigger blocks and contiguous memory to match bigger blocks..." > > I completly agree that your simplified allocator decreases fragmentation > which in turn > benefits the system overall. > > This is an area which can be further improved - ie efficiency in reducing > fragmentation > is excellent. > I sincerely appreciate the work you are doing! > > > > To measure whether the cost of page migration offsets the ability to be > > > able to deliver high-order allocations we want a set of meaningful > > > performance tests? > > > > > > > Bear in mind, there are more
Re: [PATCH] Avoiding fragmentation through different allocator
On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote: On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote: On Fri, 21 Jan 2005, Marcelo Tosatti wrote: On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: Changelog snipped Hi Mel, I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. Agreed. As I am not fully sure what workloads require high-order allocations, I updated VMRegress to keep track of the count of allocations and released 0.11 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To use it to track allocations, do the following 1. Download and unpack vmregress 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff . The patch currently requires the modified allocator but I can fix that up if people want it. Build and deploy the kernel 3. Build vmregress by ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy (or whatever path is appropriate) make 4. Load the modules with; insmod src/code/vmregress_core.ko insmod src/sense/trace_alloccount.ko This will create a proc entry /proc/vmregress/trace_alloccount that looks something like; Allocations (V1) --- KernNoRclm 997453 370 500000 0000 KernRclm 35279000000 0000 UserRclm9870808000000 0000 Total 10903540 370 500000 0000 Frees - KernNoRclm 590965 244 280000 0000 KernRclm 227100 6050000 0000 UserRclm7974200 73 170000 0000 Total 19695805 747 1000000 0000 To blank the counters, use echo 0 /proc/vmregress/trace_alloccount Whatever workload we come up with, this proc entry will tell us if it is exercising high-order allocations right now. Great, excellent! Thanks. I plan to spend some time testing and trying to understand the vmregress package this week. It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? I'm not sure. I think that the paths we exercise right now will be largely artifical. For example, you can force order-2 allocations by scping a large file through localhost (because of the large MTU in that interface). I have not come up with another meaningful workload that guarentees high-order allocations yet. Thoughts and criticism of the following ideas are very much appreciated: In private conversation with wli (who helped me providing this information) we can conjecture the following: Modern IO devices are capable of doing scatter/gather IO. There is overhead associated with setting up and managing the scatter/gather tables. The benefit of large physically contiguous blocks is the ability to avoid the SG management overhead. Now the question is: The added overhead of allocating high order blocks through migration offsets the overhead of SG IO ? Quantifying that is interesting. What is the overhead of the SG IO management and how is the improvement without them? Are block IO drivers trying to allocate big physical segments? I bet they are not, because the pool of huge pages (as you say) is limited. This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and device/IO subsystem characteristics. Also filesystems benefit from big physically contiguous blocks. Quoting wli they want bigger blocks and contiguous memory to match bigger blocks... I completly agree that your simplified allocator decreases fragmentation which in turn benefits the system overall. This is an area which can be further improved - ie efficiency in reducing fragmentation is excellent. I sincerely appreciate the work you are doing! To measure whether the cost of page migration offsets the ability to be able to deliver high-order allocations we want a set of meaningful performance tests? Bear in mind, there are more considerations. The allocator potentially makes hotplug problems easier and could be easily tied into any page-zeroing system. Some of your own benchmarks also implied that the modified allocator helped some
Re: [PATCH] Avoiding fragmentation through different allocator
On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote: > On Fri, 21 Jan 2005, Marcelo Tosatti wrote: > > > On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: > > > > > > > Hi Mel, > > > > I was thinking that it would be nice to have a set of high-order > > intensive workloads, and I wonder what are the most common high-order > > allocation paths which fail. > > > > Agreed. As I am not fully sure what workloads require high-order > allocations, I updated VMRegress to keep track of the count of > allocations and released 0.11 > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To > use it to track allocations, do the following > > 1. Download and unpack vmregress > 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff . > The patch currently requires the modified allocator but I can fix that up > if people want it. Build and deploy the kernel > 3. Build vmregress by > ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy > (or whatever path is appropriate) > make > 4. Load the modules with; > insmod src/code/vmregress_core.ko > insmod src/sense/trace_alloccount.ko > > This will create a proc entry /proc/vmregress/trace_alloccount that looks > something like; > > Allocations (V1) > --- > KernNoRclm 997453 370 500000 >0000 > KernRclm 35279000000 >0000 > UserRclm9870808000000 >0000 > Total 10903540 370 500000 >0000 > > Frees > - > KernNoRclm 590965 244 280000 >0000 > KernRclm 227100 6050000 >0000 > UserRclm7974200 73 170000 >0000 > Total 19695805 747 1000000 >0000 > > To blank the counters, use > > echo 0 > /proc/vmregress/trace_alloccount > > Whatever workload we come up with, this proc entry will tell us if it is > exercising high-order allocations right now. Great, excellent! Thanks. I plan to spend some time testing and trying to understand the vmregress package this week. > > It mostly depends on hardware because most high-order allocations happen > > inside device drivers? What are the kernel codepaths which try to do > > high-order allocations and fallback if failed? > > > > I'm not sure. I think that the paths we exercise right now will be largely > artifical. For example, you can force order-2 allocations by scping a > large file through localhost (because of the large MTU in that interface). > I have not come up with another meaningful workload that guarentees > high-order allocations yet. Thoughts and criticism of the following ideas are very much appreciated: In private conversation with wli (who helped me providing this information) we can conjecture the following: Modern IO devices are capable of doing scatter/gather IO. There is overhead associated with setting up and managing the scatter/gather tables. The benefit of large physically contiguous blocks is the ability to avoid the SG management overhead. Now the question is: The added overhead of allocating high order blocks through migration offsets the overhead of SG IO ? Quantifying that is interesting. This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and device/IO subsystem characteristics. Also filesystems benefit from big physically contiguous blocks. Quoting wli "they want bigger blocks and contiguous memory to match bigger blocks..." I completly agree that your simplified allocator decreases fragmentation which in turn benefits the system overall. This is an area which can be further improved - ie efficiency in reducing fragmentation is excellent. I sincerely appreciate the work you are doing! > > To measure whether the cost of page migration offsets the ability to be > > able to deliver high-order allocations we want a set of meaningful > > performance tests? > > > > Bear in mind, there are more considerations. The allocator potentially > makes hotplug problems easier and could be easily tied into any > page-zeroing system. Some of your own benchmarks also implied that the > modified allocator helped some types of workloads which is beneficial in > itself.The last consideration is HugeTLB pages, which I am hoping William > will weigh in. > > Right now, I believe that the pool of huge pages is of a fixed size > because of fragmentation difficulties. If we knew we could allocate huge > pages, this pool would not have to be fixed. Some
Re: [PATCH] Avoiding fragmentation through different allocator
On Fri, 21 Jan 2005, Marcelo Tosatti wrote: > On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: > > > > Hi Mel, > > I was thinking that it would be nice to have a set of high-order > intensive workloads, and I wonder what are the most common high-order > allocation paths which fail. > Agreed. As I am not fully sure what workloads require high-order allocations, I updated VMRegress to keep track of the count of allocations and released 0.11 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To use it to track allocations, do the following 1. Download and unpack vmregress 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff . The patch currently requires the modified allocator but I can fix that up if people want it. Build and deploy the kernel 3. Build vmregress by ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy (or whatever path is appropriate) make 4. Load the modules with; insmod src/code/vmregress_core.ko insmod src/sense/trace_alloccount.ko This will create a proc entry /proc/vmregress/trace_alloccount that looks something like; Allocations (V1) --- KernNoRclm 997453 370 500000 0000 KernRclm 35279000000 0000 UserRclm9870808000000 0000 Total 10903540 370 500000 0000 Frees - KernNoRclm 590965 244 280000 0000 KernRclm 227100 6050000 0000 UserRclm7974200 73 170000 0000 Total 19695805 747 1000000 0000 To blank the counters, use echo 0 > /proc/vmregress/trace_alloccount Whatever workload we come up with, this proc entry will tell us if it is exercising high-order allocations right now. > It mostly depends on hardware because most high-order allocations happen > inside device drivers? What are the kernel codepaths which try to do > high-order allocations and fallback if failed? > I'm not sure. I think that the paths we exercise right now will be largely artifical. For example, you can force order-2 allocations by scping a large file through localhost (because of the large MTU in that interface). I have not come up with another meaningful workload that guarentees high-order allocations yet. > To measure whether the cost of page migration offsets the ability to be > able to deliver high-order allocations we want a set of meaningful > performance tests? > Bear in mind, there are more considerations. The allocator potentially makes hotplug problems easier and could be easily tied into any page-zeroing system. Some of your own benchmarks also implied that the modified allocator helped some types of workloads which is beneficial in itself.The last consideration is HugeTLB pages, which I am hoping William will weigh in. Right now, I believe that the pool of huge pages is of a fixed size because of fragmentation difficulties. If we knew we could allocate huge pages, this pool would not have to be fixed. Some applications will heavily benefit from this. While databases are the obvious one, applications with large heaps will also benefit like Java Virtual Machines. I can dig up papers that measured this on Solaris although I don't have them at hand right now. We know right now that the overhead of this allocator is fairly low (anyone got benchmarks to disagree) but I understand that page migration is relatively expensive. The allocator also does not have adverse CPU+cache affects like migration and the concept is fairly simple. > Its quite possible that not all unsatisfiable high-order allocations > want to force page migration (which is quite expensive in terms of > CPU/cache). Only migrate on __GFP_NOFAIL ? > I still believe with the allocator, we will only have to migrate in exceptional circumstances. > William, that same tradeoff exists for the zone balancing through > migration idea you propose... > -- Mel Gorman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Fri, 21 Jan 2005, Marcelo Tosatti wrote: On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: Changelog snipped Hi Mel, I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. Agreed. As I am not fully sure what workloads require high-order allocations, I updated VMRegress to keep track of the count of allocations and released 0.11 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To use it to track allocations, do the following 1. Download and unpack vmregress 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff . The patch currently requires the modified allocator but I can fix that up if people want it. Build and deploy the kernel 3. Build vmregress by ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy (or whatever path is appropriate) make 4. Load the modules with; insmod src/code/vmregress_core.ko insmod src/sense/trace_alloccount.ko This will create a proc entry /proc/vmregress/trace_alloccount that looks something like; Allocations (V1) --- KernNoRclm 997453 370 500000 0000 KernRclm 35279000000 0000 UserRclm9870808000000 0000 Total 10903540 370 500000 0000 Frees - KernNoRclm 590965 244 280000 0000 KernRclm 227100 6050000 0000 UserRclm7974200 73 170000 0000 Total 19695805 747 1000000 0000 To blank the counters, use echo 0 /proc/vmregress/trace_alloccount Whatever workload we come up with, this proc entry will tell us if it is exercising high-order allocations right now. It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? I'm not sure. I think that the paths we exercise right now will be largely artifical. For example, you can force order-2 allocations by scping a large file through localhost (because of the large MTU in that interface). I have not come up with another meaningful workload that guarentees high-order allocations yet. To measure whether the cost of page migration offsets the ability to be able to deliver high-order allocations we want a set of meaningful performance tests? Bear in mind, there are more considerations. The allocator potentially makes hotplug problems easier and could be easily tied into any page-zeroing system. Some of your own benchmarks also implied that the modified allocator helped some types of workloads which is beneficial in itself.The last consideration is HugeTLB pages, which I am hoping William will weigh in. Right now, I believe that the pool of huge pages is of a fixed size because of fragmentation difficulties. If we knew we could allocate huge pages, this pool would not have to be fixed. Some applications will heavily benefit from this. While databases are the obvious one, applications with large heaps will also benefit like Java Virtual Machines. I can dig up papers that measured this on Solaris although I don't have them at hand right now. We know right now that the overhead of this allocator is fairly low (anyone got benchmarks to disagree) but I understand that page migration is relatively expensive. The allocator also does not have adverse CPU+cache affects like migration and the concept is fairly simple. Its quite possible that not all unsatisfiable high-order allocations want to force page migration (which is quite expensive in terms of CPU/cache). Only migrate on __GFP_NOFAIL ? I still believe with the allocator, we will only have to migrate in exceptional circumstances. William, that same tradeoff exists for the zone balancing through migration idea you propose... -- Mel Gorman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote: On Fri, 21 Jan 2005, Marcelo Tosatti wrote: On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: Changelog snipped Hi Mel, I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. Agreed. As I am not fully sure what workloads require high-order allocations, I updated VMRegress to keep track of the count of allocations and released 0.11 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To use it to track allocations, do the following 1. Download and unpack vmregress 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff . The patch currently requires the modified allocator but I can fix that up if people want it. Build and deploy the kernel 3. Build vmregress by ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy (or whatever path is appropriate) make 4. Load the modules with; insmod src/code/vmregress_core.ko insmod src/sense/trace_alloccount.ko This will create a proc entry /proc/vmregress/trace_alloccount that looks something like; Allocations (V1) --- KernNoRclm 997453 370 500000 0000 KernRclm 35279000000 0000 UserRclm9870808000000 0000 Total 10903540 370 500000 0000 Frees - KernNoRclm 590965 244 280000 0000 KernRclm 227100 6050000 0000 UserRclm7974200 73 170000 0000 Total 19695805 747 1000000 0000 To blank the counters, use echo 0 /proc/vmregress/trace_alloccount Whatever workload we come up with, this proc entry will tell us if it is exercising high-order allocations right now. Great, excellent! Thanks. I plan to spend some time testing and trying to understand the vmregress package this week. It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? I'm not sure. I think that the paths we exercise right now will be largely artifical. For example, you can force order-2 allocations by scping a large file through localhost (because of the large MTU in that interface). I have not come up with another meaningful workload that guarentees high-order allocations yet. Thoughts and criticism of the following ideas are very much appreciated: In private conversation with wli (who helped me providing this information) we can conjecture the following: Modern IO devices are capable of doing scatter/gather IO. There is overhead associated with setting up and managing the scatter/gather tables. The benefit of large physically contiguous blocks is the ability to avoid the SG management overhead. Now the question is: The added overhead of allocating high order blocks through migration offsets the overhead of SG IO ? Quantifying that is interesting. This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and device/IO subsystem characteristics. Also filesystems benefit from big physically contiguous blocks. Quoting wli they want bigger blocks and contiguous memory to match bigger blocks... I completly agree that your simplified allocator decreases fragmentation which in turn benefits the system overall. This is an area which can be further improved - ie efficiency in reducing fragmentation is excellent. I sincerely appreciate the work you are doing! To measure whether the cost of page migration offsets the ability to be able to deliver high-order allocations we want a set of meaningful performance tests? Bear in mind, there are more considerations. The allocator potentially makes hotplug problems easier and could be easily tied into any page-zeroing system. Some of your own benchmarks also implied that the modified allocator helped some types of workloads which is beneficial in itself.The last consideration is HugeTLB pages, which I am hoping William will weigh in. Right now, I believe that the pool of huge pages is of a fixed size because of fragmentation difficulties. If we knew we could allocate huge pages, this pool would not have to be fixed. Some applications will heavily benefit from this. While databases are the obvious one,
Re: [PATCH] Avoiding fragmentation through different allocator
On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: > Changelog since V5 > o Fixed up gcc-2.95 errors > o Fixed up whitespace damage > > Changelog since V4 > o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies > with offsets to 2.6.11-rc1-mm1 > > Changelog since V3 > o inlined get_pageblock_type() and set_pageblock_type() > o set_pageblock_type() now takes a zone parameter to avoid a call to > page_zone() > o When taking from the global pool, do not scan all the low-order lists > > Changelog since V2 > o Do not to interfere with the "min" decay > o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably > anything to do with asynchronous IO > > Changelog since V1 > o Update patch to 2.6.11-rc1 > o Cleaned up bug where memory was wasted on a large bitmap > o Remove code that needed the binary buddy bitmaps > o Update flags to avoid colliding with __GFP_ZERO changes > o Extended fallback_count bean counters to show the fallback count for each > allocation type > o In-code documentation Hi Mel, I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? To measure whether the cost of page migration offsets the ability to be able to deliver high-order allocations we want a set of meaningful performance tests? Its quite possible that not all unsatisfiable high-order allocations want to force page migration (which is quite expensive in terms of CPU/cache). Only migrate on __GFP_NOFAIL ? William, that same tradeoff exists for the zone balancing through migration idea you propose... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator
On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote: Changelog since V5 o Fixed up gcc-2.95 errors o Fixed up whitespace damage Changelog since V4 o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies with offsets to 2.6.11-rc1-mm1 Changelog since V3 o inlined get_pageblock_type() and set_pageblock_type() o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone() o When taking from the global pool, do not scan all the low-order lists Changelog since V2 o Do not to interfere with the min decay o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably anything to do with asynchronous IO Changelog since V1 o Update patch to 2.6.11-rc1 o Cleaned up bug where memory was wasted on a large bitmap o Remove code that needed the binary buddy bitmaps o Update flags to avoid colliding with __GFP_ZERO changes o Extended fallback_count bean counters to show the fallback count for each allocation type o In-code documentation Hi Mel, I was thinking that it would be nice to have a set of high-order intensive workloads, and I wonder what are the most common high-order allocation paths which fail. It mostly depends on hardware because most high-order allocations happen inside device drivers? What are the kernel codepaths which try to do high-order allocations and fallback if failed? To measure whether the cost of page migration offsets the ability to be able to deliver high-order allocations we want a set of meaningful performance tests? Its quite possible that not all unsatisfiable high-order allocations want to force page migration (which is quite expensive in terms of CPU/cache). Only migrate on __GFP_NOFAIL ? William, that same tradeoff exists for the zone balancing through migration idea you propose... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Avoiding fragmentation through different allocator
Changelog since V5 o Fixed up gcc-2.95 errors o Fixed up whitespace damage Changelog since V4 o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies with offsets to 2.6.11-rc1-mm1 Changelog since V3 o inlined get_pageblock_type() and set_pageblock_type() o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone() o When taking from the global pool, do not scan all the low-order lists Changelog since V2 o Do not to interfere with the "min" decay o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably anything to do with asynchronous IO Changelog since V1 o Update patch to 2.6.11-rc1 o Cleaned up bug where memory was wasted on a large bitmap o Remove code that needed the binary buddy bitmaps o Update flags to avoid colliding with __GFP_ZERO changes o Extended fallback_count bean counters to show the fallback count for each allocation type o In-code documentation Version 1 o Initial release against 2.6.9 This patch divides allocations into three different types of allocations; UserReclaimable - These are userspace pages that are easily reclaimable. Right now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are in this category. These pages are trivially reclaimed by writing the page out to swap or syncing with backing storage KernelReclaimable - These are pages allocated by the kernel that are easily reclaimed. This is stuff like inode caches, dcache, buffer_heads etc. These type of pages potentially could be reclaimed by dumping the caches and reaping the slabs KernelNonReclaimable - These are pages that are allocated by the kernel that are not trivially reclaimed. For example, the memory allocated for a loaded module would be in this category. By default, allocations are considered to be of this type Instead of having one global MAX_ORDER-sized array of free lists, there are three, one for each type of allocation. Finally, there is a list of pages of size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals with. Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is added to the free-lists for that type, in effect reserving it. Hence, over time, pages of the different types can be clustered together. This means that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of pages allocated for UserReclaimable and page each of them out. Fallback is used when there are no 2^MAX_ORDER pages available and there are no free pages of the desired type. The fallback lists were chosen in a way that keeps the most easily reclaimable pages together. Three benchmark results are included. The first is the output of portions of AIM9 for the vanilla allocator and the modified one; [EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt 7 page_test 120.00 9508 79.2 134696.67 System Allocations & Pages/second 8 brk_test 120.01 3401 28.33931 481768.19 System Memory Allocations/second 9 jmp_test 120.00 498718 4155.98333 4155983.33 Non-local gotos/second 10 signal_test120.01 11768 98.0585098058.50 Signal Traps/second 11 exec_test 120.04 1585 13.20393 66.02 Program Loads/second 12 fork_test 120.04 1979 16.48617 1648.62 Task Creations/second 13 link_test 120.01 11174 93.10891 5865.86 Link/Unlink Pairs/second [EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt 7 page_test 120.01 9660 80.49329 136838.60 System Allocations & Pages/second 8 brk_test 120.01 3409 28.40597 482901.42 System Memory Allocations/second 9 jmp_test 120.00 501533 4179.44167 4179441.67 Non-local gotos/second 10 signal_test120.00 11677 97.3083397308.33 Signal Traps/second 11 exec_test 120.05 1585 13.20283 66.01 Program Loads/second 12 fork_test 120.05 1889 15.73511 1573.51 Task Creations/second 13 link_test 120.01 11089 92.40063 5821.24 Link/Unlink Pairs/second They show that the allocator performs roughly similar to the standard allocator so there is negligible slowdown with the extra complexity. The second benchmark tested the CPU cache usage to make sure it was not getting clobbered. The test was to repeatadly render a large postcript file 10 times and get the average. The result is; ==> gsbench-2.6.11-rc1Standard.txt <== Average: 115.468 real, 115.092 user, 0.337 sys ==> gsbench-2.6.11-rc1MBuddy.txt <== Average: 115.47 real, 115.136 user, 0.338 sys So there are no adverse cache effects. The last test is to show that the allocator can satisfy more high-order allocations, especially under load, than the standard allocator. The
[PATCH] Avoiding fragmentation through different allocator
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Avoiding fragmentation through different allocator
- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Avoiding fragmentation through different allocator
Changelog since V5 o Fixed up gcc-2.95 errors o Fixed up whitespace damage Changelog since V4 o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies with offsets to 2.6.11-rc1-mm1 Changelog since V3 o inlined get_pageblock_type() and set_pageblock_type() o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone() o When taking from the global pool, do not scan all the low-order lists Changelog since V2 o Do not to interfere with the min decay o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably anything to do with asynchronous IO Changelog since V1 o Update patch to 2.6.11-rc1 o Cleaned up bug where memory was wasted on a large bitmap o Remove code that needed the binary buddy bitmaps o Update flags to avoid colliding with __GFP_ZERO changes o Extended fallback_count bean counters to show the fallback count for each allocation type o In-code documentation Version 1 o Initial release against 2.6.9 This patch divides allocations into three different types of allocations; UserReclaimable - These are userspace pages that are easily reclaimable. Right now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are in this category. These pages are trivially reclaimed by writing the page out to swap or syncing with backing storage KernelReclaimable - These are pages allocated by the kernel that are easily reclaimed. This is stuff like inode caches, dcache, buffer_heads etc. These type of pages potentially could be reclaimed by dumping the caches and reaping the slabs KernelNonReclaimable - These are pages that are allocated by the kernel that are not trivially reclaimed. For example, the memory allocated for a loaded module would be in this category. By default, allocations are considered to be of this type Instead of having one global MAX_ORDER-sized array of free lists, there are three, one for each type of allocation. Finally, there is a list of pages of size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals with. Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is added to the free-lists for that type, in effect reserving it. Hence, over time, pages of the different types can be clustered together. This means that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of pages allocated for UserReclaimable and page each of them out. Fallback is used when there are no 2^MAX_ORDER pages available and there are no free pages of the desired type. The fallback lists were chosen in a way that keeps the most easily reclaimable pages together. Three benchmark results are included. The first is the output of portions of AIM9 for the vanilla allocator and the modified one; [EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt 7 page_test 120.00 9508 79.2 134696.67 System Allocations Pages/second 8 brk_test 120.01 3401 28.33931 481768.19 System Memory Allocations/second 9 jmp_test 120.00 498718 4155.98333 4155983.33 Non-local gotos/second 10 signal_test120.01 11768 98.0585098058.50 Signal Traps/second 11 exec_test 120.04 1585 13.20393 66.02 Program Loads/second 12 fork_test 120.04 1979 16.48617 1648.62 Task Creations/second 13 link_test 120.01 11174 93.10891 5865.86 Link/Unlink Pairs/second [EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt 7 page_test 120.01 9660 80.49329 136838.60 System Allocations Pages/second 8 brk_test 120.01 3409 28.40597 482901.42 System Memory Allocations/second 9 jmp_test 120.00 501533 4179.44167 4179441.67 Non-local gotos/second 10 signal_test120.00 11677 97.3083397308.33 Signal Traps/second 11 exec_test 120.05 1585 13.20283 66.01 Program Loads/second 12 fork_test 120.05 1889 15.73511 1573.51 Task Creations/second 13 link_test 120.01 11089 92.40063 5821.24 Link/Unlink Pairs/second They show that the allocator performs roughly similar to the standard allocator so there is negligible slowdown with the extra complexity. The second benchmark tested the CPU cache usage to make sure it was not getting clobbered. The test was to repeatadly render a large postcript file 10 times and get the average. The result is; == gsbench-2.6.11-rc1Standard.txt == Average: 115.468 real, 115.092 user, 0.337 sys == gsbench-2.6.11-rc1MBuddy.txt == Average: 115.47 real, 115.136 user, 0.338 sys So there are no adverse cache effects. The last test is to show that the allocator can satisfy more high-order allocations, especially under load, than the standard allocator. The test
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Sun, 16 Jan 2005, Marcelo Tosatti wrote: > > No unfortunately. Do you know of a test I can use? > > Some STP reaim results have significant performance increase in general, a few > small regressions. > > I think that depending on the type of access pattern of the application(s) > there > will be either performance gain or loss, but the result is interesting > anyway. :) > That is quite exciting and I'm pleased it was able to show gains in some tests. Based on the aim9 tests, I took a look at the paths I affected to see what improvements I could make. There were three significant ones 1. I inlined get_pageblock_type and set_pageblock_type 2. set_pageblock_type was calling page_zone() even though the only caller knew the zone so I added the parameter 3. When taking fom the global pool, I was recanning all the order lists which is does not any more I am hoping that these three changes will clear up the worst of the minor regressions. With the changess, aim9 reported that the modified allocator performs as well as the standard allocator. This means that the allocator is as fast, we are reasonably sure there is no adverse cache effects (if anything cache usage is improved) and we are far more likely to be able to service high-order requests [EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt 7 page_test 120.00 9508 79.2 134696.67 System Allocations & Pages/second 8 brk_test 120.01 3401 28.33931 481768.19 System Memory Allocations/second 9 jmp_test 120.00 498718 4155.98333 4155983.33 Non-local gotos/second 10 signal_test120.01 11768 98.0585098058.50 Signal Traps/second 11 exec_test 120.04 1585 13.20393 66.02 Program Loads/second 12 fork_test 120.04 1979 16.48617 1648.62 Task Creations/second 13 link_test 120.01 11174 93.10891 5865.86 Link/Unlink Pairs/second [EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt 7 page_test 120.01 9660 80.49329 136838.60 System Allocations & Pages/second 8 brk_test 120.01 3409 28.40597 482901.42 System Memory Allocations/second 9 jmp_test 120.00 501533 4179.44167 4179441.67 Non-local gotos/second 10 signal_test120.00 11677 97.3083397308.33 Signal Traps/second 11 exec_test 120.05 1585 13.20283 66.01 Program Loads/second 12 fork_test 120.05 1889 15.73511 1573.51 Task Creations/second 13 link_test 120.01 11089 92.40063 5821.24 Link/Unlink Pairs/second Patch with minor optimisations as follows; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/buffer.c linux-2.6.11-rc1-mbuddy/fs/buffer.c --- linux-2.6.11-rc1-clean/fs/buffer.c 2005-01-12 04:01:23.0 + +++ linux-2.6.11-rc1-mbuddy/fs/buffer.c 2005-01-13 10:56:30.0 + @@ -1134,7 +1134,8 @@ grow_dev_page(struct block_device *bdev, struct page *page; struct buffer_head *bh; - page = find_or_create_page(inode->i_mapping, index, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, index, + GFP_NOFS | __GFP_USERRCLM); if (!page) return NULL; @@ -2997,7 +2998,8 @@ static void recalc_bh_state(void) struct buffer_head *alloc_buffer_head(int gfp_flags) { - struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags); + struct buffer_head *ret = kmem_cache_alloc(bh_cachep, + gfp_flags|__GFP_KERNRCLM); if (ret) { preempt_disable(); __get_cpu_var(bh_accounting).nr++; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/dcache.c linux-2.6.11-rc1-mbuddy/fs/dcache.c --- linux-2.6.11-rc1-clean/fs/dcache.c 2005-01-12 04:00:09.0 + +++ linux-2.6.11-rc1-mbuddy/fs/dcache.c 2005-01-13 10:56:30.0 + @@ -715,7 +715,8 @@ struct dentry *d_alloc(struct dentry * p struct dentry *dentry; char *dname; - dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); + dentry = kmem_cache_alloc(dentry_cache, + GFP_KERNEL|__GFP_KERNRCLM); if (!dentry) return NULL; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/ext2/super.c linux-2.6.11-rc1-mbuddy/fs/ext2/super.c --- linux-2.6.11-rc1-clean/fs/ext2/super.c 2005-01-12 04:01:24.0 + +++ linux-2.6.11-rc1-mbuddy/fs/ext2/super.c 2005-01-13 10:56:30.0 + @@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep; static struct inode *ext2_alloc_inode(struct super_block *sb) { struct ext2_inode_info *ei; - ei = (struct ext2_inode_info
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote: > On Fri, 14 Jan 2005, Marcelo Tosatti wrote: > > > On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote: > > > The patch is against 2.6.11-rc1 and I'm willing to stand by it's > > > stability. I'm also confident it does it's job pretty well so I'd like it > > > to be considered for inclusion. > > > > This is very interesting! > > > > Thanks > > > Other than the advantage of decreased fragmentation which you aim, by > > providing clustering of different types of allocations you might have a > > performance gain (or loss :)) due to changes in cache colouring > > effects. > > > > That is possible but it I haven't thought of a way of measuring the cache > colouring effects (if any). There is also the problem that the additional > complexity of the allocator will offset this benefit. The two main loss > points of the allocator are increased complexity and the increased size of > the zone struct. > > > It depends on the workload/application mix and type of cache of course, > > but I think there will be a significant measurable difference on most > > common workloads. > > > > If I could only measure it :/ > > > Have you done any investigation with that respect? IMHO such > > verification is really important before attempting to merge it. > > > > No unfortunately. Do you know of a test I can use? Some STP reaim results have significant performance increase in general, a few small regressions. I think that depending on the type of access pattern of the application(s) there will be either performance gain or loss, but the result is interesting anyway. :) I'll different more tests later on. AIM OVERVIEW The AIM Multiuser Benchmark - Suite VII tests and measures the performance of Open System multiuser computers. Multiuser computer environments typically have the following general characteristics in common: - A large number of tasks are run concurrently - Disk storage increases dramatically as the number of users increase. - Complex numerically intense applications are performed infrequently - An important amount of time is spent sorting and searching through large amounts of data. - After data is used it is placed back on disk because it is a shared resource. - A large amount of time is spent in common runtime libraries. NORMAL LOAD 4-way-SMP: kernel: patch-2.6.11-rc1 plmid: 4066 Host: stp4-000 Reaim test http://khack.osdl.org/stp/300031 kernel: 4066 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 4881.87 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 4961.19 (average of 3 runs) kernel: mel-v3-fixed plmid: 4077 Host: stp4-001 Reaim test http://khack.osdl.org/stp/300056 kernel: 4077 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 5065.93 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 5294.48 (average of 3 runs) NORMAL LOAD 1-WAY: kernel: patch-2.6.11-rc1 plmid: 4066 Host: stp1-003 Reaim test http://khack.osdl.org/stp/300029 kernel: 4066 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 993.13 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 983.11 (average of 3 runs) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. kernel: mel-v3-fixed plmid: 4077 Host: stp1-002 Reaim test http://khack.osdl.org/stp/300055 kernel: 4077 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 982.69 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 1008.06 (average of 3 runs) COMPUTE LOAD 2way (this is more CPU intensive than NORMAL reaim load): kernel: patch-2.6.11-rc1 plmid: 4066 Host: stp2-001 Reaim test http://khack.osdl.org/stp/300060 kernel: 4066 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 1482.45 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 1487.20 (average of 3 runs) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. kernel: mel-v3-fixed plmid: 4077 Host: stp2-000 Reaim test http://khack.osdl.org/stp/300058 kernel: 4077 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 1501.47 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 1462.11 (average of 3 runs) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator V2
> > That is possible but it I haven't thought of a way of measuring the cache > > colouring effects (if any). There is also the problem that the additional > > complexity of the allocator will offset this benefit. The two main loss > > points of the allocator are increased complexity and the increased size of > > the zone struct. > > We should be able to measure that too... > > If you look at the performance numbers of applications which do data > crunching, reading/writing data to disk (scientific applications). Or > even databases, plus standard set of IO benchmarks... > I used two benchmarks to test this. The first was a test that ran gs against a large postscript file 10 times and measured the average. The hypothesis was that if I was trashing the CPU cache with the allocator, there would be a marked difference between the results. The results are; ==> gsbench-2.6.11-rc1MBuddy.txt <== Average: 115.47 real, 115.136 user, 0.338 sys ==> gsbench-2.6.11-rc1Standard.txt <== Average: 115.468 real, 115.092 user, 0.337 sys So, there is no significance there. I think we are safe for the CPU cache as neither allocator is particularly cache aware. The second test was a portion of the tests from aim9. The results are MBuddy 7 page_test 120.01 9452 78.76010 133892.18 System Allocations & Pages/second 8 brk_test 120.03 3386 28.20961 479563.44 System Memory Allocations/second 9 jmp_test 120.00 501496 4179.1 4179133.33 Non-local gotos/second 10 signal_test120.01 11632 96.9252696925.26 Signal Traps/second 11 exec_test 120.07 1587 13.21729 66.09 Program Loads/second 12 fork_test 120.03 1890 15.74606 1574.61 Task Creations/second 13 link_test 120.00 11152 92.9 5854.80 Link/Unlink Pairs/second 56 fifo_test 120.00 173450 1445.41667 144541.67 FIFO Messages/second Vanilla 7 page_test 120.01 9536 79.46004 135082.08 System Allocations & Pages/second 8 brk_test 120.01 3394 28.28098 480776.60 System Memory Allocations/second 9 jmp_test 120.00 498770 4156.41667 4156416.67 Non-local gotos/second 10 signal_test120.00 11773 98.1083398108.33 Signal Traps/second 11 exec_test 120.01 1591 13.25723 66.29 Program Loads/second 12 fork_test 120.00 1941 16.17500 1617.50 Task Creations/second 13 link_test 120.00 11188 93.2 5873.70 Link/Unlink Pairs/second 56 fifo_test 120.00 179156 1492.96667 149296.67 FIFO Messages/second Here, there are worrying differences all right. The modified allocator for example is getting 1000 faults a second less than the standard allocator but that is still less than 1%. This is something I need to work on although I think it's optimisation work rather than a fundamental problem with the approach. I'm looking into using bonnie++ as another IO benchmark. > We should be able to use the CPU performance counters to get exact > miss/hit numbers, but it seems its not yet possible to use Mikael's > Pettersson pmc inside the kernel, I asked him sometime ago but never got > along to trying anything: > > This is stuff I was not aware of before and will need to follow up on. > I think some CPU/memory intensive benchmarks should give us a hint of the > total > impact ? > The ghostscript test was the one I choose. Script is below > > However, I also know the linear scanner trashed the LRU lists and probably > > comes with all sorts of performance regressions just to make the > > high-order allocations. > > Migrating pages instead of freeing them can greatly reduce the overhead I > believe > and might be a low impact way of defragmenting memory. > Very likely. As it is, the scanner I used is really stupid but I wanted to show that using a mechanism like it, we should be able to almost guarentee the allocation of a high-order block, something we cannot currently do. > I've added your patch to STP but: > > [STP 300030]Kernel Patch Error Kernel: mel-three-type-allocator-v2 PLM # 4073 > I posted a new version under the subject "[PATCH] 1/2 Reducing fragmentation through better allocation". It should apply cleanly to a vanilla kernel. Sorry about the mess of the other patch. > It failed to apply to 2.6.10-rc1 - I'll work the rejects and rerun the tests. > The patch is against 2.6.11-rc1, but I'm guessing you typos 2.6.10-rc1. -- Mel Gorman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator V2
That is possible but it I haven't thought of a way of measuring the cache colouring effects (if any). There is also the problem that the additional complexity of the allocator will offset this benefit. The two main loss points of the allocator are increased complexity and the increased size of the zone struct. We should be able to measure that too... If you look at the performance numbers of applications which do data crunching, reading/writing data to disk (scientific applications). Or even databases, plus standard set of IO benchmarks... I used two benchmarks to test this. The first was a test that ran gs against a large postscript file 10 times and measured the average. The hypothesis was that if I was trashing the CPU cache with the allocator, there would be a marked difference between the results. The results are; == gsbench-2.6.11-rc1MBuddy.txt == Average: 115.47 real, 115.136 user, 0.338 sys == gsbench-2.6.11-rc1Standard.txt == Average: 115.468 real, 115.092 user, 0.337 sys So, there is no significance there. I think we are safe for the CPU cache as neither allocator is particularly cache aware. The second test was a portion of the tests from aim9. The results are MBuddy 7 page_test 120.01 9452 78.76010 133892.18 System Allocations Pages/second 8 brk_test 120.03 3386 28.20961 479563.44 System Memory Allocations/second 9 jmp_test 120.00 501496 4179.1 4179133.33 Non-local gotos/second 10 signal_test120.01 11632 96.9252696925.26 Signal Traps/second 11 exec_test 120.07 1587 13.21729 66.09 Program Loads/second 12 fork_test 120.03 1890 15.74606 1574.61 Task Creations/second 13 link_test 120.00 11152 92.9 5854.80 Link/Unlink Pairs/second 56 fifo_test 120.00 173450 1445.41667 144541.67 FIFO Messages/second Vanilla 7 page_test 120.01 9536 79.46004 135082.08 System Allocations Pages/second 8 brk_test 120.01 3394 28.28098 480776.60 System Memory Allocations/second 9 jmp_test 120.00 498770 4156.41667 4156416.67 Non-local gotos/second 10 signal_test120.00 11773 98.1083398108.33 Signal Traps/second 11 exec_test 120.01 1591 13.25723 66.29 Program Loads/second 12 fork_test 120.00 1941 16.17500 1617.50 Task Creations/second 13 link_test 120.00 11188 93.2 5873.70 Link/Unlink Pairs/second 56 fifo_test 120.00 179156 1492.96667 149296.67 FIFO Messages/second Here, there are worrying differences all right. The modified allocator for example is getting 1000 faults a second less than the standard allocator but that is still less than 1%. This is something I need to work on although I think it's optimisation work rather than a fundamental problem with the approach. I'm looking into using bonnie++ as another IO benchmark. We should be able to use the CPU performance counters to get exact miss/hit numbers, but it seems its not yet possible to use Mikael's Pettersson pmc inside the kernel, I asked him sometime ago but never got along to trying anything: SNIP This is stuff I was not aware of before and will need to follow up on. I think some CPU/memory intensive benchmarks should give us a hint of the total impact ? The ghostscript test was the one I choose. Script is below However, I also know the linear scanner trashed the LRU lists and probably comes with all sorts of performance regressions just to make the high-order allocations. Migrating pages instead of freeing them can greatly reduce the overhead I believe and might be a low impact way of defragmenting memory. Very likely. As it is, the scanner I used is really stupid but I wanted to show that using a mechanism like it, we should be able to almost guarentee the allocation of a high-order block, something we cannot currently do. I've added your patch to STP but: [STP 300030]Kernel Patch Error Kernel: mel-three-type-allocator-v2 PLM # 4073 I posted a new version under the subject [PATCH] 1/2 Reducing fragmentation through better allocation. It should apply cleanly to a vanilla kernel. Sorry about the mess of the other patch. It failed to apply to 2.6.10-rc1 - I'll work the rejects and rerun the tests. The patch is against 2.6.11-rc1, but I'm guessing you typos 2.6.10-rc1. -- Mel Gorman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote: On Fri, 14 Jan 2005, Marcelo Tosatti wrote: On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote: The patch is against 2.6.11-rc1 and I'm willing to stand by it's stability. I'm also confident it does it's job pretty well so I'd like it to be considered for inclusion. This is very interesting! Thanks Other than the advantage of decreased fragmentation which you aim, by providing clustering of different types of allocations you might have a performance gain (or loss :)) due to changes in cache colouring effects. That is possible but it I haven't thought of a way of measuring the cache colouring effects (if any). There is also the problem that the additional complexity of the allocator will offset this benefit. The two main loss points of the allocator are increased complexity and the increased size of the zone struct. It depends on the workload/application mix and type of cache of course, but I think there will be a significant measurable difference on most common workloads. If I could only measure it :/ Have you done any investigation with that respect? IMHO such verification is really important before attempting to merge it. No unfortunately. Do you know of a test I can use? Some STP reaim results have significant performance increase in general, a few small regressions. I think that depending on the type of access pattern of the application(s) there will be either performance gain or loss, but the result is interesting anyway. :) I'll different more tests later on. AIM OVERVIEW The AIM Multiuser Benchmark - Suite VII tests and measures the performance of Open System multiuser computers. Multiuser computer environments typically have the following general characteristics in common: - A large number of tasks are run concurrently - Disk storage increases dramatically as the number of users increase. - Complex numerically intense applications are performed infrequently - An important amount of time is spent sorting and searching through large amounts of data. - After data is used it is placed back on disk because it is a shared resource. - A large amount of time is spent in common runtime libraries. NORMAL LOAD 4-way-SMP: kernel: patch-2.6.11-rc1 plmid: 4066 Host: stp4-000 Reaim test http://khack.osdl.org/stp/300031 kernel: 4066 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 4881.87 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 4961.19 (average of 3 runs) kernel: mel-v3-fixed plmid: 4077 Host: stp4-001 Reaim test http://khack.osdl.org/stp/300056 kernel: 4077 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 5065.93 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 5294.48 (average of 3 runs) NORMAL LOAD 1-WAY: kernel: patch-2.6.11-rc1 plmid: 4066 Host: stp1-003 Reaim test http://khack.osdl.org/stp/300029 kernel: 4066 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 993.13 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 983.11 (average of 3 runs) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. kernel: mel-v3-fixed plmid: 4077 Host: stp1-002 Reaim test http://khack.osdl.org/stp/300055 kernel: 4077 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 982.69 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 1008.06 (average of 3 runs) COMPUTE LOAD 2way (this is more CPU intensive than NORMAL reaim load): kernel: patch-2.6.11-rc1 plmid: 4066 Host: stp2-001 Reaim test http://khack.osdl.org/stp/300060 kernel: 4066 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 1482.45 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 1487.20 (average of 3 runs) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. kernel: mel-v3-fixed plmid: 4077 Host: stp2-000 Reaim test http://khack.osdl.org/stp/300058 kernel: 4077 Filesystem: ext3 Peak load Test: Maximum Jobs per Minute 1501.47 (average of 3 runs) Quick Convergence Test: Maximum Jobs per Minute 1462.11 (average of 3 runs) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Sun, 16 Jan 2005, Marcelo Tosatti wrote: No unfortunately. Do you know of a test I can use? Some STP reaim results have significant performance increase in general, a few small regressions. I think that depending on the type of access pattern of the application(s) there will be either performance gain or loss, but the result is interesting anyway. :) That is quite exciting and I'm pleased it was able to show gains in some tests. Based on the aim9 tests, I took a look at the paths I affected to see what improvements I could make. There were three significant ones 1. I inlined get_pageblock_type and set_pageblock_type 2. set_pageblock_type was calling page_zone() even though the only caller knew the zone so I added the parameter 3. When taking fom the global pool, I was recanning all the order lists which is does not any more I am hoping that these three changes will clear up the worst of the minor regressions. With the changess, aim9 reported that the modified allocator performs as well as the standard allocator. This means that the allocator is as fast, we are reasonably sure there is no adverse cache effects (if anything cache usage is improved) and we are far more likely to be able to service high-order requests [EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt 7 page_test 120.00 9508 79.2 134696.67 System Allocations Pages/second 8 brk_test 120.01 3401 28.33931 481768.19 System Memory Allocations/second 9 jmp_test 120.00 498718 4155.98333 4155983.33 Non-local gotos/second 10 signal_test120.01 11768 98.0585098058.50 Signal Traps/second 11 exec_test 120.04 1585 13.20393 66.02 Program Loads/second 12 fork_test 120.04 1979 16.48617 1648.62 Task Creations/second 13 link_test 120.01 11174 93.10891 5865.86 Link/Unlink Pairs/second [EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt 7 page_test 120.01 9660 80.49329 136838.60 System Allocations Pages/second 8 brk_test 120.01 3409 28.40597 482901.42 System Memory Allocations/second 9 jmp_test 120.00 501533 4179.44167 4179441.67 Non-local gotos/second 10 signal_test120.00 11677 97.3083397308.33 Signal Traps/second 11 exec_test 120.05 1585 13.20283 66.01 Program Loads/second 12 fork_test 120.05 1889 15.73511 1573.51 Task Creations/second 13 link_test 120.01 11089 92.40063 5821.24 Link/Unlink Pairs/second Patch with minor optimisations as follows; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/buffer.c linux-2.6.11-rc1-mbuddy/fs/buffer.c --- linux-2.6.11-rc1-clean/fs/buffer.c 2005-01-12 04:01:23.0 + +++ linux-2.6.11-rc1-mbuddy/fs/buffer.c 2005-01-13 10:56:30.0 + @@ -1134,7 +1134,8 @@ grow_dev_page(struct block_device *bdev, struct page *page; struct buffer_head *bh; - page = find_or_create_page(inode-i_mapping, index, GFP_NOFS); + page = find_or_create_page(inode-i_mapping, index, + GFP_NOFS | __GFP_USERRCLM); if (!page) return NULL; @@ -2997,7 +2998,8 @@ static void recalc_bh_state(void) struct buffer_head *alloc_buffer_head(int gfp_flags) { - struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags); + struct buffer_head *ret = kmem_cache_alloc(bh_cachep, + gfp_flags|__GFP_KERNRCLM); if (ret) { preempt_disable(); __get_cpu_var(bh_accounting).nr++; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/dcache.c linux-2.6.11-rc1-mbuddy/fs/dcache.c --- linux-2.6.11-rc1-clean/fs/dcache.c 2005-01-12 04:00:09.0 + +++ linux-2.6.11-rc1-mbuddy/fs/dcache.c 2005-01-13 10:56:30.0 + @@ -715,7 +715,8 @@ struct dentry *d_alloc(struct dentry * p struct dentry *dentry; char *dname; - dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); + dentry = kmem_cache_alloc(dentry_cache, + GFP_KERNEL|__GFP_KERNRCLM); if (!dentry) return NULL; diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.11-rc1-clean/fs/ext2/super.c linux-2.6.11-rc1-mbuddy/fs/ext2/super.c --- linux-2.6.11-rc1-clean/fs/ext2/super.c 2005-01-12 04:01:24.0 + +++ linux-2.6.11-rc1-mbuddy/fs/ext2/super.c 2005-01-13 10:56:30.0 + @@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep; static struct inode *ext2_alloc_inode(struct super_block *sb) { struct ext2_inode_info *ei; - ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote: > On Fri, 14 Jan 2005, Marcelo Tosatti wrote: > > > On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote: > > > The patch is against 2.6.11-rc1 and I'm willing to stand by it's > > > stability. I'm also confident it does it's job pretty well so I'd like it > > > to be considered for inclusion. > > > > This is very interesting! > > > > Thanks > > > Other than the advantage of decreased fragmentation which you aim, by > > providing clustering of different types of allocations you might have a > > performance gain (or loss :)) due to changes in cache colouring > > effects. > > > > That is possible but it I haven't thought of a way of measuring the cache > colouring effects (if any). There is also the problem that the additional > complexity of the allocator will offset this benefit. The two main loss > points of the allocator are increased complexity and the increased size of > the zone struct. We should be able to measure that too... If you look at the performance numbers of applications which do data crunching, reading/writing data to disk (scientific applications). Or even databases, plus standard set of IO benchmarks... Of course you're not able to measure the change in cache hits/misses (which would be nice), but you can get an idea how measurable is the final performance impact, including the page allocator overhead and the increase zone struct size (I dont think the struct zone size increase makes much difference). We should be able to use the CPU performance counters to get exact miss/hit numbers, but it seems its not yet possible to use Mikael's Pettersson pmc inside the kernel, I asked him sometime ago but never got along to trying anything: Subject: Re: Measuring kernel-level code cache hits/misses with perfctr > Hi Mikael, > > > > > > I've been wondering if its possible to use PMC's > > > to monitor L1 and/or L2 cache hits from kernel code? You can count them by using the global-mode counters interface (present in the perfctr-2.6 package but not in the 2.6-mm kernel unfortunately) and restricting the counters to CPL 0. However, for profiling purposes you probably want to catch overflow interrupts, and that's not supported for global-mode counters. I simply haven't had time to implement that feature. > > It depends on the workload/application mix and type of cache of course, > > but I think there will be a significant measurable difference on most > > common workloads. > > > > If I could only measure it :/ > > > Have you done any investigation with that respect? IMHO such > > verification is really important before attempting to merge it. > > > > No unfortunately. Do you know of a test I can use? I think some CPU/memory intensive benchmarks should give us a hint of the total impact ? > > BTW talking about cache colouring, I this is an area which has a HUGE > > space for improvement. The allocator is completly unaware of colouring > > (except the SLAB) - we should try to come up with a light per-process > > allocation colouring optimizer. But thats another history. > > > > This also was tried and dropped. The allocator was a lot more complex and > the implementor was unable to measure it. IIRC, the patch was not accepted > with a comment along the lines of "If you can't measure it, it doesn't > exist". Before I walk down the page coloring path again, I'll need some > scheme that measures the cache-effect. Someone needs to write the helper functions to use the PMC's and test that. > Totally aside, I'm doing this work because I've started a PhD on > developing solid metrics for measuring VM performance and then devising > new or modified algorithms using the metrics to see if the changes are any > good. Nice! Make your work public! I'm personally very interested in this area. > > > For me, the next stage is to write a linear scanner that goes through the > > > address space to free up a high-order block of pages on demand. This will > > > be a tricky job so it'll take me quite a while. > > > > We're paving the road to implement a generic "weak" migration function on > > top > > of the current page migration infrastructure. With "weak" I mean that it > > bails > > out easily if the page cannot be migrated, unlike the "strong" version which > > _has_ to migrate the page(s) (for memory hotplug purpose). > > > > With such function in place its easier to have different implementations of > > defragmentation > > logic - we might want to coolaborate on that. > > > > I've
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Fri, 14 Jan 2005, Marcelo Tosatti wrote: > On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote: > > The patch is against 2.6.11-rc1 and I'm willing to stand by it's > > stability. I'm also confident it does it's job pretty well so I'd like it > > to be considered for inclusion. > > This is very interesting! > Thanks > Other than the advantage of decreased fragmentation which you aim, by > providing clustering of different types of allocations you might have a > performance gain (or loss :)) due to changes in cache colouring > effects. > That is possible but it I haven't thought of a way of measuring the cache colouring effects (if any). There is also the problem that the additional complexity of the allocator will offset this benefit. The two main loss points of the allocator are increased complexity and the increased size of the zone struct. > It depends on the workload/application mix and type of cache of course, > but I think there will be a significant measurable difference on most > common workloads. > If I could only measure it :/ > Have you done any investigation with that respect? IMHO such > verification is really important before attempting to merge it. > No unfortunately. Do you know of a test I can use? > BTW talking about cache colouring, I this is an area which has a HUGE > space for improvement. The allocator is completly unaware of colouring > (except the SLAB) - we should try to come up with a light per-process > allocation colouring optimizer. But thats another history. > This also was tried and dropped. The allocator was a lot more complex and the implementor was unable to measure it. IIRC, the patch was not accepted with a comment along the lines of "If you can't measure it, it doesn't exist". Before I walk down the page coloring path again, I'll need some scheme that measures the cache-effect. Totally aside, I'm doing this work because I've started a PhD on developing solid metrics for measuring VM performance and then devising new or modified algorithms using the metrics to see if the changes are any good. > > For me, the next stage is to write a linear scanner that goes through the > > address space to free up a high-order block of pages on demand. This will > > be a tricky job so it'll take me quite a while. > > We're paving the road to implement a generic "weak" migration function on top > of the current page migration infrastructure. With "weak" I mean that it bails > out easily if the page cannot be migrated, unlike the "strong" version which > _has_ to migrate the page(s) (for memory hotplug purpose). > > With such function in place its easier to have different implementations of > defragmentation > logic - we might want to coolaborate on that. > I've also started something like this although I think you'll find my first approach childishly simple. I implemented a linear scanner that finds the KernRclm and UserRclm areas. It then makes a list of the PageLRU pages and sends them to shrink_list(). I ran a test which put the machine under heavy stress and then tried to allocate 75% of ZONE_NORMAL with 2^_MAX_ORDER pages (allocations done via a kernel module). I found that the standard allocator was only able to successfully allocate 1% of the allocations (3 blocks), my modified allocator managed 50% (81 blocks) and with linear scanning in place, it was 76% (122 blocks). I figure I could get the linear scanning figures even higher if I taught the allocator to reserve the pages it frees for the process performing the linear scanning. However, I also know the linear scanner trashed the LRU lists and probably comes with all sorts of performance regressions just to make the high-order allocations. The new patches for the allocator (last patch I posted has a serious bug in it), the linear scanner and the results will be posted as another mail. > Your bitmap also allows a hint for the "defragmentator" to know the type > of pages, and possibly size of the block, so it can avoid earlier trying > to migrate non reclaimable memory. It possibly makes the scanning > procedure much lightweight. > Potentially. I need to catch up more on the existing schemes. I've been out of the VM loop for a long time now so I'm still playing the Catch-Up game. > > > > You want to do > free_pages -= (z->free_area_lists[0][o].nr_free + > z->free_area_lists[2][o].nr_free + > z->free_area_lists[2][o].nr_free) << o; > > So not to interfere with the "min" decay (and remove the allocation type > loop). > Agreed. New patch has this in place > > > > - /* Require fewer higher order pages to be free */ > > - min >>= 1; > > + /* Require fewer higher order pages to be free */ > > + min >>= 1; > > > > - if (free_pages <= min) > > - return 0; > > + if (free_pages <= min) > > + return 0; > > + } > > I'll play
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Fri, 14 Jan 2005, Marcelo Tosatti wrote: On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote: The patch is against 2.6.11-rc1 and I'm willing to stand by it's stability. I'm also confident it does it's job pretty well so I'd like it to be considered for inclusion. This is very interesting! Thanks Other than the advantage of decreased fragmentation which you aim, by providing clustering of different types of allocations you might have a performance gain (or loss :)) due to changes in cache colouring effects. That is possible but it I haven't thought of a way of measuring the cache colouring effects (if any). There is also the problem that the additional complexity of the allocator will offset this benefit. The two main loss points of the allocator are increased complexity and the increased size of the zone struct. It depends on the workload/application mix and type of cache of course, but I think there will be a significant measurable difference on most common workloads. If I could only measure it :/ Have you done any investigation with that respect? IMHO such verification is really important before attempting to merge it. No unfortunately. Do you know of a test I can use? BTW talking about cache colouring, I this is an area which has a HUGE space for improvement. The allocator is completly unaware of colouring (except the SLAB) - we should try to come up with a light per-process allocation colouring optimizer. But thats another history. This also was tried and dropped. The allocator was a lot more complex and the implementor was unable to measure it. IIRC, the patch was not accepted with a comment along the lines of If you can't measure it, it doesn't exist. Before I walk down the page coloring path again, I'll need some scheme that measures the cache-effect. Totally aside, I'm doing this work because I've started a PhD on developing solid metrics for measuring VM performance and then devising new or modified algorithms using the metrics to see if the changes are any good. For me, the next stage is to write a linear scanner that goes through the address space to free up a high-order block of pages on demand. This will be a tricky job so it'll take me quite a while. We're paving the road to implement a generic weak migration function on top of the current page migration infrastructure. With weak I mean that it bails out easily if the page cannot be migrated, unlike the strong version which _has_ to migrate the page(s) (for memory hotplug purpose). With such function in place its easier to have different implementations of defragmentation logic - we might want to coolaborate on that. I've also started something like this although I think you'll find my first approach childishly simple. I implemented a linear scanner that finds the KernRclm and UserRclm areas. It then makes a list of the PageLRU pages and sends them to shrink_list(). I ran a test which put the machine under heavy stress and then tried to allocate 75% of ZONE_NORMAL with 2^_MAX_ORDER pages (allocations done via a kernel module). I found that the standard allocator was only able to successfully allocate 1% of the allocations (3 blocks), my modified allocator managed 50% (81 blocks) and with linear scanning in place, it was 76% (122 blocks). I figure I could get the linear scanning figures even higher if I taught the allocator to reserve the pages it frees for the process performing the linear scanning. However, I also know the linear scanner trashed the LRU lists and probably comes with all sorts of performance regressions just to make the high-order allocations. The new patches for the allocator (last patch I posted has a serious bug in it), the linear scanner and the results will be posted as another mail. Your bitmap also allows a hint for the defragmentator to know the type of pages, and possibly size of the block, so it can avoid earlier trying to migrate non reclaimable memory. It possibly makes the scanning procedure much lightweight. Potentially. I need to catch up more on the existing schemes. I've been out of the VM loop for a long time now so I'm still playing the Catch-Up game. SNIP You want to do free_pages -= (z-free_area_lists[0][o].nr_free + z-free_area_lists[2][o].nr_free + z-free_area_lists[2][o].nr_free) o; So not to interfere with the min decay (and remove the allocation type loop). Agreed. New patch has this in place - /* Require fewer higher order pages to be free */ - min = 1; + /* Require fewer higher order pages to be free */ + min = 1; - if (free_pages = min) - return 0; + if (free_pages = min) + return 0; + } I'll play with your patch during the weekend, run some benchmarks (STP is our friend), try to measure the
Re: [PATCH] Avoiding fragmentation through different allocator V2
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote: On Fri, 14 Jan 2005, Marcelo Tosatti wrote: On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote: The patch is against 2.6.11-rc1 and I'm willing to stand by it's stability. I'm also confident it does it's job pretty well so I'd like it to be considered for inclusion. This is very interesting! Thanks Other than the advantage of decreased fragmentation which you aim, by providing clustering of different types of allocations you might have a performance gain (or loss :)) due to changes in cache colouring effects. That is possible but it I haven't thought of a way of measuring the cache colouring effects (if any). There is also the problem that the additional complexity of the allocator will offset this benefit. The two main loss points of the allocator are increased complexity and the increased size of the zone struct. We should be able to measure that too... If you look at the performance numbers of applications which do data crunching, reading/writing data to disk (scientific applications). Or even databases, plus standard set of IO benchmarks... Of course you're not able to measure the change in cache hits/misses (which would be nice), but you can get an idea how measurable is the final performance impact, including the page allocator overhead and the increase zone struct size (I dont think the struct zone size increase makes much difference). We should be able to use the CPU performance counters to get exact miss/hit numbers, but it seems its not yet possible to use Mikael's Pettersson pmc inside the kernel, I asked him sometime ago but never got along to trying anything: Subject: Re: Measuring kernel-level code cache hits/misses with perfctr Hi Mikael, I've been wondering if its possible to use PMC's to monitor L1 and/or L2 cache hits from kernel code? You can count them by using the global-mode counters interface (present in the perfctr-2.6 package but not in the 2.6-mm kernel unfortunately) and restricting the counters to CPL 0. However, for profiling purposes you probably want to catch overflow interrupts, and that's not supported for global-mode counters. I simply haven't had time to implement that feature. It depends on the workload/application mix and type of cache of course, but I think there will be a significant measurable difference on most common workloads. If I could only measure it :/ Have you done any investigation with that respect? IMHO such verification is really important before attempting to merge it. No unfortunately. Do you know of a test I can use? I think some CPU/memory intensive benchmarks should give us a hint of the total impact ? BTW talking about cache colouring, I this is an area which has a HUGE space for improvement. The allocator is completly unaware of colouring (except the SLAB) - we should try to come up with a light per-process allocation colouring optimizer. But thats another history. This also was tried and dropped. The allocator was a lot more complex and the implementor was unable to measure it. IIRC, the patch was not accepted with a comment along the lines of If you can't measure it, it doesn't exist. Before I walk down the page coloring path again, I'll need some scheme that measures the cache-effect. Someone needs to write the helper functions to use the PMC's and test that. Totally aside, I'm doing this work because I've started a PhD on developing solid metrics for measuring VM performance and then devising new or modified algorithms using the metrics to see if the changes are any good. Nice! Make your work public! I'm personally very interested in this area. For me, the next stage is to write a linear scanner that goes through the address space to free up a high-order block of pages on demand. This will be a tricky job so it'll take me quite a while. We're paving the road to implement a generic weak migration function on top of the current page migration infrastructure. With weak I mean that it bails out easily if the page cannot be migrated, unlike the strong version which _has_ to migrate the page(s) (for memory hotplug purpose). With such function in place its easier to have different implementations of defragmentation logic - we might want to coolaborate on that. I've also started something like this although I think you'll find my first approach childishly simple. I implemented a linear scanner