Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Grant Grundler
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
> The megaraid driver is open source, do you see anything that driver can do
> to improve performance. We would greatly appreciate any feedback in this
> regard and definitely incorporate in the driver. The FW under Linux and
> windows is same, so I do not see how the megaraid stack should perform
> differently under Linux and windows?

Just to second what Andy already stated: it's more likely the
Megaraid firmware could be better at fetching the SG lists.
This is a difficult problem since the firmware needs to work
well on so many different platforms/chipsets.

If LSI has time to turn more stones, get a PCI bus analyzer and filter
it to only capture CPU MMIO traffic and DMA traffic to/from some
"well known" SG lists (ie instrument the driver to print those to
the console). Then run AIM7 or similar multithreaded workload.
A perfect PCI trace will show the device pulling the SG list in
cacheline at time after the CPU MMIO reads/writes from the card
to indicate a new transaction is ready to go.

Another stone LSI could turn is to verify the megaraid controller is
NOT contending with the CPU for cachelines used to build SG lists.
This something the driver controls but I only know how to measure
this on ia64 machines (with pfmon or caliper or similar tool).
If you want examples, see
http://iou.parisc-linux.org/ols2004/pfmon_for_iodorks.pdf

In case it's not clear from above, optimal IO flow means the device
is moving control data and streaming data in cacheline or bigger units.
If Megaraid is already doing that, then the PCI trace timing info
should point at where the latencies are.

hth,
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Mel Gorman
On Tue, 25 Jan 2005, Andi Kleen wrote:

> On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
> >
> > > e.g. performance on megaraid controllers (very popular
> > > because a big PC vendor ships them) was always quite bad on
> > > Linux. Up to the point that specific IO workloads run half as
> > > fast on a megaraid compared to other controllers. I heard
> > > they do work better on Windows.
> > >
> > 
> > > Ideally the Linux IO patterns would look similar to the
> > > Windows IO patterns, then we could reuse all the
> > > optimizations the controller vendors did for Windows :)
> >
> > LSI would leave no stone unturned to make the performance better for
> > megaraid controllers under Linux. If you have some hard data in relation to
> > comparison of performance for adapters from other vendors, please share with
> > us. We would definitely strive to better it.
>
> Sorry for being vague on this. I don't have much hard data on this,
> just telling an annecdote. The issue we saw was over a year ago
> and on a machine running an IO intensive multi process stress test
> (I believe it was an AIM7 variant with some tweaked workfile). When the test
> was moved to a machine with megaraid controller it ran significantly
> lower, compared to the old setup with a non RAID SCSI controller from
> a different vendor. I unfortunately don't know anymore the exact
> type/firmware revision etc. of the megaraid that showed the problem.
>

Ok, for me here, the bottom line is that decent hardware will not benefit
from help from the allocator. Worse, if the work required to provide
adjacent pages is high, it will even adversly affect throughput. I know as
well that to have physically contiguous pages in userspace would involve a
fair amount of overhead so even if we devise a system for providing them,
it would need to be a configurable option.

I will keep an eye out for a means of granting physically contiguous pages
for userspace in a lightweight manner but I'm going to focus on general
availability of large pages for TLBs, extend the system for a pool of
zero'd pages and how it can be adapted to help out the hotplug folks.

The system I have in mind for contiguous pages for userspace right now is
to extend the allocator API so that prefaulting and readahead will request
blocks of pages for userspace rather than a series of order-0 pages. So,
if we prefault 32 pages ahead, the allocator would have a new API that
would return 32 pages that are physically contiguous. That, in combination
with forced IOMMU may show if Contiguous Pages For IO is worth it or not.

This will take a while as I'll have to develop some mechanism for
measuring it while I'm at it and I only do this 2 days a week so it'll
take a while.

-- 
Mel Gorman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Andi Kleen
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
>  
> > e.g. performance on megaraid controllers (very popular 
> > because a big PC vendor ships them) was always quite bad on 
> > Linux. Up to the point that specific IO workloads run half as 
> > fast on a megaraid compared to other controllers. I heard 
> > they do work better on Windows.
> > 
> 
> > Ideally the Linux IO patterns would look similar to the 
> > Windows IO patterns, then we could reuse all the 
> > optimizations the controller vendors did for Windows :)
> 
> LSI would leave no stone unturned to make the performance better for
> megaraid controllers under Linux. If you have some hard data in relation to
> comparison of performance for adapters from other vendors, please share with
> us. We would definitely strive to better it.

Sorry for being vague on this. I don't have much hard data on this,
just telling an annecdote. The issue we saw was over a year ago
and on a machine running an IO intensive multi process stress test
(I believe it was an AIM7 variant with some tweaked workfile). When the test
was moved to a machine with megaraid controller it ran significantly
lower, compared to the old setup with a non RAID SCSI controller from
a different vendor. I unfortunately don't know anymore the exact
type/firmware revision etc. of the megaraid that showed the problem.

If you have already fixed the issues then please accept my apologies.

> The megaraid driver is open source, do you see anything that driver can do
> to improve performance. We would greatly appreciate any feedback in this
> regard and definitely incorporate in the driver. The FW under Linux and
> windows is same, so I do not see how the megaraid stack should perform
> differently under Linux and windows?

My understanding (may be incomplete) of the issue is basically what
Steve said: something in the stack doesn't like the Linux IO patterns
with often relatively long SG lists, which are longer than in some
other popular OS. This is unlikely to be the Linux driver
(drivers tend to just pass the SG lists through without too much processing),
more likely it was the firmware or something below.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Andi Kleen
On Tue, Jan 25, 2005 at 02:27:57PM +, Christoph Hellwig wrote:
> > It is not the driver per se, but the way the memory which is the I/O
> > source/target is presented to the driver. In linux there is a good
> > chance it will have to use more scatter gather elements to represent
> > the same amount of data.
> 
> Note that a change made a few month ago after seeing issues with
> aacraid means it's much more likely to see contingous memory,
> there were some numbers on linux-scsi and/or linux-kernel.

But only at the beginning. iirc after a few days of uptime 
and memory fragmentation it degenerates back to the old numbers.

Perhaps the recent anti defragmentation work will help more.

-Andi

P.S.: on a AMD x86-64 box the theory can be relatively easily tested:
just run with iommu=force,biomerge that will use the IOMMU to merge
SG elements.  I just don't recommend it for production because some errors 
are not well handled.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Christoph Hellwig
> It is not the driver per se, but the way the memory which is the I/O
> source/target is presented to the driver. In linux there is a good
> chance it will have to use more scatter gather elements to represent
> the same amount of data.

Note that a change made a few month ago after seeing issues with
aacraid means it's much more likely to see contingous memory,
there were some numbers on linux-scsi and/or linux-kernel.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Steve Lord
Mukker, Atul wrote:
LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.
The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?
It is not the driver per se, but the way the memory which is the I/O
source/target is presented to the driver. In linux there is a good
chance it will have to use more scatter gather elements to represent
the same amount of data.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Mukker, Atul
 
> e.g. performance on megaraid controllers (very popular 
> because a big PC vendor ships them) was always quite bad on 
> Linux. Up to the point that specific IO workloads run half as 
> fast on a megaraid compared to other controllers. I heard 
> they do work better on Windows.
> 

> Ideally the Linux IO patterns would look similar to the 
> Windows IO patterns, then we could reuse all the 
> optimizations the controller vendors did for Windows :)

LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.

The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?

Thanks

Atul Mukker
Architect, Drivers and BIOS
LSI Logic Corporation
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Mukker, Atul
 
 e.g. performance on megaraid controllers (very popular 
 because a big PC vendor ships them) was always quite bad on 
 Linux. Up to the point that specific IO workloads run half as 
 fast on a megaraid compared to other controllers. I heard 
 they do work better on Windows.
 
snip
 Ideally the Linux IO patterns would look similar to the 
 Windows IO patterns, then we could reuse all the 
 optimizations the controller vendors did for Windows :)

LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.

The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?

Thanks

Atul Mukker
Architect, Drivers and BIOS
LSI Logic Corporation
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Steve Lord
Mukker, Atul wrote:
LSI would leave no stone unturned to make the performance better for
megaraid controllers under Linux. If you have some hard data in relation to
comparison of performance for adapters from other vendors, please share with
us. We would definitely strive to better it.
The megaraid driver is open source, do you see anything that driver can do
to improve performance. We would greatly appreciate any feedback in this
regard and definitely incorporate in the driver. The FW under Linux and
windows is same, so I do not see how the megaraid stack should perform
differently under Linux and windows?
It is not the driver per se, but the way the memory which is the I/O
source/target is presented to the driver. In linux there is a good
chance it will have to use more scatter gather elements to represent
the same amount of data.
Steve
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Christoph Hellwig
 It is not the driver per se, but the way the memory which is the I/O
 source/target is presented to the driver. In linux there is a good
 chance it will have to use more scatter gather elements to represent
 the same amount of data.

Note that a change made a few month ago after seeing issues with
aacraid means it's much more likely to see contingous memory,
there were some numbers on linux-scsi and/or linux-kernel.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Andi Kleen
On Tue, Jan 25, 2005 at 02:27:57PM +, Christoph Hellwig wrote:
  It is not the driver per se, but the way the memory which is the I/O
  source/target is presented to the driver. In linux there is a good
  chance it will have to use more scatter gather elements to represent
  the same amount of data.
 
 Note that a change made a few month ago after seeing issues with
 aacraid means it's much more likely to see contingous memory,
 there were some numbers on linux-scsi and/or linux-kernel.

But only at the beginning. iirc after a few days of uptime 
and memory fragmentation it degenerates back to the old numbers.

Perhaps the recent anti defragmentation work will help more.

-Andi

P.S.: on a AMD x86-64 box the theory can be relatively easily tested:
just run with iommu=force,biomerge that will use the IOMMU to merge
SG elements.  I just don't recommend it for production because some errors 
are not well handled.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Andi Kleen
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
  
  e.g. performance on megaraid controllers (very popular 
  because a big PC vendor ships them) was always quite bad on 
  Linux. Up to the point that specific IO workloads run half as 
  fast on a megaraid compared to other controllers. I heard 
  they do work better on Windows.
  
 snip
  Ideally the Linux IO patterns would look similar to the 
  Windows IO patterns, then we could reuse all the 
  optimizations the controller vendors did for Windows :)
 
 LSI would leave no stone unturned to make the performance better for
 megaraid controllers under Linux. If you have some hard data in relation to
 comparison of performance for adapters from other vendors, please share with
 us. We would definitely strive to better it.

Sorry for being vague on this. I don't have much hard data on this,
just telling an annecdote. The issue we saw was over a year ago
and on a machine running an IO intensive multi process stress test
(I believe it was an AIM7 variant with some tweaked workfile). When the test
was moved to a machine with megaraid controller it ran significantly
lower, compared to the old setup with a non RAID SCSI controller from
a different vendor. I unfortunately don't know anymore the exact
type/firmware revision etc. of the megaraid that showed the problem.

If you have already fixed the issues then please accept my apologies.

 The megaraid driver is open source, do you see anything that driver can do
 to improve performance. We would greatly appreciate any feedback in this
 regard and definitely incorporate in the driver. The FW under Linux and
 windows is same, so I do not see how the megaraid stack should perform
 differently under Linux and windows?

My understanding (may be incomplete) of the issue is basically what
Steve said: something in the stack doesn't like the Linux IO patterns
with often relatively long SG lists, which are longer than in some
other popular OS. This is unlikely to be the Linux driver
(drivers tend to just pass the SG lists through without too much processing),
more likely it was the firmware or something below.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Mel Gorman
On Tue, 25 Jan 2005, Andi Kleen wrote:

 On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
 
   e.g. performance on megaraid controllers (very popular
   because a big PC vendor ships them) was always quite bad on
   Linux. Up to the point that specific IO workloads run half as
   fast on a megaraid compared to other controllers. I heard
   they do work better on Windows.
  
  snip
   Ideally the Linux IO patterns would look similar to the
   Windows IO patterns, then we could reuse all the
   optimizations the controller vendors did for Windows :)
 
  LSI would leave no stone unturned to make the performance better for
  megaraid controllers under Linux. If you have some hard data in relation to
  comparison of performance for adapters from other vendors, please share with
  us. We would definitely strive to better it.

 Sorry for being vague on this. I don't have much hard data on this,
 just telling an annecdote. The issue we saw was over a year ago
 and on a machine running an IO intensive multi process stress test
 (I believe it was an AIM7 variant with some tweaked workfile). When the test
 was moved to a machine with megaraid controller it ran significantly
 lower, compared to the old setup with a non RAID SCSI controller from
 a different vendor. I unfortunately don't know anymore the exact
 type/firmware revision etc. of the megaraid that showed the problem.


Ok, for me here, the bottom line is that decent hardware will not benefit
from help from the allocator. Worse, if the work required to provide
adjacent pages is high, it will even adversly affect throughput. I know as
well that to have physically contiguous pages in userspace would involve a
fair amount of overhead so even if we devise a system for providing them,
it would need to be a configurable option.

I will keep an eye out for a means of granting physically contiguous pages
for userspace in a lightweight manner but I'm going to focus on general
availability of large pages for TLBs, extend the system for a pool of
zero'd pages and how it can be adapted to help out the hotplug folks.

The system I have in mind for contiguous pages for userspace right now is
to extend the allocator API so that prefaulting and readahead will request
blocks of pages for userspace rather than a series of order-0 pages. So,
if we prefault 32 pages ahead, the allocator would have a new API that
would return 32 pages that are physically contiguous. That, in combination
with forced IOMMU may show if Contiguous Pages For IO is worth it or not.

This will take a while as I'll have to develop some mechanism for
measuring it while I'm at it and I only do this 2 days a week so it'll
take a while.

-- 
Mel Gorman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-25 Thread Grant Grundler
On Tue, Jan 25, 2005 at 09:02:34AM -0500, Mukker, Atul wrote:
 The megaraid driver is open source, do you see anything that driver can do
 to improve performance. We would greatly appreciate any feedback in this
 regard and definitely incorporate in the driver. The FW under Linux and
 windows is same, so I do not see how the megaraid stack should perform
 differently under Linux and windows?

Just to second what Andy already stated: it's more likely the
Megaraid firmware could be better at fetching the SG lists.
This is a difficult problem since the firmware needs to work
well on so many different platforms/chipsets.

If LSI has time to turn more stones, get a PCI bus analyzer and filter
it to only capture CPU MMIO traffic and DMA traffic to/from some
well known SG lists (ie instrument the driver to print those to
the console). Then run AIM7 or similar multithreaded workload.
A perfect PCI trace will show the device pulling the SG list in
cacheline at time after the CPU MMIO reads/writes from the card
to indicate a new transaction is ready to go.

Another stone LSI could turn is to verify the megaraid controller is
NOT contending with the CPU for cachelines used to build SG lists.
This something the driver controls but I only know how to measure
this on ia64 machines (with pfmon or caliper or similar tool).
If you want examples, see
http://iou.parisc-linux.org/ols2004/pfmon_for_iodorks.pdf

In case it's not clear from above, optimal IO flow means the device
is moving control data and streaming data in cacheline or bigger units.
If Megaraid is already doing that, then the PCI trace timing info
should point at where the latencies are.

hth,
grant
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Andi Kleen
Steve Lord <[EMAIL PROTECTED]> writes:
>
> I realize this is one data point on one end of the scale, but I
> just wanted to make the point that there are cases where it
> does matter. Hopefully William's little change from last
> year has helped out a lot.

There are more datapoints: 

e.g. performance on megaraid controllers (very popular because a big
PC vendor ships them) was always quite bad on Linux. Up to the point
that specific IO workloads run half as fast on a megaraid compared to
other controllers. I heard they do work better on Windows.

Also I did some experiments with coalescing SG lists in the Opteron IOMMU
some time ago. With a MPT fusion controller and forcing all SG lists
through the IOMMU so that the SCSI controller always only contiguous mappings
I saw ~5% improvement on some IO tests.

Unfortunately there are some problems that doesn't allow to enable
this unconditionally. But it gives strong evidence that MPT Fusion prefers
shorter SG lists too.

So it seems to be worthwhile to optimize for shorter SG lists.

Ideally the Linux IO patterns would look similar to the Windows IO patterns,
then we could reuse all the optimizations the controller vendors
did for Windows :)
 
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Steve Lord
James Bottomley wrote:
Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices.  They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.
No one has ever really measured an effect we can say "This is due to the
card's SG engine".  So, the rule we tend to follow is that if SG element
reduction comes for free, we take it.  The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.
Depends on the device at the other end of the scsi/fiber channel.
We have seen the processor in raid devices get maxed out by linux
when it is not maxed out by windows. Windows tends to be more device
friendly (I hate to say it), by sending larger and fewer scatter gather
elements than linux does.
Running an LSI raid over fiberchannel with 4 ports, windows was
able to sustain ~830 Mbytes/sec, basically channel speed using
only 1500 commands a second. Linux peaked at 550 Mbytes/sec using
over 4000 scsi commands to do it - the sustained rate was more
like 350 Mbytes/sec, I think at the end of the day linux was
sending 128K per scsi request. These numbers predate the current
linux scsi and io code, and I do not have the hardware to rerun
them right now.
I realize this is one data point on one end of the scale, but I
just wanted to make the point that there are cases where it
does matter. Hopefully William's little change from last
year has helped out a lot.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread James Bottomley
On Mon, 2005-01-24 at 13:49 -0200, Marcelo Tosatti wrote:
> So is it valid to affirm that on average an operation with one SG element 
> pointing to a 1MB 
> region is similar in speed to an operation with 16 SG elements each pointing 
> to a 64K 
> region due to the efficient onboard SG processing? 

it's within a few percent, yes.  And the figures depend on how good the
I/O card is at it.  I can imagine there are some wildly varying I/O
cards out there.

However, also remember that 1MB of I/O is getting beyond what's sensible
for a disc device anyway.  The cable speed is much faster than the
platter speed, so the device takes the I/O into its cache as it services
it.  If you overrun the cache it will burp (disconnect) and force a
reconnection to get the rest (effectively splitting the I/O up anyway).
This doesn't apply to arrays with huge caches, but it does to pretty
much everything else.  The average disc cache size is only a megabyte or
so.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Grant Grundler
On Mon, Jan 24, 2005 at 10:29:52AM -0200, Marcelo Tosatti wrote:
> Grant Grundler and James Bottomley have been working on this area,
> they might want to add some comments to this discussion.
> 
> It seems HP (Grant et all) has pursued using big pages on IA64 (64K)
> for this purpose.

Marcello,
That might have been Alex Williamson...but the reasons for 64K pages
is to reduce TLB thrashing, not faster IO.

On HP ZX1 boxes, SG performance is slightly better (max +5%) when going
through the IOMMU than when bypassing it. The IOMMU can perfectly
coalesce DMA pages but has a small CPU and DMA cost to do so as well.

Otherwise, I totally agree with James. IO devices do scatter-gather
pretty well and IO subsystems are tuned for page-size chunk or
smaller anyway.

...
> > I could keep digging, but I think the bottom line is that having large
> > pages generally available rather than a fixed setting is desirable. 
> 
> Definately, yes. Thanks for the pointers. 

Big pages are good for CPU TLB and that's where most of the
research has been done. I think IO devices have learned to cope
with the fact the alot less has been (or can be for many
workloads) done to coalesce IO pages.

grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Marcelo Tosatti
On Mon, Jan 24, 2005 at 10:44:12AM -0600, James Bottomley wrote:
> On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote:
> > Since the pages which compose IO operations are most likely sparse (not 
> > physically contiguous),
> > the driver+device has to perform scatter-gather IO on the pages. 
> > 
> > The idea is that if we can have larger memory blocks scatter-gather IO can 
> > use less SG list 
> > elements (decreased CPU overhead, decreased device overhead, faster). 
> > 
> > Best scenario is where only one sg element is required (ie one huge 
> > physically contiguous block).
> > 
> > Old devices/unprepared drivers which are not able to perform SG/IO
> > suffer with sequential small sized operations.
> > 
> > I'm far away from being a SCSI/ATA knowledgeable person, the storage people 
> > can 
> > help with expertise here.
> > 
> > Grant Grundler and James Bottomley have been working on this area, they 
> > might want to 
> > add some comments to this discussion.
> > 
> > It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for 
> > this purpose.
> 
> Well, the basic advice would be not to worry too much about
> fragmentation from the point of view of I/O devices.  They mostly all do
> scatter gather (SG) onboard as an intelligent processing operation and
> they're very good at it.

So is it valid to affirm that on average an operation with one SG element 
pointing to a 1MB 
region is similar in speed to an operation with 16 SG elements each pointing to 
a 64K 
region due to the efficient onboard SG processing? 

> No one has ever really measured an effect we can say "This is due to the
> card's SG engine".  So, the rule we tend to follow is that if SG element
> reduction comes for free, we take it.  The issue that actually causes
> problems isn't the reduction in processing overhead, it's that the
> device's SG list is usually finite in size and so it's worth conserving
> if we can; however it's mostly not worth conserving at the expense of
> processor cycles.
>
> The bottom line is that the I/O (block) subsystem is very efficient at
> coalescing (both in block space and in physical memory space) and we've
> got it to the point where it's about as efficient as it can be.  If
> you're going to give us better physical contiguity properties, we'll
> take them, but if you spend extra cycles doing it, the chances are
> you'll slow down the I/O throughput path.

OK! thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread James Bottomley
On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote:
> Since the pages which compose IO operations are most likely sparse (not 
> physically contiguous),
> the driver+device has to perform scatter-gather IO on the pages. 
> 
> The idea is that if we can have larger memory blocks scatter-gather IO can 
> use less SG list 
> elements (decreased CPU overhead, decreased device overhead, faster). 
> 
> Best scenario is where only one sg element is required (ie one huge 
> physically contiguous block).
> 
> Old devices/unprepared drivers which are not able to perform SG/IO
> suffer with sequential small sized operations.
> 
> I'm far away from being a SCSI/ATA knowledgeable person, the storage people 
> can 
> help with expertise here.
> 
> Grant Grundler and James Bottomley have been working on this area, they might 
> want to 
> add some comments to this discussion.
> 
> It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this 
> purpose.

Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices.  They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.

No one has ever really measured an effect we can say "This is due to the
card's SG engine".  So, the rule we tend to follow is that if SG element
reduction comes for free, we take it.  The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.

The bottom line is that the I/O (block) subsystem is very efficient at
coalescing (both in block space and in physical memory space) and we've
got it to the point where it's about as efficient as it can be.  If
you're going to give us better physical contiguity properties, we'll
take them, but if you spend extra cycles doing it, the chances are
you'll slow down the I/O throughput path.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Marcelo Tosatti

James and Grant added to CC.

On Mon, Jan 24, 2005 at 01:28:47PM +, Mel Gorman wrote:
> On Sat, 22 Jan 2005, Marcelo Tosatti wrote:
> 
> > > > I was thinking that it would be nice to have a set of high-order
> > > > intensive workloads, and I wonder what are the most common high-order
> > > > allocation paths which fail.
> > > >
> > >
> > > Agreed. As I am not fully sure what workloads require high-order
> > > allocations, I updated VMRegress to keep track of the count of
> > > allocations and released 0.11
> > > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > > use it to track allocations, do the following
> > >
> > > < VMRegress instructions snipped>
> >
> > Great, excellent! Thanks.
> >
> > I plan to spend some time testing and trying to understand the vmregress 
> > package
> > this week.
> >
> 
> The documentation is not in sync with the code as the package is fairly
> large to maintain as a side-project. For the recent data I posted, The
> interesting parts of the tools are;
> 
> 1. bin/extfrag_stat.pl will display external fragmentation as a percentage
> of each order. I can go more into the calculation of this if anyone is
> interested. It does not require any special patches or modules
> 
> 2. bin/intfrag_stat.pl will display internal fragmentation in the system.
> Use the --man switch to get a list of all options. Linux occasionally
> suffers badly from internal fragmentation but it's a problem for another
> time
> 
> 3. mapfrag_stat.pl is what I used to map where allocations are in the
> address space. It requires the kernel patch in
> kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a
> non-mbuddy version in there) before the vmregress kernel modules can be
> loaded
> 
> 4. extfrag_stat_overtime.pl tracks external fragmentation over time
> although the figures are not very useful. It can also graph what
> fragmentation for some orders are over time. The figures are not useful
> because the fragmentation figures are based on free pages and does not
> take into account the layout of the currently allocated pages.
> 
> 5. The module in src/test/highalloc.ko is what I used to test high-order
> allocations. It creates a proc entry /proc/vmregress/test_highalloc that
> can be read or written. "echo Order Pages >
> /proc/vmregress/test_highalloc" will attempt to allocate 2^Order pages
> "Pages" times.
> 
> The perl scripts are smart enough to load the modules they need at runtime
> if the modules have been installed with "make install".

OK, thanks very much for the information - you might want to write this down 
into a text file and add it to the tarball :)

> > > > It mostly depends on hardware because most high-order allocations happen
> > > > inside device drivers? What are the kernel codepaths which try to do
> > > > high-order allocations and fallback if failed?
> > > >
> > >
> > > I'm not sure. I think that the paths we exercise right now will be largely
> > > artifical. For example, you can force order-2 allocations by scping a
> > > large file through localhost (because of the large MTU in that interface).
> > > I have not come up with another meaningful workload that guarentees
> > > high-order allocations yet.
> >
> > Thoughts and criticism of the following ideas are very much appreciated:
> >
> > In private conversation with wli (who helped me providing this information) 
> > we can
> > conjecture the following:
> >
> > Modern IO devices are capable of doing scatter/gather IO.
> >
> > There is overhead associated with setting up and managing the
> > scatter/gather tables.
> >
> > The benefit of large physically contiguous blocks is the ability to
> > avoid the SG management overhead.
> >
> 
> Do we get this benefit right now? 

Since the pages which compose IO operations are most likely sparse (not 
physically contiguous),
the driver+device has to perform scatter-gather IO on the pages. 

The idea is that if we can have larger memory blocks scatter-gather IO can use 
less SG list 
elements (decreased CPU overhead, decreased device overhead, faster). 

Best scenario is where only one sg element is required (ie one huge physically 
contiguous block).

Old devices/unprepared drivers which are not able to perform SG/IO
suffer with sequential small sized operations.

I'm far away from being a SCSI/ATA knowledgeable person, the storage people can 
help with expertise here.

Grant Grundler and James Bottomley have been working on this area, they might 
want to 
add some comments to this discussion.

It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this 
purpose.

> I read through the path of
> generic_file_readv(). If I am reading this correctly (first reading, so
> may not be right), scatter/gather IO will always be using order-0 pages.
> Is this really true?

Yes, it is. 

I was referring to scatter/gather IO at the device driver level, not SG IO at 
application level (readv/writev). 

Thing is that virtually contiguous 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Mel Gorman
On Sat, 22 Jan 2005, Marcelo Tosatti wrote:

> > > I was thinking that it would be nice to have a set of high-order
> > > intensive workloads, and I wonder what are the most common high-order
> > > allocation paths which fail.
> > >
> >
> > Agreed. As I am not fully sure what workloads require high-order
> > allocations, I updated VMRegress to keep track of the count of
> > allocations and released 0.11
> > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > use it to track allocations, do the following
> >
> > < VMRegress instructions snipped>
>
> Great, excellent! Thanks.
>
> I plan to spend some time testing and trying to understand the vmregress 
> package
> this week.
>

The documentation is not in sync with the code as the package is fairly
large to maintain as a side-project. For the recent data I posted, The
interesting parts of the tools are;

1. bin/extfrag_stat.pl will display external fragmentation as a percentage
of each order. I can go more into the calculation of this if anyone is
interested. It does not require any special patches or modules

2. bin/intfrag_stat.pl will display internal fragmentation in the system.
Use the --man switch to get a list of all options. Linux occasionally
suffers badly from internal fragmentation but it's a problem for another
time

3. mapfrag_stat.pl is what I used to map where allocations are in the
address space. It requires the kernel patch in
kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a
non-mbuddy version in there) before the vmregress kernel modules can be
loaded

4. extfrag_stat_overtime.pl tracks external fragmentation over time
although the figures are not very useful. It can also graph what
fragmentation for some orders are over time. The figures are not useful
because the fragmentation figures are based on free pages and does not
take into account the layout of the currently allocated pages.

5. The module in src/test/highalloc.ko is what I used to test high-order
allocations. It creates a proc entry /proc/vmregress/test_highalloc that
can be read or written. "echo Order Pages >
/proc/vmregress/test_highalloc" will attempt to allocate 2^Order pages
"Pages" times.

The perl scripts are smart enough to load the modules they need at runtime
if the modules have been installed with "make install".

> > > It mostly depends on hardware because most high-order allocations happen
> > > inside device drivers? What are the kernel codepaths which try to do
> > > high-order allocations and fallback if failed?
> > >
> >
> > I'm not sure. I think that the paths we exercise right now will be largely
> > artifical. For example, you can force order-2 allocations by scping a
> > large file through localhost (because of the large MTU in that interface).
> > I have not come up with another meaningful workload that guarentees
> > high-order allocations yet.
>
> Thoughts and criticism of the following ideas are very much appreciated:
>
> In private conversation with wli (who helped me providing this information) 
> we can
> conjecture the following:
>
> Modern IO devices are capable of doing scatter/gather IO.
>
> There is overhead associated with setting up and managing the
> scatter/gather tables.
>
> The benefit of large physically contiguous blocks is the ability to
> avoid the SG management overhead.
>

Do we get this benefit right now? I read through the path of
generic_file_readv(). If I am reading this correctly (first reading, so
may not be right), scatter/gather IO will always be using order-0 pages.
Is this really true?

>From what I can see, the buffers being written to for readv()  are all in
userspace so are going to be order-0 (unless hugetlb is in use, is that
the really interesting case?). For reading from the disk, the blocksize is
what will be important and we can't create a filesystem with blocksizes
greater than pagesize right now.

So, for scatter/gather to take advantage of contiguous blocks, is more
work required? If not, what am I missing?

> Also filesystems benefit from big physically contiguous blocks. Quoting
> wli "they want bigger blocks and contiguous memory to match bigger
> blocks..."
>

This I don't get... What filesystems support really large blocks? ext2/3
only support pagesize and reiser will create a filesystem with a blocksize
of 8192, but not mount it.

> I completly agree that your simplified allocator decreases fragmentation
> which in turn benefits the system overall.
>
> This is an area which can be further improved - ie efficiency in
> reducing fragmentation is excellent.  I sincerely appreciate the work
> you are doing!
>

Thanks.

> > 
> >
> > Right now, I believe that the pool of huge pages is of a fixed size
> > because of fragmentation difficulties. If we knew we could allocate huge
> > pages, this pool would not have to be fixed. Some applications will
> > heavily benefit from this. While databases are the obvious one,
> > applications with large heaps will also benefit like 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Marcelo Tosatti

James and Grant added to CC.

On Mon, Jan 24, 2005 at 01:28:47PM +, Mel Gorman wrote:
 On Sat, 22 Jan 2005, Marcelo Tosatti wrote:
 
I was thinking that it would be nice to have a set of high-order
intensive workloads, and I wonder what are the most common high-order
allocation paths which fail.
   
  
   Agreed. As I am not fully sure what workloads require high-order
   allocations, I updated VMRegress to keep track of the count of
   allocations and released 0.11
   (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
   use it to track allocations, do the following
  
VMRegress instructions snipped
 
  Great, excellent! Thanks.
 
  I plan to spend some time testing and trying to understand the vmregress 
  package
  this week.
 
 
 The documentation is not in sync with the code as the package is fairly
 large to maintain as a side-project. For the recent data I posted, The
 interesting parts of the tools are;
 
 1. bin/extfrag_stat.pl will display external fragmentation as a percentage
 of each order. I can go more into the calculation of this if anyone is
 interested. It does not require any special patches or modules
 
 2. bin/intfrag_stat.pl will display internal fragmentation in the system.
 Use the --man switch to get a list of all options. Linux occasionally
 suffers badly from internal fragmentation but it's a problem for another
 time
 
 3. mapfrag_stat.pl is what I used to map where allocations are in the
 address space. It requires the kernel patch in
 kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a
 non-mbuddy version in there) before the vmregress kernel modules can be
 loaded
 
 4. extfrag_stat_overtime.pl tracks external fragmentation over time
 although the figures are not very useful. It can also graph what
 fragmentation for some orders are over time. The figures are not useful
 because the fragmentation figures are based on free pages and does not
 take into account the layout of the currently allocated pages.
 
 5. The module in src/test/highalloc.ko is what I used to test high-order
 allocations. It creates a proc entry /proc/vmregress/test_highalloc that
 can be read or written. echo Order Pages 
 /proc/vmregress/test_highalloc will attempt to allocate 2^Order pages
 Pages times.
 
 The perl scripts are smart enough to load the modules they need at runtime
 if the modules have been installed with make install.

OK, thanks very much for the information - you might want to write this down 
into a text file and add it to the tarball :)

It mostly depends on hardware because most high-order allocations happen
inside device drivers? What are the kernel codepaths which try to do
high-order allocations and fallback if failed?
   
  
   I'm not sure. I think that the paths we exercise right now will be largely
   artifical. For example, you can force order-2 allocations by scping a
   large file through localhost (because of the large MTU in that interface).
   I have not come up with another meaningful workload that guarentees
   high-order allocations yet.
 
  Thoughts and criticism of the following ideas are very much appreciated:
 
  In private conversation with wli (who helped me providing this information) 
  we can
  conjecture the following:
 
  Modern IO devices are capable of doing scatter/gather IO.
 
  There is overhead associated with setting up and managing the
  scatter/gather tables.
 
  The benefit of large physically contiguous blocks is the ability to
  avoid the SG management overhead.
 
 
 Do we get this benefit right now? 

Since the pages which compose IO operations are most likely sparse (not 
physically contiguous),
the driver+device has to perform scatter-gather IO on the pages. 

The idea is that if we can have larger memory blocks scatter-gather IO can use 
less SG list 
elements (decreased CPU overhead, decreased device overhead, faster). 

Best scenario is where only one sg element is required (ie one huge physically 
contiguous block).

Old devices/unprepared drivers which are not able to perform SG/IO
suffer with sequential small sized operations.

I'm far away from being a SCSI/ATA knowledgeable person, the storage people can 
help with expertise here.

Grant Grundler and James Bottomley have been working on this area, they might 
want to 
add some comments to this discussion.

It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this 
purpose.

 I read through the path of
 generic_file_readv(). If I am reading this correctly (first reading, so
 may not be right), scatter/gather IO will always be using order-0 pages.
 Is this really true?

Yes, it is. 

I was referring to scatter/gather IO at the device driver level, not SG IO at 
application level (readv/writev). 

Thing is that virtually contiguous data buffers which are operated on with 
read/write, 
aio_read/aio_write, etc. become in fact scatter-gather operations at the device 
level if they are not physically 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread James Bottomley
On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote:
 Since the pages which compose IO operations are most likely sparse (not 
 physically contiguous),
 the driver+device has to perform scatter-gather IO on the pages. 
 
 The idea is that if we can have larger memory blocks scatter-gather IO can 
 use less SG list 
 elements (decreased CPU overhead, decreased device overhead, faster). 
 
 Best scenario is where only one sg element is required (ie one huge 
 physically contiguous block).
 
 Old devices/unprepared drivers which are not able to perform SG/IO
 suffer with sequential small sized operations.
 
 I'm far away from being a SCSI/ATA knowledgeable person, the storage people 
 can 
 help with expertise here.
 
 Grant Grundler and James Bottomley have been working on this area, they might 
 want to 
 add some comments to this discussion.
 
 It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for this 
 purpose.

Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices.  They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.

No one has ever really measured an effect we can say This is due to the
card's SG engine.  So, the rule we tend to follow is that if SG element
reduction comes for free, we take it.  The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.

The bottom line is that the I/O (block) subsystem is very efficient at
coalescing (both in block space and in physical memory space) and we've
got it to the point where it's about as efficient as it can be.  If
you're going to give us better physical contiguity properties, we'll
take them, but if you spend extra cycles doing it, the chances are
you'll slow down the I/O throughput path.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Marcelo Tosatti
On Mon, Jan 24, 2005 at 10:44:12AM -0600, James Bottomley wrote:
 On Mon, 2005-01-24 at 10:29 -0200, Marcelo Tosatti wrote:
  Since the pages which compose IO operations are most likely sparse (not 
  physically contiguous),
  the driver+device has to perform scatter-gather IO on the pages. 
  
  The idea is that if we can have larger memory blocks scatter-gather IO can 
  use less SG list 
  elements (decreased CPU overhead, decreased device overhead, faster). 
  
  Best scenario is where only one sg element is required (ie one huge 
  physically contiguous block).
  
  Old devices/unprepared drivers which are not able to perform SG/IO
  suffer with sequential small sized operations.
  
  I'm far away from being a SCSI/ATA knowledgeable person, the storage people 
  can 
  help with expertise here.
  
  Grant Grundler and James Bottomley have been working on this area, they 
  might want to 
  add some comments to this discussion.
  
  It seems HP (Grant et all) has pursued using big pages on IA64 (64K) for 
  this purpose.
 
 Well, the basic advice would be not to worry too much about
 fragmentation from the point of view of I/O devices.  They mostly all do
 scatter gather (SG) onboard as an intelligent processing operation and
 they're very good at it.

So is it valid to affirm that on average an operation with one SG element 
pointing to a 1MB 
region is similar in speed to an operation with 16 SG elements each pointing to 
a 64K 
region due to the efficient onboard SG processing? 

 No one has ever really measured an effect we can say This is due to the
 card's SG engine.  So, the rule we tend to follow is that if SG element
 reduction comes for free, we take it.  The issue that actually causes
 problems isn't the reduction in processing overhead, it's that the
 device's SG list is usually finite in size and so it's worth conserving
 if we can; however it's mostly not worth conserving at the expense of
 processor cycles.

 The bottom line is that the I/O (block) subsystem is very efficient at
 coalescing (both in block space and in physical memory space) and we've
 got it to the point where it's about as efficient as it can be.  If
 you're going to give us better physical contiguity properties, we'll
 take them, but if you spend extra cycles doing it, the chances are
 you'll slow down the I/O throughput path.

OK! thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Grant Grundler
On Mon, Jan 24, 2005 at 10:29:52AM -0200, Marcelo Tosatti wrote:
 Grant Grundler and James Bottomley have been working on this area,
 they might want to add some comments to this discussion.
 
 It seems HP (Grant et all) has pursued using big pages on IA64 (64K)
 for this purpose.

Marcello,
That might have been Alex Williamson...but the reasons for 64K pages
is to reduce TLB thrashing, not faster IO.

On HP ZX1 boxes, SG performance is slightly better (max +5%) when going
through the IOMMU than when bypassing it. The IOMMU can perfectly
coalesce DMA pages but has a small CPU and DMA cost to do so as well.

Otherwise, I totally agree with James. IO devices do scatter-gather
pretty well and IO subsystems are tuned for page-size chunk or
smaller anyway.

...
  I could keep digging, but I think the bottom line is that having large
  pages generally available rather than a fixed setting is desirable. 
 
 Definately, yes. Thanks for the pointers. 

Big pages are good for CPU TLB and that's where most of the
research has been done. I think IO devices have learned to cope
with the fact the alot less has been (or can be for many
workloads) done to coalesce IO pages.

grant
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread James Bottomley
On Mon, 2005-01-24 at 13:49 -0200, Marcelo Tosatti wrote:
 So is it valid to affirm that on average an operation with one SG element 
 pointing to a 1MB 
 region is similar in speed to an operation with 16 SG elements each pointing 
 to a 64K 
 region due to the efficient onboard SG processing? 

it's within a few percent, yes.  And the figures depend on how good the
I/O card is at it.  I can imagine there are some wildly varying I/O
cards out there.

However, also remember that 1MB of I/O is getting beyond what's sensible
for a disc device anyway.  The cable speed is much faster than the
platter speed, so the device takes the I/O into its cache as it services
it.  If you overrun the cache it will burp (disconnect) and force a
reconnection to get the rest (effectively splitting the I/O up anyway).
This doesn't apply to arrays with huge caches, but it does to pretty
much everything else.  The average disc cache size is only a megabyte or
so.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Steve Lord
James Bottomley wrote:
Well, the basic advice would be not to worry too much about
fragmentation from the point of view of I/O devices.  They mostly all do
scatter gather (SG) onboard as an intelligent processing operation and
they're very good at it.
No one has ever really measured an effect we can say This is due to the
card's SG engine.  So, the rule we tend to follow is that if SG element
reduction comes for free, we take it.  The issue that actually causes
problems isn't the reduction in processing overhead, it's that the
device's SG list is usually finite in size and so it's worth conserving
if we can; however it's mostly not worth conserving at the expense of
processor cycles.
Depends on the device at the other end of the scsi/fiber channel.
We have seen the processor in raid devices get maxed out by linux
when it is not maxed out by windows. Windows tends to be more device
friendly (I hate to say it), by sending larger and fewer scatter gather
elements than linux does.
Running an LSI raid over fiberchannel with 4 ports, windows was
able to sustain ~830 Mbytes/sec, basically channel speed using
only 1500 commands a second. Linux peaked at 550 Mbytes/sec using
over 4000 scsi commands to do it - the sustained rate was more
like 350 Mbytes/sec, I think at the end of the day linux was
sending 128K per scsi request. These numbers predate the current
linux scsi and io code, and I do not have the hardware to rerun
them right now.
I realize this is one data point on one end of the scale, but I
just wanted to make the point that there are cases where it
does matter. Hopefully William's little change from last
year has helped out a lot.
Steve
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Andi Kleen
Steve Lord [EMAIL PROTECTED] writes:

 I realize this is one data point on one end of the scale, but I
 just wanted to make the point that there are cases where it
 does matter. Hopefully William's little change from last
 year has helped out a lot.

There are more datapoints: 

e.g. performance on megaraid controllers (very popular because a big
PC vendor ships them) was always quite bad on Linux. Up to the point
that specific IO workloads run half as fast on a megaraid compared to
other controllers. I heard they do work better on Windows.

Also I did some experiments with coalescing SG lists in the Opteron IOMMU
some time ago. With a MPT fusion controller and forcing all SG lists
through the IOMMU so that the SCSI controller always only contiguous mappings
I saw ~5% improvement on some IO tests.

Unfortunately there are some problems that doesn't allow to enable
this unconditionally. But it gives strong evidence that MPT Fusion prefers
shorter SG lists too.

So it seems to be worthwhile to optimize for shorter SG lists.

Ideally the Linux IO patterns would look similar to the Windows IO patterns,
then we could reuse all the optimizations the controller vendors
did for Windows :)
 
-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-24 Thread Mel Gorman
On Sat, 22 Jan 2005, Marcelo Tosatti wrote:

   I was thinking that it would be nice to have a set of high-order
   intensive workloads, and I wonder what are the most common high-order
   allocation paths which fail.
  
 
  Agreed. As I am not fully sure what workloads require high-order
  allocations, I updated VMRegress to keep track of the count of
  allocations and released 0.11
  (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
  use it to track allocations, do the following
 
   VMRegress instructions snipped

 Great, excellent! Thanks.

 I plan to spend some time testing and trying to understand the vmregress 
 package
 this week.


The documentation is not in sync with the code as the package is fairly
large to maintain as a side-project. For the recent data I posted, The
interesting parts of the tools are;

1. bin/extfrag_stat.pl will display external fragmentation as a percentage
of each order. I can go more into the calculation of this if anyone is
interested. It does not require any special patches or modules

2. bin/intfrag_stat.pl will display internal fragmentation in the system.
Use the --man switch to get a list of all options. Linux occasionally
suffers badly from internal fragmentation but it's a problem for another
time

3. mapfrag_stat.pl is what I used to map where allocations are in the
address space. It requires the kernel patch in
kernel_patches/v2.6/trace_pagealloc-map-formbuddy.diff (there is a
non-mbuddy version in there) before the vmregress kernel modules can be
loaded

4. extfrag_stat_overtime.pl tracks external fragmentation over time
although the figures are not very useful. It can also graph what
fragmentation for some orders are over time. The figures are not useful
because the fragmentation figures are based on free pages and does not
take into account the layout of the currently allocated pages.

5. The module in src/test/highalloc.ko is what I used to test high-order
allocations. It creates a proc entry /proc/vmregress/test_highalloc that
can be read or written. echo Order Pages 
/proc/vmregress/test_highalloc will attempt to allocate 2^Order pages
Pages times.

The perl scripts are smart enough to load the modules they need at runtime
if the modules have been installed with make install.

   It mostly depends on hardware because most high-order allocations happen
   inside device drivers? What are the kernel codepaths which try to do
   high-order allocations and fallback if failed?
  
 
  I'm not sure. I think that the paths we exercise right now will be largely
  artifical. For example, you can force order-2 allocations by scping a
  large file through localhost (because of the large MTU in that interface).
  I have not come up with another meaningful workload that guarentees
  high-order allocations yet.

 Thoughts and criticism of the following ideas are very much appreciated:

 In private conversation with wli (who helped me providing this information) 
 we can
 conjecture the following:

 Modern IO devices are capable of doing scatter/gather IO.

 There is overhead associated with setting up and managing the
 scatter/gather tables.

 The benefit of large physically contiguous blocks is the ability to
 avoid the SG management overhead.


Do we get this benefit right now? I read through the path of
generic_file_readv(). If I am reading this correctly (first reading, so
may not be right), scatter/gather IO will always be using order-0 pages.
Is this really true?

From what I can see, the buffers being written to for readv()  are all in
userspace so are going to be order-0 (unless hugetlb is in use, is that
the really interesting case?). For reading from the disk, the blocksize is
what will be important and we can't create a filesystem with blocksizes
greater than pagesize right now.

So, for scatter/gather to take advantage of contiguous blocks, is more
work required? If not, what am I missing?

 Also filesystems benefit from big physically contiguous blocks. Quoting
 wli they want bigger blocks and contiguous memory to match bigger
 blocks...


This I don't get... What filesystems support really large blocks? ext2/3
only support pagesize and reiser will create a filesystem with a blocksize
of 8192, but not mount it.

 I completly agree that your simplified allocator decreases fragmentation
 which in turn benefits the system overall.

 This is an area which can be further improved - ie efficiency in
 reducing fragmentation is excellent.  I sincerely appreciate the work
 you are doing!


Thanks.

  Snip
 
  Right now, I believe that the pool of huge pages is of a fixed size
  because of fragmentation difficulties. If we knew we could allocate huge
  pages, this pool would not have to be fixed. Some applications will
  heavily benefit from this. While databases are the obvious one,
  applications with large heaps will also benefit like Java Virtual
  Machines. I can dig up papers that measured this on Solaris although I
  don't have them at 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-23 Thread Marcelo Tosatti
On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote:
> On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote:
> > On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
> > 
> > > On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
> > > > 
> > >
> > > Hi Mel,
> > >
> > > I was thinking that it would be nice to have a set of high-order
> > > intensive workloads, and I wonder what are the most common high-order
> > > allocation paths which fail.
> > >
> > 
> > Agreed. As I am not fully sure what workloads require high-order
> > allocations, I updated VMRegress to keep track of the count of
> > allocations and released 0.11
> > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > use it to track allocations, do the following
> > 
> > 1. Download and unpack vmregress
> > 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
> > The patch currently requires the modified allocator but I can fix that up
> > if people want it. Build and deploy the kernel
> > 3. Build vmregress by
> >   ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
> >   (or whatever path is appropriate)
> >   make
> > 4. Load the modules with;
> >   insmod src/code/vmregress_core.ko
> >   insmod src/sense/trace_alloccount.ko
> > 
> > This will create a proc entry /proc/vmregress/trace_alloccount that looks
> > something like;
> > 
> > Allocations (V1)
> > ---
> > KernNoRclm   997453  370   500000   
> >  0000
> > KernRclm  35279000000   
> >  0000
> > UserRclm9870808000000   
> >  0000
> > Total  10903540  370   500000   
> >  0000
> > 
> > Frees
> > -
> > KernNoRclm   590965  244   280000   
> >  0000
> > KernRclm 227100   6050000   
> >  0000
> > UserRclm7974200   73   170000   
> >  0000
> > Total  19695805  747  1000000   
> >  0000
> > 
> > To blank the counters, use
> > 
> > echo 0 > /proc/vmregress/trace_alloccount
> > 
> > Whatever workload we come up with, this proc entry will tell us if it is
> > exercising high-order allocations right now.
> 
> Great, excellent! Thanks.
> 
> I plan to spend some time testing and trying to understand the vmregress 
> package 
> this week.
>  
> > > It mostly depends on hardware because most high-order allocations happen
> > > inside device drivers? What are the kernel codepaths which try to do
> > > high-order allocations and fallback if failed?
> > >
> > 
> > I'm not sure. I think that the paths we exercise right now will be largely
> > artifical. For example, you can force order-2 allocations by scping a
> > large file through localhost (because of the large MTU in that interface).
> > I have not come up with another meaningful workload that guarentees
> > high-order allocations yet.
> 
> Thoughts and criticism of the following ideas are very much appreciated:
> 
> In private conversation with wli (who helped me providing this information) 
> we can 
> conjecture the following:
> 
> Modern IO devices are capable of doing scatter/gather IO.
> 
> There is overhead associated with setting up and managing the scatter/gather 
> tables. 
> 
> The benefit of large physically contiguous blocks is the ability to avoid the 
> SG 
> management overhead. 
> 
> Now the question is: The added overhead of allocating high order blocks 
> through migration 
> offsets the overhead of SG IO ? Quantifying that is interesting.

What is the overhead of the SG IO management and how is the improvement without 
them?

Are block IO drivers trying to allocate big physical segments? I bet they are 
not, because the
"pool of huge pages" (as you say) is limited.

> 
> This depends on the driver implementation (how efficiently its able to manage 
> the SG IO tables) and 
> device/IO subsystem characteristics.
> 
> Also filesystems benefit from big physically contiguous blocks. Quoting wli
> "they want bigger blocks and contiguous memory to match bigger blocks..."
> 
> I completly agree that your simplified allocator decreases fragmentation 
> which in turn
> benefits the system overall. 
> 
> This is an area which can be further improved - ie efficiency in reducing 
> fragmentation 
> is excellent. 
> I sincerely appreciate the work you are doing!
> 
> > > To measure whether the cost of page migration offsets the ability to be
> > > able to deliver high-order allocations we want a set of meaningful
> > > performance tests?
> > >
> > 
> > Bear in mind, there are more 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-23 Thread Marcelo Tosatti
On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote:
 On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote:
  On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
  
   On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
Changelog snipped
  
   Hi Mel,
  
   I was thinking that it would be nice to have a set of high-order
   intensive workloads, and I wonder what are the most common high-order
   allocation paths which fail.
  
  
  Agreed. As I am not fully sure what workloads require high-order
  allocations, I updated VMRegress to keep track of the count of
  allocations and released 0.11
  (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
  use it to track allocations, do the following
  
  1. Download and unpack vmregress
  2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
  The patch currently requires the modified allocator but I can fix that up
  if people want it. Build and deploy the kernel
  3. Build vmregress by
./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
(or whatever path is appropriate)
make
  4. Load the modules with;
insmod src/code/vmregress_core.ko
insmod src/sense/trace_alloccount.ko
  
  This will create a proc entry /proc/vmregress/trace_alloccount that looks
  something like;
  
  Allocations (V1)
  ---
  KernNoRclm   997453  370   500000   
   0000
  KernRclm  35279000000   
   0000
  UserRclm9870808000000   
   0000
  Total  10903540  370   500000   
   0000
  
  Frees
  -
  KernNoRclm   590965  244   280000   
   0000
  KernRclm 227100   6050000   
   0000
  UserRclm7974200   73   170000   
   0000
  Total  19695805  747  1000000   
   0000
  
  To blank the counters, use
  
  echo 0  /proc/vmregress/trace_alloccount
  
  Whatever workload we come up with, this proc entry will tell us if it is
  exercising high-order allocations right now.
 
 Great, excellent! Thanks.
 
 I plan to spend some time testing and trying to understand the vmregress 
 package 
 this week.
  
   It mostly depends on hardware because most high-order allocations happen
   inside device drivers? What are the kernel codepaths which try to do
   high-order allocations and fallback if failed?
  
  
  I'm not sure. I think that the paths we exercise right now will be largely
  artifical. For example, you can force order-2 allocations by scping a
  large file through localhost (because of the large MTU in that interface).
  I have not come up with another meaningful workload that guarentees
  high-order allocations yet.
 
 Thoughts and criticism of the following ideas are very much appreciated:
 
 In private conversation with wli (who helped me providing this information) 
 we can 
 conjecture the following:
 
 Modern IO devices are capable of doing scatter/gather IO.
 
 There is overhead associated with setting up and managing the scatter/gather 
 tables. 
 
 The benefit of large physically contiguous blocks is the ability to avoid the 
 SG 
 management overhead. 
 
 Now the question is: The added overhead of allocating high order blocks 
 through migration 
 offsets the overhead of SG IO ? Quantifying that is interesting.

What is the overhead of the SG IO management and how is the improvement without 
them?

Are block IO drivers trying to allocate big physical segments? I bet they are 
not, because the
pool of huge pages (as you say) is limited.

 
 This depends on the driver implementation (how efficiently its able to manage 
 the SG IO tables) and 
 device/IO subsystem characteristics.
 
 Also filesystems benefit from big physically contiguous blocks. Quoting wli
 they want bigger blocks and contiguous memory to match bigger blocks...
 
 I completly agree that your simplified allocator decreases fragmentation 
 which in turn
 benefits the system overall. 
 
 This is an area which can be further improved - ie efficiency in reducing 
 fragmentation 
 is excellent. 
 I sincerely appreciate the work you are doing!
 
   To measure whether the cost of page migration offsets the ability to be
   able to deliver high-order allocations we want a set of meaningful
   performance tests?
  
  
  Bear in mind, there are more considerations. The allocator potentially
  makes hotplug problems easier and could be easily tied into any
  page-zeroing system. Some of your own benchmarks also implied that the
  modified allocator helped some 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-22 Thread Marcelo Tosatti
On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote:
> On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
> 
> > On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
> > > 
> >
> > Hi Mel,
> >
> > I was thinking that it would be nice to have a set of high-order
> > intensive workloads, and I wonder what are the most common high-order
> > allocation paths which fail.
> >
> 
> Agreed. As I am not fully sure what workloads require high-order
> allocations, I updated VMRegress to keep track of the count of
> allocations and released 0.11
> (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> use it to track allocations, do the following
> 
> 1. Download and unpack vmregress
> 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
> The patch currently requires the modified allocator but I can fix that up
> if people want it. Build and deploy the kernel
> 3. Build vmregress by
>   ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
>   (or whatever path is appropriate)
>   make
> 4. Load the modules with;
>   insmod src/code/vmregress_core.ko
>   insmod src/sense/trace_alloccount.ko
> 
> This will create a proc entry /proc/vmregress/trace_alloccount that looks
> something like;
> 
> Allocations (V1)
> ---
> KernNoRclm   997453  370   500000 
>0000
> KernRclm  35279000000 
>0000
> UserRclm9870808000000 
>0000
> Total  10903540  370   500000 
>0000
> 
> Frees
> -
> KernNoRclm   590965  244   280000 
>0000
> KernRclm 227100   6050000 
>0000
> UserRclm7974200   73   170000 
>0000
> Total  19695805  747  1000000 
>0000
> 
> To blank the counters, use
> 
> echo 0 > /proc/vmregress/trace_alloccount
> 
> Whatever workload we come up with, this proc entry will tell us if it is
> exercising high-order allocations right now.

Great, excellent! Thanks.

I plan to spend some time testing and trying to understand the vmregress 
package 
this week.
 
> > It mostly depends on hardware because most high-order allocations happen
> > inside device drivers? What are the kernel codepaths which try to do
> > high-order allocations and fallback if failed?
> >
> 
> I'm not sure. I think that the paths we exercise right now will be largely
> artifical. For example, you can force order-2 allocations by scping a
> large file through localhost (because of the large MTU in that interface).
> I have not come up with another meaningful workload that guarentees
> high-order allocations yet.

Thoughts and criticism of the following ideas are very much appreciated:

In private conversation with wli (who helped me providing this information) we 
can 
conjecture the following:

Modern IO devices are capable of doing scatter/gather IO.

There is overhead associated with setting up and managing the scatter/gather 
tables. 

The benefit of large physically contiguous blocks is the ability to avoid the 
SG 
management overhead. 

Now the question is: The added overhead of allocating high order blocks through 
migration 
offsets the overhead of SG IO ? Quantifying that is interesting.

This depends on the driver implementation (how efficiently its able to manage 
the SG IO tables) and 
device/IO subsystem characteristics.

Also filesystems benefit from big physically contiguous blocks. Quoting wli
"they want bigger blocks and contiguous memory to match bigger blocks..."

I completly agree that your simplified allocator decreases fragmentation which 
in turn
benefits the system overall. 

This is an area which can be further improved - ie efficiency in reducing 
fragmentation 
is excellent. 
I sincerely appreciate the work you are doing!

> > To measure whether the cost of page migration offsets the ability to be
> > able to deliver high-order allocations we want a set of meaningful
> > performance tests?
> >
> 
> Bear in mind, there are more considerations. The allocator potentially
> makes hotplug problems easier and could be easily tied into any
> page-zeroing system. Some of your own benchmarks also implied that the
> modified allocator helped some types of workloads which is beneficial in
> itself.The last consideration is HugeTLB pages, which I am hoping William
> will weigh in.
> 
> Right now, I believe that the pool of huge pages is of a fixed size
> because of fragmentation difficulties. If we knew we could allocate huge
> pages, this pool would not have to be fixed. Some 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-22 Thread Mel Gorman
On Fri, 21 Jan 2005, Marcelo Tosatti wrote:

> On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
> > 
>
> Hi Mel,
>
> I was thinking that it would be nice to have a set of high-order
> intensive workloads, and I wonder what are the most common high-order
> allocation paths which fail.
>

Agreed. As I am not fully sure what workloads require high-order
allocations, I updated VMRegress to keep track of the count of
allocations and released 0.11
(http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
use it to track allocations, do the following

1. Download and unpack vmregress
2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
The patch currently requires the modified allocator but I can fix that up
if people want it. Build and deploy the kernel
3. Build vmregress by
  ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
  (or whatever path is appropriate)
  make
4. Load the modules with;
  insmod src/code/vmregress_core.ko
  insmod src/sense/trace_alloccount.ko

This will create a proc entry /proc/vmregress/trace_alloccount that looks
something like;

Allocations (V1)
---
KernNoRclm   997453  370   500000   
 0000
KernRclm  35279000000   
 0000
UserRclm9870808000000   
 0000
Total  10903540  370   500000   
 0000

Frees
-
KernNoRclm   590965  244   280000   
 0000
KernRclm 227100   6050000   
 0000
UserRclm7974200   73   170000   
 0000
Total  19695805  747  1000000   
 0000

To blank the counters, use

echo 0 > /proc/vmregress/trace_alloccount

Whatever workload we come up with, this proc entry will tell us if it is
exercising high-order allocations right now.

> It mostly depends on hardware because most high-order allocations happen
> inside device drivers? What are the kernel codepaths which try to do
> high-order allocations and fallback if failed?
>

I'm not sure. I think that the paths we exercise right now will be largely
artifical. For example, you can force order-2 allocations by scping a
large file through localhost (because of the large MTU in that interface).
I have not come up with another meaningful workload that guarentees
high-order allocations yet.

> To measure whether the cost of page migration offsets the ability to be
> able to deliver high-order allocations we want a set of meaningful
> performance tests?
>

Bear in mind, there are more considerations. The allocator potentially
makes hotplug problems easier and could be easily tied into any
page-zeroing system. Some of your own benchmarks also implied that the
modified allocator helped some types of workloads which is beneficial in
itself.The last consideration is HugeTLB pages, which I am hoping William
will weigh in.

Right now, I believe that the pool of huge pages is of a fixed size
because of fragmentation difficulties. If we knew we could allocate huge
pages, this pool would not have to be fixed. Some applications will
heavily benefit from this. While databases are the obvious one,
applications with large heaps will also benefit like Java Virtual
Machines. I can dig up papers that measured this on Solaris although I
don't have them at hand right now.

We know right now that the overhead of this allocator is fairly low
(anyone got benchmarks to disagree) but I understand that page migration
is relatively expensive. The allocator also does not have adverse
CPU+cache affects like migration and the concept is fairly simple.

> Its quite possible that not all unsatisfiable high-order allocations
> want to force page migration (which is quite expensive in terms of
> CPU/cache). Only migrate on __GFP_NOFAIL ?
>

I still believe with the allocator, we will only have to migrate in
exceptional circumstances.

> William, that same tradeoff exists for the zone balancing through
> migration idea you propose...
>

-- 
Mel Gorman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-22 Thread Mel Gorman
On Fri, 21 Jan 2005, Marcelo Tosatti wrote:

 On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
  Changelog snipped

 Hi Mel,

 I was thinking that it would be nice to have a set of high-order
 intensive workloads, and I wonder what are the most common high-order
 allocation paths which fail.


Agreed. As I am not fully sure what workloads require high-order
allocations, I updated VMRegress to keep track of the count of
allocations and released 0.11
(http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
use it to track allocations, do the following

1. Download and unpack vmregress
2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
The patch currently requires the modified allocator but I can fix that up
if people want it. Build and deploy the kernel
3. Build vmregress by
  ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
  (or whatever path is appropriate)
  make
4. Load the modules with;
  insmod src/code/vmregress_core.ko
  insmod src/sense/trace_alloccount.ko

This will create a proc entry /proc/vmregress/trace_alloccount that looks
something like;

Allocations (V1)
---
KernNoRclm   997453  370   500000   
 0000
KernRclm  35279000000   
 0000
UserRclm9870808000000   
 0000
Total  10903540  370   500000   
 0000

Frees
-
KernNoRclm   590965  244   280000   
 0000
KernRclm 227100   6050000   
 0000
UserRclm7974200   73   170000   
 0000
Total  19695805  747  1000000   
 0000

To blank the counters, use

echo 0  /proc/vmregress/trace_alloccount

Whatever workload we come up with, this proc entry will tell us if it is
exercising high-order allocations right now.

 It mostly depends on hardware because most high-order allocations happen
 inside device drivers? What are the kernel codepaths which try to do
 high-order allocations and fallback if failed?


I'm not sure. I think that the paths we exercise right now will be largely
artifical. For example, you can force order-2 allocations by scping a
large file through localhost (because of the large MTU in that interface).
I have not come up with another meaningful workload that guarentees
high-order allocations yet.

 To measure whether the cost of page migration offsets the ability to be
 able to deliver high-order allocations we want a set of meaningful
 performance tests?


Bear in mind, there are more considerations. The allocator potentially
makes hotplug problems easier and could be easily tied into any
page-zeroing system. Some of your own benchmarks also implied that the
modified allocator helped some types of workloads which is beneficial in
itself.The last consideration is HugeTLB pages, which I am hoping William
will weigh in.

Right now, I believe that the pool of huge pages is of a fixed size
because of fragmentation difficulties. If we knew we could allocate huge
pages, this pool would not have to be fixed. Some applications will
heavily benefit from this. While databases are the obvious one,
applications with large heaps will also benefit like Java Virtual
Machines. I can dig up papers that measured this on Solaris although I
don't have them at hand right now.

We know right now that the overhead of this allocator is fairly low
(anyone got benchmarks to disagree) but I understand that page migration
is relatively expensive. The allocator also does not have adverse
CPU+cache affects like migration and the concept is fairly simple.

 Its quite possible that not all unsatisfiable high-order allocations
 want to force page migration (which is quite expensive in terms of
 CPU/cache). Only migrate on __GFP_NOFAIL ?


I still believe with the allocator, we will only have to migrate in
exceptional circumstances.

 William, that same tradeoff exists for the zone balancing through
 migration idea you propose...


-- 
Mel Gorman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-22 Thread Marcelo Tosatti
On Sat, Jan 22, 2005 at 09:48:20PM +, Mel Gorman wrote:
 On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
 
  On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
   Changelog snipped
 
  Hi Mel,
 
  I was thinking that it would be nice to have a set of high-order
  intensive workloads, and I wonder what are the most common high-order
  allocation paths which fail.
 
 
 Agreed. As I am not fully sure what workloads require high-order
 allocations, I updated VMRegress to keep track of the count of
 allocations and released 0.11
 (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
 use it to track allocations, do the following
 
 1. Download and unpack vmregress
 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
 The patch currently requires the modified allocator but I can fix that up
 if people want it. Build and deploy the kernel
 3. Build vmregress by
   ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
   (or whatever path is appropriate)
   make
 4. Load the modules with;
   insmod src/code/vmregress_core.ko
   insmod src/sense/trace_alloccount.ko
 
 This will create a proc entry /proc/vmregress/trace_alloccount that looks
 something like;
 
 Allocations (V1)
 ---
 KernNoRclm   997453  370   500000 
0000
 KernRclm  35279000000 
0000
 UserRclm9870808000000 
0000
 Total  10903540  370   500000 
0000
 
 Frees
 -
 KernNoRclm   590965  244   280000 
0000
 KernRclm 227100   6050000 
0000
 UserRclm7974200   73   170000 
0000
 Total  19695805  747  1000000 
0000
 
 To blank the counters, use
 
 echo 0  /proc/vmregress/trace_alloccount
 
 Whatever workload we come up with, this proc entry will tell us if it is
 exercising high-order allocations right now.

Great, excellent! Thanks.

I plan to spend some time testing and trying to understand the vmregress 
package 
this week.
 
  It mostly depends on hardware because most high-order allocations happen
  inside device drivers? What are the kernel codepaths which try to do
  high-order allocations and fallback if failed?
 
 
 I'm not sure. I think that the paths we exercise right now will be largely
 artifical. For example, you can force order-2 allocations by scping a
 large file through localhost (because of the large MTU in that interface).
 I have not come up with another meaningful workload that guarentees
 high-order allocations yet.

Thoughts and criticism of the following ideas are very much appreciated:

In private conversation with wli (who helped me providing this information) we 
can 
conjecture the following:

Modern IO devices are capable of doing scatter/gather IO.

There is overhead associated with setting up and managing the scatter/gather 
tables. 

The benefit of large physically contiguous blocks is the ability to avoid the 
SG 
management overhead. 

Now the question is: The added overhead of allocating high order blocks through 
migration 
offsets the overhead of SG IO ? Quantifying that is interesting.

This depends on the driver implementation (how efficiently its able to manage 
the SG IO tables) and 
device/IO subsystem characteristics.

Also filesystems benefit from big physically contiguous blocks. Quoting wli
they want bigger blocks and contiguous memory to match bigger blocks...

I completly agree that your simplified allocator decreases fragmentation which 
in turn
benefits the system overall. 

This is an area which can be further improved - ie efficiency in reducing 
fragmentation 
is excellent. 
I sincerely appreciate the work you are doing!

  To measure whether the cost of page migration offsets the ability to be
  able to deliver high-order allocations we want a set of meaningful
  performance tests?
 
 
 Bear in mind, there are more considerations. The allocator potentially
 makes hotplug problems easier and could be easily tied into any
 page-zeroing system. Some of your own benchmarks also implied that the
 modified allocator helped some types of workloads which is beneficial in
 itself.The last consideration is HugeTLB pages, which I am hoping William
 will weigh in.
 
 Right now, I believe that the pool of huge pages is of a fixed size
 because of fragmentation difficulties. If we knew we could allocate huge
 pages, this pool would not have to be fixed. Some applications will
 heavily benefit from this. While databases are the obvious one,
 

Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-21 Thread Marcelo Tosatti
On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
> Changelog since V5
> o Fixed up gcc-2.95 errors
> o Fixed up whitespace damage
> 
> Changelog since V4
> o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
>   with offsets to 2.6.11-rc1-mm1
> 
> Changelog since V3
> o inlined get_pageblock_type() and set_pageblock_type()
> o set_pageblock_type() now takes a zone parameter to avoid a call to 
> page_zone()
> o When taking from the global pool, do not scan all the low-order lists
> 
> Changelog since V2
> o Do not to interfere with the "min" decay
> o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
>   anything to do with asynchronous IO
>   
> Changelog since V1
> o Update patch to 2.6.11-rc1
> o Cleaned up bug where memory was wasted on a large bitmap
> o Remove code that needed the binary buddy bitmaps
> o Update flags to avoid colliding with __GFP_ZERO changes
> o Extended fallback_count bean counters to show the fallback count for each
>   allocation type
> o In-code documentation

Hi Mel,

I was thinking that it would be nice to have a set of high-order intensive 
workloads, 
and I wonder what are the most common high-order allocation paths which fail.

It mostly depends on hardware because most high-order allocations happen inside
device drivers? What are the kernel codepaths which try to do high-order 
allocations
and fallback if failed? 

To measure whether the cost of page migration offsets the ability to be able to 
deliver
high-order allocations we want a set of meaningful performance tests?

Its quite possible that not all unsatisfiable high-order allocations want to 
force page migration (which is quite expensive in terms of CPU/cache). Only 
migrate on 
__GFP_NOFAIL ?

William, that same tradeoff exists for the zone balancing through migration idea
you propose...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator

2005-01-21 Thread Marcelo Tosatti
On Thu, Jan 20, 2005 at 10:13:00AM +, Mel Gorman wrote:
 Changelog since V5
 o Fixed up gcc-2.95 errors
 o Fixed up whitespace damage
 
 Changelog since V4
 o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
   with offsets to 2.6.11-rc1-mm1
 
 Changelog since V3
 o inlined get_pageblock_type() and set_pageblock_type()
 o set_pageblock_type() now takes a zone parameter to avoid a call to 
 page_zone()
 o When taking from the global pool, do not scan all the low-order lists
 
 Changelog since V2
 o Do not to interfere with the min decay
 o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
   anything to do with asynchronous IO
   
 Changelog since V1
 o Update patch to 2.6.11-rc1
 o Cleaned up bug where memory was wasted on a large bitmap
 o Remove code that needed the binary buddy bitmaps
 o Update flags to avoid colliding with __GFP_ZERO changes
 o Extended fallback_count bean counters to show the fallback count for each
   allocation type
 o In-code documentation

Hi Mel,

I was thinking that it would be nice to have a set of high-order intensive 
workloads, 
and I wonder what are the most common high-order allocation paths which fail.

It mostly depends on hardware because most high-order allocations happen inside
device drivers? What are the kernel codepaths which try to do high-order 
allocations
and fallback if failed? 

To measure whether the cost of page migration offsets the ability to be able to 
deliver
high-order allocations we want a set of meaningful performance tests?

Its quite possible that not all unsatisfiable high-order allocations want to 
force page migration (which is quite expensive in terms of CPU/cache). Only 
migrate on 
__GFP_NOFAIL ?

William, that same tradeoff exists for the zone balancing through migration idea
you propose...


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Avoiding fragmentation through different allocator

2005-01-20 Thread Mel Gorman
Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage

Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
  with offsets to 2.6.11-rc1-mm1

Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists

Changelog since V2
o Do not to interfere with the "min" decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
  anything to do with asynchronous IO
  
Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
  allocation type
o In-code documentation

Version 1
o Initial release against 2.6.9

This patch divides allocations into three different types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are
in this category. These pages are trivially reclaimed by writing
the page out to swap or syncing with backing storage

KernelReclaimable - These are pages allocated by the kernel that are easily
reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
These type of pages potentially could be reclaimed by dumping the
caches and reaping the slabs

KernelNonReclaimable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there are
three, one for each type of allocation. Finally, there is a list of pages of
size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals
with. 

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of
pages allocated for UserReclaimable and page each of them out.

Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.

Three benchmark results are included. The first is the output of portions
of AIM9 for the vanilla allocator and the modified one;

[EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt
 7 page_test  120.00   9508   79.2   134696.67 System 
Allocations & Pages/second
 8 brk_test   120.01   3401   28.33931   481768.19 System 
Memory Allocations/second
 9 jmp_test   120.00 498718 4155.98333  4155983.33 
Non-local gotos/second
10 signal_test120.01  11768   98.0585098058.50 Signal 
Traps/second
11 exec_test  120.04   1585   13.20393   66.02 Program 
Loads/second
12 fork_test  120.04   1979   16.48617 1648.62 Task 
Creations/second
13 link_test  120.01  11174   93.10891 5865.86 
Link/Unlink Pairs/second
[EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt
 7 page_test  120.01   9660   80.49329   136838.60 System 
Allocations & Pages/second
 8 brk_test   120.01   3409   28.40597   482901.42 System 
Memory Allocations/second
 9 jmp_test   120.00 501533 4179.44167  4179441.67 
Non-local gotos/second
10 signal_test120.00  11677   97.3083397308.33 Signal 
Traps/second
11 exec_test  120.05   1585   13.20283   66.01 Program 
Loads/second
12 fork_test  120.05   1889   15.73511 1573.51 Task 
Creations/second
13 link_test  120.01  11089   92.40063 5821.24 
Link/Unlink Pairs/second

They show that the allocator performs roughly similar to the standard
allocator so there is negligible slowdown with the extra complexity. The
second benchmark tested the CPU cache usage to make sure it was not getting
clobbered. The test was to repeatadly render a large postcript file 10 times
and get the average. The result is;

==> gsbench-2.6.11-rc1Standard.txt <==
Average: 115.468 real, 115.092 user, 0.337 sys

==> gsbench-2.6.11-rc1MBuddy.txt <==
Average: 115.47 real, 115.136 user, 0.338 sys


So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The 

[PATCH] Avoiding fragmentation through different allocator

2005-01-20 Thread Mel Gorman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Avoiding fragmentation through different allocator

2005-01-20 Thread Mel Gorman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Avoiding fragmentation through different allocator

2005-01-20 Thread Mel Gorman
Changelog since V5
o Fixed up gcc-2.95 errors
o Fixed up whitespace damage

Changelog since V4
o No changes. Applies cleanly against 2.6.11-rc1 and 2.6.11-rc1-bk6. Applies
  with offsets to 2.6.11-rc1-mm1

Changelog since V3
o inlined get_pageblock_type() and set_pageblock_type()
o set_pageblock_type() now takes a zone parameter to avoid a call to page_zone()
o When taking from the global pool, do not scan all the low-order lists

Changelog since V2
o Do not to interfere with the min decay
o Update the __GFP_BITS_SHIFT properly. Old value broke fsync and probably
  anything to do with asynchronous IO
  
Changelog since V1
o Update patch to 2.6.11-rc1
o Cleaned up bug where memory was wasted on a large bitmap
o Remove code that needed the binary buddy bitmaps
o Update flags to avoid colliding with __GFP_ZERO changes
o Extended fallback_count bean counters to show the fallback count for each
  allocation type
o In-code documentation

Version 1
o Initial release against 2.6.9

This patch divides allocations into three different types of allocations;

UserReclaimable - These are userspace pages that are easily reclaimable. Right
now, all allocations of GFP_USER, GFP_HIGHUSER and disk buffers are
in this category. These pages are trivially reclaimed by writing
the page out to swap or syncing with backing storage

KernelReclaimable - These are pages allocated by the kernel that are easily
reclaimed. This is stuff like inode caches, dcache, buffer_heads etc.
These type of pages potentially could be reclaimed by dumping the
caches and reaping the slabs

KernelNonReclaimable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type

Instead of having one global MAX_ORDER-sized array of free lists, there are
three, one for each type of allocation. Finally, there is a list of pages of
size 2^MAX_ORDER which is a global pool of the largest pages the kernel deals
with. 

Once a 2^MAX_ORDER block of pages it split for a type of allocation, it is
added to the free-lists for that type, in effect reserving it. Hence, over
time, pages of the different types can be clustered together. This means that
if we wanted 2^MAX_ORDER number of pages, we could linearly scan a block of
pages allocated for UserReclaimable and page each of them out.

Fallback is used when there are no 2^MAX_ORDER pages available and there
are no free pages of the desired type. The fallback lists were chosen in a
way that keeps the most easily reclaimable pages together.

Three benchmark results are included. The first is the output of portions
of AIM9 for the vanilla allocator and the modified one;

[EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt
 7 page_test  120.00   9508   79.2   134696.67 System 
Allocations  Pages/second
 8 brk_test   120.01   3401   28.33931   481768.19 System 
Memory Allocations/second
 9 jmp_test   120.00 498718 4155.98333  4155983.33 
Non-local gotos/second
10 signal_test120.01  11768   98.0585098058.50 Signal 
Traps/second
11 exec_test  120.04   1585   13.20393   66.02 Program 
Loads/second
12 fork_test  120.04   1979   16.48617 1648.62 Task 
Creations/second
13 link_test  120.01  11174   93.10891 5865.86 
Link/Unlink Pairs/second
[EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt
 7 page_test  120.01   9660   80.49329   136838.60 System 
Allocations  Pages/second
 8 brk_test   120.01   3409   28.40597   482901.42 System 
Memory Allocations/second
 9 jmp_test   120.00 501533 4179.44167  4179441.67 
Non-local gotos/second
10 signal_test120.00  11677   97.3083397308.33 Signal 
Traps/second
11 exec_test  120.05   1585   13.20283   66.01 Program 
Loads/second
12 fork_test  120.05   1889   15.73511 1573.51 Task 
Creations/second
13 link_test  120.01  11089   92.40063 5821.24 
Link/Unlink Pairs/second

They show that the allocator performs roughly similar to the standard
allocator so there is negligible slowdown with the extra complexity. The
second benchmark tested the CPU cache usage to make sure it was not getting
clobbered. The test was to repeatadly render a large postcript file 10 times
and get the average. The result is;

== gsbench-2.6.11-rc1Standard.txt ==
Average: 115.468 real, 115.092 user, 0.337 sys

== gsbench-2.6.11-rc1MBuddy.txt ==
Average: 115.47 real, 115.136 user, 0.338 sys


So there are no adverse cache effects. The last test is to show that the
allocator can satisfy more high-order allocations, especially under load,
than the standard allocator. The test 

Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-16 Thread Mel Gorman
On Sun, 16 Jan 2005, Marcelo Tosatti wrote:

> > No unfortunately. Do you know of a test I can use?
>
> Some STP reaim results have significant performance increase in general, a few
> small regressions.
>
> I think that depending on the type of access pattern of the application(s) 
> there
> will be either performance gain or loss, but the result is interesting 
> anyway. :)
>

That is quite exciting and I'm pleased it was able to show gains in some
tests. Based on the aim9 tests, I took a look at the paths I affected to
see what improvements I could make. There were three significant ones

1. I inlined get_pageblock_type and set_pageblock_type
2. set_pageblock_type was calling page_zone() even though the only caller
knew the zone so I added the parameter
3. When taking fom the global pool, I was recanning all the order lists
which is does not any more

I am hoping that these three changes will clear up the worst of the minor
regressions.

With the changess, aim9 reported that the modified allocator performs as
well as the standard allocator. This means that the allocator is as fast,
we are reasonably sure there is no adverse cache effects (if anything
cache usage is improved) and we are far more likely to be able to service
high-order requests

[EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt
 7 page_test  120.00   9508   79.2   134696.67 System 
Allocations & Pages/second
 8 brk_test   120.01   3401   28.33931   481768.19 System 
Memory Allocations/second
 9 jmp_test   120.00 498718 4155.98333  4155983.33 
Non-local gotos/second
10 signal_test120.01  11768   98.0585098058.50 Signal 
Traps/second
11 exec_test  120.04   1585   13.20393   66.02 Program 
Loads/second
12 fork_test  120.04   1979   16.48617 1648.62 Task 
Creations/second
13 link_test  120.01  11174   93.10891 5865.86 
Link/Unlink Pairs/second
[EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt
 7 page_test  120.01   9660   80.49329   136838.60 System 
Allocations & Pages/second
 8 brk_test   120.01   3409   28.40597   482901.42 System 
Memory Allocations/second
 9 jmp_test   120.00 501533 4179.44167  4179441.67 
Non-local gotos/second
10 signal_test120.00  11677   97.3083397308.33 Signal 
Traps/second
11 exec_test  120.05   1585   13.20283   66.01 Program 
Loads/second
12 fork_test  120.05   1889   15.73511 1573.51 Task 
Creations/second
13 link_test  120.01  11089   92.40063 5821.24 
Link/Unlink Pairs/second


Patch with minor optimisations as follows;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff 
linux-2.6.11-rc1-clean/fs/buffer.c linux-2.6.11-rc1-mbuddy/fs/buffer.c
--- linux-2.6.11-rc1-clean/fs/buffer.c  2005-01-12 04:01:23.0 +
+++ linux-2.6.11-rc1-mbuddy/fs/buffer.c 2005-01-13 10:56:30.0 +
@@ -1134,7 +1134,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;

-   page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+   page = find_or_create_page(inode->i_mapping, index,
+   GFP_NOFS | __GFP_USERRCLM);
if (!page)
return NULL;

@@ -2997,7 +2998,8 @@ static void recalc_bh_state(void)

 struct buffer_head *alloc_buffer_head(int gfp_flags)
 {
-   struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+   struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+  gfp_flags|__GFP_KERNRCLM);
if (ret) {
preempt_disable();
__get_cpu_var(bh_accounting).nr++;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff 
linux-2.6.11-rc1-clean/fs/dcache.c linux-2.6.11-rc1-mbuddy/fs/dcache.c
--- linux-2.6.11-rc1-clean/fs/dcache.c  2005-01-12 04:00:09.0 +
+++ linux-2.6.11-rc1-mbuddy/fs/dcache.c 2005-01-13 10:56:30.0 +
@@ -715,7 +715,8 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;

-   dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+   dentry = kmem_cache_alloc(dentry_cache,
+ GFP_KERNEL|__GFP_KERNRCLM);
if (!dentry)
return NULL;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff 
linux-2.6.11-rc1-clean/fs/ext2/super.c linux-2.6.11-rc1-mbuddy/fs/ext2/super.c
--- linux-2.6.11-rc1-clean/fs/ext2/super.c  2005-01-12 04:01:24.0 
+
+++ linux-2.6.11-rc1-mbuddy/fs/ext2/super.c 2005-01-13 10:56:30.0 
+
@@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
struct ext2_inode_info *ei;
-   ei = (struct ext2_inode_info 

Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-16 Thread Marcelo Tosatti
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote:
> On Fri, 14 Jan 2005, Marcelo Tosatti wrote:
> 
> > On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote:
> > > The patch is against 2.6.11-rc1 and I'm willing to stand by it's
> > > stability. I'm also confident it does it's job pretty well so I'd like it
> > > to be considered for inclusion.
> >
> > This is very interesting!
> >
> 
> Thanks
> 
> > Other than the advantage of decreased fragmentation which you aim, by
> > providing clustering of different types of allocations you might have a
> > performance gain (or loss :))  due to changes in cache colouring
> > effects.
> >
> 
> That is possible but it I haven't thought of a way of measuring the cache
> colouring effects (if any). There is also the problem that the additional
> complexity of the allocator will offset this benefit. The two main loss
> points of the allocator are increased complexity and the increased size of
> the zone struct.
> 
> > It depends on the workload/application mix and type of cache of course,
> > but I think there will be a significant measurable difference on most
> > common workloads.
> >
> 
> If I could only measure it :/
> 
> > Have you done any investigation with that respect? IMHO such
> > verification is really important before attempting to merge it.
> >
> 
> No unfortunately. Do you know of a test I can use?

Some STP reaim results have significant performance increase in general, a few
small regressions. 

I think that depending on the type of access pattern of the application(s) there
will be either performance gain or loss, but the result is interesting anyway. 
:)

I'll different more tests later on.

AIM OVERVIEW
The AIM Multiuser Benchmark - Suite VII tests and measures the performance of
Open System multiuser computers. Multiuser computer environments typically have
the following general characteristics in common:

- A large number of tasks are run concurrently
- Disk storage increases dramatically as the number of users increase.
- Complex numerically intense applications are performed infrequently
- An important amount of time is spent sorting and searching through large
  amounts of data.
- After data is used it is placed back on disk because it is a shared resource.
- A large amount of time is spent in common runtime libraries.




NORMAL LOAD 4-way-SMP:


kernel: patch-2.6.11-rc1
plmid: 4066
Host: stp4-000
Reaim test
http://khack.osdl.org/stp/300031
kernel: 4066
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 4881.87 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4961.19 (average of 3 runs)

kernel: mel-v3-fixed
plmid: 4077
Host: stp4-001
Reaim test
http://khack.osdl.org/stp/300056
kernel: 4077
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 5065.93 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 5294.48 (average of 3 runs)


NORMAL LOAD 1-WAY:

kernel: patch-2.6.11-rc1
plmid: 4066
Host: stp1-003
Reaim test
http://khack.osdl.org/stp/300029
kernel: 4066
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 993.13 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 983.11 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.


kernel: mel-v3-fixed
plmid: 4077
Host: stp1-002
Reaim test
http://khack.osdl.org/stp/300055
kernel: 4077
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 982.69 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 1008.06 (average of 3 runs)


COMPUTE LOAD 2way (this is more CPU intensive than NORMAL reaim load):

kernel: patch-2.6.11-rc1
plmid: 4066
Host: stp2-001
Reaim test
http://khack.osdl.org/stp/300060
kernel: 4066
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 1482.45 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 1487.20 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

kernel: mel-v3-fixed
plmid: 4077
Host: stp2-000
Reaim test
http://khack.osdl.org/stp/300058
kernel: 4077
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 1501.47 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 1462.11 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-16 Thread Mel Gorman

> > That is possible but it I haven't thought of a way of measuring the cache
> > colouring effects (if any). There is also the problem that the additional
> > complexity of the allocator will offset this benefit. The two main loss
> > points of the allocator are increased complexity and the increased size of
> > the zone struct.
>
> We should be able to measure that too...
>
> If you look at the performance numbers of applications which do data
> crunching, reading/writing data to disk (scientific applications). Or
> even databases, plus standard set of IO benchmarks...
>

I used two benchmarks to test this. The first was a test that ran gs
against a large postscript file 10 times and measured the average. The
hypothesis was that if I was trashing the CPU cache with the allocator,
there would be a marked difference between the results. The results are;

==> gsbench-2.6.11-rc1MBuddy.txt <==
Average: 115.47 real, 115.136 user, 0.338 sys

==> gsbench-2.6.11-rc1Standard.txt <==
Average: 115.468 real, 115.092 user, 0.337 sys

So, there is no significance there. I think we are safe for the CPU cache
as neither allocator is particularly cache aware.

The second test was a portion of the tests from aim9. The results are

MBuddy
 7 page_test  120.01   9452   78.76010   133892.18 System 
Allocations & Pages/second
 8 brk_test   120.03   3386   28.20961   479563.44 System 
Memory Allocations/second
 9 jmp_test   120.00 501496 4179.1  4179133.33 
Non-local gotos/second
10 signal_test120.01  11632   96.9252696925.26 Signal 
Traps/second
11 exec_test  120.07   1587   13.21729   66.09 Program 
Loads/second
12 fork_test  120.03   1890   15.74606 1574.61 Task 
Creations/second
13 link_test  120.00  11152   92.9 5854.80 
Link/Unlink Pairs/second
56 fifo_test  120.00 173450 1445.41667   144541.67
FIFO Messages/second

Vanilla
 7 page_test  120.01   9536   79.46004   135082.08 System 
Allocations & Pages/second
 8 brk_test   120.01   3394   28.28098   480776.60 System 
Memory Allocations/second
 9 jmp_test   120.00 498770 4156.41667  4156416.67 
Non-local gotos/second
10 signal_test120.00  11773   98.1083398108.33 Signal 
Traps/second
11 exec_test  120.01   1591   13.25723   66.29 Program 
Loads/second
12 fork_test  120.00   1941   16.17500 1617.50 Task 
Creations/second
13 link_test  120.00  11188   93.2 5873.70 
Link/Unlink Pairs/second
56 fifo_test  120.00 179156 1492.96667   149296.67 FIFO 
Messages/second

Here, there are worrying differences all right. The modified allocator for
example is getting 1000 faults a second less than the standard allocator
but that is still less than 1%.  This is something I need to work on
although I think it's optimisation work rather than a fundamental problem
with the approach.

I'm looking into using bonnie++ as another IO benchmark.

> We should be able to use the CPU performance counters to get exact
> miss/hit numbers, but it seems its not yet possible to use Mikael's
> Pettersson pmc inside the kernel, I asked him sometime ago but never got
> along to trying anything:
>
> 

This is stuff I was not aware of before and will need to follow up on.

> I think some CPU/memory intensive benchmarks should give us a hint of the 
> total
> impact ?
>

The ghostscript test was the one I choose. Script is below

> > However, I also know the linear scanner trashed the LRU lists and probably
> > comes with all sorts of performance regressions just to make the
> > high-order allocations.
>
> Migrating pages instead of freeing them can greatly reduce the overhead I 
> believe
> and might be a low impact way of defragmenting memory.
>

Very likely. As it is, the scanner I used is really stupid but I wanted to
show that using a mechanism like it, we should be able to almost guarentee
the allocation of a high-order block, something we cannot currently do.

> I've added your patch to STP but:
>
> [STP 300030]Kernel Patch Error  Kernel: mel-three-type-allocator-v2 PLM # 4073
>

I posted a new version under the subject "[PATCH] 1/2 Reducing
fragmentation through better allocation". It should apply cleanly to a
vanilla kernel. Sorry about the mess of the other patch.

> It failed to apply to 2.6.10-rc1 - I'll work the rejects and rerun the tests.
>

The patch is against 2.6.11-rc1, but I'm guessing you typos 2.6.10-rc1.

-- 
Mel Gorman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-16 Thread Mel Gorman

  That is possible but it I haven't thought of a way of measuring the cache
  colouring effects (if any). There is also the problem that the additional
  complexity of the allocator will offset this benefit. The two main loss
  points of the allocator are increased complexity and the increased size of
  the zone struct.

 We should be able to measure that too...

 If you look at the performance numbers of applications which do data
 crunching, reading/writing data to disk (scientific applications). Or
 even databases, plus standard set of IO benchmarks...


I used two benchmarks to test this. The first was a test that ran gs
against a large postscript file 10 times and measured the average. The
hypothesis was that if I was trashing the CPU cache with the allocator,
there would be a marked difference between the results. The results are;

== gsbench-2.6.11-rc1MBuddy.txt ==
Average: 115.47 real, 115.136 user, 0.338 sys

== gsbench-2.6.11-rc1Standard.txt ==
Average: 115.468 real, 115.092 user, 0.337 sys

So, there is no significance there. I think we are safe for the CPU cache
as neither allocator is particularly cache aware.

The second test was a portion of the tests from aim9. The results are

MBuddy
 7 page_test  120.01   9452   78.76010   133892.18 System 
Allocations  Pages/second
 8 brk_test   120.03   3386   28.20961   479563.44 System 
Memory Allocations/second
 9 jmp_test   120.00 501496 4179.1  4179133.33 
Non-local gotos/second
10 signal_test120.01  11632   96.9252696925.26 Signal 
Traps/second
11 exec_test  120.07   1587   13.21729   66.09 Program 
Loads/second
12 fork_test  120.03   1890   15.74606 1574.61 Task 
Creations/second
13 link_test  120.00  11152   92.9 5854.80 
Link/Unlink Pairs/second
56 fifo_test  120.00 173450 1445.41667   144541.67
FIFO Messages/second

Vanilla
 7 page_test  120.01   9536   79.46004   135082.08 System 
Allocations  Pages/second
 8 brk_test   120.01   3394   28.28098   480776.60 System 
Memory Allocations/second
 9 jmp_test   120.00 498770 4156.41667  4156416.67 
Non-local gotos/second
10 signal_test120.00  11773   98.1083398108.33 Signal 
Traps/second
11 exec_test  120.01   1591   13.25723   66.29 Program 
Loads/second
12 fork_test  120.00   1941   16.17500 1617.50 Task 
Creations/second
13 link_test  120.00  11188   93.2 5873.70 
Link/Unlink Pairs/second
56 fifo_test  120.00 179156 1492.96667   149296.67 FIFO 
Messages/second

Here, there are worrying differences all right. The modified allocator for
example is getting 1000 faults a second less than the standard allocator
but that is still less than 1%.  This is something I need to work on
although I think it's optimisation work rather than a fundamental problem
with the approach.

I'm looking into using bonnie++ as another IO benchmark.

 We should be able to use the CPU performance counters to get exact
 miss/hit numbers, but it seems its not yet possible to use Mikael's
 Pettersson pmc inside the kernel, I asked him sometime ago but never got
 along to trying anything:

 SNIP

This is stuff I was not aware of before and will need to follow up on.

 I think some CPU/memory intensive benchmarks should give us a hint of the 
 total
 impact ?


The ghostscript test was the one I choose. Script is below

  However, I also know the linear scanner trashed the LRU lists and probably
  comes with all sorts of performance regressions just to make the
  high-order allocations.

 Migrating pages instead of freeing them can greatly reduce the overhead I 
 believe
 and might be a low impact way of defragmenting memory.


Very likely. As it is, the scanner I used is really stupid but I wanted to
show that using a mechanism like it, we should be able to almost guarentee
the allocation of a high-order block, something we cannot currently do.

 I've added your patch to STP but:

 [STP 300030]Kernel Patch Error  Kernel: mel-three-type-allocator-v2 PLM # 4073


I posted a new version under the subject [PATCH] 1/2 Reducing
fragmentation through better allocation. It should apply cleanly to a
vanilla kernel. Sorry about the mess of the other patch.

 It failed to apply to 2.6.10-rc1 - I'll work the rejects and rerun the tests.


The patch is against 2.6.11-rc1, but I'm guessing you typos 2.6.10-rc1.

-- 
Mel Gorman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-16 Thread Marcelo Tosatti
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote:
 On Fri, 14 Jan 2005, Marcelo Tosatti wrote:
 
  On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote:
   The patch is against 2.6.11-rc1 and I'm willing to stand by it's
   stability. I'm also confident it does it's job pretty well so I'd like it
   to be considered for inclusion.
 
  This is very interesting!
 
 
 Thanks
 
  Other than the advantage of decreased fragmentation which you aim, by
  providing clustering of different types of allocations you might have a
  performance gain (or loss :))  due to changes in cache colouring
  effects.
 
 
 That is possible but it I haven't thought of a way of measuring the cache
 colouring effects (if any). There is also the problem that the additional
 complexity of the allocator will offset this benefit. The two main loss
 points of the allocator are increased complexity and the increased size of
 the zone struct.
 
  It depends on the workload/application mix and type of cache of course,
  but I think there will be a significant measurable difference on most
  common workloads.
 
 
 If I could only measure it :/
 
  Have you done any investigation with that respect? IMHO such
  verification is really important before attempting to merge it.
 
 
 No unfortunately. Do you know of a test I can use?

Some STP reaim results have significant performance increase in general, a few
small regressions. 

I think that depending on the type of access pattern of the application(s) there
will be either performance gain or loss, but the result is interesting anyway. 
:)

I'll different more tests later on.

AIM OVERVIEW
The AIM Multiuser Benchmark - Suite VII tests and measures the performance of
Open System multiuser computers. Multiuser computer environments typically have
the following general characteristics in common:

- A large number of tasks are run concurrently
- Disk storage increases dramatically as the number of users increase.
- Complex numerically intense applications are performed infrequently
- An important amount of time is spent sorting and searching through large
  amounts of data.
- After data is used it is placed back on disk because it is a shared resource.
- A large amount of time is spent in common runtime libraries.




NORMAL LOAD 4-way-SMP:


kernel: patch-2.6.11-rc1
plmid: 4066
Host: stp4-000
Reaim test
http://khack.osdl.org/stp/300031
kernel: 4066
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 4881.87 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4961.19 (average of 3 runs)

kernel: mel-v3-fixed
plmid: 4077
Host: stp4-001
Reaim test
http://khack.osdl.org/stp/300056
kernel: 4077
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 5065.93 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 5294.48 (average of 3 runs)


NORMAL LOAD 1-WAY:

kernel: patch-2.6.11-rc1
plmid: 4066
Host: stp1-003
Reaim test
http://khack.osdl.org/stp/300029
kernel: 4066
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 993.13 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 983.11 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.


kernel: mel-v3-fixed
plmid: 4077
Host: stp1-002
Reaim test
http://khack.osdl.org/stp/300055
kernel: 4077
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 982.69 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 1008.06 (average of 3 runs)


COMPUTE LOAD 2way (this is more CPU intensive than NORMAL reaim load):

kernel: patch-2.6.11-rc1
plmid: 4066
Host: stp2-001
Reaim test
http://khack.osdl.org/stp/300060
kernel: 4066
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 1482.45 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 1487.20 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

kernel: mel-v3-fixed
plmid: 4077
Host: stp2-000
Reaim test
http://khack.osdl.org/stp/300058
kernel: 4077
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 1501.47 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 1462.11 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.








-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-16 Thread Mel Gorman
On Sun, 16 Jan 2005, Marcelo Tosatti wrote:

  No unfortunately. Do you know of a test I can use?

 Some STP reaim results have significant performance increase in general, a few
 small regressions.

 I think that depending on the type of access pattern of the application(s) 
 there
 will be either performance gain or loss, but the result is interesting 
 anyway. :)


That is quite exciting and I'm pleased it was able to show gains in some
tests. Based on the aim9 tests, I took a look at the paths I affected to
see what improvements I could make. There were three significant ones

1. I inlined get_pageblock_type and set_pageblock_type
2. set_pageblock_type was calling page_zone() even though the only caller
knew the zone so I added the parameter
3. When taking fom the global pool, I was recanning all the order lists
which is does not any more

I am hoping that these three changes will clear up the worst of the minor
regressions.

With the changess, aim9 reported that the modified allocator performs as
well as the standard allocator. This means that the allocator is as fast,
we are reasonably sure there is no adverse cache effects (if anything
cache usage is improved) and we are far more likely to be able to service
high-order requests

[EMAIL PROTECTED]:~# grep _test aim9-vanilla-120.txt
 7 page_test  120.00   9508   79.2   134696.67 System 
Allocations  Pages/second
 8 brk_test   120.01   3401   28.33931   481768.19 System 
Memory Allocations/second
 9 jmp_test   120.00 498718 4155.98333  4155983.33 
Non-local gotos/second
10 signal_test120.01  11768   98.0585098058.50 Signal 
Traps/second
11 exec_test  120.04   1585   13.20393   66.02 Program 
Loads/second
12 fork_test  120.04   1979   16.48617 1648.62 Task 
Creations/second
13 link_test  120.01  11174   93.10891 5865.86 
Link/Unlink Pairs/second
[EMAIL PROTECTED]:~# grep _test aim9-mbuddyV3-120.txt
 7 page_test  120.01   9660   80.49329   136838.60 System 
Allocations  Pages/second
 8 brk_test   120.01   3409   28.40597   482901.42 System 
Memory Allocations/second
 9 jmp_test   120.00 501533 4179.44167  4179441.67 
Non-local gotos/second
10 signal_test120.00  11677   97.3083397308.33 Signal 
Traps/second
11 exec_test  120.05   1585   13.20283   66.01 Program 
Loads/second
12 fork_test  120.05   1889   15.73511 1573.51 Task 
Creations/second
13 link_test  120.01  11089   92.40063 5821.24 
Link/Unlink Pairs/second


Patch with minor optimisations as follows;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff 
linux-2.6.11-rc1-clean/fs/buffer.c linux-2.6.11-rc1-mbuddy/fs/buffer.c
--- linux-2.6.11-rc1-clean/fs/buffer.c  2005-01-12 04:01:23.0 +
+++ linux-2.6.11-rc1-mbuddy/fs/buffer.c 2005-01-13 10:56:30.0 +
@@ -1134,7 +1134,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;

-   page = find_or_create_page(inode-i_mapping, index, GFP_NOFS);
+   page = find_or_create_page(inode-i_mapping, index,
+   GFP_NOFS | __GFP_USERRCLM);
if (!page)
return NULL;

@@ -2997,7 +2998,8 @@ static void recalc_bh_state(void)

 struct buffer_head *alloc_buffer_head(int gfp_flags)
 {
-   struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+   struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+  gfp_flags|__GFP_KERNRCLM);
if (ret) {
preempt_disable();
__get_cpu_var(bh_accounting).nr++;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff 
linux-2.6.11-rc1-clean/fs/dcache.c linux-2.6.11-rc1-mbuddy/fs/dcache.c
--- linux-2.6.11-rc1-clean/fs/dcache.c  2005-01-12 04:00:09.0 +
+++ linux-2.6.11-rc1-mbuddy/fs/dcache.c 2005-01-13 10:56:30.0 +
@@ -715,7 +715,8 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;

-   dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+   dentry = kmem_cache_alloc(dentry_cache,
+ GFP_KERNEL|__GFP_KERNRCLM);
if (!dentry)
return NULL;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff 
linux-2.6.11-rc1-clean/fs/ext2/super.c linux-2.6.11-rc1-mbuddy/fs/ext2/super.c
--- linux-2.6.11-rc1-clean/fs/ext2/super.c  2005-01-12 04:01:24.0 
+
+++ linux-2.6.11-rc1-mbuddy/fs/ext2/super.c 2005-01-13 10:56:30.0 
+
@@ -137,7 +137,7 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
struct ext2_inode_info *ei;
-   ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, 

Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-15 Thread Marcelo Tosatti
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote:
> On Fri, 14 Jan 2005, Marcelo Tosatti wrote:
> 
> > On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote:
> > > The patch is against 2.6.11-rc1 and I'm willing to stand by it's
> > > stability. I'm also confident it does it's job pretty well so I'd like it
> > > to be considered for inclusion.
> >
> > This is very interesting!
> >
> 
> Thanks
> 
> > Other than the advantage of decreased fragmentation which you aim, by
> > providing clustering of different types of allocations you might have a
> > performance gain (or loss :))  due to changes in cache colouring
> > effects.
> >
> 
> That is possible but it I haven't thought of a way of measuring the cache
> colouring effects (if any). There is also the problem that the additional
> complexity of the allocator will offset this benefit. The two main loss
> points of the allocator are increased complexity and the increased size of
> the zone struct.

We should be able to measure that too...

If you look at the performance numbers of applications which do data crunching,
reading/writing data to disk (scientific applications). Or even databases,
plus standard set of IO benchmarks...

Of course you're not able to measure the change in cache hits/misses (which 
would be nice),
but you can get an idea how measurable is the final performance impact, 
including
the page allocator overhead and the increase zone struct size (I dont think the 
struct zone 
size increase makes much difference).

We should be able to use the CPU performance counters to get exact miss/hit 
numbers, 
but it seems its not yet possible to use Mikael's Pettersson pmc inside the 
kernel, I asked him
sometime ago but never got along to trying anything:

Subject: Re: Measuring kernel-level code cache hits/misses with perfctr 

 > Hi Mikael,   
 >  
 >  
 >  
 >  
 >  
 > I've been wondering if its possible to use PMC's 
 >  
 >  
 > to monitor L1 and/or L2 cache hits from kernel code? 

You can count them by using the global-mode counters interface
(present in the perfctr-2.6 package but not in the 2.6-mm kernel
unfortunately) and restricting the counters to CPL 0.

However, for profiling purposes you probably want to catch overflow
interrupts, and that's not supported for global-mode counters.
I simply haven't had time to implement that feature.


> > It depends on the workload/application mix and type of cache of course,
> > but I think there will be a significant measurable difference on most
> > common workloads.
> >
> 
> If I could only measure it :/
> 
> > Have you done any investigation with that respect? IMHO such
> > verification is really important before attempting to merge it.
> >
> 
> No unfortunately. Do you know of a test I can use?

I think some CPU/memory intensive benchmarks should give us a hint of the total
impact ?

> > BTW talking about cache colouring, I this is an area which has a HUGE
> > space for improvement. The allocator is completly unaware of colouring
> > (except the SLAB) - we should try to come up with a light per-process
> > allocation colouring optimizer. But thats another history.
> >
> 
> This also was tried and dropped. The allocator was a lot more complex and
> the implementor was unable to measure it. IIRC, the patch was not accepted
> with a comment along the lines of "If you can't measure it, it doesn't
> exist". Before I walk down the page coloring path again, I'll need some
> scheme that measures the cache-effect.

Someone needs to write the helper functions to use the PMC's and test that.

> Totally aside, I'm doing this work because I've started a PhD on
> developing solid metrics for measuring VM performance and then devising
> new or modified algorithms using the metrics to see if the changes are any
> good.

Nice! Make your work public! I'm personally very interested in this area.

> > > For me, the next stage is to write a linear scanner that goes through the
> > > address space to free up a high-order block of pages on demand. This will
> > > be a tricky job so it'll take me quite a while.
> >
> > We're paving the road to implement a generic "weak" migration function on 
> > top
> > of the current page migration infrastructure. With "weak" I mean that it 
> > bails
> > out easily if the page cannot be migrated, unlike the "strong" version which
> > _has_ to migrate the page(s) (for memory hotplug purpose).
> >
> > With such function in place its easier to have different implementations of 
> > defragmentation
> > logic - we might want to coolaborate on that.
> >
> 
> I've 

Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-15 Thread Mel Gorman
On Fri, 14 Jan 2005, Marcelo Tosatti wrote:

> On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote:
> > The patch is against 2.6.11-rc1 and I'm willing to stand by it's
> > stability. I'm also confident it does it's job pretty well so I'd like it
> > to be considered for inclusion.
>
> This is very interesting!
>

Thanks

> Other than the advantage of decreased fragmentation which you aim, by
> providing clustering of different types of allocations you might have a
> performance gain (or loss :))  due to changes in cache colouring
> effects.
>

That is possible but it I haven't thought of a way of measuring the cache
colouring effects (if any). There is also the problem that the additional
complexity of the allocator will offset this benefit. The two main loss
points of the allocator are increased complexity and the increased size of
the zone struct.

> It depends on the workload/application mix and type of cache of course,
> but I think there will be a significant measurable difference on most
> common workloads.
>

If I could only measure it :/

> Have you done any investigation with that respect? IMHO such
> verification is really important before attempting to merge it.
>

No unfortunately. Do you know of a test I can use?

> BTW talking about cache colouring, I this is an area which has a HUGE
> space for improvement. The allocator is completly unaware of colouring
> (except the SLAB) - we should try to come up with a light per-process
> allocation colouring optimizer. But thats another history.
>

This also was tried and dropped. The allocator was a lot more complex and
the implementor was unable to measure it. IIRC, the patch was not accepted
with a comment along the lines of "If you can't measure it, it doesn't
exist". Before I walk down the page coloring path again, I'll need some
scheme that measures the cache-effect.

Totally aside, I'm doing this work because I've started a PhD on
developing solid metrics for measuring VM performance and then devising
new or modified algorithms using the metrics to see if the changes are any
good.

> > For me, the next stage is to write a linear scanner that goes through the
> > address space to free up a high-order block of pages on demand. This will
> > be a tricky job so it'll take me quite a while.
>
> We're paving the road to implement a generic "weak" migration function on top
> of the current page migration infrastructure. With "weak" I mean that it bails
> out easily if the page cannot be migrated, unlike the "strong" version which
> _has_ to migrate the page(s) (for memory hotplug purpose).
>
> With such function in place its easier to have different implementations of 
> defragmentation
> logic - we might want to coolaborate on that.
>

I've also started something like this although I think you'll find my
first approach childishly simple. I implemented a linear scanner that
finds the KernRclm and UserRclm areas. It then makes a list of the PageLRU
pages and sends them to shrink_list(). I ran a test which put the machine
under heavy stress and then tried to allocate 75% of ZONE_NORMAL with
2^_MAX_ORDER pages (allocations done via a kernel module). I found that
the standard allocator was only able to successfully allocate 1% of the
allocations (3 blocks), my modified allocator managed 50% (81 blocks) and
with linear scanning in place, it was 76% (122 blocks). I figure I could
get the linear scanning figures even higher if I taught the allocator to
reserve the pages it frees for the process performing the linear scanning.

However, I also know the linear scanner trashed the LRU lists and probably
comes with all sorts of performance regressions just to make the
high-order allocations.

The new patches for the allocator (last patch I posted has a serious bug
in it), the linear scanner and the results will be posted as another mail.

> Your bitmap also allows a hint for the "defragmentator" to know the type
> of pages, and possibly size of the block, so it can avoid earlier trying
> to migrate non reclaimable memory. It possibly makes the scanning
> procedure much lightweight.
>

Potentially. I need to catch up more on the existing schemes. I've been
out of the VM loop for a long time now so I'm still playing the Catch-Up
game.

> > 
>
> You want to do
>   free_pages -= (z->free_area_lists[0][o].nr_free + 
> z->free_area_lists[2][o].nr_free +
>   z->free_area_lists[2][o].nr_free) << o;
>
> So not to interfere with the "min" decay (and remove the allocation type 
> loop).
>

Agreed. New patch has this in place

> >
> > -   /* Require fewer higher order pages to be free */
> > -   min >>= 1;
> > +   /* Require fewer higher order pages to be free */
> > +   min >>= 1;
> >
> > -   if (free_pages <= min)
> > -   return 0;
> > +   if (free_pages <= min)
> > +   return 0;
> > +   }
>
> I'll play 

Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-15 Thread Mel Gorman
On Fri, 14 Jan 2005, Marcelo Tosatti wrote:

 On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote:
  The patch is against 2.6.11-rc1 and I'm willing to stand by it's
  stability. I'm also confident it does it's job pretty well so I'd like it
  to be considered for inclusion.

 This is very interesting!


Thanks

 Other than the advantage of decreased fragmentation which you aim, by
 providing clustering of different types of allocations you might have a
 performance gain (or loss :))  due to changes in cache colouring
 effects.


That is possible but it I haven't thought of a way of measuring the cache
colouring effects (if any). There is also the problem that the additional
complexity of the allocator will offset this benefit. The two main loss
points of the allocator are increased complexity and the increased size of
the zone struct.

 It depends on the workload/application mix and type of cache of course,
 but I think there will be a significant measurable difference on most
 common workloads.


If I could only measure it :/

 Have you done any investigation with that respect? IMHO such
 verification is really important before attempting to merge it.


No unfortunately. Do you know of a test I can use?

 BTW talking about cache colouring, I this is an area which has a HUGE
 space for improvement. The allocator is completly unaware of colouring
 (except the SLAB) - we should try to come up with a light per-process
 allocation colouring optimizer. But thats another history.


This also was tried and dropped. The allocator was a lot more complex and
the implementor was unable to measure it. IIRC, the patch was not accepted
with a comment along the lines of If you can't measure it, it doesn't
exist. Before I walk down the page coloring path again, I'll need some
scheme that measures the cache-effect.

Totally aside, I'm doing this work because I've started a PhD on
developing solid metrics for measuring VM performance and then devising
new or modified algorithms using the metrics to see if the changes are any
good.

  For me, the next stage is to write a linear scanner that goes through the
  address space to free up a high-order block of pages on demand. This will
  be a tricky job so it'll take me quite a while.

 We're paving the road to implement a generic weak migration function on top
 of the current page migration infrastructure. With weak I mean that it bails
 out easily if the page cannot be migrated, unlike the strong version which
 _has_ to migrate the page(s) (for memory hotplug purpose).

 With such function in place its easier to have different implementations of 
 defragmentation
 logic - we might want to coolaborate on that.


I've also started something like this although I think you'll find my
first approach childishly simple. I implemented a linear scanner that
finds the KernRclm and UserRclm areas. It then makes a list of the PageLRU
pages and sends them to shrink_list(). I ran a test which put the machine
under heavy stress and then tried to allocate 75% of ZONE_NORMAL with
2^_MAX_ORDER pages (allocations done via a kernel module). I found that
the standard allocator was only able to successfully allocate 1% of the
allocations (3 blocks), my modified allocator managed 50% (81 blocks) and
with linear scanning in place, it was 76% (122 blocks). I figure I could
get the linear scanning figures even higher if I taught the allocator to
reserve the pages it frees for the process performing the linear scanning.

However, I also know the linear scanner trashed the LRU lists and probably
comes with all sorts of performance regressions just to make the
high-order allocations.

The new patches for the allocator (last patch I posted has a serious bug
in it), the linear scanner and the results will be posted as another mail.

 Your bitmap also allows a hint for the defragmentator to know the type
 of pages, and possibly size of the block, so it can avoid earlier trying
 to migrate non reclaimable memory. It possibly makes the scanning
 procedure much lightweight.


Potentially. I need to catch up more on the existing schemes. I've been
out of the VM loop for a long time now so I'm still playing the Catch-Up
game.

  SNIP

 You want to do
   free_pages -= (z-free_area_lists[0][o].nr_free + 
 z-free_area_lists[2][o].nr_free +
   z-free_area_lists[2][o].nr_free)  o;

 So not to interfere with the min decay (and remove the allocation type 
 loop).


Agreed. New patch has this in place

 
  -   /* Require fewer higher order pages to be free */
  -   min = 1;
  +   /* Require fewer higher order pages to be free */
  +   min = 1;
 
  -   if (free_pages = min)
  -   return 0;
  +   if (free_pages = min)
  +   return 0;
  +   }

 I'll play with your patch during the weekend, run some benchmarks (STP
 is our friend), try to measure the 

Re: [PATCH] Avoiding fragmentation through different allocator V2

2005-01-15 Thread Marcelo Tosatti
On Sat, Jan 15, 2005 at 07:18:42PM +, Mel Gorman wrote:
 On Fri, 14 Jan 2005, Marcelo Tosatti wrote:
 
  On Thu, Jan 13, 2005 at 03:56:46PM +, Mel Gorman wrote:
   The patch is against 2.6.11-rc1 and I'm willing to stand by it's
   stability. I'm also confident it does it's job pretty well so I'd like it
   to be considered for inclusion.
 
  This is very interesting!
 
 
 Thanks
 
  Other than the advantage of decreased fragmentation which you aim, by
  providing clustering of different types of allocations you might have a
  performance gain (or loss :))  due to changes in cache colouring
  effects.
 
 
 That is possible but it I haven't thought of a way of measuring the cache
 colouring effects (if any). There is also the problem that the additional
 complexity of the allocator will offset this benefit. The two main loss
 points of the allocator are increased complexity and the increased size of
 the zone struct.

We should be able to measure that too...

If you look at the performance numbers of applications which do data crunching,
reading/writing data to disk (scientific applications). Or even databases,
plus standard set of IO benchmarks...

Of course you're not able to measure the change in cache hits/misses (which 
would be nice),
but you can get an idea how measurable is the final performance impact, 
including
the page allocator overhead and the increase zone struct size (I dont think the 
struct zone 
size increase makes much difference).

We should be able to use the CPU performance counters to get exact miss/hit 
numbers, 
but it seems its not yet possible to use Mikael's Pettersson pmc inside the 
kernel, I asked him
sometime ago but never got along to trying anything:

Subject: Re: Measuring kernel-level code cache hits/misses with perfctr 

  Hi Mikael,   
   
   
   
   
   
  I've been wondering if its possible to use PMC's 
   
   
  to monitor L1 and/or L2 cache hits from kernel code? 

You can count them by using the global-mode counters interface
(present in the perfctr-2.6 package but not in the 2.6-mm kernel
unfortunately) and restricting the counters to CPL 0.

However, for profiling purposes you probably want to catch overflow
interrupts, and that's not supported for global-mode counters.
I simply haven't had time to implement that feature.


  It depends on the workload/application mix and type of cache of course,
  but I think there will be a significant measurable difference on most
  common workloads.
 
 
 If I could only measure it :/
 
  Have you done any investigation with that respect? IMHO such
  verification is really important before attempting to merge it.
 
 
 No unfortunately. Do you know of a test I can use?

I think some CPU/memory intensive benchmarks should give us a hint of the total
impact ?

  BTW talking about cache colouring, I this is an area which has a HUGE
  space for improvement. The allocator is completly unaware of colouring
  (except the SLAB) - we should try to come up with a light per-process
  allocation colouring optimizer. But thats another history.
 
 
 This also was tried and dropped. The allocator was a lot more complex and
 the implementor was unable to measure it. IIRC, the patch was not accepted
 with a comment along the lines of If you can't measure it, it doesn't
 exist. Before I walk down the page coloring path again, I'll need some
 scheme that measures the cache-effect.

Someone needs to write the helper functions to use the PMC's and test that.

 Totally aside, I'm doing this work because I've started a PhD on
 developing solid metrics for measuring VM performance and then devising
 new or modified algorithms using the metrics to see if the changes are any
 good.

Nice! Make your work public! I'm personally very interested in this area.

   For me, the next stage is to write a linear scanner that goes through the
   address space to free up a high-order block of pages on demand. This will
   be a tricky job so it'll take me quite a while.
 
  We're paving the road to implement a generic weak migration function on 
  top
  of the current page migration infrastructure. With weak I mean that it 
  bails
  out easily if the page cannot be migrated, unlike the strong version which
  _has_ to migrate the page(s) (for memory hotplug purpose).
 
  With such function in place its easier to have different implementations of 
  defragmentation
  logic - we might want to coolaborate on that.
 
 
 I've also started something like this although I think you'll find my
 first approach childishly simple. I implemented a linear scanner