Re: IO queueing and complete affinity w/ threads: Some results

2008-02-14 Thread Alan D. Brunelle
Taking a step back, I went to a very simple test environment:

o  4-way IA64
o  2 disks (on separate RAID controller, handled by separate ports on the same 
FC HBA - generates different IRQs).
o  Using write-cached tests - keep all IOs inside of the RAID controller's 
cache, so no perturbations due to platter accesses)

Basically:

o  CPU 0 handled IRQs for /dev/sds
o  CPU 2 handled IRQs for /dev/sdaa

We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The 
IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - 
well within the controller cache on the external storage device). We have found 
that this is a simple way to maximize throughput, and thus be able to watch the 
system for effects without worrying about odd seek & other platter-induced 
issues. Each test took about 6 minutes to run (ran a specific amount of IO, so 
we could compare & contrast system measurements).

First: overall performance

2.6.24 (no patches)  : 106.90 MB/sec

2.6.24 + original patches + rq=0 : 103.09 MB/sec
rq=1 :  98.81 MB/sec

2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
rq=1 : 107.16 MB/sec

So, the kthreads patches works much better here - and on-par or better than 
straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and 
ia64-specific, sorry), and looked at the cycles used. On an ia64 
back-end-bubbles are deadly, and can be caused by cache misses  Looking at 
the gross data:

KernelCPU_CYCLES   BACK END BUBBLES  100.0 
* (BEB/CC)
   -  -  

2.6.24 (no patches)  : 2,357,215,454,852231,547,237,267   9.8%

2.6.24 + original patches + rq=0 : 2,444,895,579,790242,719,920,828   9.9%
rq=1 : 2,551,175,203,455148,586,145,513   5.8%

2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043255,563,975,526  10.8%
rq=1 : 2,350,539,631,362208,888,961,094   8.9%

For both the original & kthreads patches we see a /significant/ drop in bubbles 
when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not 
spent in %system) - a graph is provided up on 
http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from 
stats extracted from running mpstat in conjunction with the IO runs.

Combining %sys & %soft IRQ, we see:

Kernel  % user % sys   % iowait   % idle
         
2.6.24 (no patches)  :   0.141%   10.088%   43.949%   45.819%

2.6.24 + original patches + rq=0 :   0.123%   11.361%   43.507%   45.008%
rq=1 :   0.156%6.030%   44.021%   49.794%

2.6.24 + kthreads patches + rq=0 :   0.163%   10.402%   43.744%   45.686%
rq=1 :   0.156%8.160%   41.880%   49.804%

The good news (I think) is that even with rq=0 with the kthreads patches we're 
getting on-par performance w/ 2.6.24, so the default case should be ok...

I've only done a few runs by hand with this - these results are from one 
representative run out of the bunch - but at least this (I believe) shows what 
this patch stream is intending to do: optimize placement of IO completion 
handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel 
is always helpful! :-)

I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I 
get there... (BTW: I'll be dropping testing of the original patch sequence, the 
kthreads patches look better in general (both in terms of code & results, 
coincidence?).

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-14 Thread Alan D. Brunelle
Taking a step back, I went to a very simple test environment:

o  4-way IA64
o  2 disks (on separate RAID controller, handled by separate ports on the same 
FC HBA - generates different IRQs).
o  Using write-cached tests - keep all IOs inside of the RAID controller's 
cache, so no perturbations due to platter accesses)

Basically:

o  CPU 0 handled IRQs for /dev/sds
o  CPU 2 handled IRQs for /dev/sdaa

We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The 
IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - 
well within the controller cache on the external storage device). We have found 
that this is a simple way to maximize throughput, and thus be able to watch the 
system for effects without worrying about odd seek  other platter-induced 
issues. Each test took about 6 minutes to run (ran a specific amount of IO, so 
we could compare  contrast system measurements).

First: overall performance

2.6.24 (no patches)  : 106.90 MB/sec

2.6.24 + original patches + rq=0 : 103.09 MB/sec
rq=1 :  98.81 MB/sec

2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
rq=1 : 107.16 MB/sec

So, the kthreads patches works much better here - and on-par or better than 
straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and 
ia64-specific, sorry), and looked at the cycles used. On an ia64 
back-end-bubbles are deadly, and can be caused by cache misses c. Looking at 
the gross data:

KernelCPU_CYCLES   BACK END BUBBLES  100.0 
* (BEB/CC)
   -  -  

2.6.24 (no patches)  : 2,357,215,454,852231,547,237,267   9.8%

2.6.24 + original patches + rq=0 : 2,444,895,579,790242,719,920,828   9.9%
rq=1 : 2,551,175,203,455148,586,145,513   5.8%

2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043255,563,975,526  10.8%
rq=1 : 2,350,539,631,362208,888,961,094   8.9%

For both the original  kthreads patches we see a /significant/ drop in bubbles 
when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not 
spent in %system) - a graph is provided up on 
http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from 
stats extracted from running mpstat in conjunction with the IO runs.

Combining %sys  %soft IRQ, we see:

Kernel  % user % sys   % iowait   % idle
         
2.6.24 (no patches)  :   0.141%   10.088%   43.949%   45.819%

2.6.24 + original patches + rq=0 :   0.123%   11.361%   43.507%   45.008%
rq=1 :   0.156%6.030%   44.021%   49.794%

2.6.24 + kthreads patches + rq=0 :   0.163%   10.402%   43.744%   45.686%
rq=1 :   0.156%8.160%   41.880%   49.804%

The good news (I think) is that even with rq=0 with the kthreads patches we're 
getting on-par performance w/ 2.6.24, so the default case should be ok...

I've only done a few runs by hand with this - these results are from one 
representative run out of the bunch - but at least this (I believe) shows what 
this patch stream is intending to do: optimize placement of IO completion 
handling to minimize cache  TLB disruptions. Freeing up cycles in the kernel 
is always helpful! :-)

I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I 
get there... (BTW: I'll be dropping testing of the original patch sequence, the 
kthreads patches look better in general (both in terms of code  results, 
coincidence?).

Alan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-13 Thread Alan D. Brunelle
Comparative results between the original affinity patch and the kthreads-based 
patch on the 32-way running the kernel make sequence. 

It may be easier to compare/contrast with the graphs provided at 
http://free.linux.hp.com/~adb/jens/kernmk.png (kernmk.agr also provided, if you 
want to run xmgrace by hand). 

Tests are:

1. Make Ext2 FS on each of 12 64GB devices in parallel, times include: mkfs, 
mount & unmount
2. Untar a full Linux source code tree onto the devices in parallel, times 
include: mount, untar, unmount
3. Make (-j4) of the full source code tree, times include: mount, make -j4, 
unmount
4. Clean full source code tree, times include: mount, make clean, unmount

The results are so close amongst all the runs (given the large-ish standard 
deviations), that we probably can't deduce much from this. A bit of a concern 
on the top two graphs - mkfs & untar - it certainly appears that the kthreads 
version is a little slower (about 2.9% difference across the values for the 
mkfs runs, and 3.5% for the untar operations). On the make runs, however, we 
didn't see hardly any difference between the runs at all...

We are trying to setup to do some AIM7 tests on a different system over the 
weekend (15 February - 18 February 2008), I'll post those results on the 18th 
or 19th if we can pull it off. [I'll also try to steal time on the 32-way to 
run a straight 2.6.24 kernel, do these runs again, and post those results.]

For the tables below:

 q0 == queue_affinity set to -1
 q1 == queue_affinity set to the CPU managing the IRQ for each device
 c0 == completion_affinity set to -1
 c1 == completion_affinity set to CPU managing the IRQ for each device
rq0 == rq_affinity set to 0
rq1 == rq_affinity set to 1

This 4-test sequence was run 10 times (for each kernel), and results averaged. 
As posted yesterday, here's the original patch sequence results:

mkfsMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  17.814  30.322  33.263   4.551 
q0.c0.rq1  17.540  30.058  32.885   4.321 
q0.c1.rq0  17.770  31.328  32.958   3.121 
q1.c0.rq0  17.907  31.032  32.767   3.515 
q1.c1.rq0  16.891  30.319  33.097   4.624 

untar   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  19.747  21.971  26.292   1.215 
q0.c0.rq1  19.680  22.365  36.395   2.010 
q0.c1.rq0  18.823  21.390  24.455   0.976 
q1.c0.rq0  18.433  21.500  23.371   1.009 
q1.c1.rq0  19.414  21.761  34.115   1.378 

makeMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0 527.418 543.296 552.030   5.384 
q0.c0.rq1 526.265 542.312 549.477   5.467 
q0.c1.rq0 528.935 544.940 553.823   4.746 
q1.c0.rq0 529.432 544.399 553.212   5.166 
q1.c1.rq0 527.638 543.577 551.323   5.478 

clean   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.962  20.308  33.775   3.179 
q0.c0.rq1  17.436  20.156  29.370   3.097 
q0.c1.rq0  17.061  20.111  31.504   2.791 
q1.c0.rq0  16.745  20.247  29.327   2.953 
q1.c1.rq0  17.346  20.316  31.178   3.283 

And for the kthreads-based kernel:

mkfsMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.686  31.069  33.361   3.452 
q0.c0.rq1  16.976  31.719  32.869   2.395 
q0.c1.rq0  16.857  31.345  33.410   3.209 
q1.c0.rq0  17.317  31.997  34.444   3.099 
q1.c1.rq0  16.791  32.266  33.378   2.035 

untar   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  19.769  22.398  25.196   1.076 
q0.c0.rq1  19.742  22.517  38.498   1.733 
q0.c1.rq0  20.071  22.698  36.160   2.259 
q1.c0.rq0  19.910  22.377  35.640   1.528 
q1.c1.rq0  19.448  22.339  24.887   0.926 

makeMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0 526.971 542.820 550.591   4.607 
q0.c0.rq1 527.320 544.422 550.504   3.798 
q0.c1.rq0 527.367 543.856 550.331   4.152 
q1.c0.rq0 527.406 543.636 552.947   4.315 
q1.c1.rq0 528.921 544.594 550.832   3.786 

clean   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.644  20.242  29.524   2.991 
q0.c0.rq1  16.942  20.008  29.729   2.845 
q0.c1.rq0  17.205  20.117  29.851   2.661 
q1.c0.rq0  17.400  20.147  32.581   2.862 
q1.c1.rq0  16.799  20.072  31.883   2.872 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-13 Thread Alan D. Brunelle
Comparative results between the original affinity patch and the kthreads-based 
patch on the 32-way running the kernel make sequence. 

It may be easier to compare/contrast with the graphs provided at 
http://free.linux.hp.com/~adb/jens/kernmk.png (kernmk.agr also provided, if you 
want to run xmgrace by hand). 

Tests are:

1. Make Ext2 FS on each of 12 64GB devices in parallel, times include: mkfs, 
mount  unmount
2. Untar a full Linux source code tree onto the devices in parallel, times 
include: mount, untar, unmount
3. Make (-j4) of the full source code tree, times include: mount, make -j4, 
unmount
4. Clean full source code tree, times include: mount, make clean, unmount

The results are so close amongst all the runs (given the large-ish standard 
deviations), that we probably can't deduce much from this. A bit of a concern 
on the top two graphs - mkfs  untar - it certainly appears that the kthreads 
version is a little slower (about 2.9% difference across the values for the 
mkfs runs, and 3.5% for the untar operations). On the make runs, however, we 
didn't see hardly any difference between the runs at all...

We are trying to setup to do some AIM7 tests on a different system over the 
weekend (15 February - 18 February 2008), I'll post those results on the 18th 
or 19th if we can pull it off. [I'll also try to steal time on the 32-way to 
run a straight 2.6.24 kernel, do these runs again, and post those results.]

For the tables below:

 q0 == queue_affinity set to -1
 q1 == queue_affinity set to the CPU managing the IRQ for each device
 c0 == completion_affinity set to -1
 c1 == completion_affinity set to CPU managing the IRQ for each device
rq0 == rq_affinity set to 0
rq1 == rq_affinity set to 1

This 4-test sequence was run 10 times (for each kernel), and results averaged. 
As posted yesterday, here's the original patch sequence results:

mkfsMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  17.814  30.322  33.263   4.551 
q0.c0.rq1  17.540  30.058  32.885   4.321 
q0.c1.rq0  17.770  31.328  32.958   3.121 
q1.c0.rq0  17.907  31.032  32.767   3.515 
q1.c1.rq0  16.891  30.319  33.097   4.624 

untar   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  19.747  21.971  26.292   1.215 
q0.c0.rq1  19.680  22.365  36.395   2.010 
q0.c1.rq0  18.823  21.390  24.455   0.976 
q1.c0.rq0  18.433  21.500  23.371   1.009 
q1.c1.rq0  19.414  21.761  34.115   1.378 

makeMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0 527.418 543.296 552.030   5.384 
q0.c0.rq1 526.265 542.312 549.477   5.467 
q0.c1.rq0 528.935 544.940 553.823   4.746 
q1.c0.rq0 529.432 544.399 553.212   5.166 
q1.c1.rq0 527.638 543.577 551.323   5.478 

clean   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.962  20.308  33.775   3.179 
q0.c0.rq1  17.436  20.156  29.370   3.097 
q0.c1.rq0  17.061  20.111  31.504   2.791 
q1.c0.rq0  16.745  20.247  29.327   2.953 
q1.c1.rq0  17.346  20.316  31.178   3.283 

And for the kthreads-based kernel:

mkfsMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.686  31.069  33.361   3.452 
q0.c0.rq1  16.976  31.719  32.869   2.395 
q0.c1.rq0  16.857  31.345  33.410   3.209 
q1.c0.rq0  17.317  31.997  34.444   3.099 
q1.c1.rq0  16.791  32.266  33.378   2.035 

untar   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  19.769  22.398  25.196   1.076 
q0.c0.rq1  19.742  22.517  38.498   1.733 
q0.c1.rq0  20.071  22.698  36.160   2.259 
q1.c0.rq0  19.910  22.377  35.640   1.528 
q1.c1.rq0  19.448  22.339  24.887   0.926 

makeMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0 526.971 542.820 550.591   4.607 
q0.c0.rq1 527.320 544.422 550.504   3.798 
q0.c1.rq0 527.367 543.856 550.331   4.152 
q1.c0.rq0 527.406 543.636 552.947   4.315 
q1.c1.rq0 528.921 544.594 550.832   3.786 

clean   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.644  20.242  29.524   2.991 
q0.c0.rq1  16.942  20.008  29.729   2.845 
q0.c1.rq0  17.205  20.117  29.851   2.661 
q1.c0.rq0  17.400  20.147  32.581   2.862 
q1.c1.rq0  16.799  20.072  31.883   2.872 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-12 Thread Alan D. Brunelle
Alan D. Brunelle wrote:

> 
> Hopefully, the first column is self-explanatory - these are the settings 
> applied to the queue_affinity, completion_affinity and rq_affinity tunables. 
> Due to the fact that the standard deviations are so large coupled with the 
> very close average results, I'm not seeing anything in this set of tests to 
> favor any of the combinations...
> 

Note quite:

Q or C = 0 really means Q or C set to -1 (default), Q or C = 1 means placing 
that thread on the CPU managing the IRQ. Sorry... 


Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-12 Thread Alan D. Brunelle
Back on the 32-way, in this set of tests we're running 12 disks spread out 
through the 8 cells of the 32-way. Each disk will have an Ext2 FS placed on it, 
a clean Linux kernel source untar()ed onto it, then a full make (-j4) and then 
a make clean performed. The 12 series are done in parallel - so each disk will 
have:

mkfs
tar x
make
make clean

performed. This was performed ten times, and the overall averages are presented 
below - note this is Jens' original patch sequence NOT the kthread one (those 
results available tomorrow, hopefully). 

mkfsMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  17.814  30.322  33.263   4.551 
q0.c0.rq1  17.540  30.058  32.885   4.321 
q0.c1.rq0  17.770  31.328  32.958   3.121 
q1.c0.rq0  17.907  31.032  32.767   3.515 
q1.c1.rq0  16.891  30.319  33.097   4.624 

untar   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  19.747  21.971  26.292   1.215 
q0.c0.rq1  19.680  22.365  36.395   2.010 
q0.c1.rq0  18.823  21.390  24.455   0.976 
q1.c0.rq0  18.433  21.500  23.371   1.009 
q1.c1.rq0  19.414  21.761  34.115   1.378 

makeMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0 527.418 543.296 552.030   5.384 
q0.c0.rq1 526.265 542.312 549.477   5.467 
q0.c1.rq0 528.935 544.940 553.823   4.746 
q1.c0.rq0 529.432 544.399 553.212   5.166 
q1.c1.rq0 527.638 543.577 551.323   5.478 

clean   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.962  20.308  33.775   3.179 
q0.c0.rq1  17.436  20.156  29.370   3.097 
q0.c1.rq0  17.061  20.111  31.504   2.791 
q1.c0.rq0  16.745  20.247  29.327   2.953 
q1.c1.rq0  17.346  20.316  31.178   3.283 

Hopefully, the first column is self-explanatory - these are the settings 
applied to the queue_affinity, completion_affinity and rq_affinity tunables. 
Due to the fact that the standard deviations are so large coupled with the very 
close average results, I'm not seeing anything in this set of tests to favor 
any of the combinations...

As noted, I will be having the machine run the kthreads-variant of the patch 
stream tonight, and then I have to go back and run a non-patched kernel to see 
if there are any /regressions/. 

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-12 Thread Alan D. Brunelle
Whilst running a series of file system related loads on our 32-way*, I dropped 
down to a 16-way w/ only 24 disks, and ran two kernels: the original set of 
Jens' patches and then his subsequent kthreads-based set. Here are the results:

Original:
A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
- | --  -- |   | --- 
X X X | 1850.4 0.413880 0.0109 |  0.0  55860.8 | 0.0  27946.9
X X A | 1850.6 0.413848 0.0106 |  0.0  55859.2 | 0.0  27946.1
X X I | 1850.6 0.413830 0.0107 |  0.0  55858.5 | 27945.8  0.0
- | --  -- |   | --- 
X A X | 1850.0 0.413949 0.0106 |  55843.7  0.0 | 0.0  27938.3
X A A | 1850.2 0.413931 0.0107 |  55844.2  0.0 | 0.0  27938.6
X A I | 1850.4 0.413862 0.0107 |  55854.3  0.0 | 27943.7  0.0
- | --  -- |   | --- 
X I X | 1850.9 0.413764 0.0107 |  0.0  55866.2 | 0.0  27949.6
X I A | 1850.5 0.413854 0.0108 |  0.0  55855.0 | 0.0  27944.0
X I I | 1850.4 0.413848 0.0105 |  0.0  55854.6 | 27943.8  0.0
= | ==  == |   | === 
I X X | 1570.7 0.487686 0.0142 |  0.0  47406.1 | 0.0  23719.5
I X A | 1570.8 0.487666 0.0143 |  0.0  47409.3 | 23721.2  0.0
I X I | 1570.8 0.487664 0.0142 |  0.0  47410.7 | 23721.8  0.0
- | --  -- |   | --- 
I A X | 1570.9 0.487642 0.0144 |  47412.2  0.0 | 0.0  23722.6
I A A | 1570.8 0.487647 0.0141 |  47411.2  0.0 | 23722.1  0.0
I A I | 1570.8 0.487651 0.0143 |  47410.8  0.0 | 23721.9  0.0
- | --  -- |   | --- 
I I X | 1570.8 0.487683 0.0142 |  47410.2  0.0 | 0.0  23721.6
I I A | 1571.1 0.487591 0.0146 |  47415.0  0.0 | 23724.0  0.0
I I I | 1571.0 0.487623 0.0143 |  47412.5  0.0 | 23722.8  0.0
= | ==  == |   | === 
rq=0  | 1726.7 0.443562 0.0120 |  52118.6  0.0 |  2138.6  23937.2
rq=1  | 1820.5 0.420729 0.0110 |  54938.2  0.0 | 0.0  27485.6
- | --  -- |   | --- 


kthreads-based:
A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
- | --  -- |   | --- 
X X X | 1850.5 0.413867 0.0107 |  0.0  55854.7 | 0.0  27943.8
X X A | 1850.9 0.413763 0.0107 |  0.0  55867.0 | 0.0  27950.0
X X I | 1850.3 0.413911 0.0109 |  0.0  55849.0 | 27941.0  0.0
- | --  -- |   | --- 
X A X | 1851.0 0.413730 0.0107 |  55871.4  0.0 | 0.0  27952.2
X A A | 1850.1 0.413919 0.0107 |  55845.5  0.0 | 0.0  27939.2
X A I | 1850.8 0.413789 0.0108 |  55864.8  0.0 | 27948.9  0.0
- | --  -- |   | --- 
X I X | 1850.5 0.413849 0.0107 |  0.0  55856.5 | 0.0  27944.8
X I A | 1850.6 0.413818 0.0108 |  0.0  55860.2 | 0.0  27946.6
X I I | 1850.8 0.413764 0.0108 |  0.0  55866.7 | 27949.8  0.0
= | ==  == |   | === 
I X X | 1570.9 0.487662 0.0145 |  0.0  47410.1 | 0.0  23721.6
I X A | 1570.7 0.487691 0.0142 |  0.0  47406.9 | 23720.0  0.0
I X I | 1570.7 0.487688 0.0141 |  0.0  47406.5 | 23719.8  0.0
- | --  -- |   | --- 
I A X | 1570.9 0.487661 0.0144 |  47415.4  0.0 | 0.0  23724.2
I A A | 1570.8 0.487648 0.0141 |  47409.1  0.0 | 23721.0  0.0
I A I | 1570.7 0.487667 0.0141 |  47406.1  0.0 | 23719.5  0.0
- | --  -- |   | --- 
I I X | 1570.8 0.487691 0.0142 |  47409.3  0.0 | 0.0  23721.2
I I A | 1570.9 0.487644 0.0142 |  47408.8  0.0 | 23720.9  0.0
I I I | 1570.6 0.487671 0.0141 |  47412.5  0.0 | 23722.8  0.0
= | ==  == |   | === 
rq=0  | 1742.1 0.439676 0.0118 |  52578.1  0.0 |  3602.6  22703.0
rq=1  | 1745.0 0.438918 0.0115 |  52666.3  0.0 |  3473.0  22876.6
- | --  -- |   | --- 

For the first 18 sets on both kernels the results are very similar, the last 
two rq=0/1 sets are perturbed too much by application placement (I would 
guess). Have to think about that some more.

Alan
* What I'm doing on the 32-way is to compare and contrast mkfs, untar, kernel 
make & kernel clean times with different combinations of Q, C and RQ. [[This is 
currently with the "Jens original" patch, if things go well, I can do an 
overnight run with the kthreads-based patch.]]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: IO queueing and complete affinity w/ threads: Some results

2008-02-12 Thread Alan D. Brunelle
Alan D. Brunelle wrote:

 
 Hopefully, the first column is self-explanatory - these are the settings 
 applied to the queue_affinity, completion_affinity and rq_affinity tunables. 
 Due to the fact that the standard deviations are so large coupled with the 
 very close average results, I'm not seeing anything in this set of tests to 
 favor any of the combinations...
 

Note quite:

Q or C = 0 really means Q or C set to -1 (default), Q or C = 1 means placing 
that thread on the CPU managing the IRQ. Sorry... 

sigh
Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-12 Thread Alan D. Brunelle
Back on the 32-way, in this set of tests we're running 12 disks spread out 
through the 8 cells of the 32-way. Each disk will have an Ext2 FS placed on it, 
a clean Linux kernel source untar()ed onto it, then a full make (-j4) and then 
a make clean performed. The 12 series are done in parallel - so each disk will 
have:

mkfs
tar x
make
make clean

performed. This was performed ten times, and the overall averages are presented 
below - note this is Jens' original patch sequence NOT the kthread one (those 
results available tomorrow, hopefully). 

mkfsMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  17.814  30.322  33.263   4.551 
q0.c0.rq1  17.540  30.058  32.885   4.321 
q0.c1.rq0  17.770  31.328  32.958   3.121 
q1.c0.rq0  17.907  31.032  32.767   3.515 
q1.c1.rq0  16.891  30.319  33.097   4.624 

untar   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  19.747  21.971  26.292   1.215 
q0.c0.rq1  19.680  22.365  36.395   2.010 
q0.c1.rq0  18.823  21.390  24.455   0.976 
q1.c0.rq0  18.433  21.500  23.371   1.009 
q1.c1.rq0  19.414  21.761  34.115   1.378 

makeMin Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0 527.418 543.296 552.030   5.384 
q0.c0.rq1 526.265 542.312 549.477   5.467 
q0.c1.rq0 528.935 544.940 553.823   4.746 
q1.c0.rq0 529.432 544.399 553.212   5.166 
q1.c1.rq0 527.638 543.577 551.323   5.478 

clean   Min Avg Max   Std Dev 
- --- --- --- ---
q0.c0.rq0  16.962  20.308  33.775   3.179 
q0.c0.rq1  17.436  20.156  29.370   3.097 
q0.c1.rq0  17.061  20.111  31.504   2.791 
q1.c0.rq0  16.745  20.247  29.327   2.953 
q1.c1.rq0  17.346  20.316  31.178   3.283 

Hopefully, the first column is self-explanatory - these are the settings 
applied to the queue_affinity, completion_affinity and rq_affinity tunables. 
Due to the fact that the standard deviations are so large coupled with the very 
close average results, I'm not seeing anything in this set of tests to favor 
any of the combinations...

As noted, I will be having the machine run the kthreads-variant of the patch 
stream tonight, and then I have to go back and run a non-patched kernel to see 
if there are any /regressions/. 

Alan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-12 Thread Alan D. Brunelle
Whilst running a series of file system related loads on our 32-way*, I dropped 
down to a 16-way w/ only 24 disks, and ran two kernels: the original set of 
Jens' patches and then his subsequent kthreads-based set. Here are the results:

Original:
A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
- | --  -- |   | --- 
X X X | 1850.4 0.413880 0.0109 |  0.0  55860.8 | 0.0  27946.9
X X A | 1850.6 0.413848 0.0106 |  0.0  55859.2 | 0.0  27946.1
X X I | 1850.6 0.413830 0.0107 |  0.0  55858.5 | 27945.8  0.0
- | --  -- |   | --- 
X A X | 1850.0 0.413949 0.0106 |  55843.7  0.0 | 0.0  27938.3
X A A | 1850.2 0.413931 0.0107 |  55844.2  0.0 | 0.0  27938.6
X A I | 1850.4 0.413862 0.0107 |  55854.3  0.0 | 27943.7  0.0
- | --  -- |   | --- 
X I X | 1850.9 0.413764 0.0107 |  0.0  55866.2 | 0.0  27949.6
X I A | 1850.5 0.413854 0.0108 |  0.0  55855.0 | 0.0  27944.0
X I I | 1850.4 0.413848 0.0105 |  0.0  55854.6 | 27943.8  0.0
= | ==  == |   | === 
I X X | 1570.7 0.487686 0.0142 |  0.0  47406.1 | 0.0  23719.5
I X A | 1570.8 0.487666 0.0143 |  0.0  47409.3 | 23721.2  0.0
I X I | 1570.8 0.487664 0.0142 |  0.0  47410.7 | 23721.8  0.0
- | --  -- |   | --- 
I A X | 1570.9 0.487642 0.0144 |  47412.2  0.0 | 0.0  23722.6
I A A | 1570.8 0.487647 0.0141 |  47411.2  0.0 | 23722.1  0.0
I A I | 1570.8 0.487651 0.0143 |  47410.8  0.0 | 23721.9  0.0
- | --  -- |   | --- 
I I X | 1570.8 0.487683 0.0142 |  47410.2  0.0 | 0.0  23721.6
I I A | 1571.1 0.487591 0.0146 |  47415.0  0.0 | 23724.0  0.0
I I I | 1571.0 0.487623 0.0143 |  47412.5  0.0 | 23722.8  0.0
= | ==  == |   | === 
rq=0  | 1726.7 0.443562 0.0120 |  52118.6  0.0 |  2138.6  23937.2
rq=1  | 1820.5 0.420729 0.0110 |  54938.2  0.0 | 0.0  27485.6
- | --  -- |   | --- 


kthreads-based:
A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
- | --  -- |   | --- 
X X X | 1850.5 0.413867 0.0107 |  0.0  55854.7 | 0.0  27943.8
X X A | 1850.9 0.413763 0.0107 |  0.0  55867.0 | 0.0  27950.0
X X I | 1850.3 0.413911 0.0109 |  0.0  55849.0 | 27941.0  0.0
- | --  -- |   | --- 
X A X | 1851.0 0.413730 0.0107 |  55871.4  0.0 | 0.0  27952.2
X A A | 1850.1 0.413919 0.0107 |  55845.5  0.0 | 0.0  27939.2
X A I | 1850.8 0.413789 0.0108 |  55864.8  0.0 | 27948.9  0.0
- | --  -- |   | --- 
X I X | 1850.5 0.413849 0.0107 |  0.0  55856.5 | 0.0  27944.8
X I A | 1850.6 0.413818 0.0108 |  0.0  55860.2 | 0.0  27946.6
X I I | 1850.8 0.413764 0.0108 |  0.0  55866.7 | 27949.8  0.0
= | ==  == |   | === 
I X X | 1570.9 0.487662 0.0145 |  0.0  47410.1 | 0.0  23721.6
I X A | 1570.7 0.487691 0.0142 |  0.0  47406.9 | 23720.0  0.0
I X I | 1570.7 0.487688 0.0141 |  0.0  47406.5 | 23719.8  0.0
- | --  -- |   | --- 
I A X | 1570.9 0.487661 0.0144 |  47415.4  0.0 | 0.0  23724.2
I A A | 1570.8 0.487648 0.0141 |  47409.1  0.0 | 23721.0  0.0
I A I | 1570.7 0.487667 0.0141 |  47406.1  0.0 | 23719.5  0.0
- | --  -- |   | --- 
I I X | 1570.8 0.487691 0.0142 |  47409.3  0.0 | 0.0  23721.2
I I A | 1570.9 0.487644 0.0142 |  47408.8  0.0 | 23720.9  0.0
I I I | 1570.6 0.487671 0.0141 |  47412.5  0.0 | 23722.8  0.0
= | ==  == |   | === 
rq=0  | 1742.1 0.439676 0.0118 |  52578.1  0.0 |  3602.6  22703.0
rq=1  | 1745.0 0.438918 0.0115 |  52666.3  0.0 |  3473.0  22876.6
- | --  -- |   | --- 

For the first 18 sets on both kernels the results are very similar, the last 
two rq=0/1 sets are perturbed too much by application placement (I would 
guess). Have to think about that some more.

Alan
* What I'm doing on the 32-way is to compare and contrast mkfs, untar, kernel 
make  kernel clean times with different combinations of Q, C and RQ. [[This is 
currently with the Jens original patch, if things go well, I can do an 
overnight run with the kthreads-based patch.]]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

IO queueing and complete affinity w/ threads: Some results

2008-02-11 Thread Alan D. Brunelle
The test case chosen may not be a very good start, but anyways, here are some 
initial test results with the "nasty arch bits". This was performed on a 32-way 
ia64 box with 1 terrabyte of RAM, and 144 FC disks (contained in 24 HP MSA1000 
RAID controlers attached to 12 dual-port adapters). Each test case was run for 
3 minutes. I had one application per device performing a large amount of 
direct/asynchronous large reads. Here's the table of results, with explanation 
below (results are for all 144 devices either accumulated (MBPS) or averaged 
(other columns)):

A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
- | --  -- |   | --- 
X X X | 3859.9 1.190067 0.0502 |  0.0  19484.7 | 0.0   9758.8
X X A | 3856.3 1.191220 0.0490 |  0.0  19467.2 | 0.0   9750.1
X X I | 3850.3 1.192992 0.0508 |  0.0  19437.3 |  9735.1  0.0
- | --  -- |   | --- 
X A X | 3853.9 1.191891 0.0503 |  19455.4  0.0 | 0.0   9744.2
X A A | 3853.5 1.191935 0.0507 |  19453.2  0.0 | 0.0   9743.1
X A I | 3856.6 1.191043 0.0512 |  19468.7  0.0 |  9750.8  0.0
- | --  -- |   | --- 
X I X | 3854.7 1.191674 0.0491 |  0.0  19459.8 | 0.0   9746.4
X I A | 3855.3 1.191434 0.0501 |  0.0  19461.9 | 0.0   9747.4
X I I | 3856.2 1.191128 0.0506 |  0.0  19466.6 |  9749.8  0.0
= | ==  == |   | === 
I X X | 3857.0 1.190987 0.0500 |  0.0  19471.9 | 0.0   9752.5
I X A | 3856.5 1.191082 0.0496 |  0.0  19469.4 |  9751.2  0.0
I X I | 3853.7 1.191938 0.0500 |  0.0  19456.2 |  9744.6  0.0
- | --  -- |   | --- 
I A X | 3854.8 1.191675 0.0502 |  19461.5  0.0 | 0.0   9747.2
I A A | 3855.1 1.191464 0.0503 |  19464.0  0.0 |  9748.5  0.0
I A I | 3854.9 1.191627 0.0483 |  19461.7  0.0 |  9747.4  0.0
- | --  -- |   | --- 
I I X | 3853.4 1.192070 0.0484 |  19454.8  0.0 | 0.0   9743.9
I I A | 3852.2 1.192403 0.0502 |  19448.5  0.0 |  9740.8  0.0
I I I | 3854.0 1.191822 0.0499 |  19457.9  0.0 |  9745.5  0.0
= | ==  == |   | === 
rq=0  | 3854.8 1.191680 0.0480 |  19459.7  0.0 |   202.9   9543.5
rq=1  | 3854.0 1.191965 0.0483 |  19457.0  0.0 |   403.1   9341.9
- | --  -- |   | --- 

The variables being played with:

'A' - When set to X the application was placed on a CPU other than the one 
handling IRQs for the device (in another cell)

'Q' - When set to X, queue affinity was placed in another cell from the 
application OR completion OR IRQ, when set to 'A' it was pegged onto the same 
CPU as the application, when set to 'I' it was set to the CPU that was managing 
the IRQ for its device.

'C' - Likewise for the completion affinity: 'X' means on another cell besides 
the one containing the application or the queueing or the IRQ handling CPU, A 
means put on the same CPU as the application, and I means put on the same CPU 
as the IRQ handler.

o  For the last two rows, we set Q == C == -1, and let the application go to 
any CPU (as dictated by the scheduler). Then we had 'rq_affinity' set to 0 or 1.

The resulting columns include:

MBPS - Total megabytes per second (so we're seeing about 3.8 gigabytes per 
second for the system)
Avg lat - Average per IO measured latency in seconds (note: I had upwards of 
128 X 256K IOs going on per device across the system)
StdDev - Average standard deviation across the devices

Q-local & Q-remote refer to the average number of queue operations handled 
locally and remotely, respectively. (Average per device)
C-local & C-remote refer to the average number of completion operations handled 
locally and remotely, respectively. (Average per device)

As noted above, I'm not so sure this is the best test case - it's rather 
artificial, I was hoping to see some differences based upon affinitization, but 
whilst there appears to be some trends, the results are so close (0.2% 
difference from best to worst case MBPS, and the standard deviation on the 
latencies are +/- within the groups), I doubt there is anything definitive. 
Unfortunately, most of the disks are all being used for real data right now, so 
I can't perform significant write tests (with file systems in place, say) which 
would be more real-worldly. I do have access to about 24 of the disks, so I 
will try to place file system on those and do some tests. [I won't be able to 
use XFS without going through some hoops - its a Red Hat installation right 
now, and they don't support XFS out of the box...] 

BTW: The Q/C local/remote columns were put in place to make sure that I had 
things set up right, and for the first 18 cases I think they look 

IO queueing and complete affinity w/ threads: Some results

2008-02-11 Thread Alan D. Brunelle
The test case chosen may not be a very good start, but anyways, here are some 
initial test results with the nasty arch bits. This was performed on a 32-way 
ia64 box with 1 terrabyte of RAM, and 144 FC disks (contained in 24 HP MSA1000 
RAID controlers attached to 12 dual-port adapters). Each test case was run for 
3 minutes. I had one application per device performing a large amount of 
direct/asynchronous large reads. Here's the table of results, with explanation 
below (results are for all 144 devices either accumulated (MBPS) or averaged 
(other columns)):

A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
- | --  -- |   | --- 
X X X | 3859.9 1.190067 0.0502 |  0.0  19484.7 | 0.0   9758.8
X X A | 3856.3 1.191220 0.0490 |  0.0  19467.2 | 0.0   9750.1
X X I | 3850.3 1.192992 0.0508 |  0.0  19437.3 |  9735.1  0.0
- | --  -- |   | --- 
X A X | 3853.9 1.191891 0.0503 |  19455.4  0.0 | 0.0   9744.2
X A A | 3853.5 1.191935 0.0507 |  19453.2  0.0 | 0.0   9743.1
X A I | 3856.6 1.191043 0.0512 |  19468.7  0.0 |  9750.8  0.0
- | --  -- |   | --- 
X I X | 3854.7 1.191674 0.0491 |  0.0  19459.8 | 0.0   9746.4
X I A | 3855.3 1.191434 0.0501 |  0.0  19461.9 | 0.0   9747.4
X I I | 3856.2 1.191128 0.0506 |  0.0  19466.6 |  9749.8  0.0
= | ==  == |   | === 
I X X | 3857.0 1.190987 0.0500 |  0.0  19471.9 | 0.0   9752.5
I X A | 3856.5 1.191082 0.0496 |  0.0  19469.4 |  9751.2  0.0
I X I | 3853.7 1.191938 0.0500 |  0.0  19456.2 |  9744.6  0.0
- | --  -- |   | --- 
I A X | 3854.8 1.191675 0.0502 |  19461.5  0.0 | 0.0   9747.2
I A A | 3855.1 1.191464 0.0503 |  19464.0  0.0 |  9748.5  0.0
I A I | 3854.9 1.191627 0.0483 |  19461.7  0.0 |  9747.4  0.0
- | --  -- |   | --- 
I I X | 3853.4 1.192070 0.0484 |  19454.8  0.0 | 0.0   9743.9
I I A | 3852.2 1.192403 0.0502 |  19448.5  0.0 |  9740.8  0.0
I I I | 3854.0 1.191822 0.0499 |  19457.9  0.0 |  9745.5  0.0
= | ==  == |   | === 
rq=0  | 3854.8 1.191680 0.0480 |  19459.7  0.0 |   202.9   9543.5
rq=1  | 3854.0 1.191965 0.0483 |  19457.0  0.0 |   403.1   9341.9
- | --  -- |   | --- 

The variables being played with:

'A' - When set to X the application was placed on a CPU other than the one 
handling IRQs for the device (in another cell)

'Q' - When set to X, queue affinity was placed in another cell from the 
application OR completion OR IRQ, when set to 'A' it was pegged onto the same 
CPU as the application, when set to 'I' it was set to the CPU that was managing 
the IRQ for its device.

'C' - Likewise for the completion affinity: 'X' means on another cell besides 
the one containing the application or the queueing or the IRQ handling CPU, A 
means put on the same CPU as the application, and I means put on the same CPU 
as the IRQ handler.

o  For the last two rows, we set Q == C == -1, and let the application go to 
any CPU (as dictated by the scheduler). Then we had 'rq_affinity' set to 0 or 1.

The resulting columns include:

MBPS - Total megabytes per second (so we're seeing about 3.8 gigabytes per 
second for the system)
Avg lat - Average per IO measured latency in seconds (note: I had upwards of 
128 X 256K IOs going on per device across the system)
StdDev - Average standard deviation across the devices

Q-local  Q-remote refer to the average number of queue operations handled 
locally and remotely, respectively. (Average per device)
C-local  C-remote refer to the average number of completion operations handled 
locally and remotely, respectively. (Average per device)

As noted above, I'm not so sure this is the best test case - it's rather 
artificial, I was hoping to see some differences based upon affinitization, but 
whilst there appears to be some trends, the results are so close (0.2% 
difference from best to worst case MBPS, and the standard deviation on the 
latencies are +/- within the groups), I doubt there is anything definitive. 
Unfortunately, most of the disks are all being used for real data right now, so 
I can't perform significant write tests (with file systems in place, say) which 
would be more real-worldly. I do have access to about 24 of the disks, so I 
will try to place file system on those and do some tests. [I won't be able to 
use XFS without going through some hoops - its a Red Hat installation right 
now, and they don't support XFS out of the box...] 

BTW: The Q/C local/remote columns were put in place to make sure that I had 
things set up right, and for the first 18 cases I think they look 

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-07 Thread Alan D. Brunelle
Jens Axboe wrote:
> Hi,
> 
> Here's a variant using kernel threads only, the nasty arch bits are then
> not needed. Works for me, no performance testing (that's a hint for Alan
> to try and queue up some testing for this variant as well :-)
> 
> 
I'll get to that, working my way through the first batch of testing on a NUMA 
platform. [[If anybody has ideas on specific testing to do, that would be 
helpful.]] I do plan on running some AIM7 tests, as those have shown 
improvement in other types of affinity changes in the kernel, and some of them 
have "interesting" IO load characteristics.

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] IO queuing and complete affinity

2008-02-07 Thread Alan D. Brunelle
Jens Axboe wrote:
> Hi,
> 
> Since I'll be on vacation next week, I thought I'd send this out in
> case people wanted to play with it. It works here, but I haven't done
> any performance numbers at all.
> 
> Patches 1-7 are all preparation patches for #8, which contains the
> real changes. I'm not particularly happy with the arch implementation
> for raising a softirq on another CPU, but it should be fast enough
> so suffice for testing.
> 
> Anyway, this patchset is mainly meant as a playground for testing IO
> affinity. It allows you to set three values per queue, see the files
> in the /sys/block//queue directory:
> 
> completion_affinity
>   Only allow completions to happen on the defined CPU mask.
> queue_affinity
>   Only allow queuing to happen on the defined CPU mask.
> rq_affinity
>   Always complete a request on the same CPU that queued it.
> 
> As you can tell, there's some overlap to allow for experimentation.
> rq_affinity will override completion_affinity, so it's possible to
> have completions on a CPU that isn't set in that mask. The interface
> is currently limited to all CPUs or a specific CPU, but the implementation
> is supports (and works with) cpu masks. The logic is in
> blk_queue_set_cpumask(), it should be easy enough to change this to
> echo a full mask, or allow OR'ing of CPU masks when a new CPU is passed in.
> For now, echo a CPU number to set that CPU, or use -1 to set all CPUs.
> The default is all CPUs for no change in behaviour.
> 
> Patch set is against current git as of this morning. The code is also in
> the block git repo, branch is io-cpu-affinity.
> 
> git://git.kernel.dk/linux-2.6-block.git io-cpu-affinity
> 

FYI: on a kernel with this patch set, running on a 4-way ia64 (non-NUMA) w/ a 
FC disk, I crafted a test with 135 combinations:

o  Having the issuing application pegged on each CPU - or - left alone (run on 
any CPU), yields 5 possibilities
o  Having the queue affinity on each CPU, or any (-1), yields 5 possibilities
o  Having the completion affinity on each CPU, or any (-1), yields 5 
possibilities

and

o  Having the issuing application pegged on each CPU - or - left alone (run on 
ay CPU), yields 5 possibilities
o  Having rq_affinity set to 0 or 1, yields 2 possibilities.

Each test was for 10 minutes, and ran overnight just fine. The difference 
amongst the 135 resulting values (based upon latency per-IO seen at the 
application layer) was <<1% (0.32% to be exact). This would seem to indicate 
that there isn't a penalty for running with this code, and it seems relatively 
stable given this.

The application used was doing 64KiB asynchronous direct reads, and had a 
minimum average per-IO latency of 42.426310 milliseconds, and average of 
42.486557 milliseconds (std dev of 0.0041561), and a max of 42.561360 
milliseconds

I'm going to do some runs on a 16-way NUMA box, w/ a lot of disks today, to see 
if we see gains in that environment.

Alan D. Brunelle
HP OSLO S
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-07 Thread Alan D. Brunelle
Jens Axboe wrote:
 Hi,
 
 Here's a variant using kernel threads only, the nasty arch bits are then
 not needed. Works for me, no performance testing (that's a hint for Alan
 to try and queue up some testing for this variant as well :-)
 
 
I'll get to that, working my way through the first batch of testing on a NUMA 
platform. [[If anybody has ideas on specific testing to do, that would be 
helpful.]] I do plan on running some AIM7 tests, as those have shown 
improvement in other types of affinity changes in the kernel, and some of them 
have interesting IO load characteristics.

Alan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] IO queuing and complete affinity

2008-02-07 Thread Alan D. Brunelle
Jens Axboe wrote:
 Hi,
 
 Since I'll be on vacation next week, I thought I'd send this out in
 case people wanted to play with it. It works here, but I haven't done
 any performance numbers at all.
 
 Patches 1-7 are all preparation patches for #8, which contains the
 real changes. I'm not particularly happy with the arch implementation
 for raising a softirq on another CPU, but it should be fast enough
 so suffice for testing.
 
 Anyway, this patchset is mainly meant as a playground for testing IO
 affinity. It allows you to set three values per queue, see the files
 in the /sys/block/dev/queue directory:
 
 completion_affinity
   Only allow completions to happen on the defined CPU mask.
 queue_affinity
   Only allow queuing to happen on the defined CPU mask.
 rq_affinity
   Always complete a request on the same CPU that queued it.
 
 As you can tell, there's some overlap to allow for experimentation.
 rq_affinity will override completion_affinity, so it's possible to
 have completions on a CPU that isn't set in that mask. The interface
 is currently limited to all CPUs or a specific CPU, but the implementation
 is supports (and works with) cpu masks. The logic is in
 blk_queue_set_cpumask(), it should be easy enough to change this to
 echo a full mask, or allow OR'ing of CPU masks when a new CPU is passed in.
 For now, echo a CPU number to set that CPU, or use -1 to set all CPUs.
 The default is all CPUs for no change in behaviour.
 
 Patch set is against current git as of this morning. The code is also in
 the block git repo, branch is io-cpu-affinity.
 
 git://git.kernel.dk/linux-2.6-block.git io-cpu-affinity
 

FYI: on a kernel with this patch set, running on a 4-way ia64 (non-NUMA) w/ a 
FC disk, I crafted a test with 135 combinations:

o  Having the issuing application pegged on each CPU - or - left alone (run on 
any CPU), yields 5 possibilities
o  Having the queue affinity on each CPU, or any (-1), yields 5 possibilities
o  Having the completion affinity on each CPU, or any (-1), yields 5 
possibilities

and

o  Having the issuing application pegged on each CPU - or - left alone (run on 
ay CPU), yields 5 possibilities
o  Having rq_affinity set to 0 or 1, yields 2 possibilities.

Each test was for 10 minutes, and ran overnight just fine. The difference 
amongst the 135 resulting values (based upon latency per-IO seen at the 
application layer) was 1% (0.32% to be exact). This would seem to indicate 
that there isn't a penalty for running with this code, and it seems relatively 
stable given this.

The application used was doing 64KiB asynchronous direct reads, and had a 
minimum average per-IO latency of 42.426310 milliseconds, and average of 
42.486557 milliseconds (std dev of 0.0041561), and a max of 42.561360 
milliseconds

I'm going to do some runs on a 16-way NUMA box, w/ a lot of disks today, to see 
if we see gains in that environment.

Alan D. Brunelle
HP OSLO SP
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24 regression w/ QLA2300

2008-02-05 Thread Alan D. Brunelle
Andrew Vasquez wrote:
> On Tue, 05 Feb 2008, Andrew Vasquez wrote:
> 
>> On Tue, 05 Feb 2008, Alan D. Brunelle wrote:
>>
>>> commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a
>>> Merge: 50d9a12... 23c3e29...
>>> Author: Linus Torvalds <[EMAIL PROTECTED]>
>>> Date:   Fri Jan 25 17:19:08 2008 -0800
>>>
>>> Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
>>>
>>> * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: 
>>> (200 commits)
>>>
>>> I believe a regression was introduced. I'm running on a 4-way IA64,
>>> with straight 2.6.24 and 2 dual-port cards:
>>>
>>> 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>>> 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>>> c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>>> c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>>>
>>> the adapters failed initialization. In particular, I narrowed it down
>>> to failing the qla2x00_mbox_command call within qla2x00_init_firmware
>>> function. I went and removed the qla2x00-related parts of this (large-ish)
>>> merge, and the 4 ports initialized just fine.
>> Could you load the (default 2.6.24) driver with
>> ql2xextended_error_logging modules parameter set:
>>
>>  # insmod qla2xxx ql2xextended_error_logging=1
>>
>> and send the resultant kernel logs?
> 
> Could you tray the patch referenced here:
> 
> qla2xxx: Correct issue where incorrect init-fw mailbox command was used on 
> non-NPIV capable ISPs.
> http://article.gmane.org/gmane.linux.scsi/38240
> 
> Thanks, av

The referenced patch worked fine Andrew, thanks much! 

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24 regression w/ QLA2300

2008-02-05 Thread Alan D. Brunelle
Andrew Vasquez wrote:
> On Tue, 05 Feb 2008, Alan D. Brunelle wrote:
> 
>> commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a
>> Merge: 50d9a12... 23c3e29...
>> Author: Linus Torvalds <[EMAIL PROTECTED]>
>> Date:   Fri Jan 25 17:19:08 2008 -0800
>>
>> Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
>>
>> * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 
>> commits)
>>
>> I believe a regression was introduced. I'm running on a 4-way IA64,
>> with straight 2.6.24 and 2 dual-port cards:
>>
>> 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>> 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>> c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>> c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
>>
>> the adapters failed initialization. In particular, I narrowed it down
>> to failing the qla2x00_mbox_command call within qla2x00_init_firmware
>> function. I went and removed the qla2x00-related parts of this (large-ish)
>> merge, and the 4 ports initialized just fine.
> 
> Could you load the (default 2.6.24) driver with
> ql2xextended_error_logging modules parameter set:
> 
>   # insmod qla2xxx ql2xextended_error_logging=1
> 
> and send the resultant kernel logs?

Here's the output to the console (if there are other logs you need, let me 
know). I'll try the patch next, and sorry, hadn't realized merges were still 
coming in under 2.6.24 in Linus' tree... 

QLogic Fibre Channel HBA Driver
ACPI: PCI Interrupt :40:01.0[A] -> GSI 38 (level, low) -> IRQ 58
qla2xxx :40:01.0: Found an ISP2312, irq 58, iobase 0xc000a0041000
qla2xxx :40:01.0: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c.
qla2xxx :40:01.0: Configure NVRAM parameters...
qla2xxx :40:01.0: Verifying loaded RISC code...
scsi(14):  Load RISC code 
scsi(14): Verifying Checksum of loaded RISC code.
scsi(14): Checksum OK, start firmware.
qla2xxx :40:01.0: Allocated (412 KB) for firmware dump...
scsi(14): Issue init firmware.
qla2x00_mailbox_command(14):  FAILED. mbx0=4001, mbx1=0, mbx2=ba8a, cmd=48 

qla2x00_init_firmware(14): failed=102 mb0=4001.
scsi(14): Init firmware  FAILED .
qla2xxx :40:01.0: Failed to initialize adapter
scsi(14): Failed to initialize adapter - Adapter flags 10.
ACPI: PCI Interrupt :40:01.1[B] -> GSI 39 (level, low) -> IRQ 59
qla2xxx :40:01.1: Found an ISP2312, irq 59, iobase 0xc000a004
qla2xxx :40:01.1: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c.
qla2xxx :40:01.1: Configure NVRAM parameters...
qla2xxx :40:01.1: Verifying loaded RISC code...
scsi(15):  Load RISC code 
scsi(15): Verifying Checksum of loaded RISC code.
scsi(15): Checksum OK, start firmware.
qla2xxx :40:01.1: Allocated (412 KB) for firmware dump...
scsi(15): Issue init firmware.
qla2x00_mailbox_command(15):  FAILED. mbx0=4001, mbx1=0, mbx2=bac6, cmd=48 

qla2x00_init_firmware(15): failed=102 mb0=4001.
scsi(15): Init firmware  FAILED .
qla2xxx :40:01.1: Failed to initialize adapter
scsi(15): Failed to initialize adapter - Adapter flags 10.
ACPI: PCI Interrupt :c0:01.0[A] -> GSI 71 (level, low) -> IRQ 60
qla2xxx :c0:01.0: Found an ISP2312, irq 60, iobase 0xc000e0041000
qla2xxx :c0:01.0: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c.
qla2xxx :c0:01.0: Configure NVRAM parameters...
qla2xxx :c0:01.0: Verifying loaded RISC code...
scsi(16):  Load RISC code 
scsi(16): Verifying Checksum of loaded RISC code.
scsi(16): Checksum OK, start firmware.
qla2xxx :c0:01.0: Allocated (412 KB) for firmware dump...
scsi(16): Issue init firmware.
qla2x00_mailbox_command(16):  FAILED. mbx0=4001, mbx1=0, mbx2=bae3, cmd=48 

qla2x00_init_firmware(16): failed=102 mb0=4001.
scsi(16): Init firmware  FAILED .
qla2xxx :c0:01.0: Failed to initialize adapter
scsi(16): Failed to initialize adapter - Adapter flags 10.
ACPI: PCI Interrupt :c0:01.1[B] -> GSI 72 (level, low) -> IRQ 61
qla2xxx :c0:01.1: Found an ISP2312, irq 61, iobase 0xc000e004
qla2xxx :c0:01.1: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c.
qla2xxx :c0:01.1: Configure NVRAM parameters...
qla2xxx :c0:01.1: Verifying loaded RISC c

Re: 2.6.24 regression w/ QLA2300

2008-02-05 Thread Alan D. Brunelle
Andrew Vasquez wrote:
 On Tue, 05 Feb 2008, Alan D. Brunelle wrote:
 
 commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a
 Merge: 50d9a12... 23c3e29...
 Author: Linus Torvalds [EMAIL PROTECTED]
 Date:   Fri Jan 25 17:19:08 2008 -0800

 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6

 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 
 commits)

 I believe a regression was introduced. I'm running on a 4-way IA64,
 with straight 2.6.24 and 2 dual-port cards:

 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
 c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
 c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)

 the adapters failed initialization. In particular, I narrowed it down
 to failing the qla2x00_mbox_command call within qla2x00_init_firmware
 function. I went and removed the qla2x00-related parts of this (large-ish)
 merge, and the 4 ports initialized just fine.
 
 Could you load the (default 2.6.24) driver with
 ql2xextended_error_logging modules parameter set:
 
   # insmod qla2xxx ql2xextended_error_logging=1
 
 and send the resultant kernel logs?

Here's the output to the console (if there are other logs you need, let me 
know). I'll try the patch next, and sorry, hadn't realized merges were still 
coming in under 2.6.24 in Linus' tree... 

QLogic Fibre Channel HBA Driver
ACPI: PCI Interrupt :40:01.0[A] - GSI 38 (level, low) - IRQ 58
qla2xxx :40:01.0: Found an ISP2312, irq 58, iobase 0xc000a0041000
qla2xxx :40:01.0: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c.
qla2xxx :40:01.0: Configure NVRAM parameters...
qla2xxx :40:01.0: Verifying loaded RISC code...
scsi(14):  Load RISC code 
scsi(14): Verifying Checksum of loaded RISC code.
scsi(14): Checksum OK, start firmware.
qla2xxx :40:01.0: Allocated (412 KB) for firmware dump...
scsi(14): Issue init firmware.
qla2x00_mailbox_command(14):  FAILED. mbx0=4001, mbx1=0, mbx2=ba8a, cmd=48 

qla2x00_init_firmware(14): failed=102 mb0=4001.
scsi(14): Init firmware  FAILED .
qla2xxx :40:01.0: Failed to initialize adapter
scsi(14): Failed to initialize adapter - Adapter flags 10.
ACPI: PCI Interrupt :40:01.1[B] - GSI 39 (level, low) - IRQ 59
qla2xxx :40:01.1: Found an ISP2312, irq 59, iobase 0xc000a004
qla2xxx :40:01.1: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c.
qla2xxx :40:01.1: Configure NVRAM parameters...
qla2xxx :40:01.1: Verifying loaded RISC code...
scsi(15):  Load RISC code 
scsi(15): Verifying Checksum of loaded RISC code.
scsi(15): Checksum OK, start firmware.
qla2xxx :40:01.1: Allocated (412 KB) for firmware dump...
scsi(15): Issue init firmware.
qla2x00_mailbox_command(15):  FAILED. mbx0=4001, mbx1=0, mbx2=bac6, cmd=48 

qla2x00_init_firmware(15): failed=102 mb0=4001.
scsi(15): Init firmware  FAILED .
qla2xxx :40:01.1: Failed to initialize adapter
scsi(15): Failed to initialize adapter - Adapter flags 10.
ACPI: PCI Interrupt :c0:01.0[A] - GSI 71 (level, low) - IRQ 60
qla2xxx :c0:01.0: Found an ISP2312, irq 60, iobase 0xc000e0041000
qla2xxx :c0:01.0: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c.
qla2xxx :c0:01.0: Configure NVRAM parameters...
qla2xxx :c0:01.0: Verifying loaded RISC code...
scsi(16):  Load RISC code 
scsi(16): Verifying Checksum of loaded RISC code.
scsi(16): Checksum OK, start firmware.
qla2xxx :c0:01.0: Allocated (412 KB) for firmware dump...
scsi(16): Issue init firmware.
qla2x00_mailbox_command(16):  FAILED. mbx0=4001, mbx1=0, mbx2=bae3, cmd=48 

qla2x00_init_firmware(16): failed=102 mb0=4001.
scsi(16): Init firmware  FAILED .
qla2xxx :c0:01.0: Failed to initialize adapter
scsi(16): Failed to initialize adapter - Adapter flags 10.
ACPI: PCI Interrupt :c0:01.1[B] - GSI 72 (level, low) - IRQ 61
qla2xxx :c0:01.1: Found an ISP2312, irq 61, iobase 0xc000e004
qla2xxx :c0:01.1: Configuring PCI space...
qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c.
qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c.
qla2xxx :c0:01.1: Configure NVRAM parameters...
qla2xxx :c0:01.1: Verifying loaded RISC code...
scsi(17):  Load RISC code 
scsi(17): Verifying Checksum of loaded RISC code.
scsi(17): Checksum OK, start firmware.
qla2xxx :c0:01.1: Allocated (412 KB) for firmware dump...
scsi(17): Issue init firmware.
qla2x00_mailbox_command(17

Re: 2.6.24 regression w/ QLA2300

2008-02-05 Thread Alan D. Brunelle
Andrew Vasquez wrote:
 On Tue, 05 Feb 2008, Andrew Vasquez wrote:
 
 On Tue, 05 Feb 2008, Alan D. Brunelle wrote:

 commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a
 Merge: 50d9a12... 23c3e29...
 Author: Linus Torvalds [EMAIL PROTECTED]
 Date:   Fri Jan 25 17:19:08 2008 -0800

 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6

 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: 
 (200 commits)

 I believe a regression was introduced. I'm running on a 4-way IA64,
 with straight 2.6.24 and 2 dual-port cards:

 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
 c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)
 c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03)

 the adapters failed initialization. In particular, I narrowed it down
 to failing the qla2x00_mbox_command call within qla2x00_init_firmware
 function. I went and removed the qla2x00-related parts of this (large-ish)
 merge, and the 4 ports initialized just fine.
 Could you load the (default 2.6.24) driver with
 ql2xextended_error_logging modules parameter set:

  # insmod qla2xxx ql2xextended_error_logging=1

 and send the resultant kernel logs?
 
 Could you tray the patch referenced here:
 
 qla2xxx: Correct issue where incorrect init-fw mailbox command was used on 
 non-NPIV capable ISPs.
 http://article.gmane.org/gmane.linux.scsi/38240
 
 Thanks, av

The referenced patch worked fine Andrew, thanks much! 

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Moved UNPLUG traces to match 1-to-1 with PLUG traces

2008-02-01 Thread Alan D. Brunelle

Currently, with DM (and probably MD) we can receive streams of multiple
PLUG and/or UNPLUG traces on the lower devices:

  8,32   1   91043825.383725302 12843  P   N [mkfs.ext2]
  8,32   1   91162725.385613612 12843  P   N [mkfs.ext2]
  8,32   1   91181925.385931255 12843  P   N [mkfs.ext2]
  8,32   1   91317625.388396840 12843  P   N [mkfs.ext2]
  8,32   1   91417025.391634524 12843  P   N [mkfs.ext2]
  8,32   1   91523925.393325078 12843  P   N [mkfs.ext2]
  8,32   1   91547325.397930230 0 UT   N [swapper] 18
  8,32   1   91547425.39793614532  U   N [kblockd/1] 18
  8,32   1   91549725.594446953 12843  P   N [mkfs.ext2]
  8,32   1   91680625.596543309 12843  P   N [mkfs.ext2]
  8,32   1   91835125.599276485 12843  P   N [mkfs.ext2]
  8,32   1   91837725.599313544 12843  U   N [mkfs.ext2] 9

The PLUG traces are "protected" by the test-and-clear functionality
in blk_plug_device, however the UNPLUG traces has no such protection.
And in the case where MD or DM were involved, the upper level dev as
well as the lower level devs would both go through blk_unplug which
would generate extra UNPLUGS.

With the proposed change, I only see a good one-to-one mapping of PLUG
and UNPLUG traces on the underlying devices. However, we no longer see
the UNPLUG traces on the MD or DM devices, which one could argue makes
sense because (a) those devices don't have request queues managed by
the block layer, and thus (b) they never had any notion of having been
plugged. (So we saw UNPLUG traces on devices that never had PLUG traces.)

A similar stream with the new patch:

  8,32   1   88290824.179721271  7539  P   N [mkfs.ext2]
  8,32   1   88412124.182232467  7539  U   N [mkfs.ext2] 10
  8,32   1   88412924.182305789  7539  P   N [mkfs.ext2]
  8,32   1   88547824.184748842  7539  U   N [mkfs.ext2] 14
  8,32   1   88548724.184791013  7539  P   N [mkfs.ext2]
  8,32   1   88633624.186479185  7539  U   N [mkfs.ext2] 15
  8,32   1   88634324.186516024  7539  P   N [mkfs.ext2]
  8,32   1   88641424.186637771  7539  U   N [mkfs.ext2] 14
  8,32   1   88642024.186649173  7539  P   N [mkfs.ext2]
  8,32   1   88672624.193329121 0 UT   N [swapper] 15
  8,32   1   88672724.19333642832  U   N [kblockd/1] 15
  8,32   1   88674824.354771821  7539  P   N [mkfs.ext2]
  8,32   1   88808124.357279380  7539  U   N [mkfs.ext2] 4
  8,32   1   88809024.357323899  7539  P   N [mkfs.ext2]
  8,32   1   88893424.358969886  7539  U   N [mkfs.ext2] 5
  8,32   1   88894224.359019161  7539  P   N [mkfs.ext2]
  8,32   1   89014724.361314613  7539  U   N [mkfs.ext2] 8

The proposed patch was tested with a 2.6.22-based kernel, and
compile tested with a 2.6.24-based tree from 31 January 2008
(85004cc367abc000aa36c0d0e270ab609a68b0cb).

Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]>
---
 block/blk-core.c |   12 
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8ff9944..1d148b5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -218,6 +218,9 @@ int blk_remove_plug(struct request_queue *q)
if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, >queue_flags))
return 0;
 
+   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   q->rq.count[READ] + q->rq.count[WRITE]);
+
del_timer(>unplug_timer);
return 1;
 }
@@ -271,9 +274,6 @@ void blk_unplug_work(struct work_struct *work)
struct request_queue *q =
container_of(work, struct request_queue, unplug_work);
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
-   q->rq.count[READ] + q->rq.count[WRITE]);
-
q->unplug_fn(q);
 }
 
@@ -292,12 +292,8 @@ void blk_unplug(struct request_queue *q)
/*
 * devices don't necessarily have an ->unplug_fn defined
 */
-   if (q->unplug_fn) {
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
-   q->rq.count[READ] + q->rq.count[WRITE]);
-
+   if (q->unplug_fn)
q->unplug_fn(q);
-   }
 }
 EXPORT_SYMBOL(blk_unplug);
 
-- 
1.5.2.5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Moved UNPLUG traces to match 1-to-1 with PLUG traces

2008-02-01 Thread Alan D. Brunelle

Currently, with DM (and probably MD) we can receive streams of multiple
PLUG and/or UNPLUG traces on the lower devices:

  8,32   1   91043825.383725302 12843  P   N [mkfs.ext2]
  8,32   1   91162725.385613612 12843  P   N [mkfs.ext2]
  8,32   1   91181925.385931255 12843  P   N [mkfs.ext2]
  8,32   1   91317625.388396840 12843  P   N [mkfs.ext2]
  8,32   1   91417025.391634524 12843  P   N [mkfs.ext2]
  8,32   1   91523925.393325078 12843  P   N [mkfs.ext2]
  8,32   1   91547325.397930230 0 UT   N [swapper] 18
  8,32   1   91547425.39793614532  U   N [kblockd/1] 18
  8,32   1   91549725.594446953 12843  P   N [mkfs.ext2]
  8,32   1   91680625.596543309 12843  P   N [mkfs.ext2]
  8,32   1   91835125.599276485 12843  P   N [mkfs.ext2]
  8,32   1   91837725.599313544 12843  U   N [mkfs.ext2] 9

The PLUG traces are protected by the test-and-clear functionality
in blk_plug_device, however the UNPLUG traces has no such protection.
And in the case where MD or DM were involved, the upper level dev as
well as the lower level devs would both go through blk_unplug which
would generate extra UNPLUGS.

With the proposed change, I only see a good one-to-one mapping of PLUG
and UNPLUG traces on the underlying devices. However, we no longer see
the UNPLUG traces on the MD or DM devices, which one could argue makes
sense because (a) those devices don't have request queues managed by
the block layer, and thus (b) they never had any notion of having been
plugged. (So we saw UNPLUG traces on devices that never had PLUG traces.)

A similar stream with the new patch:

  8,32   1   88290824.179721271  7539  P   N [mkfs.ext2]
  8,32   1   88412124.182232467  7539  U   N [mkfs.ext2] 10
  8,32   1   88412924.182305789  7539  P   N [mkfs.ext2]
  8,32   1   88547824.184748842  7539  U   N [mkfs.ext2] 14
  8,32   1   88548724.184791013  7539  P   N [mkfs.ext2]
  8,32   1   88633624.186479185  7539  U   N [mkfs.ext2] 15
  8,32   1   88634324.186516024  7539  P   N [mkfs.ext2]
  8,32   1   88641424.186637771  7539  U   N [mkfs.ext2] 14
  8,32   1   88642024.186649173  7539  P   N [mkfs.ext2]
  8,32   1   88672624.193329121 0 UT   N [swapper] 15
  8,32   1   88672724.19333642832  U   N [kblockd/1] 15
  8,32   1   88674824.354771821  7539  P   N [mkfs.ext2]
  8,32   1   88808124.357279380  7539  U   N [mkfs.ext2] 4
  8,32   1   88809024.357323899  7539  P   N [mkfs.ext2]
  8,32   1   88893424.358969886  7539  U   N [mkfs.ext2] 5
  8,32   1   88894224.359019161  7539  P   N [mkfs.ext2]
  8,32   1   89014724.361314613  7539  U   N [mkfs.ext2] 8

The proposed patch was tested with a 2.6.22-based kernel, and
compile tested with a 2.6.24-based tree from 31 January 2008
(85004cc367abc000aa36c0d0e270ab609a68b0cb).

Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED]
---
 block/blk-core.c |   12 
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8ff9944..1d148b5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -218,6 +218,9 @@ int blk_remove_plug(struct request_queue *q)
if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags))
return 0;
 
+   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+   q-rq.count[READ] + q-rq.count[WRITE]);
+
del_timer(q-unplug_timer);
return 1;
 }
@@ -271,9 +274,6 @@ void blk_unplug_work(struct work_struct *work)
struct request_queue *q =
container_of(work, struct request_queue, unplug_work);
 
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
-   q-rq.count[READ] + q-rq.count[WRITE]);
-
q-unplug_fn(q);
 }
 
@@ -292,12 +292,8 @@ void blk_unplug(struct request_queue *q)
/*
 * devices don't necessarily have an -unplug_fn defined
 */
-   if (q-unplug_fn) {
-   blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
-   q-rq.count[READ] + q-rq.count[WRITE]);
-
+   if (q-unplug_fn)
q-unplug_fn(q);
-   }
 }
 EXPORT_SYMBOL(blk_unplug);
 
-- 
1.5.2.5


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-16 Thread Alan D. Brunelle

Ray Lee wrote:



Out of curiosity, what are the mount options for the freshly created
ext3 fs? In particular, are you using noatime, nodiratime?

Ray


Nope, just mount. However, the tool I'm using to read the large file & 
overwrite the large file does open with O_NOATIME for reads...


The tool used to read the files in the read-a-tree test is dd, and I 
doubt(?) it does a O_NOATIME...


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-16 Thread Alan D. Brunelle

Alan D. Brunelle wrote:


Read large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  201.6  215.1  275.5   22.8 0.26%4.69%   33.54%
arjan:  198.0  210.3  261.5   18.5 0.33%   10.24%   54.00%

Read a tree:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base : 3518.2 4631.3 5991.3  784.6 0.19%3.29%   23.56%
arjan: 5731.6 6849.8 .4  731.6 0.32%9.90%   52.70%

Overwrite large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  104.2  147.7  239.5   38.4  0.02%0.05%   1.08%
arjan:  106.2  149.7  239.2   38.4  0.25%0.79%  14.97%




I'm going to try and do some clean up work on the iostat CPU results - 
the reason %user & %system are so low is (I think) because they also 
include a lot of 0% results from the tail of the runs (as the unmount is 
going on I think). I'm going to try and extract results for just the 
"meat" of the runs.


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-16 Thread Alan D. Brunelle

Here are the results for the latest tests, some notes:

o  The machine actually has 8GiB of RAM, so the tests still may end up 
using (some) page cache. (But at least it was the same for both kernels! 
:-) )


o  Sorry the results took so long - the updated tree size caused the 
runs to take > 12 hours...


o  The longer runs seemed to bring down the standard deviation a bit, 
although they are still quite large.


o  10 runs per test (read large file, read a tree, overwrite large 
file), with averages presented.


o  1st 4 columns (min, avg, max, std dev) refer to the average run 
lengths for the tests - real time, in seconds


o  The last 3 columns are extracted from iostat results over the course 
of the whole run.


o  The read a tree test certainly stands out - the other 2 large file 
manipulations have the two kernels within a couple of percent, but the 
read a tree test has Arjan's patch taking about 47%(!) longer on 
average. The increased %iowait & %system time in all 3 cases is interesting.



Read large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  201.6  215.1  275.5   22.8 0.26%4.69%   33.54%
arjan:  198.0  210.3  261.5   18.5 0.33%   10.24%   54.00%

Read a tree:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base : 3518.2 4631.3 5991.3  784.6 0.19%3.29%   23.56%
arjan: 5731.6 6849.8 .4  731.6 0.32%9.90%   52.70%

Overwrite large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  104.2  147.7  239.5   38.4  0.02%0.05%   1.08%
arjan:  106.2  149.7  239.2   38.4  0.25%0.79%  14.97%

Let me know if there is anything else I can do to elaborate, or if you 
have suggestions for further testing.


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-16 Thread Alan D. Brunelle

Here are the results for the latest tests, some notes:

o  The machine actually has 8GiB of RAM, so the tests still may end up 
using (some) page cache. (But at least it was the same for both kernels! 
:-) )


o  Sorry the results took so long - the updated tree size caused the 
runs to take  12 hours...


o  The longer runs seemed to bring down the standard deviation a bit, 
although they are still quite large.


o  10 runs per test (read large file, read a tree, overwrite large 
file), with averages presented.


o  1st 4 columns (min, avg, max, std dev) refer to the average run 
lengths for the tests - real time, in seconds


o  The last 3 columns are extracted from iostat results over the course 
of the whole run.


o  The read a tree test certainly stands out - the other 2 large file 
manipulations have the two kernels within a couple of percent, but the 
read a tree test has Arjan's patch taking about 47%(!) longer on 
average. The increased %iowait  %system time in all 3 cases is interesting.



Read large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  201.6  215.1  275.5   22.8 0.26%4.69%   33.54%
arjan:  198.0  210.3  261.5   18.5 0.33%   10.24%   54.00%

Read a tree:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base : 3518.2 4631.3 5991.3  784.6 0.19%3.29%   23.56%
arjan: 5731.6 6849.8 .4  731.6 0.32%9.90%   52.70%

Overwrite large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  104.2  147.7  239.5   38.4  0.02%0.05%   1.08%
arjan:  106.2  149.7  239.2   38.4  0.25%0.79%  14.97%

Let me know if there is anything else I can do to elaborate, or if you 
have suggestions for further testing.


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-16 Thread Alan D. Brunelle

Alan D. Brunelle wrote:


Read large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  201.6  215.1  275.5   22.8 0.26%4.69%   33.54%
arjan:  198.0  210.3  261.5   18.5 0.33%   10.24%   54.00%

Read a tree:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base : 3518.2 4631.3 5991.3  784.6 0.19%3.29%   23.56%
arjan: 5731.6 6849.8 .4  731.6 0.32%9.90%   52.70%

Overwrite large file:

Kernel  MinAvgMax   Std Dev%user  %system  %iowait
--
base :  104.2  147.7  239.5   38.4  0.02%0.05%   1.08%
arjan:  106.2  149.7  239.2   38.4  0.25%0.79%  14.97%




I'm going to try and do some clean up work on the iostat CPU results - 
the reason %user  %system are so low is (I think) because they also 
include a lot of 0% results from the tail of the runs (as the unmount is 
going on I think). I'm going to try and extract results for just the 
meat of the runs.


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-16 Thread Alan D. Brunelle

Ray Lee wrote:



Out of curiosity, what are the mount options for the freshly created
ext3 fs? In particular, are you using noatime, nodiratime?

Ray


Nope, just mount. However, the tool I'm using to read the large file  
overwrite the large file does open with O_NOATIME for reads...


The tool used to read the files in the read-a-tree test is dd, and I 
doubt(?) it does a O_NOATIME...


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-14 Thread Alan D. Brunelle

Oh, and the runs were done in single-user mode...

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-14 Thread Alan D. Brunelle

Arjan van de Ven wrote:

On Wed, 14 Nov 2007 18:18:05 +0100
Ingo Molnar <[EMAIL PROTECTED]> wrote:

  

* Andrew Morton <[EMAIL PROTECTED]> wrote:



ooh, more performance testing.  Thanks

  

* The overwriter task (on an 8GiB file), average over 10 runs:
  o 2.6.24 - 300.88226 seconds
  o 2.6.24 + Arjan's patch - 403.85505 seconds

* The read-a-different-kernel-tree task, average over 10 runs:
  o 2.6.24 - 46.8145945549 seconds
  o 2.6.24 + Arjan's patch - 39.6430601119 seconds

* The large-linear-read task (on an 8GiB file), average over
10 runs: o 2.6.24 - 290.32522 seconds
  o 2.6.24 + Arjan's patch - 386.34860 seconds


These are *large* differences, making this a very signifcant
patch. Much care is needed now.
  
and the numbers suggest it's mostly a severe performance regression. 
That's not what i have expected - ho hum. Apologies for my earlier 
"please merge it already!" whining.



that's.. not automatic; it depends on what the right thing is :-(
What for sure changes is that who gets to do IO changes. Some of the
tests we ran internally (we didn't publish yet because we saw REALLY
large variations for most of them even without any patch) show for
example that "dbench" got slower. But.. dbench gets slower when things
get more fair, and faster when things get unfair. What conclusion you
draw out of that is a whole different matter and depends on exactly
what the test is doing, and what is the right thing for the OS to do in
terms of who gets to do the IO.

THis makes the patch more tricky than the one line change suggests, and
this is also why I haven't published a ton of data yet; it's hard to
get useful tests for this (and the variation of the 2.6.23+ kernels
makes it even harder to do anything meaningful ;-( )


  
I'd also like to point out here that the run-to-run deviation was indeed 
quite large for both the unpatched- and patched-kernels, I'll report on 
that information with the next set of results...


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-14 Thread Alan D. Brunelle

Andrew Morton wrote:

(cc lkml restored, with permission)

On Wed, 14 Nov 2007 10:48:10 -0500 "Alan D. Brunelle" <[EMAIL PROTECTED]> wrote:

  

Andrew Morton wrote:


On Mon, 15 Oct 2007 16:13:15 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

  
  

Since you have been involved a lot with ext3 development,
which kinds of workloads do you think will show a performance
degradation with Arjan's patch?   What should I test?



Gee.  Expect the unexpected ;)

One problem might be when kjournald is doing its ordered-mode data
writeback at the start of commit.  That writeout will now be
higher-priority and might impact other tasks which are doing synchronous
file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
old reads.

If the aggregate number of seeks over the long term is the same as before
then of course the overall throughput should be the same, in which case the
impact might only be upon latency.  However if for some reason the total
number of seeks is increased then there will be throughput impacts as well.

So as a starting point I guess one could set up a
copy-a-kernel-tree-in-a-loop running in the background and then see what
impact that has upon a large-linear-read, upon a
read-a-different-kernel-tree and upon some database-style multithreaded
O_DIRECT reader/overwriter.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  
  

Hi folks -

I noticed this thread recently (sorry, month late and a dollar short), 
but it was very interesting to me, and as I had some cycles yesterday I 
did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out 
and then w/ Ajran's patch). After doing the runs last night I'm not 
overly happy with the test setup, but before redoing that, I thought I'd 
ask to make sure that this patch was still being considered?



I'd consider its status to be "might be a good idea, more performance
testing needed".

  

And, btw, the results from the first pass were rather mixed -



ooh, more performance testing.  Thanks

  

* The overwriter task (on an 8GiB file), average over 10 runs:
  o 2.6.24 - 300.88226 seconds
  o 2.6.24 + Arjan's patch - 403.85505 seconds

* The read-a-different-kernel-tree task, average over 10 runs:
  o 2.6.24 - 46.8145945549 seconds
  o 2.6.24 + Arjan's patch - 39.6430601119 seconds

* The large-linear-read task (on an 8GiB file), average over 10 runs:
  o 2.6.24 - 290.32522 seconds
  o 2.6.24 + Arjan's patch - 386.34860 seconds



These are *large* differences, making this a very signifcant patch.  Much
care is needed now.

Could you expand a bit on what you're testing here?  I think that in one
process you're doing a continuous copy-a-kernel-tree and in the other
process you're the above three things, yes?
  


The test works like this:

  1. I ensure that the device under test (DUT) is set to run the CFQ
 scheduler.
1. It is a Fibre Channel 72GiB disk
2. Single partition...
  2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096)
  3. Mount the device, and then:
1. Put an 8GiB file on the new FS
2. Put 3 copies of a Linux tree (w/ objs & kernel & such) onto
   the FS in separate directories
  1. Note: I'm going to do runs with 6 copies to each
 directory tree to get to about 4.2GiB per directory tree
  4. Then, for each of the tests:
1. Remount the device (purge page cache by umount & then mount)
2. Start up a copy of 1 kernel tree to another tree (you hadn't
   specified if the copy in the background should be to a new
   area or not, so I'm just re-using the same area so we don't
   have to worry about removing the old). I keep doing the copy
   as long as the tests are going
3. Perform the test (10 times)

The tests are:

   * Linear read of a large file (8GiB)
   * Tree read (foreach file in the tree, dd it to /dev/null)
   * Overwrite of that large file: was doing 256KiB random
 read/writes, will go down to 4KiB read/writes as that is more
 realistic I'd guess

I'm going to try and get the comparisons done by tomorrow, the results 
should be very different due to the changes noted above (going to 4.2GiB 
trees instead of 700MiB, going to 4K instead of 256K read/writes). This 
may cause the runs to be much longer, and then I won't get it done as 
quickly...



I guess the other things we should look at are the impact on the
continuously-copy-a-kernel-tree process and also the overall IO throughput.
 These things will of course be related.  If the overall system-wide IO
throughput increases with the patch then we probably have a no-brainer.  If

Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-14 Thread Alan D. Brunelle

Andrew Morton wrote:

(cc lkml restored, with permission)

On Wed, 14 Nov 2007 10:48:10 -0500 Alan D. Brunelle [EMAIL PROTECTED] wrote:

  

Andrew Morton wrote:


On Mon, 15 Oct 2007 16:13:15 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

  
  

Since you have been involved a lot with ext3 development,
which kinds of workloads do you think will show a performance
degradation with Arjan's patch?   What should I test?



Gee.  Expect the unexpected ;)

One problem might be when kjournald is doing its ordered-mode data
writeback at the start of commit.  That writeout will now be
higher-priority and might impact other tasks which are doing synchronous
file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
old reads.

If the aggregate number of seeks over the long term is the same as before
then of course the overall throughput should be the same, in which case the
impact might only be upon latency.  However if for some reason the total
number of seeks is increased then there will be throughput impacts as well.

So as a starting point I guess one could set up a
copy-a-kernel-tree-in-a-loop running in the background and then see what
impact that has upon a large-linear-read, upon a
read-a-different-kernel-tree and upon some database-style multithreaded
O_DIRECT reader/overwriter.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  
  

Hi folks -

I noticed this thread recently (sorry, month late and a dollar short), 
but it was very interesting to me, and as I had some cycles yesterday I 
did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out 
and then w/ Ajran's patch). After doing the runs last night I'm not 
overly happy with the test setup, but before redoing that, I thought I'd 
ask to make sure that this patch was still being considered?



I'd consider its status to be might be a good idea, more performance
testing needed.

  

And, btw, the results from the first pass were rather mixed -



ooh, more performance testing.  Thanks

  

* The overwriter task (on an 8GiB file), average over 10 runs:
  o 2.6.24 - 300.88226 seconds
  o 2.6.24 + Arjan's patch - 403.85505 seconds

* The read-a-different-kernel-tree task, average over 10 runs:
  o 2.6.24 - 46.8145945549 seconds
  o 2.6.24 + Arjan's patch - 39.6430601119 seconds

* The large-linear-read task (on an 8GiB file), average over 10 runs:
  o 2.6.24 - 290.32522 seconds
  o 2.6.24 + Arjan's patch - 386.34860 seconds



These are *large* differences, making this a very signifcant patch.  Much
care is needed now.

Could you expand a bit on what you're testing here?  I think that in one
process you're doing a continuous copy-a-kernel-tree and in the other
process you're the above three things, yes?
  


The test works like this:

  1. I ensure that the device under test (DUT) is set to run the CFQ
 scheduler.
1. It is a Fibre Channel 72GiB disk
2. Single partition...
  2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096)
  3. Mount the device, and then:
1. Put an 8GiB file on the new FS
2. Put 3 copies of a Linux tree (w/ objs  kernel  such) onto
   the FS in separate directories
  1. Note: I'm going to do runs with 6 copies to each
 directory tree to get to about 4.2GiB per directory tree
  4. Then, for each of the tests:
1. Remount the device (purge page cache by umount  then mount)
2. Start up a copy of 1 kernel tree to another tree (you hadn't
   specified if the copy in the background should be to a new
   area or not, so I'm just re-using the same area so we don't
   have to worry about removing the old). I keep doing the copy
   as long as the tests are going
3. Perform the test (10 times)

The tests are:

   * Linear read of a large file (8GiB)
   * Tree read (foreach file in the tree, dd it to /dev/null)
   * Overwrite of that large file: was doing 256KiB randomdirect
 read/writes, will go down to 4KiB read/writes as that is more
 realistic I'd guess

I'm going to try and get the comparisons done by tomorrow, the results 
should be very different due to the changes noted above (going to 4.2GiB 
trees instead of 700MiB, going to 4K instead of 256K read/writes). This 
may cause the runs to be much longer, and then I won't get it done as 
quickly...



I guess the other things we should look at are the impact on the
continuously-copy-a-kernel-tree process and also the overall IO throughput.
 These things will of course be related.  If the overall system-wide IO
throughput increases with the patch then we probably have a no-brainer.  If
(as I suspect) the overall IO throughput is decreased

Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-14 Thread Alan D. Brunelle

Arjan van de Ven wrote:

On Wed, 14 Nov 2007 18:18:05 +0100
Ingo Molnar [EMAIL PROTECTED] wrote:

  

* Andrew Morton [EMAIL PROTECTED] wrote:



ooh, more performance testing.  Thanks

  

* The overwriter task (on an 8GiB file), average over 10 runs:
  o 2.6.24 - 300.88226 seconds
  o 2.6.24 + Arjan's patch - 403.85505 seconds

* The read-a-different-kernel-tree task, average over 10 runs:
  o 2.6.24 - 46.8145945549 seconds
  o 2.6.24 + Arjan's patch - 39.6430601119 seconds

* The large-linear-read task (on an 8GiB file), average over
10 runs: o 2.6.24 - 290.32522 seconds
  o 2.6.24 + Arjan's patch - 386.34860 seconds


These are *large* differences, making this a very signifcant
patch. Much care is needed now.
  
and the numbers suggest it's mostly a severe performance regression. 
That's not what i have expected - ho hum. Apologies for my earlier 
please merge it already! whining.



that's.. not automatic; it depends on what the right thing is :-(
What for sure changes is that who gets to do IO changes. Some of the
tests we ran internally (we didn't publish yet because we saw REALLY
large variations for most of them even without any patch) show for
example that dbench got slower. But.. dbench gets slower when things
get more fair, and faster when things get unfair. What conclusion you
draw out of that is a whole different matter and depends on exactly
what the test is doing, and what is the right thing for the OS to do in
terms of who gets to do the IO.

THis makes the patch more tricky than the one line change suggests, and
this is also why I haven't published a ton of data yet; it's hard to
get useful tests for this (and the variation of the 2.6.23+ kernels
makes it even harder to do anything meaningful ;-( )


  
I'd also like to point out here that the run-to-run deviation was indeed 
quite large for both the unpatched- and patched-kernels, I'll report on 
that information with the next set of results...


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

2007-11-14 Thread Alan D. Brunelle

Oh, and the runs were done in single-user mode...

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add UNPLUG traces to all appropriate places

2007-11-08 Thread Alan D. Brunelle

Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG. Previously, we saw a PLUG on each of the
underlying devices, and an UNPLUG on the volume. This patch ensures that
we see the UNPLUG calls for each of the underlying devices.

Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]>
---
block/ll_rw_blk.c  |   24 +++-
drivers/md/bitmap.c|3 +--
drivers/md/dm-table.c  |3 +--
drivers/md/linear.c|3 +--
drivers/md/md.c|4 ++--
drivers/md/multipath.c |3 +--
drivers/md/raid0.c |3 +--
drivers/md/raid1.c |3 +--
drivers/md/raid10.c|3 +--
drivers/md/raid5.c |3 +--
include/linux/blkdev.h |1 +
11 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 75c98d5..37f8e9c 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1634,15 +1634,7 @@ static void blk_backing_dev_unplug(struct 
backing_dev_info *bdi,

{
struct request_queue *q = bdi->unplug_io_data;

-/*
- * devices don't necessarily have an ->unplug_fn defined
- */
-if (q->unplug_fn) {
-blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
-q->rq.count[READ] + q->rq.count[WRITE]);
-
-q->unplug_fn(q);
-}
+blk_unplug(q);
}

static void blk_unplug_work(struct work_struct *work)
@@ -1666,6 +1658,20 @@ static void blk_unplug_timeout(unsigned long data)
kblockd_schedule_work(>unplug_work);
}

+void blk_unplug(struct request_queue *q)
+{
+/*
+ * devices don't necessarily have an ->unplug_fn defined
+ */
+if (q->unplug_fn) {
+blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+q->rq.count[READ] + q->rq.count[WRITE]);
+
+q->unplug_fn(q);
+}
+}
+EXPORT_SYMBOL(blk_unplug);
+
/**
 * blk_start_queue - restart a previously stopped queue
 * @q:The  request_queue in question
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 7c426d0..1b1ef31 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1207,8 +1207,7 @@ int bitmap_startwrite(struct bitmap *bitmap, 
sector_t offset, unsigned long sect

prepare_to_wait(>overflow_wait, &__wait,
TASK_UNINTERRUPTIBLE);
spin_unlock_irq(>lock);
-bitmap->mddev->queue
-->unplug_fn(bitmap->mddev->queue);
+blk_unplug(bitmap->mddev->queue);
schedule();
finish_wait(>overflow_wait, &__wait);
continue;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5a7eb65..e298d8d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1000,8 +1000,7 @@ void dm_table_unplug_all(struct dm_table *t)
struct dm_dev *dd = list_entry(d, struct dm_dev, list);
struct request_queue *q = bdev_get_queue(dd->bdev);

-if (q->unplug_fn)
-q->unplug_fn(q);
+blk_unplug(q);
}
}

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 56a11f6..3dac1cf 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -87,8 +87,7 @@ static void linear_unplug(struct request_queue *q)

for (i=0; i < mddev->raid_disks; i++) {
struct request_queue *r_queue = 
bdev_get_queue(conf->disks[i].rdev->bdev);

-if (r_queue->unplug_fn)
-r_queue->unplug_fn(r_queue);
+blk_unplug(r_queue);
}
}

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 808cd95..cef9ebd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5445,7 +5445,7 @@ void md_do_sync(mddev_t *mddev)
 * about not overloading the IO subsystem. (things like an
 * e2fsck being done on the RAID array should execute fast)
 */
-mddev->queue->unplug_fn(mddev->queue);
+blk_unplug(mddev->queue);
cond_resched();

currspeed = ((unsigned long)(io_sectors-mddev->resync_mark_cnt))/2
@@ -5464,7 +5464,7 @@ void md_do_sync(mddev_t *mddev)
 * this also signals 'finished resyncing' to md_stop
 */
 out:
-mddev->queue->unplug_fn(mddev->queue);
+blk_unplug(mddev->queue);

wait_event(mddev->recovery_wait, 
!atomic_read(>recovery_active));


diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index b35731c..eb631eb 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -125,8 +125,7 @@ static void unplug_slaves(mddev_t *mddev)
atomic_inc(>nr_pending);
rcu_read_unlock();

-if (r_queue->unplug_fn)
-r_queue->unplug_fn(r_queue);
+blk_unplug(r_queue);

rdev_dec_pending(rdev, mddev);
rcu_read_lock();
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index c05..f8e5917 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -35,8 +35,7 @@ static

[PATCH] Add UNPLUG traces to all appropriate places

2007-11-08 Thread Alan D. Brunelle

Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG. Previously, we saw a PLUG on each of the
underlying devices, and an UNPLUG on the volume. This patch ensures that
we see the UNPLUG calls for each of the underlying devices.

Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED]
---
block/ll_rw_blk.c  |   24 +++-
drivers/md/bitmap.c|3 +--
drivers/md/dm-table.c  |3 +--
drivers/md/linear.c|3 +--
drivers/md/md.c|4 ++--
drivers/md/multipath.c |3 +--
drivers/md/raid0.c |3 +--
drivers/md/raid1.c |3 +--
drivers/md/raid10.c|3 +--
drivers/md/raid5.c |3 +--
include/linux/blkdev.h |1 +
11 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 75c98d5..37f8e9c 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1634,15 +1634,7 @@ static void blk_backing_dev_unplug(struct 
backing_dev_info *bdi,

{
struct request_queue *q = bdi-unplug_io_data;

-/*
- * devices don't necessarily have an -unplug_fn defined
- */
-if (q-unplug_fn) {
-blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
-q-rq.count[READ] + q-rq.count[WRITE]);
-
-q-unplug_fn(q);
-}
+blk_unplug(q);
}

static void blk_unplug_work(struct work_struct *work)
@@ -1666,6 +1658,20 @@ static void blk_unplug_timeout(unsigned long data)
kblockd_schedule_work(q-unplug_work);
}

+void blk_unplug(struct request_queue *q)
+{
+/*
+ * devices don't necessarily have an -unplug_fn defined
+ */
+if (q-unplug_fn) {
+blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+q-rq.count[READ] + q-rq.count[WRITE]);
+
+q-unplug_fn(q);
+}
+}
+EXPORT_SYMBOL(blk_unplug);
+
/**
 * blk_start_queue - restart a previously stopped queue
 * @q:The struct request_queue in question
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 7c426d0..1b1ef31 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1207,8 +1207,7 @@ int bitmap_startwrite(struct bitmap *bitmap, 
sector_t offset, unsigned long sect

prepare_to_wait(bitmap-overflow_wait, __wait,
TASK_UNINTERRUPTIBLE);
spin_unlock_irq(bitmap-lock);
-bitmap-mddev-queue
--unplug_fn(bitmap-mddev-queue);
+blk_unplug(bitmap-mddev-queue);
schedule();
finish_wait(bitmap-overflow_wait, __wait);
continue;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5a7eb65..e298d8d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1000,8 +1000,7 @@ void dm_table_unplug_all(struct dm_table *t)
struct dm_dev *dd = list_entry(d, struct dm_dev, list);
struct request_queue *q = bdev_get_queue(dd-bdev);

-if (q-unplug_fn)
-q-unplug_fn(q);
+blk_unplug(q);
}
}

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 56a11f6..3dac1cf 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -87,8 +87,7 @@ static void linear_unplug(struct request_queue *q)

for (i=0; i  mddev-raid_disks; i++) {
struct request_queue *r_queue = 
bdev_get_queue(conf-disks[i].rdev-bdev);

-if (r_queue-unplug_fn)
-r_queue-unplug_fn(r_queue);
+blk_unplug(r_queue);
}
}

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 808cd95..cef9ebd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5445,7 +5445,7 @@ void md_do_sync(mddev_t *mddev)
 * about not overloading the IO subsystem. (things like an
 * e2fsck being done on the RAID array should execute fast)
 */
-mddev-queue-unplug_fn(mddev-queue);
+blk_unplug(mddev-queue);
cond_resched();

currspeed = ((unsigned long)(io_sectors-mddev-resync_mark_cnt))/2
@@ -5464,7 +5464,7 @@ void md_do_sync(mddev_t *mddev)
 * this also signals 'finished resyncing' to md_stop
 */
 out:
-mddev-queue-unplug_fn(mddev-queue);
+blk_unplug(mddev-queue);

wait_event(mddev-recovery_wait, 
!atomic_read(mddev-recovery_active));


diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index b35731c..eb631eb 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -125,8 +125,7 @@ static void unplug_slaves(mddev_t *mddev)
atomic_inc(rdev-nr_pending);
rcu_read_unlock();

-if (r_queue-unplug_fn)
-r_queue-unplug_fn(r_queue);
+blk_unplug(r_queue);

rdev_dec_pending(rdev, mddev);
rcu_read_lock();
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index c05..f8e5917 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -35,8 +35,7 @@ static void raid0_unplug(struct request_queue *q)
for (i=0; imddev-raid_disks; i++) {
struct request_queue *r_queue = bdev_get_queue(devlist[i]-bdev

Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system

2007-10-02 Thread Alan D. Brunelle
Mathieu Desnoyers wrote:
>> remember that we have seen and discussed something like this before,
>> it's still a puzzle to me...
>>
>> 
> I do wonder about that performance _increase_ with blktrace enabled. I
>
> Interesting question indeed.
>
> In those tests, when blktrace is running, are the relay buffers only
> written to or they are also read ?
>   

blktrace (the utility) was running too - so the relay buffere /were/
being read and stored out to disk elsewhere.

> Running the tests without consuming the buffers (in overwrite mode)
> would tell us more about the nature of the disturbance causing the
> performance increase.
>   

I'd have to write a utility to enable the traces, but then not read. Let
me think about that.

> Also, a kernel trace could help us understand more thoroughly what is
> happening there.. is it caused by the scheduler ? memory allocation ?
> data cache alignment ?
>   

Yep - when I get some time, I'll look into that. [Clearly not a gating
issue for marker support...]

> I would suggest that you try aligning the block layer data structures
> accessed by blktrace on L2 cacheline size and compare the results (when
> blktrace is disabled).
>   

Again, when I get some time! :-)

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system

2007-10-02 Thread Alan D. Brunelle
Mathieu Desnoyers wrote:
 remember that we have seen and discussed something like this before,
 it's still a puzzle to me...

 
 I do wonder about that performance _increase_ with blktrace enabled. I

 Interesting question indeed.

 In those tests, when blktrace is running, are the relay buffers only
 written to or they are also read ?
   

blktrace (the utility) was running too - so the relay buffere /were/
being read and stored out to disk elsewhere.

 Running the tests without consuming the buffers (in overwrite mode)
 would tell us more about the nature of the disturbance causing the
 performance increase.
   

I'd have to write a utility to enable the traces, but then not read. Let
me think about that.

 Also, a kernel trace could help us understand more thoroughly what is
 happening there.. is it caused by the scheduler ? memory allocation ?
 data cache alignment ?
   

Yep - when I get some time, I'll look into that. [Clearly not a gating
issue for marker support...]

 I would suggest that you try aligning the block layer data structures
 accessed by blktrace on L2 cacheline size and compare the results (when
 blktrace is disabled).
   

Again, when I get some time! :-)

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Some IO scheduler cleanup in Documentation/block

2007-09-27 Thread Alan D. Brunelle


[PATCH] Some IO scheduler cleanup in Documentation/block

as-iosched.txt:
  o  Changed IO scheduler selection text to a reference to the
 switching-sched.txt file.

  o  Fixed typo: 'for up time...' -> 'for up to...'

  o  Added short description of the est_time file.

deadline-iosched.txt:
  o  Changed IO scheduler selection text to a reference to the
 switching-sched.txt file.

  o  Removed references to non-existent seek-cost and stream_unit.

  o  Fixed typo: 'write_starved' -> 'writes_starved'

switching-sched.txt:
  o  Added in boot-time argument to set the default IO scheduler. (From
 as-iosched.txt)

  o  Added in sysfs mount instructions. (From deadline-iosched.txt)

Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]>
---
 Documentation/block/as-iosched.txt   |   21 +
 Documentation/block/deadline-iosched.txt |   23 +++
 Documentation/block/switching-sched.txt  |   21 +
 3 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt
index a598fe1..738b72b 100644
--- a/Documentation/block/as-iosched.txt
+++ b/Documentation/block/as-iosched.txt
@@ -20,15 +20,10 @@ actually has a head for each physical device in the logical RAID device.
 However, setting the antic_expire (see tunable parameters below) produces
 very similar behavior to the deadline IO scheduler.
 
-
 Selecting IO schedulers
 ---
-To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
-'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are
-assigned globally at boot time only presently. It's also possible to change
-the IO scheduler for a determined device on the fly, as described in
-Documentation/block/switching-sched.txt.
-
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
 
 Anticipatory IO scheduler Policies
 --
@@ -115,7 +110,7 @@ statistics (average think time, average seek distance) on the process
 that submitted the just completed request are examined.  If it seems
 likely that that process will submit another request soon, and that
 request is likely to be near the just completed request, then the IO
-scheduler will stop dispatching more read requests for up time (antic_expire)
+scheduler will stop dispatching more read requests for up to (antic_expire)
 milliseconds, hoping that process will submit a new request near the one
 that just completed.  If such a request is made, then it is dispatched
 immediately.  If the antic_expire wait time expires, then the IO scheduler
@@ -165,3 +160,13 @@ The parameters are:
 for big seek time devices though not a linear correspondence - most
 processes have only a few ms thinktime.
 
+In addition to the tunables above there is a read-only file named est_time
+which, when read, will show:
+
+- The probability of a task exiting without a cooperating task
+  submitting an anticipated IO.
+
+- The current mean think time.
+
+- The seek distance used to determine if an incoming IO is better.
+
diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt
index be08ffd..0a89839 100644
--- a/Documentation/block/deadline-iosched.txt
+++ b/Documentation/block/deadline-iosched.txt
@@ -5,16 +5,10 @@ This little file attempts to document how the deadline io scheduler works.
 In particular, it will clarify the meaning of the exposed tunables that may be
 of interest to power users.
 
-Each io queue has a set of io scheduler tunables associated with it. These
-tunables control how the io scheduler works. You can find these entries
-in:
-
-/sys/block//queue/iosched
-
-assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
-you can do so by typing:
-
-# mount none /sys -t sysfs
+Selecting IO schedulers
+---
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
 
 
 
@@ -41,14 +35,11 @@ fifo_batch
 
 When a read request expires its deadline, we must move some requests from
 the sorted io scheduler list to the block device dispatch queue. fifo_batch
-controls how many requests we move, based on the cost of each request. A
-request is either qualified as a seek or a stream. The io scheduler knows
-the last request that was serviced by the drive (or will be serviced right
-before this one). See seek_cost and stream_unit.
+controls how many requests we move.
 
 
-write_starved	(number of dispatches)
--
+writes_starved	(number of dispatches)
+--
 
 When we have to move requests from the io scheduler queue to the block
 device dispatch queue, we always give a preference to reads. However, we
diff --git a/Documentation/blo

[PATCH] Some IO scheduler cleanup in Documentation/block

2007-09-27 Thread Alan D. Brunelle


[PATCH] Some IO scheduler cleanup in Documentation/block

as-iosched.txt:
  o  Changed IO scheduler selection text to a reference to the
 switching-sched.txt file.

  o  Fixed typo: 'for up time...' - 'for up to...'

  o  Added short description of the est_time file.

deadline-iosched.txt:
  o  Changed IO scheduler selection text to a reference to the
 switching-sched.txt file.

  o  Removed references to non-existent seek-cost and stream_unit.

  o  Fixed typo: 'write_starved' - 'writes_starved'

switching-sched.txt:
  o  Added in boot-time argument to set the default IO scheduler. (From
 as-iosched.txt)

  o  Added in sysfs mount instructions. (From deadline-iosched.txt)

Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED]
---
 Documentation/block/as-iosched.txt   |   21 +
 Documentation/block/deadline-iosched.txt |   23 +++
 Documentation/block/switching-sched.txt  |   21 +
 3 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt
index a598fe1..738b72b 100644
--- a/Documentation/block/as-iosched.txt
+++ b/Documentation/block/as-iosched.txt
@@ -20,15 +20,10 @@ actually has a head for each physical device in the logical RAID device.
 However, setting the antic_expire (see tunable parameters below) produces
 very similar behavior to the deadline IO scheduler.
 
-
 Selecting IO schedulers
 ---
-To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
-'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are
-assigned globally at boot time only presently. It's also possible to change
-the IO scheduler for a determined device on the fly, as described in
-Documentation/block/switching-sched.txt.
-
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
 
 Anticipatory IO scheduler Policies
 --
@@ -115,7 +110,7 @@ statistics (average think time, average seek distance) on the process
 that submitted the just completed request are examined.  If it seems
 likely that that process will submit another request soon, and that
 request is likely to be near the just completed request, then the IO
-scheduler will stop dispatching more read requests for up time (antic_expire)
+scheduler will stop dispatching more read requests for up to (antic_expire)
 milliseconds, hoping that process will submit a new request near the one
 that just completed.  If such a request is made, then it is dispatched
 immediately.  If the antic_expire wait time expires, then the IO scheduler
@@ -165,3 +160,13 @@ The parameters are:
 for big seek time devices though not a linear correspondence - most
 processes have only a few ms thinktime.
 
+In addition to the tunables above there is a read-only file named est_time
+which, when read, will show:
+
+- The probability of a task exiting without a cooperating task
+  submitting an anticipated IO.
+
+- The current mean think time.
+
+- The seek distance used to determine if an incoming IO is better.
+
diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt
index be08ffd..0a89839 100644
--- a/Documentation/block/deadline-iosched.txt
+++ b/Documentation/block/deadline-iosched.txt
@@ -5,16 +5,10 @@ This little file attempts to document how the deadline io scheduler works.
 In particular, it will clarify the meaning of the exposed tunables that may be
 of interest to power users.
 
-Each io queue has a set of io scheduler tunables associated with it. These
-tunables control how the io scheduler works. You can find these entries
-in:
-
-/sys/block/device/queue/iosched
-
-assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
-you can do so by typing:
-
-# mount none /sys -t sysfs
+Selecting IO schedulers
+---
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
 
 
 
@@ -41,14 +35,11 @@ fifo_batch
 
 When a read request expires its deadline, we must move some requests from
 the sorted io scheduler list to the block device dispatch queue. fifo_batch
-controls how many requests we move, based on the cost of each request. A
-request is either qualified as a seek or a stream. The io scheduler knows
-the last request that was serviced by the drive (or will be serviced right
-before this one). See seek_cost and stream_unit.
+controls how many requests we move.
 
 
-write_starved	(number of dispatches)
--
+writes_starved	(number of dispatches)
+--
 
 When we have to move requests from the io scheduler queue to the block
 device dispatch queue, we always give a preference to reads. However, we
diff --git a/Documentation/block

Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system

2007-09-26 Thread Alan D. Brunelle

Mathieu Desnoyers wrote:

* Alan D. Brunelle ([EMAIL PROTECTED]) wrote:
Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample 
runs of the following on both it and after applying Mathieu Desnoyers 
11-patch sequence (19 September 2007).


   * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB
 volume per MSA used)

   * 10 runs with each configuration, averages shown below
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running

   * A run consists of doing the following in parallel:
 o Make an ext3 FS on each of the 10 volumes
 o Mount & unmount each volume
   + The unmounting generates a tremendous amount of writes
 to the disks - thus stressing the intended storage
 devices (10 volumes) plus the separate volume for all
 the blktrace data (when blk tracing is enabled).
   + Note the times reported below only cover the
 make/mount/unmount time - the actual blktrace runs
 extended beyond the times measured (took quite a while
 for the blk trace data to be output). We're only
 concerned with the impact on the "application"
 performance in this instance.

Results are:

Kernel w/out BT   STDDEV w/ BTSTDDEV
-  -  --   -  --
2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34   27.7547962.09
2.6.23-rc6 + 2.6.23-rc6-mm1 + markers  14.9930410.59   26.6949933.23



Interesting results, although we cannot say any of the solutions has much
impact due to the std dev.

Also, it could be interesting to add the "blktrace compiled out" as a
base line.

Thanks for running those tests,

Mathieu

Mathieu:

Here are the results from 6 different kernels (including ones with 
blktrace not configured in), with now performing 40 runs per kernel.


 o  All kernels start off with Linux 2.6.23-rc6 + 2.6.23-rc6-mm1

 o  '- bt cfg' or '+ bt cfg' means a kernel without or with blktrace 
configured respectively.


 o  '- markers' or '+ markers' means a kernel without or with the 
11-patch marker series respectively.


38 runs without blk traces being captured (dropped hi/lo value from 40 runs)

Kernel Options   Min valAvg valMax valStd Dev
--  -  -  -  -
- markers - bt cfg  15.349127  16.169459  16.372980   0.184417
+ markers - bt cfg  15.280382  16.202398  16.409257   0.191861

- markers + bt cfg  14.464366  14.754347  16.052306   0.463665
+ markers + bt cfg  14.421765  14.644406  15.690871   0.233885

38 runs with blk traces being captured (dropped hi/lo value from 40 runs)

Kernel Options   Min valAvg valMax valStd Dev
--  -  -  -  -
- markers + bt cfg  24.675859  28.480446  32.571484   1.713603
+ markers + bt cfg  18.713280  27.054927  31.684325   2.857186

 o  It is not at all clear why running without blk trace configured 
into the kernel runs slower than with blk trace configured in. (9.6 to 
10.6% reduction)
   
 o  The data is still not conclusive with respect to whether the marker 
patches change performance characteristics when we're not gathering 
traces. It appears

that any change in performance is minimal at worst for this test.
   
 o  The data so far still doesn't conclusively show a win in this case 
even when we are capturing traces, although, the average certainly seems 
to be in its favor.
   
One concern that I should be able to deal easily with is the choice of 
the IO scheduler being used for both the volume being used to perform 
the test on, as well as the one used for storing blk traces (when 
enabled). Right now I was using the default CFQ, when perhaps NOOP or 
DEADLINE would be a better choice. If there is enough interest in seeing 
how that changes things I could try to get some runs in later this week.


Alan D. Brunelle
Hewlett-Packard / Open Source and Linux Organization / Scalability and 
Performance Group


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system

2007-09-26 Thread Alan D. Brunelle

Mathieu Desnoyers wrote:

* Alan D. Brunelle ([EMAIL PROTECTED]) wrote:
Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample 
runs of the following on both it and after applying Mathieu Desnoyers 
11-patch sequence (19 September 2007).


   * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB
 volume per MSA used)

   * 10 runs with each configuration, averages shown below
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running

   * A run consists of doing the following in parallel:
 o Make an ext3 FS on each of the 10 volumes
 o Mount  unmount each volume
   + The unmounting generates a tremendous amount of writes
 to the disks - thus stressing the intended storage
 devices (10 volumes) plus the separate volume for all
 the blktrace data (when blk tracing is enabled).
   + Note the times reported below only cover the
 make/mount/unmount time - the actual blktrace runs
 extended beyond the times measured (took quite a while
 for the blk trace data to be output). We're only
 concerned with the impact on the application
 performance in this instance.

Results are:

Kernel w/out BT   STDDEV w/ BTSTDDEV
-  -  --   -  --
2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34   27.7547962.09
2.6.23-rc6 + 2.6.23-rc6-mm1 + markers  14.9930410.59   26.6949933.23



Interesting results, although we cannot say any of the solutions has much
impact due to the std dev.

Also, it could be interesting to add the blktrace compiled out as a
base line.

Thanks for running those tests,

Mathieu

Mathieu:

Here are the results from 6 different kernels (including ones with 
blktrace not configured in), with now performing 40 runs per kernel.


 o  All kernels start off with Linux 2.6.23-rc6 + 2.6.23-rc6-mm1

 o  '- bt cfg' or '+ bt cfg' means a kernel without or with blktrace 
configured respectively.


 o  '- markers' or '+ markers' means a kernel without or with the 
11-patch marker series respectively.


38 runs without blk traces being captured (dropped hi/lo value from 40 runs)

Kernel Options   Min valAvg valMax valStd Dev
--  -  -  -  -
- markers - bt cfg  15.349127  16.169459  16.372980   0.184417
+ markers - bt cfg  15.280382  16.202398  16.409257   0.191861

- markers + bt cfg  14.464366  14.754347  16.052306   0.463665
+ markers + bt cfg  14.421765  14.644406  15.690871   0.233885

38 runs with blk traces being captured (dropped hi/lo value from 40 runs)

Kernel Options   Min valAvg valMax valStd Dev
--  -  -  -  -
- markers + bt cfg  24.675859  28.480446  32.571484   1.713603
+ markers + bt cfg  18.713280  27.054927  31.684325   2.857186

 o  It is not at all clear why running without blk trace configured 
into the kernel runs slower than with blk trace configured in. (9.6 to 
10.6% reduction)
   
 o  The data is still not conclusive with respect to whether the marker 
patches change performance characteristics when we're not gathering 
traces. It appears

that any change in performance is minimal at worst for this test.
   
 o  The data so far still doesn't conclusively show a win in this case 
even when we are capturing traces, although, the average certainly seems 
to be in its favor.
   
One concern that I should be able to deal easily with is the choice of 
the IO scheduler being used for both the volume being used to perform 
the test on, as well as the one used for storing blk traces (when 
enabled). Right now I was using the default CFQ, when perhaps NOOP or 
DEADLINE would be a better choice. If there is enough interest in seeing 
how that changes things I could try to get some runs in later this week.


Alan D. Brunelle
Hewlett-Packard / Open Source and Linux Organization / Scalability and 
Performance Group


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system

2007-09-25 Thread Alan D. Brunelle
Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample 
runs of the following on both it and after applying Mathieu Desnoyers 
11-patch sequence (19 September 2007).


   * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB
 volume per MSA used)

   * 10 runs with each configuration, averages shown below
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running

   * A run consists of doing the following in parallel:
 o Make an ext3 FS on each of the 10 volumes
 o Mount & unmount each volume
   + The unmounting generates a tremendous amount of writes
 to the disks - thus stressing the intended storage
 devices (10 volumes) plus the separate volume for all
 the blktrace data (when blk tracing is enabled).
   + Note the times reported below only cover the
 make/mount/unmount time - the actual blktrace runs
 extended beyond the times measured (took quite a while
 for the blk trace data to be output). We're only
 concerned with the impact on the "application"
 performance in this instance.

Results are:

Kernel w/out BT   STDDEV w/ BTSTDDEV
-  -  --   -  --
2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34   27.7547962.09
2.6.23-rc6 + 2.6.23-rc6-mm1 + markers  14.9930410.59   26.6949933.23

It looks to be about 2.1% increase in time to do the make/mount/unmount 
operations with the marker patches in place and no blktrace operations. 
With the blktrace operations in place we see about a 3.8% decrease in 
time to do the same ops.


When our Oracle benchmarking machine frees up, and when the 
marker/blktrace patches are more stable, we'll try to get some "real" 
Oracle benchmark runs done to gage the impact of the markers changes to 
performance...


Alan D. Brunelle
Hewlett-Packard / Open Source and Linux Organization / Scalability and 
Performance Group


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system

2007-09-25 Thread Alan D. Brunelle
Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample 
runs of the following on both it and after applying Mathieu Desnoyers 
11-patch sequence (19 September 2007).


   * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB
 volume per MSA used)

   * 10 runs with each configuration, averages shown below
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running
 o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running

   * A run consists of doing the following in parallel:
 o Make an ext3 FS on each of the 10 volumes
 o Mount  unmount each volume
   + The unmounting generates a tremendous amount of writes
 to the disks - thus stressing the intended storage
 devices (10 volumes) plus the separate volume for all
 the blktrace data (when blk tracing is enabled).
   + Note the times reported below only cover the
 make/mount/unmount time - the actual blktrace runs
 extended beyond the times measured (took quite a while
 for the blk trace data to be output). We're only
 concerned with the impact on the application
 performance in this instance.

Results are:

Kernel w/out BT   STDDEV w/ BTSTDDEV
-  -  --   -  --
2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34   27.7547962.09
2.6.23-rc6 + 2.6.23-rc6-mm1 + markers  14.9930410.59   26.6949933.23

It looks to be about 2.1% increase in time to do the make/mount/unmount 
operations with the marker patches in place and no blktrace operations. 
With the blktrace operations in place we see about a 3.8% decrease in 
time to do the same ops.


When our Oracle benchmarking machine frees up, and when the 
marker/blktrace patches are more stable, we'll try to get some real 
Oracle benchmark runs done to gage the impact of the markers changes to 
performance...


Alan D. Brunelle
Hewlett-Packard / Open Source and Linux Organization / Scalability and 
Performance Group


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix remap handling by blktrace

2007-08-07 Thread Alan D. Brunelle



This patch provides more information concerning REMAP operations on block
IOs. The additional information provides clearer details at the user level,
and supports post-processing analysis in btt.

o  Adds in partition remaps on the same device.
o  Fixed up the remap information in DM to be in the right order
o  Sent up mapped-from and mapped-to device information

Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]>

---
 block/ll_rw_blk.c|4 
 drivers/md/dm.c  |4 ++--
 include/linux/blktrace_api.h |3 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8c2caff..a15845c 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3047,6 +3047,10 @@ static inline void blk_partition_remap(struct bio *bio)
 
 		bio->bi_sector += p->start_sect;
 		bio->bi_bdev = bdev->bd_contains;
+
+		blk_add_trace_remap(bdev_get_queue(bio->bi_bdev), bio,
+bdev->bd_dev, bio->bi_sector,
+bio->bi_sector - p->start_sect);
 	}
 }
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 141ff9f..2120155 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -580,8 +580,8 @@ static void __map_bio(struct dm_target *ti, struct bio *clone,
 		/* the bio has been remapped so dispatch it */
 
 		blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone,
-tio->io->bio->bi_bdev->bd_dev, sector,
-clone->bi_sector);
+tio->io->bio->bi_bdev->bd_dev,
+clone->bi_sector, sector);
 
 		generic_make_request(clone);
 	} else if (r < 0 || r == DM_MAPIO_REQUEUE) {
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 90874a5..7b5d56b 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -105,7 +105,7 @@ struct blk_io_trace {
  */
 struct blk_io_trace_remap {
 	__be32 device;
-	u32 __pad;
+	__be32 device_from;
 	__be64 sector;
 };
 
@@ -272,6 +272,7 @@ static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio,
 		return;
 
 	r.device = cpu_to_be32(dev);
+	r.device_from = cpu_to_be32(bio->bi_bdev->bd_dev);
 	r.sector = cpu_to_be64(to);
 
 	__blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), );


[PATCH] Fix remap handling by blktrace

2007-08-07 Thread Alan D. Brunelle



This patch provides more information concerning REMAP operations on block
IOs. The additional information provides clearer details at the user level,
and supports post-processing analysis in btt.

o  Adds in partition remaps on the same device.
o  Fixed up the remap information in DM to be in the right order
o  Sent up mapped-from and mapped-to device information

Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED]

---
 block/ll_rw_blk.c|4 
 drivers/md/dm.c  |4 ++--
 include/linux/blktrace_api.h |3 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8c2caff..a15845c 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3047,6 +3047,10 @@ static inline void blk_partition_remap(struct bio *bio)
 
 		bio-bi_sector += p-start_sect;
 		bio-bi_bdev = bdev-bd_contains;
+
+		blk_add_trace_remap(bdev_get_queue(bio-bi_bdev), bio,
+bdev-bd_dev, bio-bi_sector,
+bio-bi_sector - p-start_sect);
 	}
 }
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 141ff9f..2120155 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -580,8 +580,8 @@ static void __map_bio(struct dm_target *ti, struct bio *clone,
 		/* the bio has been remapped so dispatch it */
 
 		blk_add_trace_remap(bdev_get_queue(clone-bi_bdev), clone,
-tio-io-bio-bi_bdev-bd_dev, sector,
-clone-bi_sector);
+tio-io-bio-bi_bdev-bd_dev,
+clone-bi_sector, sector);
 
 		generic_make_request(clone);
 	} else if (r  0 || r == DM_MAPIO_REQUEUE) {
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 90874a5..7b5d56b 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -105,7 +105,7 @@ struct blk_io_trace {
  */
 struct blk_io_trace_remap {
 	__be32 device;
-	u32 __pad;
+	__be32 device_from;
 	__be64 sector;
 };
 
@@ -272,6 +272,7 @@ static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio,
 		return;
 
 	r.device = cpu_to_be32(dev);
+	r.device_from = cpu_to_be32(bio-bi_bdev-bd_dev);
 	r.sector = cpu_to_be64(to);
 
 	__blk_add_trace(bt, from, bio-bi_size, bio-bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), r);


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-05-21 Thread Alan D. Brunelle

Jens Axboe wrote:

On Mon, May 21 2007, Alan D. Brunelle wrote:
  

Jens Axboe wrote:


On Tue, May 01 2007, Alan D. Brunelle wrote:
 
  

Jens Axboe wrote:
   


On Mon, Apr 30 2007, Alan D. Brunelle wrote:
 
  
The results from a single run of an AIM7 DBase load on a 16-way ia64 
box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by 
adding in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye 
on as the regression showed itself across the complete run.
   


Do you know if this regression is due to worse IO performance, or
increased system CPU usage?
 
  
We performed two point runs yesterday (20,000 and 50,000 tasks) and here 
are the results:


Kernel  Tasks  Jobs per Minute  %sys (avg)
--  -  ---  --
2.6.21  2 60,831.139.83%
CFQ br  2 60,237.440.80%
  -0.98%+2.44%

2.6.21  5 60,881.640.43%
CFQ br  5 60,400.640.80%
  -0.79%+0.92%

So we're seeing a slight IO performance regression with a slight 
increase in %system with the CFQ branch. (A chart of the complete run 
values is up on  http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png  ).
   


Alan, can you repeat that same run with this patch applied? It
reinstates the cfq lookup hash, which could account for increased system
utilization.
 
  

Hi Jens -

This test was performed over the weekend, results are updated on

http://free.linux.hp.com/~adb/cfq/cfq_dbase.png



Thanks a lot, Alan! So the cfq hash does indeed improve things a little,
that's a shame. I guess I'll just reinstate the hash lookup.

  
You're welcome Jens, but remember: It's one set of data; from one 
benchmark; on one architecture; on one platform...don't know if you 
should scrap the whole thing for that! :-) At the very least, I could 
look into trying it out on another architecture. Let me see what I can 
dig up...


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-05-21 Thread Alan D. Brunelle

Jens Axboe wrote:

On Tue, May 01 2007, Alan D. Brunelle wrote:
  

Jens Axboe wrote:


On Mon, Apr 30 2007, Alan D. Brunelle wrote:
  
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.


Do you know if this regression is due to worse IO performance, or
increased system CPU usage?
  
We performed two point runs yesterday (20,000 and 50,000 tasks) and here 
are the results:


Kernel  Tasks  Jobs per Minute  %sys (avg)
--  -  ---  --
2.6.21  2 60,831.139.83%
CFQ br  2 60,237.440.80%
   -0.98%+2.44%

2.6.21  5 60,881.640.43%
CFQ br  5 60,400.640.80%
   -0.79%+0.92%

So we're seeing a slight IO performance regression with a slight 
increase in %system with the CFQ branch. (A chart of the complete run 
values is up on  http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png  ).



Alan, can you repeat that same run with this patch applied? It
reinstates the cfq lookup hash, which could account for increased system
utilization.
  


Hi Jens -

This test was performed over the weekend, results are updated on

http://free.linux.hp.com/~adb/cfq/cfq_dbase.png

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-05-21 Thread Alan D. Brunelle

Jens Axboe wrote:

On Tue, May 01 2007, Alan D. Brunelle wrote:
  

Jens Axboe wrote:


On Mon, Apr 30 2007, Alan D. Brunelle wrote:
  
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.


Do you know if this regression is due to worse IO performance, or
increased system CPU usage?
  
We performed two point runs yesterday (20,000 and 50,000 tasks) and here 
are the results:


Kernel  Tasks  Jobs per Minute  %sys (avg)
--  -  ---  --
2.6.21  2 60,831.139.83%
CFQ br  2 60,237.440.80%
   -0.98%+2.44%

2.6.21  5 60,881.640.43%
CFQ br  5 60,400.640.80%
   -0.79%+0.92%

So we're seeing a slight IO performance regression with a slight 
increase in %system with the CFQ branch. (A chart of the complete run 
values is up on  http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png  ).



Alan, can you repeat that same run with this patch applied? It
reinstates the cfq lookup hash, which could account for increased system
utilization.
  


Hi Jens -

This test was performed over the weekend, results are updated on

http://free.linux.hp.com/~adb/cfq/cfq_dbase.png

Alan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-05-21 Thread Alan D. Brunelle

Jens Axboe wrote:

On Mon, May 21 2007, Alan D. Brunelle wrote:
  

Jens Axboe wrote:


On Tue, May 01 2007, Alan D. Brunelle wrote:
 
  

Jens Axboe wrote:
   


On Mon, Apr 30 2007, Alan D. Brunelle wrote:
 
  
The results from a single run of an AIM7 DBase load on a 16-way ia64 
box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by 
adding in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye 
on as the regression showed itself across the complete run.
   


Do you know if this regression is due to worse IO performance, or
increased system CPU usage?
 
  
We performed two point runs yesterday (20,000 and 50,000 tasks) and here 
are the results:


Kernel  Tasks  Jobs per Minute  %sys (avg)
--  -  ---  --
2.6.21  2 60,831.139.83%
CFQ br  2 60,237.440.80%
  -0.98%+2.44%

2.6.21  5 60,881.640.43%
CFQ br  5 60,400.640.80%
  -0.79%+0.92%

So we're seeing a slight IO performance regression with a slight 
increase in %system with the CFQ branch. (A chart of the complete run 
values is up on  http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png  ).
   


Alan, can you repeat that same run with this patch applied? It
reinstates the cfq lookup hash, which could account for increased system
utilization.
 
  

Hi Jens -

This test was performed over the weekend, results are updated on

http://free.linux.hp.com/~adb/cfq/cfq_dbase.png



Thanks a lot, Alan! So the cfq hash does indeed improve things a little,
that's a shame. I guess I'll just reinstate the hash lookup.

  
You're welcome Jens, but remember: It's one set of data; from one 
benchmark; on one architecture; on one platform...don't know if you 
should scrap the whole thing for that! :-) At the very least, I could 
look into trying it out on another architecture. Let me see what I can 
dig up...


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-05-01 Thread Alan D. Brunelle

Jens Axboe wrote:

On Mon, Apr 30 2007, Alan D. Brunelle wrote:
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.


Do you know if this regression is due to worse IO performance, or
increased system CPU usage?
We performed two point runs yesterday (20,000 and 50,000 tasks) and here 
are the results:


Kernel  Tasks  Jobs per Minute  %sys (avg)
--  -  ---  --
2.6.21  2 60,831.139.83%
CFQ br  2 60,237.440.80%
   -0.98%+2.44%

2.6.21  5 60,881.640.43%
CFQ br  5 60,400.640.80%
   -0.79%+0.92%

So we're seeing a slight IO performance regression with a slight 
increase in %system with the CFQ branch. (A chart of the complete run 
values is up on  http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png  ).


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-05-01 Thread Alan D. Brunelle

Jens Axboe wrote:

On Mon, Apr 30 2007, Alan D. Brunelle wrote:
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.


Do you know if this regression is due to worse IO performance, or
increased system CPU usage?
We performed two point runs yesterday (20,000 and 50,000 tasks) and here 
are the results:


Kernel  Tasks  Jobs per Minute  %sys (avg)
--  -  ---  --
2.6.21  2 60,831.139.83%
CFQ br  2 60,237.440.80%
   -0.98%+2.44%

2.6.21  5 60,881.640.43%
CFQ br  5 60,400.640.80%
   -0.79%+0.92%

So we're seeing a slight IO performance regression with a slight 
increase in %system with the CFQ branch. (A chart of the complete run 
values is up on  http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png  ).


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-04-30 Thread Alan D. Brunelle

Jens Axboe wrote:

On Mon, Apr 30 2007, Alan D. Brunelle wrote:
  
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.



Do you know if this regression is due to worse IO performance, or
increased system CPU usage?

  
Unfortunately, the runs generate different X points - I'm going to try 
and get a second run with the same X-points, and then I can compare 
iostat results (these are being collected).


I do have some iostat data from these runs, and I am trying to make 
sense of them. But, with only about a 0.5% difference in performance, 
and different X values, not much can be gleaned. We'll see when the 
second run of a kernel can be done, and I'll get back to you on that.


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-04-30 Thread Alan D. Brunelle
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.


Alan D. Brunelle

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-04-30 Thread Alan D. Brunelle
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.


Alan D. Brunelle

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64

2007-04-30 Thread Alan D. Brunelle

Jens Axboe wrote:

On Mon, Apr 30 2007, Alan D. Brunelle wrote:
  
The results from a single run of an AIM7 DBase load on a 16-way ia64 box 
(64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding 
in this patch. (Graph can be found at   
http://free.linux.hp.com/~adb/cfq/cfq_dbase.png   ) It is only a single 
set of runs, on a single platform, but it is something to keep an eye on 
as the regression showed itself across the complete run.



Do you know if this regression is due to worse IO performance, or
increased system CPU usage?

  
Unfortunately, the runs generate different X points - I'm going to try 
and get a second run with the same X-points, and then I can compare 
iostat results (these are being collected).


I do have some iostat data from these runs, and I am trying to make 
sense of them. But, with only about a 0.5% difference in performance, 
and different X values, not much can be gleaned. We'll see when the 
second run of a kernel can be done, and I'll get back to you on that.


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH linux-2.6-block.git] Fix blktrace trace ordering for plug branch

2007-04-27 Thread Alan D. Brunelle
The attached patch will correct the ordering of trace output between 
request queue insertions (I) and unplug calls (U). Right now the insert 
precedes the unplug, which just isn't right:


65,128  0167.699868965  7882  Q   R 0 + 1 [aiod]
65,128  0267.699876462  7882  G   R 0 + 1 [aiod]
65,128  0367.699878286  7882  P   W [aiod]
65,128  0467.699880491  7882  I   R 0 + 1 [aiod]
65,128  0567.699887589  7882  U   R [aiod] 1
65,128  0667.69989831754  D   R 0 + 1 [kblockd/0]
65,128  2  15367.700126590 0  C   R 0 + 1 [0]

With the patch provided the unplug comes first:

65,128  31 0.0  7045  Q   R 0 + 1 [aiod]
65,128  32 0.02295  7045  G   R 0 + 1 [aiod]
65,128  33 0.02617  7045  P   W [aiod]
65,128  34 0.03685  7045  U   R [aiod] 1
65,128  35 0.04107  7045  I   R 0 + 1 [aiod]
65,128  36 0.0949157  D   R 0 + 1 [kblockd/3]
65,128  21 0.000232447 0  C   R 0 + 1 [0]

Jens: If you agree, the patch can be applied to your plug branch for 
git://git.kernel.dk/data/git/linux-2.6-block.git


Thanks,
Alan
From: Alan D. Brunelle <[EMAIL PROTECTED]>

Fix unplug/insert trace inversion problem.

Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c  |8 
 include/linux/blkdev.h |1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 46d29f7..3bec97f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -2981,6 +2981,7 @@ out_unlock:
 		if (bio_data_dir(bio) == WRITE && ioc->qrcu_idx == -1)
 			ioc->qrcu_idx = qrcu_read_lock(>qrcu);
 		list_add_tail(>queuelist, >plugged_list);
+		ioc->plugged_list_len++;
 	}
 
 out:
@@ -3720,7 +3721,6 @@ void blk_unplug_current(void)
 	struct io_context *ioc = current->io_context;
 	struct request *req;
 	request_queue_t *q;
-	int nr_unplug;
 
 	if (!ioc)
 		return;
@@ -3735,19 +3735,19 @@ void blk_unplug_current(void)
 	if (list_empty(>plugged_list))
 		goto out;
 
-	nr_unplug = 0;
+	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, ioc->plugged_list_len);
+
 	spin_lock_irq(q->queue_lock);
 	do {
 		req = list_entry_rq(ioc->plugged_list.next);
 		list_del_init(>queuelist);
 		add_request(q, req);
-		nr_unplug++;
 	} while (!list_empty(>plugged_list));
+	ioc->plugged_list_len = 0;
 
 	spin_unlock_irq(q->queue_lock);
 
 	queue_delayed_work(kblockd_workqueue, >delay_work, 0);
-	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, nr_unplug);
 
 out:
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f8cdd44..848564c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -113,6 +113,7 @@ struct io_context {
 	 */
 	int plugged;
 	int qrcu_idx;
+	int plugged_list_len;
 	struct list_head plugged_list;
 	struct request_queue *plugged_queue;
 


[PATCH linux-2.6-block.git] Fix blktrace trace ordering for plug branch

2007-04-27 Thread Alan D. Brunelle
The attached patch will correct the ordering of trace output between 
request queue insertions (I) and unplug calls (U). Right now the insert 
precedes the unplug, which just isn't right:


65,128  0167.699868965  7882  Q   R 0 + 1 [aiod]
65,128  0267.699876462  7882  G   R 0 + 1 [aiod]
65,128  0367.699878286  7882  P   W [aiod]
65,128  0467.699880491  7882  I   R 0 + 1 [aiod]
65,128  0567.699887589  7882  U   R [aiod] 1
65,128  0667.69989831754  D   R 0 + 1 [kblockd/0]
65,128  2  15367.700126590 0  C   R 0 + 1 [0]

With the patch provided the unplug comes first:

65,128  31 0.0  7045  Q   R 0 + 1 [aiod]
65,128  32 0.02295  7045  G   R 0 + 1 [aiod]
65,128  33 0.02617  7045  P   W [aiod]
65,128  34 0.03685  7045  U   R [aiod] 1
65,128  35 0.04107  7045  I   R 0 + 1 [aiod]
65,128  36 0.0949157  D   R 0 + 1 [kblockd/3]
65,128  21 0.000232447 0  C   R 0 + 1 [0]

Jens: If you agree, the patch can be applied to your plug branch for 
git://git.kernel.dk/data/git/linux-2.6-block.git


Thanks,
Alan
From: Alan D. Brunelle [EMAIL PROTECTED]

Fix unplug/insert trace inversion problem.

Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED]
---
 block/ll_rw_blk.c  |8 
 include/linux/blkdev.h |1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 46d29f7..3bec97f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -2981,6 +2981,7 @@ out_unlock:
 		if (bio_data_dir(bio) == WRITE  ioc-qrcu_idx == -1)
 			ioc-qrcu_idx = qrcu_read_lock(q-qrcu);
 		list_add_tail(req-queuelist, ioc-plugged_list);
+		ioc-plugged_list_len++;
 	}
 
 out:
@@ -3720,7 +3721,6 @@ void blk_unplug_current(void)
 	struct io_context *ioc = current-io_context;
 	struct request *req;
 	request_queue_t *q;
-	int nr_unplug;
 
 	if (!ioc)
 		return;
@@ -3735,19 +3735,19 @@ void blk_unplug_current(void)
 	if (list_empty(ioc-plugged_list))
 		goto out;
 
-	nr_unplug = 0;
+	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, ioc-plugged_list_len);
+
 	spin_lock_irq(q-queue_lock);
 	do {
 		req = list_entry_rq(ioc-plugged_list.next);
 		list_del_init(req-queuelist);
 		add_request(q, req);
-		nr_unplug++;
 	} while (!list_empty(ioc-plugged_list));
+	ioc-plugged_list_len = 0;
 
 	spin_unlock_irq(q-queue_lock);
 
 	queue_delayed_work(kblockd_workqueue, q-delay_work, 0);
-	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, nr_unplug);
 
 out:
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f8cdd44..848564c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -113,6 +113,7 @@ struct io_context {
 	 */
 	int plugged;
 	int qrcu_idx;
+	int plugged_list_len;
 	struct list_head plugged_list;
 	struct request_queue *plugged_queue;
 


Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling

2007-04-26 Thread Alan D. Brunelle

Jens Axboe wrote:

On Wed, Apr 25 2007, Jens Axboe wrote:

On Wed, Apr 25 2007, Jens Axboe wrote:

On Wed, Apr 25 2007, Alan D. Brunelle wrote:

Hi Jens -

The attached patch speeds it up even more - I'm finding a >9% reduction 
in %system with no loss in IO performance. This just sets the cached 
element when the first is looked for.

Interesting, good thinking. It should not change the IO pattern, as the
end result should be the same. Thanks Alan, will commit!

I'll give elevator.c the same treatment, should be even more beneficial.
Stay tuned for a test patch.

Something like this, totally untested (it compiles). I initially wanted
to fold the cfq addon into the elevator.h provided implementation, but
that requires more extensive changes. Given how little code it is, I
think I'll keep them seperate.


Booted, seems to work fine for me. In a null ended IO test, I get about
a 1-2% speedup for a single queue of depth 64 using libaio. So it's
definitely worth it, will commit.

After longer runs last night, I think the patched elevator code /does/ 
help (albeit ever so slightly - about 0.6% performance improvement at a 
1.1% %system overhead).


 rkB_s%system  Kernel
-  ---  
1022942.2 3.69  Original patch + fix to cfq_rb_first
1029087.0 3.73  This patch stream (including fixes to elevator code)

Alan



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling

2007-04-26 Thread Alan D. Brunelle

Jens Axboe wrote:

On Wed, Apr 25 2007, Jens Axboe wrote:

On Wed, Apr 25 2007, Jens Axboe wrote:

On Wed, Apr 25 2007, Alan D. Brunelle wrote:

Hi Jens -

The attached patch speeds it up even more - I'm finding a 9% reduction 
in %system with no loss in IO performance. This just sets the cached 
element when the first is looked for.

Interesting, good thinking. It should not change the IO pattern, as the
end result should be the same. Thanks Alan, will commit!

I'll give elevator.c the same treatment, should be even more beneficial.
Stay tuned for a test patch.

Something like this, totally untested (it compiles). I initially wanted
to fold the cfq addon into the elevator.h provided implementation, but
that requires more extensive changes. Given how little code it is, I
think I'll keep them seperate.


Booted, seems to work fine for me. In a null ended IO test, I get about
a 1-2% speedup for a single queue of depth 64 using libaio. So it's
definitely worth it, will commit.

After longer runs last night, I think the patched elevator code /does/ 
help (albeit ever so slightly - about 0.6% performance improvement at a 
1.1% %system overhead).


 rkB_s%system  Kernel
-  ---  
1022942.2 3.69  Original patch + fix to cfq_rb_first
1029087.0 3.73  This patch stream (including fixes to elevator code)

Alan



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling

2007-04-25 Thread Alan D. Brunelle

Hi Jens -

The attached patch speeds it up even more - I'm finding a >9% reduction 
in %system with no loss in IO performance. This just sets the cached 
element when the first is looked for.


Alan
From: Alan D. Brunelle <[EMAIL PROTECTED]>

Update cached leftmost every time it is found.

Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]>
---

 block/cfq-iosched.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8093733..a86a7c3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -388,10 +388,10 @@ cfq_choose_req(struct cfq_data *cfqd, struct request 
*rq1, struct request *rq2)
  */
 static struct rb_node *cfq_rb_first(struct cfq_rb_root *root)
 {
-   if (root->left)
-   return root->left;
+   if (!root->left)
+   root->left = rb_first(>rb);
 
-   return rb_first(>rb);
+   return root->left;
 }
 
 static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)


Re: [PATCH 0/15] CFQ IO scheduler patch series

2007-04-25 Thread Alan D. Brunelle
Using the patches posted yesterday 
(http://marc.info/?l=linux-kernel=117740312628325=2) here are some 
quick read results (as measured by iostat over a 5 minute period, taken 
in 6 second intervals) on a 4-way IA64 box with 42 disks (24 FC and 18 
U320), 42 processes (1 per disk) with 256 AIOs (16KB) outstanding at all 
times per device:


2.6.21-rc7:   1,006.023 MB/second
2.6.21-rc7 + new CFQ IO scheduler: 1,030.767 MB/second

showing about a 2.46% performance improvement with a 2.43% increase in 
%system used (3.738% -> 3.829%).


Interestingly enough this patch also seems to remove some noise during 
the run - see the chart at http://free.linux.hp.com/~adb/cfq/rkb_s.png


Alan D. Brunelle
HP / Open Source and Linux Organization / Scalability and Performance Group

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/15] CFQ IO scheduler patch series

2007-04-25 Thread Alan D. Brunelle
Using the patches posted yesterday 
(http://marc.info/?l=linux-kernelm=117740312628325w=2) here are some 
quick read results (as measured by iostat over a 5 minute period, taken 
in 6 second intervals) on a 4-way IA64 box with 42 disks (24 FC and 18 
U320), 42 processes (1 per disk) with 256 AIOs (16KB) outstanding at all 
times per device:


2.6.21-rc7:   1,006.023 MB/second
2.6.21-rc7 + new CFQ IO scheduler: 1,030.767 MB/second

showing about a 2.46% performance improvement with a 2.43% increase in 
%system used (3.738% - 3.829%).


Interestingly enough this patch also seems to remove some noise during 
the run - see the chart at http://free.linux.hp.com/~adb/cfq/rkb_s.png


Alan D. Brunelle
HP / Open Source and Linux Organization / Scalability and Performance Group

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling

2007-04-25 Thread Alan D. Brunelle

Hi Jens -

The attached patch speeds it up even more - I'm finding a 9% reduction 
in %system with no loss in IO performance. This just sets the cached 
element when the first is looked for.


Alan
From: Alan D. Brunelle [EMAIL PROTECTED]

Update cached leftmost every time it is found.

Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED]
---

 block/cfq-iosched.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8093733..a86a7c3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -388,10 +388,10 @@ cfq_choose_req(struct cfq_data *cfqd, struct request 
*rq1, struct request *rq2)
  */
 static struct rb_node *cfq_rb_first(struct cfq_rb_root *root)
 {
-   if (root-left)
-   return root-left;
+   if (!root-left)
+   root-left = rb_first(root-rb);
 
-   return rb_first(root-rb);
+   return root-left;
 }
 
 static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)