Re: IO queueing and complete affinity w/ threads: Some results
Taking a step back, I went to a very simple test environment: o 4-way IA64 o 2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs). o Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses) Basically: o CPU 0 handled IRQs for /dev/sds o CPU 2 handled IRQs for /dev/sdaa We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements). First: overall performance 2.6.24 (no patches) : 106.90 MB/sec 2.6.24 + original patches + rq=0 : 103.09 MB/sec rq=1 : 98.81 MB/sec 2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec rq=1 : 107.16 MB/sec So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses Looking at the gross data: KernelCPU_CYCLES BACK END BUBBLES 100.0 * (BEB/CC) - - 2.6.24 (no patches) : 2,357,215,454,852231,547,237,267 9.8% 2.6.24 + original patches + rq=0 : 2,444,895,579,790242,719,920,828 9.9% rq=1 : 2,551,175,203,455148,586,145,513 5.8% 2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043255,563,975,526 10.8% rq=1 : 2,350,539,631,362208,888,961,094 8.9% For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs. Combining %sys & %soft IRQ, we see: Kernel % user % sys % iowait % idle 2.6.24 (no patches) : 0.141% 10.088% 43.949% 45.819% 2.6.24 + original patches + rq=0 : 0.123% 11.361% 43.507% 45.008% rq=1 : 0.156%6.030% 44.021% 49.794% 2.6.24 + kthreads patches + rq=0 : 0.163% 10.402% 43.744% 45.686% rq=1 : 0.156%8.160% 41.880% 49.804% The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok... I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-) I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?). Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Taking a step back, I went to a very simple test environment: o 4-way IA64 o 2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs). o Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses) Basically: o CPU 0 handled IRQs for /dev/sds o CPU 2 handled IRQs for /dev/sdaa We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare contrast system measurements). First: overall performance 2.6.24 (no patches) : 106.90 MB/sec 2.6.24 + original patches + rq=0 : 103.09 MB/sec rq=1 : 98.81 MB/sec 2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec rq=1 : 107.16 MB/sec So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses c. Looking at the gross data: KernelCPU_CYCLES BACK END BUBBLES 100.0 * (BEB/CC) - - 2.6.24 (no patches) : 2,357,215,454,852231,547,237,267 9.8% 2.6.24 + original patches + rq=0 : 2,444,895,579,790242,719,920,828 9.9% rq=1 : 2,551,175,203,455148,586,145,513 5.8% 2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043255,563,975,526 10.8% rq=1 : 2,350,539,631,362208,888,961,094 8.9% For both the original kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs. Combining %sys %soft IRQ, we see: Kernel % user % sys % iowait % idle 2.6.24 (no patches) : 0.141% 10.088% 43.949% 45.819% 2.6.24 + original patches + rq=0 : 0.123% 11.361% 43.507% 45.008% rq=1 : 0.156%6.030% 44.021% 49.794% 2.6.24 + kthreads patches + rq=0 : 0.163% 10.402% 43.744% 45.686% rq=1 : 0.156%8.160% 41.880% 49.804% The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok... I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache TLB disruptions. Freeing up cycles in the kernel is always helpful! :-) I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code results, coincidence?). Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Comparative results between the original affinity patch and the kthreads-based patch on the 32-way running the kernel make sequence. It may be easier to compare/contrast with the graphs provided at http://free.linux.hp.com/~adb/jens/kernmk.png (kernmk.agr also provided, if you want to run xmgrace by hand). Tests are: 1. Make Ext2 FS on each of 12 64GB devices in parallel, times include: mkfs, mount & unmount 2. Untar a full Linux source code tree onto the devices in parallel, times include: mount, untar, unmount 3. Make (-j4) of the full source code tree, times include: mount, make -j4, unmount 4. Clean full source code tree, times include: mount, make clean, unmount The results are so close amongst all the runs (given the large-ish standard deviations), that we probably can't deduce much from this. A bit of a concern on the top two graphs - mkfs & untar - it certainly appears that the kthreads version is a little slower (about 2.9% difference across the values for the mkfs runs, and 3.5% for the untar operations). On the make runs, however, we didn't see hardly any difference between the runs at all... We are trying to setup to do some AIM7 tests on a different system over the weekend (15 February - 18 February 2008), I'll post those results on the 18th or 19th if we can pull it off. [I'll also try to steal time on the 32-way to run a straight 2.6.24 kernel, do these runs again, and post those results.] For the tables below: q0 == queue_affinity set to -1 q1 == queue_affinity set to the CPU managing the IRQ for each device c0 == completion_affinity set to -1 c1 == completion_affinity set to CPU managing the IRQ for each device rq0 == rq_affinity set to 0 rq1 == rq_affinity set to 1 This 4-test sequence was run 10 times (for each kernel), and results averaged. As posted yesterday, here's the original patch sequence results: mkfsMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 17.814 30.322 33.263 4.551 q0.c0.rq1 17.540 30.058 32.885 4.321 q0.c1.rq0 17.770 31.328 32.958 3.121 q1.c0.rq0 17.907 31.032 32.767 3.515 q1.c1.rq0 16.891 30.319 33.097 4.624 untar Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 19.747 21.971 26.292 1.215 q0.c0.rq1 19.680 22.365 36.395 2.010 q0.c1.rq0 18.823 21.390 24.455 0.976 q1.c0.rq0 18.433 21.500 23.371 1.009 q1.c1.rq0 19.414 21.761 34.115 1.378 makeMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 527.418 543.296 552.030 5.384 q0.c0.rq1 526.265 542.312 549.477 5.467 q0.c1.rq0 528.935 544.940 553.823 4.746 q1.c0.rq0 529.432 544.399 553.212 5.166 q1.c1.rq0 527.638 543.577 551.323 5.478 clean Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.962 20.308 33.775 3.179 q0.c0.rq1 17.436 20.156 29.370 3.097 q0.c1.rq0 17.061 20.111 31.504 2.791 q1.c0.rq0 16.745 20.247 29.327 2.953 q1.c1.rq0 17.346 20.316 31.178 3.283 And for the kthreads-based kernel: mkfsMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.686 31.069 33.361 3.452 q0.c0.rq1 16.976 31.719 32.869 2.395 q0.c1.rq0 16.857 31.345 33.410 3.209 q1.c0.rq0 17.317 31.997 34.444 3.099 q1.c1.rq0 16.791 32.266 33.378 2.035 untar Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 19.769 22.398 25.196 1.076 q0.c0.rq1 19.742 22.517 38.498 1.733 q0.c1.rq0 20.071 22.698 36.160 2.259 q1.c0.rq0 19.910 22.377 35.640 1.528 q1.c1.rq0 19.448 22.339 24.887 0.926 makeMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 526.971 542.820 550.591 4.607 q0.c0.rq1 527.320 544.422 550.504 3.798 q0.c1.rq0 527.367 543.856 550.331 4.152 q1.c0.rq0 527.406 543.636 552.947 4.315 q1.c1.rq0 528.921 544.594 550.832 3.786 clean Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.644 20.242 29.524 2.991 q0.c0.rq1 16.942 20.008 29.729 2.845 q0.c1.rq0 17.205 20.117 29.851 2.661 q1.c0.rq0 17.400 20.147 32.581 2.862 q1.c1.rq0 16.799 20.072 31.883 2.872 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Comparative results between the original affinity patch and the kthreads-based patch on the 32-way running the kernel make sequence. It may be easier to compare/contrast with the graphs provided at http://free.linux.hp.com/~adb/jens/kernmk.png (kernmk.agr also provided, if you want to run xmgrace by hand). Tests are: 1. Make Ext2 FS on each of 12 64GB devices in parallel, times include: mkfs, mount unmount 2. Untar a full Linux source code tree onto the devices in parallel, times include: mount, untar, unmount 3. Make (-j4) of the full source code tree, times include: mount, make -j4, unmount 4. Clean full source code tree, times include: mount, make clean, unmount The results are so close amongst all the runs (given the large-ish standard deviations), that we probably can't deduce much from this. A bit of a concern on the top two graphs - mkfs untar - it certainly appears that the kthreads version is a little slower (about 2.9% difference across the values for the mkfs runs, and 3.5% for the untar operations). On the make runs, however, we didn't see hardly any difference between the runs at all... We are trying to setup to do some AIM7 tests on a different system over the weekend (15 February - 18 February 2008), I'll post those results on the 18th or 19th if we can pull it off. [I'll also try to steal time on the 32-way to run a straight 2.6.24 kernel, do these runs again, and post those results.] For the tables below: q0 == queue_affinity set to -1 q1 == queue_affinity set to the CPU managing the IRQ for each device c0 == completion_affinity set to -1 c1 == completion_affinity set to CPU managing the IRQ for each device rq0 == rq_affinity set to 0 rq1 == rq_affinity set to 1 This 4-test sequence was run 10 times (for each kernel), and results averaged. As posted yesterday, here's the original patch sequence results: mkfsMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 17.814 30.322 33.263 4.551 q0.c0.rq1 17.540 30.058 32.885 4.321 q0.c1.rq0 17.770 31.328 32.958 3.121 q1.c0.rq0 17.907 31.032 32.767 3.515 q1.c1.rq0 16.891 30.319 33.097 4.624 untar Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 19.747 21.971 26.292 1.215 q0.c0.rq1 19.680 22.365 36.395 2.010 q0.c1.rq0 18.823 21.390 24.455 0.976 q1.c0.rq0 18.433 21.500 23.371 1.009 q1.c1.rq0 19.414 21.761 34.115 1.378 makeMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 527.418 543.296 552.030 5.384 q0.c0.rq1 526.265 542.312 549.477 5.467 q0.c1.rq0 528.935 544.940 553.823 4.746 q1.c0.rq0 529.432 544.399 553.212 5.166 q1.c1.rq0 527.638 543.577 551.323 5.478 clean Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.962 20.308 33.775 3.179 q0.c0.rq1 17.436 20.156 29.370 3.097 q0.c1.rq0 17.061 20.111 31.504 2.791 q1.c0.rq0 16.745 20.247 29.327 2.953 q1.c1.rq0 17.346 20.316 31.178 3.283 And for the kthreads-based kernel: mkfsMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.686 31.069 33.361 3.452 q0.c0.rq1 16.976 31.719 32.869 2.395 q0.c1.rq0 16.857 31.345 33.410 3.209 q1.c0.rq0 17.317 31.997 34.444 3.099 q1.c1.rq0 16.791 32.266 33.378 2.035 untar Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 19.769 22.398 25.196 1.076 q0.c0.rq1 19.742 22.517 38.498 1.733 q0.c1.rq0 20.071 22.698 36.160 2.259 q1.c0.rq0 19.910 22.377 35.640 1.528 q1.c1.rq0 19.448 22.339 24.887 0.926 makeMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 526.971 542.820 550.591 4.607 q0.c0.rq1 527.320 544.422 550.504 3.798 q0.c1.rq0 527.367 543.856 550.331 4.152 q1.c0.rq0 527.406 543.636 552.947 4.315 q1.c1.rq0 528.921 544.594 550.832 3.786 clean Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.644 20.242 29.524 2.991 q0.c0.rq1 16.942 20.008 29.729 2.845 q0.c1.rq0 17.205 20.117 29.851 2.661 q1.c0.rq0 17.400 20.147 32.581 2.862 q1.c1.rq0 16.799 20.072 31.883 2.872 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Alan D. Brunelle wrote: > > Hopefully, the first column is self-explanatory - these are the settings > applied to the queue_affinity, completion_affinity and rq_affinity tunables. > Due to the fact that the standard deviations are so large coupled with the > very close average results, I'm not seeing anything in this set of tests to > favor any of the combinations... > Note quite: Q or C = 0 really means Q or C set to -1 (default), Q or C = 1 means placing that thread on the CPU managing the IRQ. Sorry... Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Back on the 32-way, in this set of tests we're running 12 disks spread out through the 8 cells of the 32-way. Each disk will have an Ext2 FS placed on it, a clean Linux kernel source untar()ed onto it, then a full make (-j4) and then a make clean performed. The 12 series are done in parallel - so each disk will have: mkfs tar x make make clean performed. This was performed ten times, and the overall averages are presented below - note this is Jens' original patch sequence NOT the kthread one (those results available tomorrow, hopefully). mkfsMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 17.814 30.322 33.263 4.551 q0.c0.rq1 17.540 30.058 32.885 4.321 q0.c1.rq0 17.770 31.328 32.958 3.121 q1.c0.rq0 17.907 31.032 32.767 3.515 q1.c1.rq0 16.891 30.319 33.097 4.624 untar Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 19.747 21.971 26.292 1.215 q0.c0.rq1 19.680 22.365 36.395 2.010 q0.c1.rq0 18.823 21.390 24.455 0.976 q1.c0.rq0 18.433 21.500 23.371 1.009 q1.c1.rq0 19.414 21.761 34.115 1.378 makeMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 527.418 543.296 552.030 5.384 q0.c0.rq1 526.265 542.312 549.477 5.467 q0.c1.rq0 528.935 544.940 553.823 4.746 q1.c0.rq0 529.432 544.399 553.212 5.166 q1.c1.rq0 527.638 543.577 551.323 5.478 clean Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.962 20.308 33.775 3.179 q0.c0.rq1 17.436 20.156 29.370 3.097 q0.c1.rq0 17.061 20.111 31.504 2.791 q1.c0.rq0 16.745 20.247 29.327 2.953 q1.c1.rq0 17.346 20.316 31.178 3.283 Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations... As noted, I will be having the machine run the kthreads-variant of the patch stream tonight, and then I have to go back and run a non-patched kernel to see if there are any /regressions/. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Whilst running a series of file system related loads on our 32-way*, I dropped down to a 16-way w/ only 24 disks, and ran two kernels: the original set of Jens' patches and then his subsequent kthreads-based set. Here are the results: Original: A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote - | -- -- | | --- X X X | 1850.4 0.413880 0.0109 | 0.0 55860.8 | 0.0 27946.9 X X A | 1850.6 0.413848 0.0106 | 0.0 55859.2 | 0.0 27946.1 X X I | 1850.6 0.413830 0.0107 | 0.0 55858.5 | 27945.8 0.0 - | -- -- | | --- X A X | 1850.0 0.413949 0.0106 | 55843.7 0.0 | 0.0 27938.3 X A A | 1850.2 0.413931 0.0107 | 55844.2 0.0 | 0.0 27938.6 X A I | 1850.4 0.413862 0.0107 | 55854.3 0.0 | 27943.7 0.0 - | -- -- | | --- X I X | 1850.9 0.413764 0.0107 | 0.0 55866.2 | 0.0 27949.6 X I A | 1850.5 0.413854 0.0108 | 0.0 55855.0 | 0.0 27944.0 X I I | 1850.4 0.413848 0.0105 | 0.0 55854.6 | 27943.8 0.0 = | == == | | === I X X | 1570.7 0.487686 0.0142 | 0.0 47406.1 | 0.0 23719.5 I X A | 1570.8 0.487666 0.0143 | 0.0 47409.3 | 23721.2 0.0 I X I | 1570.8 0.487664 0.0142 | 0.0 47410.7 | 23721.8 0.0 - | -- -- | | --- I A X | 1570.9 0.487642 0.0144 | 47412.2 0.0 | 0.0 23722.6 I A A | 1570.8 0.487647 0.0141 | 47411.2 0.0 | 23722.1 0.0 I A I | 1570.8 0.487651 0.0143 | 47410.8 0.0 | 23721.9 0.0 - | -- -- | | --- I I X | 1570.8 0.487683 0.0142 | 47410.2 0.0 | 0.0 23721.6 I I A | 1571.1 0.487591 0.0146 | 47415.0 0.0 | 23724.0 0.0 I I I | 1571.0 0.487623 0.0143 | 47412.5 0.0 | 23722.8 0.0 = | == == | | === rq=0 | 1726.7 0.443562 0.0120 | 52118.6 0.0 | 2138.6 23937.2 rq=1 | 1820.5 0.420729 0.0110 | 54938.2 0.0 | 0.0 27485.6 - | -- -- | | --- kthreads-based: A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote - | -- -- | | --- X X X | 1850.5 0.413867 0.0107 | 0.0 55854.7 | 0.0 27943.8 X X A | 1850.9 0.413763 0.0107 | 0.0 55867.0 | 0.0 27950.0 X X I | 1850.3 0.413911 0.0109 | 0.0 55849.0 | 27941.0 0.0 - | -- -- | | --- X A X | 1851.0 0.413730 0.0107 | 55871.4 0.0 | 0.0 27952.2 X A A | 1850.1 0.413919 0.0107 | 55845.5 0.0 | 0.0 27939.2 X A I | 1850.8 0.413789 0.0108 | 55864.8 0.0 | 27948.9 0.0 - | -- -- | | --- X I X | 1850.5 0.413849 0.0107 | 0.0 55856.5 | 0.0 27944.8 X I A | 1850.6 0.413818 0.0108 | 0.0 55860.2 | 0.0 27946.6 X I I | 1850.8 0.413764 0.0108 | 0.0 55866.7 | 27949.8 0.0 = | == == | | === I X X | 1570.9 0.487662 0.0145 | 0.0 47410.1 | 0.0 23721.6 I X A | 1570.7 0.487691 0.0142 | 0.0 47406.9 | 23720.0 0.0 I X I | 1570.7 0.487688 0.0141 | 0.0 47406.5 | 23719.8 0.0 - | -- -- | | --- I A X | 1570.9 0.487661 0.0144 | 47415.4 0.0 | 0.0 23724.2 I A A | 1570.8 0.487648 0.0141 | 47409.1 0.0 | 23721.0 0.0 I A I | 1570.7 0.487667 0.0141 | 47406.1 0.0 | 23719.5 0.0 - | -- -- | | --- I I X | 1570.8 0.487691 0.0142 | 47409.3 0.0 | 0.0 23721.2 I I A | 1570.9 0.487644 0.0142 | 47408.8 0.0 | 23720.9 0.0 I I I | 1570.6 0.487671 0.0141 | 47412.5 0.0 | 23722.8 0.0 = | == == | | === rq=0 | 1742.1 0.439676 0.0118 | 52578.1 0.0 | 3602.6 22703.0 rq=1 | 1745.0 0.438918 0.0115 | 52666.3 0.0 | 3473.0 22876.6 - | -- -- | | --- For the first 18 sets on both kernels the results are very similar, the last two rq=0/1 sets are perturbed too much by application placement (I would guess). Have to think about that some more. Alan * What I'm doing on the 32-way is to compare and contrast mkfs, untar, kernel make & kernel clean times with different combinations of Q, C and RQ. [[This is currently with the "Jens original" patch, if things go well, I can do an overnight run with the kthreads-based patch.]] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IO queueing and complete affinity w/ threads: Some results
Alan D. Brunelle wrote: Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations... Note quite: Q or C = 0 really means Q or C set to -1 (default), Q or C = 1 means placing that thread on the CPU managing the IRQ. Sorry... sigh Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Back on the 32-way, in this set of tests we're running 12 disks spread out through the 8 cells of the 32-way. Each disk will have an Ext2 FS placed on it, a clean Linux kernel source untar()ed onto it, then a full make (-j4) and then a make clean performed. The 12 series are done in parallel - so each disk will have: mkfs tar x make make clean performed. This was performed ten times, and the overall averages are presented below - note this is Jens' original patch sequence NOT the kthread one (those results available tomorrow, hopefully). mkfsMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 17.814 30.322 33.263 4.551 q0.c0.rq1 17.540 30.058 32.885 4.321 q0.c1.rq0 17.770 31.328 32.958 3.121 q1.c0.rq0 17.907 31.032 32.767 3.515 q1.c1.rq0 16.891 30.319 33.097 4.624 untar Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 19.747 21.971 26.292 1.215 q0.c0.rq1 19.680 22.365 36.395 2.010 q0.c1.rq0 18.823 21.390 24.455 0.976 q1.c0.rq0 18.433 21.500 23.371 1.009 q1.c1.rq0 19.414 21.761 34.115 1.378 makeMin Avg Max Std Dev - --- --- --- --- q0.c0.rq0 527.418 543.296 552.030 5.384 q0.c0.rq1 526.265 542.312 549.477 5.467 q0.c1.rq0 528.935 544.940 553.823 4.746 q1.c0.rq0 529.432 544.399 553.212 5.166 q1.c1.rq0 527.638 543.577 551.323 5.478 clean Min Avg Max Std Dev - --- --- --- --- q0.c0.rq0 16.962 20.308 33.775 3.179 q0.c0.rq1 17.436 20.156 29.370 3.097 q0.c1.rq0 17.061 20.111 31.504 2.791 q1.c0.rq0 16.745 20.247 29.327 2.953 q1.c1.rq0 17.346 20.316 31.178 3.283 Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations... As noted, I will be having the machine run the kthreads-variant of the patch stream tonight, and then I have to go back and run a non-patched kernel to see if there are any /regressions/. Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
Whilst running a series of file system related loads on our 32-way*, I dropped down to a 16-way w/ only 24 disks, and ran two kernels: the original set of Jens' patches and then his subsequent kthreads-based set. Here are the results: Original: A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote - | -- -- | | --- X X X | 1850.4 0.413880 0.0109 | 0.0 55860.8 | 0.0 27946.9 X X A | 1850.6 0.413848 0.0106 | 0.0 55859.2 | 0.0 27946.1 X X I | 1850.6 0.413830 0.0107 | 0.0 55858.5 | 27945.8 0.0 - | -- -- | | --- X A X | 1850.0 0.413949 0.0106 | 55843.7 0.0 | 0.0 27938.3 X A A | 1850.2 0.413931 0.0107 | 55844.2 0.0 | 0.0 27938.6 X A I | 1850.4 0.413862 0.0107 | 55854.3 0.0 | 27943.7 0.0 - | -- -- | | --- X I X | 1850.9 0.413764 0.0107 | 0.0 55866.2 | 0.0 27949.6 X I A | 1850.5 0.413854 0.0108 | 0.0 55855.0 | 0.0 27944.0 X I I | 1850.4 0.413848 0.0105 | 0.0 55854.6 | 27943.8 0.0 = | == == | | === I X X | 1570.7 0.487686 0.0142 | 0.0 47406.1 | 0.0 23719.5 I X A | 1570.8 0.487666 0.0143 | 0.0 47409.3 | 23721.2 0.0 I X I | 1570.8 0.487664 0.0142 | 0.0 47410.7 | 23721.8 0.0 - | -- -- | | --- I A X | 1570.9 0.487642 0.0144 | 47412.2 0.0 | 0.0 23722.6 I A A | 1570.8 0.487647 0.0141 | 47411.2 0.0 | 23722.1 0.0 I A I | 1570.8 0.487651 0.0143 | 47410.8 0.0 | 23721.9 0.0 - | -- -- | | --- I I X | 1570.8 0.487683 0.0142 | 47410.2 0.0 | 0.0 23721.6 I I A | 1571.1 0.487591 0.0146 | 47415.0 0.0 | 23724.0 0.0 I I I | 1571.0 0.487623 0.0143 | 47412.5 0.0 | 23722.8 0.0 = | == == | | === rq=0 | 1726.7 0.443562 0.0120 | 52118.6 0.0 | 2138.6 23937.2 rq=1 | 1820.5 0.420729 0.0110 | 54938.2 0.0 | 0.0 27485.6 - | -- -- | | --- kthreads-based: A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote - | -- -- | | --- X X X | 1850.5 0.413867 0.0107 | 0.0 55854.7 | 0.0 27943.8 X X A | 1850.9 0.413763 0.0107 | 0.0 55867.0 | 0.0 27950.0 X X I | 1850.3 0.413911 0.0109 | 0.0 55849.0 | 27941.0 0.0 - | -- -- | | --- X A X | 1851.0 0.413730 0.0107 | 55871.4 0.0 | 0.0 27952.2 X A A | 1850.1 0.413919 0.0107 | 55845.5 0.0 | 0.0 27939.2 X A I | 1850.8 0.413789 0.0108 | 55864.8 0.0 | 27948.9 0.0 - | -- -- | | --- X I X | 1850.5 0.413849 0.0107 | 0.0 55856.5 | 0.0 27944.8 X I A | 1850.6 0.413818 0.0108 | 0.0 55860.2 | 0.0 27946.6 X I I | 1850.8 0.413764 0.0108 | 0.0 55866.7 | 27949.8 0.0 = | == == | | === I X X | 1570.9 0.487662 0.0145 | 0.0 47410.1 | 0.0 23721.6 I X A | 1570.7 0.487691 0.0142 | 0.0 47406.9 | 23720.0 0.0 I X I | 1570.7 0.487688 0.0141 | 0.0 47406.5 | 23719.8 0.0 - | -- -- | | --- I A X | 1570.9 0.487661 0.0144 | 47415.4 0.0 | 0.0 23724.2 I A A | 1570.8 0.487648 0.0141 | 47409.1 0.0 | 23721.0 0.0 I A I | 1570.7 0.487667 0.0141 | 47406.1 0.0 | 23719.5 0.0 - | -- -- | | --- I I X | 1570.8 0.487691 0.0142 | 47409.3 0.0 | 0.0 23721.2 I I A | 1570.9 0.487644 0.0142 | 47408.8 0.0 | 23720.9 0.0 I I I | 1570.6 0.487671 0.0141 | 47412.5 0.0 | 23722.8 0.0 = | == == | | === rq=0 | 1742.1 0.439676 0.0118 | 52578.1 0.0 | 3602.6 22703.0 rq=1 | 1745.0 0.438918 0.0115 | 52666.3 0.0 | 3473.0 22876.6 - | -- -- | | --- For the first 18 sets on both kernels the results are very similar, the last two rq=0/1 sets are perturbed too much by application placement (I would guess). Have to think about that some more. Alan * What I'm doing on the 32-way is to compare and contrast mkfs, untar, kernel make kernel clean times with different combinations of Q, C and RQ. [[This is currently with the Jens original patch, if things go well, I can do an overnight run with the kthreads-based patch.]] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
IO queueing and complete affinity w/ threads: Some results
The test case chosen may not be a very good start, but anyways, here are some initial test results with the "nasty arch bits". This was performed on a 32-way ia64 box with 1 terrabyte of RAM, and 144 FC disks (contained in 24 HP MSA1000 RAID controlers attached to 12 dual-port adapters). Each test case was run for 3 minutes. I had one application per device performing a large amount of direct/asynchronous large reads. Here's the table of results, with explanation below (results are for all 144 devices either accumulated (MBPS) or averaged (other columns)): A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote - | -- -- | | --- X X X | 3859.9 1.190067 0.0502 | 0.0 19484.7 | 0.0 9758.8 X X A | 3856.3 1.191220 0.0490 | 0.0 19467.2 | 0.0 9750.1 X X I | 3850.3 1.192992 0.0508 | 0.0 19437.3 | 9735.1 0.0 - | -- -- | | --- X A X | 3853.9 1.191891 0.0503 | 19455.4 0.0 | 0.0 9744.2 X A A | 3853.5 1.191935 0.0507 | 19453.2 0.0 | 0.0 9743.1 X A I | 3856.6 1.191043 0.0512 | 19468.7 0.0 | 9750.8 0.0 - | -- -- | | --- X I X | 3854.7 1.191674 0.0491 | 0.0 19459.8 | 0.0 9746.4 X I A | 3855.3 1.191434 0.0501 | 0.0 19461.9 | 0.0 9747.4 X I I | 3856.2 1.191128 0.0506 | 0.0 19466.6 | 9749.8 0.0 = | == == | | === I X X | 3857.0 1.190987 0.0500 | 0.0 19471.9 | 0.0 9752.5 I X A | 3856.5 1.191082 0.0496 | 0.0 19469.4 | 9751.2 0.0 I X I | 3853.7 1.191938 0.0500 | 0.0 19456.2 | 9744.6 0.0 - | -- -- | | --- I A X | 3854.8 1.191675 0.0502 | 19461.5 0.0 | 0.0 9747.2 I A A | 3855.1 1.191464 0.0503 | 19464.0 0.0 | 9748.5 0.0 I A I | 3854.9 1.191627 0.0483 | 19461.7 0.0 | 9747.4 0.0 - | -- -- | | --- I I X | 3853.4 1.192070 0.0484 | 19454.8 0.0 | 0.0 9743.9 I I A | 3852.2 1.192403 0.0502 | 19448.5 0.0 | 9740.8 0.0 I I I | 3854.0 1.191822 0.0499 | 19457.9 0.0 | 9745.5 0.0 = | == == | | === rq=0 | 3854.8 1.191680 0.0480 | 19459.7 0.0 | 202.9 9543.5 rq=1 | 3854.0 1.191965 0.0483 | 19457.0 0.0 | 403.1 9341.9 - | -- -- | | --- The variables being played with: 'A' - When set to X the application was placed on a CPU other than the one handling IRQs for the device (in another cell) 'Q' - When set to X, queue affinity was placed in another cell from the application OR completion OR IRQ, when set to 'A' it was pegged onto the same CPU as the application, when set to 'I' it was set to the CPU that was managing the IRQ for its device. 'C' - Likewise for the completion affinity: 'X' means on another cell besides the one containing the application or the queueing or the IRQ handling CPU, A means put on the same CPU as the application, and I means put on the same CPU as the IRQ handler. o For the last two rows, we set Q == C == -1, and let the application go to any CPU (as dictated by the scheduler). Then we had 'rq_affinity' set to 0 or 1. The resulting columns include: MBPS - Total megabytes per second (so we're seeing about 3.8 gigabytes per second for the system) Avg lat - Average per IO measured latency in seconds (note: I had upwards of 128 X 256K IOs going on per device across the system) StdDev - Average standard deviation across the devices Q-local & Q-remote refer to the average number of queue operations handled locally and remotely, respectively. (Average per device) C-local & C-remote refer to the average number of completion operations handled locally and remotely, respectively. (Average per device) As noted above, I'm not so sure this is the best test case - it's rather artificial, I was hoping to see some differences based upon affinitization, but whilst there appears to be some trends, the results are so close (0.2% difference from best to worst case MBPS, and the standard deviation on the latencies are +/- within the groups), I doubt there is anything definitive. Unfortunately, most of the disks are all being used for real data right now, so I can't perform significant write tests (with file systems in place, say) which would be more real-worldly. I do have access to about 24 of the disks, so I will try to place file system on those and do some tests. [I won't be able to use XFS without going through some hoops - its a Red Hat installation right now, and they don't support XFS out of the box...] BTW: The Q/C local/remote columns were put in place to make sure that I had things set up right, and for the first 18 cases I think they look
IO queueing and complete affinity w/ threads: Some results
The test case chosen may not be a very good start, but anyways, here are some initial test results with the nasty arch bits. This was performed on a 32-way ia64 box with 1 terrabyte of RAM, and 144 FC disks (contained in 24 HP MSA1000 RAID controlers attached to 12 dual-port adapters). Each test case was run for 3 minutes. I had one application per device performing a large amount of direct/asynchronous large reads. Here's the table of results, with explanation below (results are for all 144 devices either accumulated (MBPS) or averaged (other columns)): A Q C | MBPS Avg Lat StdDev | Q-local Q-remote | C-local C-remote - | -- -- | | --- X X X | 3859.9 1.190067 0.0502 | 0.0 19484.7 | 0.0 9758.8 X X A | 3856.3 1.191220 0.0490 | 0.0 19467.2 | 0.0 9750.1 X X I | 3850.3 1.192992 0.0508 | 0.0 19437.3 | 9735.1 0.0 - | -- -- | | --- X A X | 3853.9 1.191891 0.0503 | 19455.4 0.0 | 0.0 9744.2 X A A | 3853.5 1.191935 0.0507 | 19453.2 0.0 | 0.0 9743.1 X A I | 3856.6 1.191043 0.0512 | 19468.7 0.0 | 9750.8 0.0 - | -- -- | | --- X I X | 3854.7 1.191674 0.0491 | 0.0 19459.8 | 0.0 9746.4 X I A | 3855.3 1.191434 0.0501 | 0.0 19461.9 | 0.0 9747.4 X I I | 3856.2 1.191128 0.0506 | 0.0 19466.6 | 9749.8 0.0 = | == == | | === I X X | 3857.0 1.190987 0.0500 | 0.0 19471.9 | 0.0 9752.5 I X A | 3856.5 1.191082 0.0496 | 0.0 19469.4 | 9751.2 0.0 I X I | 3853.7 1.191938 0.0500 | 0.0 19456.2 | 9744.6 0.0 - | -- -- | | --- I A X | 3854.8 1.191675 0.0502 | 19461.5 0.0 | 0.0 9747.2 I A A | 3855.1 1.191464 0.0503 | 19464.0 0.0 | 9748.5 0.0 I A I | 3854.9 1.191627 0.0483 | 19461.7 0.0 | 9747.4 0.0 - | -- -- | | --- I I X | 3853.4 1.192070 0.0484 | 19454.8 0.0 | 0.0 9743.9 I I A | 3852.2 1.192403 0.0502 | 19448.5 0.0 | 9740.8 0.0 I I I | 3854.0 1.191822 0.0499 | 19457.9 0.0 | 9745.5 0.0 = | == == | | === rq=0 | 3854.8 1.191680 0.0480 | 19459.7 0.0 | 202.9 9543.5 rq=1 | 3854.0 1.191965 0.0483 | 19457.0 0.0 | 403.1 9341.9 - | -- -- | | --- The variables being played with: 'A' - When set to X the application was placed on a CPU other than the one handling IRQs for the device (in another cell) 'Q' - When set to X, queue affinity was placed in another cell from the application OR completion OR IRQ, when set to 'A' it was pegged onto the same CPU as the application, when set to 'I' it was set to the CPU that was managing the IRQ for its device. 'C' - Likewise for the completion affinity: 'X' means on another cell besides the one containing the application or the queueing or the IRQ handling CPU, A means put on the same CPU as the application, and I means put on the same CPU as the IRQ handler. o For the last two rows, we set Q == C == -1, and let the application go to any CPU (as dictated by the scheduler). Then we had 'rq_affinity' set to 0 or 1. The resulting columns include: MBPS - Total megabytes per second (so we're seeing about 3.8 gigabytes per second for the system) Avg lat - Average per IO measured latency in seconds (note: I had upwards of 128 X 256K IOs going on per device across the system) StdDev - Average standard deviation across the devices Q-local Q-remote refer to the average number of queue operations handled locally and remotely, respectively. (Average per device) C-local C-remote refer to the average number of completion operations handled locally and remotely, respectively. (Average per device) As noted above, I'm not so sure this is the best test case - it's rather artificial, I was hoping to see some differences based upon affinitization, but whilst there appears to be some trends, the results are so close (0.2% difference from best to worst case MBPS, and the standard deviation on the latencies are +/- within the groups), I doubt there is anything definitive. Unfortunately, most of the disks are all being used for real data right now, so I can't perform significant write tests (with file systems in place, say) which would be more real-worldly. I do have access to about 24 of the disks, so I will try to place file system on those and do some tests. [I won't be able to use XFS without going through some hoops - its a Red Hat installation right now, and they don't support XFS out of the box...] BTW: The Q/C local/remote columns were put in place to make sure that I had things set up right, and for the first 18 cases I think they look
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
Jens Axboe wrote: > Hi, > > Here's a variant using kernel threads only, the nasty arch bits are then > not needed. Works for me, no performance testing (that's a hint for Alan > to try and queue up some testing for this variant as well :-) > > I'll get to that, working my way through the first batch of testing on a NUMA platform. [[If anybody has ideas on specific testing to do, that would be helpful.]] I do plan on running some AIM7 tests, as those have shown improvement in other types of affinity changes in the kernel, and some of them have "interesting" IO load characteristics. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/8] IO queuing and complete affinity
Jens Axboe wrote: > Hi, > > Since I'll be on vacation next week, I thought I'd send this out in > case people wanted to play with it. It works here, but I haven't done > any performance numbers at all. > > Patches 1-7 are all preparation patches for #8, which contains the > real changes. I'm not particularly happy with the arch implementation > for raising a softirq on another CPU, but it should be fast enough > so suffice for testing. > > Anyway, this patchset is mainly meant as a playground for testing IO > affinity. It allows you to set three values per queue, see the files > in the /sys/block//queue directory: > > completion_affinity > Only allow completions to happen on the defined CPU mask. > queue_affinity > Only allow queuing to happen on the defined CPU mask. > rq_affinity > Always complete a request on the same CPU that queued it. > > As you can tell, there's some overlap to allow for experimentation. > rq_affinity will override completion_affinity, so it's possible to > have completions on a CPU that isn't set in that mask. The interface > is currently limited to all CPUs or a specific CPU, but the implementation > is supports (and works with) cpu masks. The logic is in > blk_queue_set_cpumask(), it should be easy enough to change this to > echo a full mask, or allow OR'ing of CPU masks when a new CPU is passed in. > For now, echo a CPU number to set that CPU, or use -1 to set all CPUs. > The default is all CPUs for no change in behaviour. > > Patch set is against current git as of this morning. The code is also in > the block git repo, branch is io-cpu-affinity. > > git://git.kernel.dk/linux-2.6-block.git io-cpu-affinity > FYI: on a kernel with this patch set, running on a 4-way ia64 (non-NUMA) w/ a FC disk, I crafted a test with 135 combinations: o Having the issuing application pegged on each CPU - or - left alone (run on any CPU), yields 5 possibilities o Having the queue affinity on each CPU, or any (-1), yields 5 possibilities o Having the completion affinity on each CPU, or any (-1), yields 5 possibilities and o Having the issuing application pegged on each CPU - or - left alone (run on ay CPU), yields 5 possibilities o Having rq_affinity set to 0 or 1, yields 2 possibilities. Each test was for 10 minutes, and ran overnight just fine. The difference amongst the 135 resulting values (based upon latency per-IO seen at the application layer) was <<1% (0.32% to be exact). This would seem to indicate that there isn't a penalty for running with this code, and it seems relatively stable given this. The application used was doing 64KiB asynchronous direct reads, and had a minimum average per-IO latency of 42.426310 milliseconds, and average of 42.486557 milliseconds (std dev of 0.0041561), and a max of 42.561360 milliseconds I'm going to do some runs on a 16-way NUMA box, w/ a lot of disks today, to see if we see gains in that environment. Alan D. Brunelle HP OSLO S -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
Jens Axboe wrote: Hi, Here's a variant using kernel threads only, the nasty arch bits are then not needed. Works for me, no performance testing (that's a hint for Alan to try and queue up some testing for this variant as well :-) I'll get to that, working my way through the first batch of testing on a NUMA platform. [[If anybody has ideas on specific testing to do, that would be helpful.]] I do plan on running some AIM7 tests, as those have shown improvement in other types of affinity changes in the kernel, and some of them have interesting IO load characteristics. Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/8] IO queuing and complete affinity
Jens Axboe wrote: Hi, Since I'll be on vacation next week, I thought I'd send this out in case people wanted to play with it. It works here, but I haven't done any performance numbers at all. Patches 1-7 are all preparation patches for #8, which contains the real changes. I'm not particularly happy with the arch implementation for raising a softirq on another CPU, but it should be fast enough so suffice for testing. Anyway, this patchset is mainly meant as a playground for testing IO affinity. It allows you to set three values per queue, see the files in the /sys/block/dev/queue directory: completion_affinity Only allow completions to happen on the defined CPU mask. queue_affinity Only allow queuing to happen on the defined CPU mask. rq_affinity Always complete a request on the same CPU that queued it. As you can tell, there's some overlap to allow for experimentation. rq_affinity will override completion_affinity, so it's possible to have completions on a CPU that isn't set in that mask. The interface is currently limited to all CPUs or a specific CPU, but the implementation is supports (and works with) cpu masks. The logic is in blk_queue_set_cpumask(), it should be easy enough to change this to echo a full mask, or allow OR'ing of CPU masks when a new CPU is passed in. For now, echo a CPU number to set that CPU, or use -1 to set all CPUs. The default is all CPUs for no change in behaviour. Patch set is against current git as of this morning. The code is also in the block git repo, branch is io-cpu-affinity. git://git.kernel.dk/linux-2.6-block.git io-cpu-affinity FYI: on a kernel with this patch set, running on a 4-way ia64 (non-NUMA) w/ a FC disk, I crafted a test with 135 combinations: o Having the issuing application pegged on each CPU - or - left alone (run on any CPU), yields 5 possibilities o Having the queue affinity on each CPU, or any (-1), yields 5 possibilities o Having the completion affinity on each CPU, or any (-1), yields 5 possibilities and o Having the issuing application pegged on each CPU - or - left alone (run on ay CPU), yields 5 possibilities o Having rq_affinity set to 0 or 1, yields 2 possibilities. Each test was for 10 minutes, and ran overnight just fine. The difference amongst the 135 resulting values (based upon latency per-IO seen at the application layer) was 1% (0.32% to be exact). This would seem to indicate that there isn't a penalty for running with this code, and it seems relatively stable given this. The application used was doing 64KiB asynchronous direct reads, and had a minimum average per-IO latency of 42.426310 milliseconds, and average of 42.486557 milliseconds (std dev of 0.0041561), and a max of 42.561360 milliseconds I'm going to do some runs on a 16-way NUMA box, w/ a lot of disks today, to see if we see gains in that environment. Alan D. Brunelle HP OSLO SP -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24 regression w/ QLA2300
Andrew Vasquez wrote: > On Tue, 05 Feb 2008, Andrew Vasquez wrote: > >> On Tue, 05 Feb 2008, Alan D. Brunelle wrote: >> >>> commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a >>> Merge: 50d9a12... 23c3e29... >>> Author: Linus Torvalds <[EMAIL PROTECTED]> >>> Date: Fri Jan 25 17:19:08 2008 -0800 >>> >>> Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 >>> >>> * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: >>> (200 commits) >>> >>> I believe a regression was introduced. I'm running on a 4-way IA64, >>> with straight 2.6.24 and 2 dual-port cards: >>> >>> 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >>> 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >>> c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >>> c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >>> >>> the adapters failed initialization. In particular, I narrowed it down >>> to failing the qla2x00_mbox_command call within qla2x00_init_firmware >>> function. I went and removed the qla2x00-related parts of this (large-ish) >>> merge, and the 4 ports initialized just fine. >> Could you load the (default 2.6.24) driver with >> ql2xextended_error_logging modules parameter set: >> >> # insmod qla2xxx ql2xextended_error_logging=1 >> >> and send the resultant kernel logs? > > Could you tray the patch referenced here: > > qla2xxx: Correct issue where incorrect init-fw mailbox command was used on > non-NPIV capable ISPs. > http://article.gmane.org/gmane.linux.scsi/38240 > > Thanks, av The referenced patch worked fine Andrew, thanks much! Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24 regression w/ QLA2300
Andrew Vasquez wrote: > On Tue, 05 Feb 2008, Alan D. Brunelle wrote: > >> commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a >> Merge: 50d9a12... 23c3e29... >> Author: Linus Torvalds <[EMAIL PROTECTED]> >> Date: Fri Jan 25 17:19:08 2008 -0800 >> >> Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 >> >> * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 >> commits) >> >> I believe a regression was introduced. I'm running on a 4-way IA64, >> with straight 2.6.24 and 2 dual-port cards: >> >> 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >> 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >> c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >> c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) >> >> the adapters failed initialization. In particular, I narrowed it down >> to failing the qla2x00_mbox_command call within qla2x00_init_firmware >> function. I went and removed the qla2x00-related parts of this (large-ish) >> merge, and the 4 ports initialized just fine. > > Could you load the (default 2.6.24) driver with > ql2xextended_error_logging modules parameter set: > > # insmod qla2xxx ql2xextended_error_logging=1 > > and send the resultant kernel logs? Here's the output to the console (if there are other logs you need, let me know). I'll try the patch next, and sorry, hadn't realized merges were still coming in under 2.6.24 in Linus' tree... QLogic Fibre Channel HBA Driver ACPI: PCI Interrupt :40:01.0[A] -> GSI 38 (level, low) -> IRQ 58 qla2xxx :40:01.0: Found an ISP2312, irq 58, iobase 0xc000a0041000 qla2xxx :40:01.0: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c. qla2xxx :40:01.0: Configure NVRAM parameters... qla2xxx :40:01.0: Verifying loaded RISC code... scsi(14): Load RISC code scsi(14): Verifying Checksum of loaded RISC code. scsi(14): Checksum OK, start firmware. qla2xxx :40:01.0: Allocated (412 KB) for firmware dump... scsi(14): Issue init firmware. qla2x00_mailbox_command(14): FAILED. mbx0=4001, mbx1=0, mbx2=ba8a, cmd=48 qla2x00_init_firmware(14): failed=102 mb0=4001. scsi(14): Init firmware FAILED . qla2xxx :40:01.0: Failed to initialize adapter scsi(14): Failed to initialize adapter - Adapter flags 10. ACPI: PCI Interrupt :40:01.1[B] -> GSI 39 (level, low) -> IRQ 59 qla2xxx :40:01.1: Found an ISP2312, irq 59, iobase 0xc000a004 qla2xxx :40:01.1: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c. qla2xxx :40:01.1: Configure NVRAM parameters... qla2xxx :40:01.1: Verifying loaded RISC code... scsi(15): Load RISC code scsi(15): Verifying Checksum of loaded RISC code. scsi(15): Checksum OK, start firmware. qla2xxx :40:01.1: Allocated (412 KB) for firmware dump... scsi(15): Issue init firmware. qla2x00_mailbox_command(15): FAILED. mbx0=4001, mbx1=0, mbx2=bac6, cmd=48 qla2x00_init_firmware(15): failed=102 mb0=4001. scsi(15): Init firmware FAILED . qla2xxx :40:01.1: Failed to initialize adapter scsi(15): Failed to initialize adapter - Adapter flags 10. ACPI: PCI Interrupt :c0:01.0[A] -> GSI 71 (level, low) -> IRQ 60 qla2xxx :c0:01.0: Found an ISP2312, irq 60, iobase 0xc000e0041000 qla2xxx :c0:01.0: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c. qla2xxx :c0:01.0: Configure NVRAM parameters... qla2xxx :c0:01.0: Verifying loaded RISC code... scsi(16): Load RISC code scsi(16): Verifying Checksum of loaded RISC code. scsi(16): Checksum OK, start firmware. qla2xxx :c0:01.0: Allocated (412 KB) for firmware dump... scsi(16): Issue init firmware. qla2x00_mailbox_command(16): FAILED. mbx0=4001, mbx1=0, mbx2=bae3, cmd=48 qla2x00_init_firmware(16): failed=102 mb0=4001. scsi(16): Init firmware FAILED . qla2xxx :c0:01.0: Failed to initialize adapter scsi(16): Failed to initialize adapter - Adapter flags 10. ACPI: PCI Interrupt :c0:01.1[B] -> GSI 72 (level, low) -> IRQ 61 qla2xxx :c0:01.1: Found an ISP2312, irq 61, iobase 0xc000e004 qla2xxx :c0:01.1: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c. qla2xxx :c0:01.1: Configure NVRAM parameters... qla2xxx :c0:01.1: Verifying loaded RISC c
Re: 2.6.24 regression w/ QLA2300
Andrew Vasquez wrote: On Tue, 05 Feb 2008, Alan D. Brunelle wrote: commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a Merge: 50d9a12... 23c3e29... Author: Linus Torvalds [EMAIL PROTECTED] Date: Fri Jan 25 17:19:08 2008 -0800 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 commits) I believe a regression was introduced. I'm running on a 4-way IA64, with straight 2.6.24 and 2 dual-port cards: 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) the adapters failed initialization. In particular, I narrowed it down to failing the qla2x00_mbox_command call within qla2x00_init_firmware function. I went and removed the qla2x00-related parts of this (large-ish) merge, and the 4 ports initialized just fine. Could you load the (default 2.6.24) driver with ql2xextended_error_logging modules parameter set: # insmod qla2xxx ql2xextended_error_logging=1 and send the resultant kernel logs? Here's the output to the console (if there are other logs you need, let me know). I'll try the patch next, and sorry, hadn't realized merges were still coming in under 2.6.24 in Linus' tree... QLogic Fibre Channel HBA Driver ACPI: PCI Interrupt :40:01.0[A] - GSI 38 (level, low) - IRQ 58 qla2xxx :40:01.0: Found an ISP2312, irq 58, iobase 0xc000a0041000 qla2xxx :40:01.0: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c. qla2xxx :40:01.0: Configure NVRAM parameters... qla2xxx :40:01.0: Verifying loaded RISC code... scsi(14): Load RISC code scsi(14): Verifying Checksum of loaded RISC code. scsi(14): Checksum OK, start firmware. qla2xxx :40:01.0: Allocated (412 KB) for firmware dump... scsi(14): Issue init firmware. qla2x00_mailbox_command(14): FAILED. mbx0=4001, mbx1=0, mbx2=ba8a, cmd=48 qla2x00_init_firmware(14): failed=102 mb0=4001. scsi(14): Init firmware FAILED . qla2xxx :40:01.0: Failed to initialize adapter scsi(14): Failed to initialize adapter - Adapter flags 10. ACPI: PCI Interrupt :40:01.1[B] - GSI 39 (level, low) - IRQ 59 qla2xxx :40:01.1: Found an ISP2312, irq 59, iobase 0xc000a004 qla2xxx :40:01.1: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids da1c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1f61c. qla2xxx :40:01.1: Configure NVRAM parameters... qla2xxx :40:01.1: Verifying loaded RISC code... scsi(15): Load RISC code scsi(15): Verifying Checksum of loaded RISC code. scsi(15): Checksum OK, start firmware. qla2xxx :40:01.1: Allocated (412 KB) for firmware dump... scsi(15): Issue init firmware. qla2x00_mailbox_command(15): FAILED. mbx0=4001, mbx1=0, mbx2=bac6, cmd=48 qla2x00_init_firmware(15): failed=102 mb0=4001. scsi(15): Init firmware FAILED . qla2xxx :40:01.1: Failed to initialize adapter scsi(15): Failed to initialize adapter - Adapter flags 10. ACPI: PCI Interrupt :c0:01.0[A] - GSI 71 (level, low) - IRQ 60 qla2xxx :c0:01.0: Found an ISP2312, irq 60, iobase 0xc000e0041000 qla2xxx :c0:01.0: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c. qla2xxx :c0:01.0: Configure NVRAM parameters... qla2xxx :c0:01.0: Verifying loaded RISC code... scsi(16): Load RISC code scsi(16): Verifying Checksum of loaded RISC code. scsi(16): Checksum OK, start firmware. qla2xxx :c0:01.0: Allocated (412 KB) for firmware dump... scsi(16): Issue init firmware. qla2x00_mailbox_command(16): FAILED. mbx0=4001, mbx1=0, mbx2=bae3, cmd=48 qla2x00_init_firmware(16): failed=102 mb0=4001. scsi(16): Init firmware FAILED . qla2xxx :c0:01.0: Failed to initialize adapter scsi(16): Failed to initialize adapter - Adapter flags 10. ACPI: PCI Interrupt :c0:01.1[B] - GSI 72 (level, low) - IRQ 61 qla2xxx :c0:01.1: Found an ISP2312, irq 61, iobase 0xc000e004 qla2xxx :c0:01.1: Configuring PCI space... qla2x00_get_flash_version(): Unrecognized code type ff at pcids c61c. qla2x00_get_flash_version(): Unrecognized code type ff at pcids 1da1c. qla2xxx :c0:01.1: Configure NVRAM parameters... qla2xxx :c0:01.1: Verifying loaded RISC code... scsi(17): Load RISC code scsi(17): Verifying Checksum of loaded RISC code. scsi(17): Checksum OK, start firmware. qla2xxx :c0:01.1: Allocated (412 KB) for firmware dump... scsi(17): Issue init firmware. qla2x00_mailbox_command(17
Re: 2.6.24 regression w/ QLA2300
Andrew Vasquez wrote: On Tue, 05 Feb 2008, Andrew Vasquez wrote: On Tue, 05 Feb 2008, Alan D. Brunelle wrote: commit 9b73e76f3cf63379dcf45fcd4f112f5812418d0a Merge: 50d9a12... 23c3e29... Author: Linus Torvalds [EMAIL PROTECTED] Date: Fri Jan 25 17:19:08 2008 -0800 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 commits) I believe a regression was introduced. I'm running on a 4-way IA64, with straight 2.6.24 and 2 dual-port cards: 40:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) 40:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) c0:01.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) c0:01.1 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 03) the adapters failed initialization. In particular, I narrowed it down to failing the qla2x00_mbox_command call within qla2x00_init_firmware function. I went and removed the qla2x00-related parts of this (large-ish) merge, and the 4 ports initialized just fine. Could you load the (default 2.6.24) driver with ql2xextended_error_logging modules parameter set: # insmod qla2xxx ql2xextended_error_logging=1 and send the resultant kernel logs? Could you tray the patch referenced here: qla2xxx: Correct issue where incorrect init-fw mailbox command was used on non-NPIV capable ISPs. http://article.gmane.org/gmane.linux.scsi/38240 Thanks, av The referenced patch worked fine Andrew, thanks much! Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Moved UNPLUG traces to match 1-to-1 with PLUG traces
Currently, with DM (and probably MD) we can receive streams of multiple PLUG and/or UNPLUG traces on the lower devices: 8,32 1 91043825.383725302 12843 P N [mkfs.ext2] 8,32 1 91162725.385613612 12843 P N [mkfs.ext2] 8,32 1 91181925.385931255 12843 P N [mkfs.ext2] 8,32 1 91317625.388396840 12843 P N [mkfs.ext2] 8,32 1 91417025.391634524 12843 P N [mkfs.ext2] 8,32 1 91523925.393325078 12843 P N [mkfs.ext2] 8,32 1 91547325.397930230 0 UT N [swapper] 18 8,32 1 91547425.39793614532 U N [kblockd/1] 18 8,32 1 91549725.594446953 12843 P N [mkfs.ext2] 8,32 1 91680625.596543309 12843 P N [mkfs.ext2] 8,32 1 91835125.599276485 12843 P N [mkfs.ext2] 8,32 1 91837725.599313544 12843 U N [mkfs.ext2] 9 The PLUG traces are "protected" by the test-and-clear functionality in blk_plug_device, however the UNPLUG traces has no such protection. And in the case where MD or DM were involved, the upper level dev as well as the lower level devs would both go through blk_unplug which would generate extra UNPLUGS. With the proposed change, I only see a good one-to-one mapping of PLUG and UNPLUG traces on the underlying devices. However, we no longer see the UNPLUG traces on the MD or DM devices, which one could argue makes sense because (a) those devices don't have request queues managed by the block layer, and thus (b) they never had any notion of having been plugged. (So we saw UNPLUG traces on devices that never had PLUG traces.) A similar stream with the new patch: 8,32 1 88290824.179721271 7539 P N [mkfs.ext2] 8,32 1 88412124.182232467 7539 U N [mkfs.ext2] 10 8,32 1 88412924.182305789 7539 P N [mkfs.ext2] 8,32 1 88547824.184748842 7539 U N [mkfs.ext2] 14 8,32 1 88548724.184791013 7539 P N [mkfs.ext2] 8,32 1 88633624.186479185 7539 U N [mkfs.ext2] 15 8,32 1 88634324.186516024 7539 P N [mkfs.ext2] 8,32 1 88641424.186637771 7539 U N [mkfs.ext2] 14 8,32 1 88642024.186649173 7539 P N [mkfs.ext2] 8,32 1 88672624.193329121 0 UT N [swapper] 15 8,32 1 88672724.19333642832 U N [kblockd/1] 15 8,32 1 88674824.354771821 7539 P N [mkfs.ext2] 8,32 1 88808124.357279380 7539 U N [mkfs.ext2] 4 8,32 1 88809024.357323899 7539 P N [mkfs.ext2] 8,32 1 88893424.358969886 7539 U N [mkfs.ext2] 5 8,32 1 88894224.359019161 7539 P N [mkfs.ext2] 8,32 1 89014724.361314613 7539 U N [mkfs.ext2] 8 The proposed patch was tested with a 2.6.22-based kernel, and compile tested with a 2.6.24-based tree from 31 January 2008 (85004cc367abc000aa36c0d0e270ab609a68b0cb). Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]> --- block/blk-core.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 8ff9944..1d148b5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -218,6 +218,9 @@ int blk_remove_plug(struct request_queue *q) if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, >queue_flags)) return 0; + blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + q->rq.count[READ] + q->rq.count[WRITE]); + del_timer(>unplug_timer); return 1; } @@ -271,9 +274,6 @@ void blk_unplug_work(struct work_struct *work) struct request_queue *q = container_of(work, struct request_queue, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, - q->rq.count[READ] + q->rq.count[WRITE]); - q->unplug_fn(q); } @@ -292,12 +292,8 @@ void blk_unplug(struct request_queue *q) /* * devices don't necessarily have an ->unplug_fn defined */ - if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, - q->rq.count[READ] + q->rq.count[WRITE]); - + if (q->unplug_fn) q->unplug_fn(q); - } } EXPORT_SYMBOL(blk_unplug); -- 1.5.2.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Moved UNPLUG traces to match 1-to-1 with PLUG traces
Currently, with DM (and probably MD) we can receive streams of multiple PLUG and/or UNPLUG traces on the lower devices: 8,32 1 91043825.383725302 12843 P N [mkfs.ext2] 8,32 1 91162725.385613612 12843 P N [mkfs.ext2] 8,32 1 91181925.385931255 12843 P N [mkfs.ext2] 8,32 1 91317625.388396840 12843 P N [mkfs.ext2] 8,32 1 91417025.391634524 12843 P N [mkfs.ext2] 8,32 1 91523925.393325078 12843 P N [mkfs.ext2] 8,32 1 91547325.397930230 0 UT N [swapper] 18 8,32 1 91547425.39793614532 U N [kblockd/1] 18 8,32 1 91549725.594446953 12843 P N [mkfs.ext2] 8,32 1 91680625.596543309 12843 P N [mkfs.ext2] 8,32 1 91835125.599276485 12843 P N [mkfs.ext2] 8,32 1 91837725.599313544 12843 U N [mkfs.ext2] 9 The PLUG traces are protected by the test-and-clear functionality in blk_plug_device, however the UNPLUG traces has no such protection. And in the case where MD or DM were involved, the upper level dev as well as the lower level devs would both go through blk_unplug which would generate extra UNPLUGS. With the proposed change, I only see a good one-to-one mapping of PLUG and UNPLUG traces on the underlying devices. However, we no longer see the UNPLUG traces on the MD or DM devices, which one could argue makes sense because (a) those devices don't have request queues managed by the block layer, and thus (b) they never had any notion of having been plugged. (So we saw UNPLUG traces on devices that never had PLUG traces.) A similar stream with the new patch: 8,32 1 88290824.179721271 7539 P N [mkfs.ext2] 8,32 1 88412124.182232467 7539 U N [mkfs.ext2] 10 8,32 1 88412924.182305789 7539 P N [mkfs.ext2] 8,32 1 88547824.184748842 7539 U N [mkfs.ext2] 14 8,32 1 88548724.184791013 7539 P N [mkfs.ext2] 8,32 1 88633624.186479185 7539 U N [mkfs.ext2] 15 8,32 1 88634324.186516024 7539 P N [mkfs.ext2] 8,32 1 88641424.186637771 7539 U N [mkfs.ext2] 14 8,32 1 88642024.186649173 7539 P N [mkfs.ext2] 8,32 1 88672624.193329121 0 UT N [swapper] 15 8,32 1 88672724.19333642832 U N [kblockd/1] 15 8,32 1 88674824.354771821 7539 P N [mkfs.ext2] 8,32 1 88808124.357279380 7539 U N [mkfs.ext2] 4 8,32 1 88809024.357323899 7539 P N [mkfs.ext2] 8,32 1 88893424.358969886 7539 U N [mkfs.ext2] 5 8,32 1 88894224.359019161 7539 P N [mkfs.ext2] 8,32 1 89014724.361314613 7539 U N [mkfs.ext2] 8 The proposed patch was tested with a 2.6.22-based kernel, and compile tested with a 2.6.24-based tree from 31 January 2008 (85004cc367abc000aa36c0d0e270ab609a68b0cb). Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED] --- block/blk-core.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 8ff9944..1d148b5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -218,6 +218,9 @@ int blk_remove_plug(struct request_queue *q) if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) return 0; + blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + q-rq.count[READ] + q-rq.count[WRITE]); + del_timer(q-unplug_timer); return 1; } @@ -271,9 +274,6 @@ void blk_unplug_work(struct work_struct *work) struct request_queue *q = container_of(work, struct request_queue, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, - q-rq.count[READ] + q-rq.count[WRITE]); - q-unplug_fn(q); } @@ -292,12 +292,8 @@ void blk_unplug(struct request_queue *q) /* * devices don't necessarily have an -unplug_fn defined */ - if (q-unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, - q-rq.count[READ] + q-rq.count[WRITE]); - + if (q-unplug_fn) q-unplug_fn(q); - } } EXPORT_SYMBOL(blk_unplug); -- 1.5.2.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Ray Lee wrote: Out of curiosity, what are the mount options for the freshly created ext3 fs? In particular, are you using noatime, nodiratime? Ray Nope, just mount. However, the tool I'm using to read the large file & overwrite the large file does open with O_NOATIME for reads... The tool used to read the files in the read-a-tree test is dd, and I doubt(?) it does a O_NOATIME... Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Alan D. Brunelle wrote: Read large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 201.6 215.1 275.5 22.8 0.26%4.69% 33.54% arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00% Read a tree: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 3518.2 4631.3 5991.3 784.6 0.19%3.29% 23.56% arjan: 5731.6 6849.8 .4 731.6 0.32%9.90% 52.70% Overwrite large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 104.2 147.7 239.5 38.4 0.02%0.05% 1.08% arjan: 106.2 149.7 239.2 38.4 0.25%0.79% 14.97% I'm going to try and do some clean up work on the iostat CPU results - the reason %user & %system are so low is (I think) because they also include a lot of 0% results from the tail of the runs (as the unmount is going on I think). I'm going to try and extract results for just the "meat" of the runs. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Here are the results for the latest tests, some notes: o The machine actually has 8GiB of RAM, so the tests still may end up using (some) page cache. (But at least it was the same for both kernels! :-) ) o Sorry the results took so long - the updated tree size caused the runs to take > 12 hours... o The longer runs seemed to bring down the standard deviation a bit, although they are still quite large. o 10 runs per test (read large file, read a tree, overwrite large file), with averages presented. o 1st 4 columns (min, avg, max, std dev) refer to the average run lengths for the tests - real time, in seconds o The last 3 columns are extracted from iostat results over the course of the whole run. o The read a tree test certainly stands out - the other 2 large file manipulations have the two kernels within a couple of percent, but the read a tree test has Arjan's patch taking about 47%(!) longer on average. The increased %iowait & %system time in all 3 cases is interesting. Read large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 201.6 215.1 275.5 22.8 0.26%4.69% 33.54% arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00% Read a tree: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 3518.2 4631.3 5991.3 784.6 0.19%3.29% 23.56% arjan: 5731.6 6849.8 .4 731.6 0.32%9.90% 52.70% Overwrite large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 104.2 147.7 239.5 38.4 0.02%0.05% 1.08% arjan: 106.2 149.7 239.2 38.4 0.25%0.79% 14.97% Let me know if there is anything else I can do to elaborate, or if you have suggestions for further testing. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Here are the results for the latest tests, some notes: o The machine actually has 8GiB of RAM, so the tests still may end up using (some) page cache. (But at least it was the same for both kernels! :-) ) o Sorry the results took so long - the updated tree size caused the runs to take 12 hours... o The longer runs seemed to bring down the standard deviation a bit, although they are still quite large. o 10 runs per test (read large file, read a tree, overwrite large file), with averages presented. o 1st 4 columns (min, avg, max, std dev) refer to the average run lengths for the tests - real time, in seconds o The last 3 columns are extracted from iostat results over the course of the whole run. o The read a tree test certainly stands out - the other 2 large file manipulations have the two kernels within a couple of percent, but the read a tree test has Arjan's patch taking about 47%(!) longer on average. The increased %iowait %system time in all 3 cases is interesting. Read large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 201.6 215.1 275.5 22.8 0.26%4.69% 33.54% arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00% Read a tree: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 3518.2 4631.3 5991.3 784.6 0.19%3.29% 23.56% arjan: 5731.6 6849.8 .4 731.6 0.32%9.90% 52.70% Overwrite large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 104.2 147.7 239.5 38.4 0.02%0.05% 1.08% arjan: 106.2 149.7 239.2 38.4 0.25%0.79% 14.97% Let me know if there is anything else I can do to elaborate, or if you have suggestions for further testing. Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Alan D. Brunelle wrote: Read large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 201.6 215.1 275.5 22.8 0.26%4.69% 33.54% arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00% Read a tree: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 3518.2 4631.3 5991.3 784.6 0.19%3.29% 23.56% arjan: 5731.6 6849.8 .4 731.6 0.32%9.90% 52.70% Overwrite large file: Kernel MinAvgMax Std Dev%user %system %iowait -- base : 104.2 147.7 239.5 38.4 0.02%0.05% 1.08% arjan: 106.2 149.7 239.2 38.4 0.25%0.79% 14.97% I'm going to try and do some clean up work on the iostat CPU results - the reason %user %system are so low is (I think) because they also include a lot of 0% results from the tail of the runs (as the unmount is going on I think). I'm going to try and extract results for just the meat of the runs. Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Ray Lee wrote: Out of curiosity, what are the mount options for the freshly created ext3 fs? In particular, are you using noatime, nodiratime? Ray Nope, just mount. However, the tool I'm using to read the large file overwrite the large file does open with O_NOATIME for reads... The tool used to read the files in the read-a-tree test is dd, and I doubt(?) it does a O_NOATIME... Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Oh, and the runs were done in single-user mode... Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Arjan van de Ven wrote: On Wed, 14 Nov 2007 18:18:05 +0100 Ingo Molnar <[EMAIL PROTECTED]> wrote: * Andrew Morton <[EMAIL PROTECTED]> wrote: ooh, more performance testing. Thanks * The overwriter task (on an 8GiB file), average over 10 runs: o 2.6.24 - 300.88226 seconds o 2.6.24 + Arjan's patch - 403.85505 seconds * The read-a-different-kernel-tree task, average over 10 runs: o 2.6.24 - 46.8145945549 seconds o 2.6.24 + Arjan's patch - 39.6430601119 seconds * The large-linear-read task (on an 8GiB file), average over 10 runs: o 2.6.24 - 290.32522 seconds o 2.6.24 + Arjan's patch - 386.34860 seconds These are *large* differences, making this a very signifcant patch. Much care is needed now. and the numbers suggest it's mostly a severe performance regression. That's not what i have expected - ho hum. Apologies for my earlier "please merge it already!" whining. that's.. not automatic; it depends on what the right thing is :-( What for sure changes is that who gets to do IO changes. Some of the tests we ran internally (we didn't publish yet because we saw REALLY large variations for most of them even without any patch) show for example that "dbench" got slower. But.. dbench gets slower when things get more fair, and faster when things get unfair. What conclusion you draw out of that is a whole different matter and depends on exactly what the test is doing, and what is the right thing for the OS to do in terms of who gets to do the IO. THis makes the patch more tricky than the one line change suggests, and this is also why I haven't published a ton of data yet; it's hard to get useful tests for this (and the variation of the 2.6.23+ kernels makes it even harder to do anything meaningful ;-( ) I'd also like to point out here that the run-to-run deviation was indeed quite large for both the unpatched- and patched-kernels, I'll report on that information with the next set of results... Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Andrew Morton wrote: (cc lkml restored, with permission) On Wed, 14 Nov 2007 10:48:10 -0500 "Alan D. Brunelle" <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: On Mon, 15 Oct 2007 16:13:15 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Since you have been involved a lot with ext3 development, which kinds of workloads do you think will show a performance degradation with Arjan's patch? What should I test? Gee. Expect the unexpected ;) One problem might be when kjournald is doing its ordered-mode data writeback at the start of commit. That writeout will now be higher-priority and might impact other tasks which are doing synchronous file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain old reads. If the aggregate number of seeks over the long term is the same as before then of course the overall throughput should be the same, in which case the impact might only be upon latency. However if for some reason the total number of seeks is increased then there will be throughput impacts as well. So as a starting point I guess one could set up a copy-a-kernel-tree-in-a-loop running in the background and then see what impact that has upon a large-linear-read, upon a read-a-different-kernel-tree and upon some database-style multithreaded O_DIRECT reader/overwriter. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hi folks - I noticed this thread recently (sorry, month late and a dollar short), but it was very interesting to me, and as I had some cycles yesterday I did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out and then w/ Ajran's patch). After doing the runs last night I'm not overly happy with the test setup, but before redoing that, I thought I'd ask to make sure that this patch was still being considered? I'd consider its status to be "might be a good idea, more performance testing needed". And, btw, the results from the first pass were rather mixed - ooh, more performance testing. Thanks * The overwriter task (on an 8GiB file), average over 10 runs: o 2.6.24 - 300.88226 seconds o 2.6.24 + Arjan's patch - 403.85505 seconds * The read-a-different-kernel-tree task, average over 10 runs: o 2.6.24 - 46.8145945549 seconds o 2.6.24 + Arjan's patch - 39.6430601119 seconds * The large-linear-read task (on an 8GiB file), average over 10 runs: o 2.6.24 - 290.32522 seconds o 2.6.24 + Arjan's patch - 386.34860 seconds These are *large* differences, making this a very signifcant patch. Much care is needed now. Could you expand a bit on what you're testing here? I think that in one process you're doing a continuous copy-a-kernel-tree and in the other process you're the above three things, yes? The test works like this: 1. I ensure that the device under test (DUT) is set to run the CFQ scheduler. 1. It is a Fibre Channel 72GiB disk 2. Single partition... 2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096) 3. Mount the device, and then: 1. Put an 8GiB file on the new FS 2. Put 3 copies of a Linux tree (w/ objs & kernel & such) onto the FS in separate directories 1. Note: I'm going to do runs with 6 copies to each directory tree to get to about 4.2GiB per directory tree 4. Then, for each of the tests: 1. Remount the device (purge page cache by umount & then mount) 2. Start up a copy of 1 kernel tree to another tree (you hadn't specified if the copy in the background should be to a new area or not, so I'm just re-using the same area so we don't have to worry about removing the old). I keep doing the copy as long as the tests are going 3. Perform the test (10 times) The tests are: * Linear read of a large file (8GiB) * Tree read (foreach file in the tree, dd it to /dev/null) * Overwrite of that large file: was doing 256KiB random read/writes, will go down to 4KiB read/writes as that is more realistic I'd guess I'm going to try and get the comparisons done by tomorrow, the results should be very different due to the changes noted above (going to 4.2GiB trees instead of 700MiB, going to 4K instead of 256K read/writes). This may cause the runs to be much longer, and then I won't get it done as quickly... I guess the other things we should look at are the impact on the continuously-copy-a-kernel-tree process and also the overall IO throughput. These things will of course be related. If the overall system-wide IO throughput increases with the patch then we probably have a no-brainer. If
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Andrew Morton wrote: (cc lkml restored, with permission) On Wed, 14 Nov 2007 10:48:10 -0500 Alan D. Brunelle [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Mon, 15 Oct 2007 16:13:15 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Since you have been involved a lot with ext3 development, which kinds of workloads do you think will show a performance degradation with Arjan's patch? What should I test? Gee. Expect the unexpected ;) One problem might be when kjournald is doing its ordered-mode data writeback at the start of commit. That writeout will now be higher-priority and might impact other tasks which are doing synchronous file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain old reads. If the aggregate number of seeks over the long term is the same as before then of course the overall throughput should be the same, in which case the impact might only be upon latency. However if for some reason the total number of seeks is increased then there will be throughput impacts as well. So as a starting point I guess one could set up a copy-a-kernel-tree-in-a-loop running in the background and then see what impact that has upon a large-linear-read, upon a read-a-different-kernel-tree and upon some database-style multithreaded O_DIRECT reader/overwriter. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hi folks - I noticed this thread recently (sorry, month late and a dollar short), but it was very interesting to me, and as I had some cycles yesterday I did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out and then w/ Ajran's patch). After doing the runs last night I'm not overly happy with the test setup, but before redoing that, I thought I'd ask to make sure that this patch was still being considered? I'd consider its status to be might be a good idea, more performance testing needed. And, btw, the results from the first pass were rather mixed - ooh, more performance testing. Thanks * The overwriter task (on an 8GiB file), average over 10 runs: o 2.6.24 - 300.88226 seconds o 2.6.24 + Arjan's patch - 403.85505 seconds * The read-a-different-kernel-tree task, average over 10 runs: o 2.6.24 - 46.8145945549 seconds o 2.6.24 + Arjan's patch - 39.6430601119 seconds * The large-linear-read task (on an 8GiB file), average over 10 runs: o 2.6.24 - 290.32522 seconds o 2.6.24 + Arjan's patch - 386.34860 seconds These are *large* differences, making this a very signifcant patch. Much care is needed now. Could you expand a bit on what you're testing here? I think that in one process you're doing a continuous copy-a-kernel-tree and in the other process you're the above three things, yes? The test works like this: 1. I ensure that the device under test (DUT) is set to run the CFQ scheduler. 1. It is a Fibre Channel 72GiB disk 2. Single partition... 2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096) 3. Mount the device, and then: 1. Put an 8GiB file on the new FS 2. Put 3 copies of a Linux tree (w/ objs kernel such) onto the FS in separate directories 1. Note: I'm going to do runs with 6 copies to each directory tree to get to about 4.2GiB per directory tree 4. Then, for each of the tests: 1. Remount the device (purge page cache by umount then mount) 2. Start up a copy of 1 kernel tree to another tree (you hadn't specified if the copy in the background should be to a new area or not, so I'm just re-using the same area so we don't have to worry about removing the old). I keep doing the copy as long as the tests are going 3. Perform the test (10 times) The tests are: * Linear read of a large file (8GiB) * Tree read (foreach file in the tree, dd it to /dev/null) * Overwrite of that large file: was doing 256KiB randomdirect read/writes, will go down to 4KiB read/writes as that is more realistic I'd guess I'm going to try and get the comparisons done by tomorrow, the results should be very different due to the changes noted above (going to 4.2GiB trees instead of 700MiB, going to 4K instead of 256K read/writes). This may cause the runs to be much longer, and then I won't get it done as quickly... I guess the other things we should look at are the impact on the continuously-copy-a-kernel-tree process and also the overall IO throughput. These things will of course be related. If the overall system-wide IO throughput increases with the patch then we probably have a no-brainer. If (as I suspect) the overall IO throughput is decreased
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Arjan van de Ven wrote: On Wed, 14 Nov 2007 18:18:05 +0100 Ingo Molnar [EMAIL PROTECTED] wrote: * Andrew Morton [EMAIL PROTECTED] wrote: ooh, more performance testing. Thanks * The overwriter task (on an 8GiB file), average over 10 runs: o 2.6.24 - 300.88226 seconds o 2.6.24 + Arjan's patch - 403.85505 seconds * The read-a-different-kernel-tree task, average over 10 runs: o 2.6.24 - 46.8145945549 seconds o 2.6.24 + Arjan's patch - 39.6430601119 seconds * The large-linear-read task (on an 8GiB file), average over 10 runs: o 2.6.24 - 290.32522 seconds o 2.6.24 + Arjan's patch - 386.34860 seconds These are *large* differences, making this a very signifcant patch. Much care is needed now. and the numbers suggest it's mostly a severe performance regression. That's not what i have expected - ho hum. Apologies for my earlier please merge it already! whining. that's.. not automatic; it depends on what the right thing is :-( What for sure changes is that who gets to do IO changes. Some of the tests we ran internally (we didn't publish yet because we saw REALLY large variations for most of them even without any patch) show for example that dbench got slower. But.. dbench gets slower when things get more fair, and faster when things get unfair. What conclusion you draw out of that is a whole different matter and depends on exactly what the test is doing, and what is the right thing for the OS to do in terms of who gets to do the IO. THis makes the patch more tricky than the one line change suggests, and this is also why I haven't published a ton of data yet; it's hard to get useful tests for this (and the variation of the 2.6.23+ kernels makes it even harder to do anything meaningful ;-( ) I'd also like to point out here that the run-to-run deviation was indeed quite large for both the unpatched- and patched-kernels, I'll report on that information with the next set of results... Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
Oh, and the runs were done in single-user mode... Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add UNPLUG traces to all appropriate places
Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Previously, we saw a PLUG on each of the underlying devices, and an UNPLUG on the volume. This patch ensures that we see the UNPLUG calls for each of the underlying devices. Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]> --- block/ll_rw_blk.c | 24 +++- drivers/md/bitmap.c|3 +-- drivers/md/dm-table.c |3 +-- drivers/md/linear.c|3 +-- drivers/md/md.c|4 ++-- drivers/md/multipath.c |3 +-- drivers/md/raid0.c |3 +-- drivers/md/raid1.c |3 +-- drivers/md/raid10.c|3 +-- drivers/md/raid5.c |3 +-- include/linux/blkdev.h |1 + 11 files changed, 26 insertions(+), 27 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 75c98d5..37f8e9c 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1634,15 +1634,7 @@ static void blk_backing_dev_unplug(struct backing_dev_info *bdi, { struct request_queue *q = bdi->unplug_io_data; -/* - * devices don't necessarily have an ->unplug_fn defined - */ -if (q->unplug_fn) { -blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, -q->rq.count[READ] + q->rq.count[WRITE]); - -q->unplug_fn(q); -} +blk_unplug(q); } static void blk_unplug_work(struct work_struct *work) @@ -1666,6 +1658,20 @@ static void blk_unplug_timeout(unsigned long data) kblockd_schedule_work(>unplug_work); } +void blk_unplug(struct request_queue *q) +{ +/* + * devices don't necessarily have an ->unplug_fn defined + */ +if (q->unplug_fn) { +blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, +q->rq.count[READ] + q->rq.count[WRITE]); + +q->unplug_fn(q); +} +} +EXPORT_SYMBOL(blk_unplug); + /** * blk_start_queue - restart a previously stopped queue * @q:The request_queue in question diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index 7c426d0..1b1ef31 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -1207,8 +1207,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect prepare_to_wait(>overflow_wait, &__wait, TASK_UNINTERRUPTIBLE); spin_unlock_irq(>lock); -bitmap->mddev->queue -->unplug_fn(bitmap->mddev->queue); +blk_unplug(bitmap->mddev->queue); schedule(); finish_wait(>overflow_wait, &__wait); continue; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 5a7eb65..e298d8d 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1000,8 +1000,7 @@ void dm_table_unplug_all(struct dm_table *t) struct dm_dev *dd = list_entry(d, struct dm_dev, list); struct request_queue *q = bdev_get_queue(dd->bdev); -if (q->unplug_fn) -q->unplug_fn(q); +blk_unplug(q); } } diff --git a/drivers/md/linear.c b/drivers/md/linear.c index 56a11f6..3dac1cf 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -87,8 +87,7 @@ static void linear_unplug(struct request_queue *q) for (i=0; i < mddev->raid_disks; i++) { struct request_queue *r_queue = bdev_get_queue(conf->disks[i].rdev->bdev); -if (r_queue->unplug_fn) -r_queue->unplug_fn(r_queue); +blk_unplug(r_queue); } } diff --git a/drivers/md/md.c b/drivers/md/md.c index 808cd95..cef9ebd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5445,7 +5445,7 @@ void md_do_sync(mddev_t *mddev) * about not overloading the IO subsystem. (things like an * e2fsck being done on the RAID array should execute fast) */ -mddev->queue->unplug_fn(mddev->queue); +blk_unplug(mddev->queue); cond_resched(); currspeed = ((unsigned long)(io_sectors-mddev->resync_mark_cnt))/2 @@ -5464,7 +5464,7 @@ void md_do_sync(mddev_t *mddev) * this also signals 'finished resyncing' to md_stop */ out: -mddev->queue->unplug_fn(mddev->queue); +blk_unplug(mddev->queue); wait_event(mddev->recovery_wait, !atomic_read(>recovery_active)); diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index b35731c..eb631eb 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -125,8 +125,7 @@ static void unplug_slaves(mddev_t *mddev) atomic_inc(>nr_pending); rcu_read_unlock(); -if (r_queue->unplug_fn) -r_queue->unplug_fn(r_queue); +blk_unplug(r_queue); rdev_dec_pending(rdev, mddev); rcu_read_lock(); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index c05..f8e5917 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -35,8 +35,7 @@ static
[PATCH] Add UNPLUG traces to all appropriate places
Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Previously, we saw a PLUG on each of the underlying devices, and an UNPLUG on the volume. This patch ensures that we see the UNPLUG calls for each of the underlying devices. Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED] --- block/ll_rw_blk.c | 24 +++- drivers/md/bitmap.c|3 +-- drivers/md/dm-table.c |3 +-- drivers/md/linear.c|3 +-- drivers/md/md.c|4 ++-- drivers/md/multipath.c |3 +-- drivers/md/raid0.c |3 +-- drivers/md/raid1.c |3 +-- drivers/md/raid10.c|3 +-- drivers/md/raid5.c |3 +-- include/linux/blkdev.h |1 + 11 files changed, 26 insertions(+), 27 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 75c98d5..37f8e9c 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1634,15 +1634,7 @@ static void blk_backing_dev_unplug(struct backing_dev_info *bdi, { struct request_queue *q = bdi-unplug_io_data; -/* - * devices don't necessarily have an -unplug_fn defined - */ -if (q-unplug_fn) { -blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, -q-rq.count[READ] + q-rq.count[WRITE]); - -q-unplug_fn(q); -} +blk_unplug(q); } static void blk_unplug_work(struct work_struct *work) @@ -1666,6 +1658,20 @@ static void blk_unplug_timeout(unsigned long data) kblockd_schedule_work(q-unplug_work); } +void blk_unplug(struct request_queue *q) +{ +/* + * devices don't necessarily have an -unplug_fn defined + */ +if (q-unplug_fn) { +blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, +q-rq.count[READ] + q-rq.count[WRITE]); + +q-unplug_fn(q); +} +} +EXPORT_SYMBOL(blk_unplug); + /** * blk_start_queue - restart a previously stopped queue * @q:The struct request_queue in question diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index 7c426d0..1b1ef31 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -1207,8 +1207,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect prepare_to_wait(bitmap-overflow_wait, __wait, TASK_UNINTERRUPTIBLE); spin_unlock_irq(bitmap-lock); -bitmap-mddev-queue --unplug_fn(bitmap-mddev-queue); +blk_unplug(bitmap-mddev-queue); schedule(); finish_wait(bitmap-overflow_wait, __wait); continue; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 5a7eb65..e298d8d 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1000,8 +1000,7 @@ void dm_table_unplug_all(struct dm_table *t) struct dm_dev *dd = list_entry(d, struct dm_dev, list); struct request_queue *q = bdev_get_queue(dd-bdev); -if (q-unplug_fn) -q-unplug_fn(q); +blk_unplug(q); } } diff --git a/drivers/md/linear.c b/drivers/md/linear.c index 56a11f6..3dac1cf 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -87,8 +87,7 @@ static void linear_unplug(struct request_queue *q) for (i=0; i mddev-raid_disks; i++) { struct request_queue *r_queue = bdev_get_queue(conf-disks[i].rdev-bdev); -if (r_queue-unplug_fn) -r_queue-unplug_fn(r_queue); +blk_unplug(r_queue); } } diff --git a/drivers/md/md.c b/drivers/md/md.c index 808cd95..cef9ebd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5445,7 +5445,7 @@ void md_do_sync(mddev_t *mddev) * about not overloading the IO subsystem. (things like an * e2fsck being done on the RAID array should execute fast) */ -mddev-queue-unplug_fn(mddev-queue); +blk_unplug(mddev-queue); cond_resched(); currspeed = ((unsigned long)(io_sectors-mddev-resync_mark_cnt))/2 @@ -5464,7 +5464,7 @@ void md_do_sync(mddev_t *mddev) * this also signals 'finished resyncing' to md_stop */ out: -mddev-queue-unplug_fn(mddev-queue); +blk_unplug(mddev-queue); wait_event(mddev-recovery_wait, !atomic_read(mddev-recovery_active)); diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index b35731c..eb631eb 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -125,8 +125,7 @@ static void unplug_slaves(mddev_t *mddev) atomic_inc(rdev-nr_pending); rcu_read_unlock(); -if (r_queue-unplug_fn) -r_queue-unplug_fn(r_queue); +blk_unplug(r_queue); rdev_dec_pending(rdev, mddev); rcu_read_lock(); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index c05..f8e5917 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -35,8 +35,7 @@ static void raid0_unplug(struct request_queue *q) for (i=0; imddev-raid_disks; i++) { struct request_queue *r_queue = bdev_get_queue(devlist[i]-bdev
Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system
Mathieu Desnoyers wrote: >> remember that we have seen and discussed something like this before, >> it's still a puzzle to me... >> >> > I do wonder about that performance _increase_ with blktrace enabled. I > > Interesting question indeed. > > In those tests, when blktrace is running, are the relay buffers only > written to or they are also read ? > blktrace (the utility) was running too - so the relay buffere /were/ being read and stored out to disk elsewhere. > Running the tests without consuming the buffers (in overwrite mode) > would tell us more about the nature of the disturbance causing the > performance increase. > I'd have to write a utility to enable the traces, but then not read. Let me think about that. > Also, a kernel trace could help us understand more thoroughly what is > happening there.. is it caused by the scheduler ? memory allocation ? > data cache alignment ? > Yep - when I get some time, I'll look into that. [Clearly not a gating issue for marker support...] > I would suggest that you try aligning the block layer data structures > accessed by blktrace on L2 cacheline size and compare the results (when > blktrace is disabled). > Again, when I get some time! :-) Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system
Mathieu Desnoyers wrote: remember that we have seen and discussed something like this before, it's still a puzzle to me... I do wonder about that performance _increase_ with blktrace enabled. I Interesting question indeed. In those tests, when blktrace is running, are the relay buffers only written to or they are also read ? blktrace (the utility) was running too - so the relay buffere /were/ being read and stored out to disk elsewhere. Running the tests without consuming the buffers (in overwrite mode) would tell us more about the nature of the disturbance causing the performance increase. I'd have to write a utility to enable the traces, but then not read. Let me think about that. Also, a kernel trace could help us understand more thoroughly what is happening there.. is it caused by the scheduler ? memory allocation ? data cache alignment ? Yep - when I get some time, I'll look into that. [Clearly not a gating issue for marker support...] I would suggest that you try aligning the block layer data structures accessed by blktrace on L2 cacheline size and compare the results (when blktrace is disabled). Again, when I get some time! :-) Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Some IO scheduler cleanup in Documentation/block
[PATCH] Some IO scheduler cleanup in Documentation/block as-iosched.txt: o Changed IO scheduler selection text to a reference to the switching-sched.txt file. o Fixed typo: 'for up time...' -> 'for up to...' o Added short description of the est_time file. deadline-iosched.txt: o Changed IO scheduler selection text to a reference to the switching-sched.txt file. o Removed references to non-existent seek-cost and stream_unit. o Fixed typo: 'write_starved' -> 'writes_starved' switching-sched.txt: o Added in boot-time argument to set the default IO scheduler. (From as-iosched.txt) o Added in sysfs mount instructions. (From deadline-iosched.txt) Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]> --- Documentation/block/as-iosched.txt | 21 + Documentation/block/deadline-iosched.txt | 23 +++ Documentation/block/switching-sched.txt | 21 + 3 files changed, 41 insertions(+), 24 deletions(-) diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt index a598fe1..738b72b 100644 --- a/Documentation/block/as-iosched.txt +++ b/Documentation/block/as-iosched.txt @@ -20,15 +20,10 @@ actually has a head for each physical device in the logical RAID device. However, setting the antic_expire (see tunable parameters below) produces very similar behavior to the deadline IO scheduler. - Selecting IO schedulers --- -To choose IO schedulers at boot time, use the argument 'elevator=deadline'. -'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are -assigned globally at boot time only presently. It's also possible to change -the IO scheduler for a determined device on the fly, as described in -Documentation/block/switching-sched.txt. - +Refer to Documentation/block/switching-sched.txt for information on +selecting an io scheduler on a per-device basis. Anticipatory IO scheduler Policies -- @@ -115,7 +110,7 @@ statistics (average think time, average seek distance) on the process that submitted the just completed request are examined. If it seems likely that that process will submit another request soon, and that request is likely to be near the just completed request, then the IO -scheduler will stop dispatching more read requests for up time (antic_expire) +scheduler will stop dispatching more read requests for up to (antic_expire) milliseconds, hoping that process will submit a new request near the one that just completed. If such a request is made, then it is dispatched immediately. If the antic_expire wait time expires, then the IO scheduler @@ -165,3 +160,13 @@ The parameters are: for big seek time devices though not a linear correspondence - most processes have only a few ms thinktime. +In addition to the tunables above there is a read-only file named est_time +which, when read, will show: + +- The probability of a task exiting without a cooperating task + submitting an anticipated IO. + +- The current mean think time. + +- The seek distance used to determine if an incoming IO is better. + diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt index be08ffd..0a89839 100644 --- a/Documentation/block/deadline-iosched.txt +++ b/Documentation/block/deadline-iosched.txt @@ -5,16 +5,10 @@ This little file attempts to document how the deadline io scheduler works. In particular, it will clarify the meaning of the exposed tunables that may be of interest to power users. -Each io queue has a set of io scheduler tunables associated with it. These -tunables control how the io scheduler works. You can find these entries -in: - -/sys/block//queue/iosched - -assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted, -you can do so by typing: - -# mount none /sys -t sysfs +Selecting IO schedulers +--- +Refer to Documentation/block/switching-sched.txt for information on +selecting an io scheduler on a per-device basis. @@ -41,14 +35,11 @@ fifo_batch When a read request expires its deadline, we must move some requests from the sorted io scheduler list to the block device dispatch queue. fifo_batch -controls how many requests we move, based on the cost of each request. A -request is either qualified as a seek or a stream. The io scheduler knows -the last request that was serviced by the drive (or will be serviced right -before this one). See seek_cost and stream_unit. +controls how many requests we move. -write_starved (number of dispatches) -- +writes_starved (number of dispatches) +-- When we have to move requests from the io scheduler queue to the block device dispatch queue, we always give a preference to reads. However, we diff --git a/Documentation/blo
[PATCH] Some IO scheduler cleanup in Documentation/block
[PATCH] Some IO scheduler cleanup in Documentation/block as-iosched.txt: o Changed IO scheduler selection text to a reference to the switching-sched.txt file. o Fixed typo: 'for up time...' - 'for up to...' o Added short description of the est_time file. deadline-iosched.txt: o Changed IO scheduler selection text to a reference to the switching-sched.txt file. o Removed references to non-existent seek-cost and stream_unit. o Fixed typo: 'write_starved' - 'writes_starved' switching-sched.txt: o Added in boot-time argument to set the default IO scheduler. (From as-iosched.txt) o Added in sysfs mount instructions. (From deadline-iosched.txt) Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED] --- Documentation/block/as-iosched.txt | 21 + Documentation/block/deadline-iosched.txt | 23 +++ Documentation/block/switching-sched.txt | 21 + 3 files changed, 41 insertions(+), 24 deletions(-) diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt index a598fe1..738b72b 100644 --- a/Documentation/block/as-iosched.txt +++ b/Documentation/block/as-iosched.txt @@ -20,15 +20,10 @@ actually has a head for each physical device in the logical RAID device. However, setting the antic_expire (see tunable parameters below) produces very similar behavior to the deadline IO scheduler. - Selecting IO schedulers --- -To choose IO schedulers at boot time, use the argument 'elevator=deadline'. -'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are -assigned globally at boot time only presently. It's also possible to change -the IO scheduler for a determined device on the fly, as described in -Documentation/block/switching-sched.txt. - +Refer to Documentation/block/switching-sched.txt for information on +selecting an io scheduler on a per-device basis. Anticipatory IO scheduler Policies -- @@ -115,7 +110,7 @@ statistics (average think time, average seek distance) on the process that submitted the just completed request are examined. If it seems likely that that process will submit another request soon, and that request is likely to be near the just completed request, then the IO -scheduler will stop dispatching more read requests for up time (antic_expire) +scheduler will stop dispatching more read requests for up to (antic_expire) milliseconds, hoping that process will submit a new request near the one that just completed. If such a request is made, then it is dispatched immediately. If the antic_expire wait time expires, then the IO scheduler @@ -165,3 +160,13 @@ The parameters are: for big seek time devices though not a linear correspondence - most processes have only a few ms thinktime. +In addition to the tunables above there is a read-only file named est_time +which, when read, will show: + +- The probability of a task exiting without a cooperating task + submitting an anticipated IO. + +- The current mean think time. + +- The seek distance used to determine if an incoming IO is better. + diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt index be08ffd..0a89839 100644 --- a/Documentation/block/deadline-iosched.txt +++ b/Documentation/block/deadline-iosched.txt @@ -5,16 +5,10 @@ This little file attempts to document how the deadline io scheduler works. In particular, it will clarify the meaning of the exposed tunables that may be of interest to power users. -Each io queue has a set of io scheduler tunables associated with it. These -tunables control how the io scheduler works. You can find these entries -in: - -/sys/block/device/queue/iosched - -assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted, -you can do so by typing: - -# mount none /sys -t sysfs +Selecting IO schedulers +--- +Refer to Documentation/block/switching-sched.txt for information on +selecting an io scheduler on a per-device basis. @@ -41,14 +35,11 @@ fifo_batch When a read request expires its deadline, we must move some requests from the sorted io scheduler list to the block device dispatch queue. fifo_batch -controls how many requests we move, based on the cost of each request. A -request is either qualified as a seek or a stream. The io scheduler knows -the last request that was serviced by the drive (or will be serviced right -before this one). See seek_cost and stream_unit. +controls how many requests we move. -write_starved (number of dispatches) -- +writes_starved (number of dispatches) +-- When we have to move requests from the io scheduler queue to the block device dispatch queue, we always give a preference to reads. However, we diff --git a/Documentation/block
Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system
Mathieu Desnoyers wrote: * Alan D. Brunelle ([EMAIL PROTECTED]) wrote: Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample runs of the following on both it and after applying Mathieu Desnoyers 11-patch sequence (19 September 2007). * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB volume per MSA used) * 10 runs with each configuration, averages shown below o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running * A run consists of doing the following in parallel: o Make an ext3 FS on each of the 10 volumes o Mount & unmount each volume + The unmounting generates a tremendous amount of writes to the disks - thus stressing the intended storage devices (10 volumes) plus the separate volume for all the blktrace data (when blk tracing is enabled). + Note the times reported below only cover the make/mount/unmount time - the actual blktrace runs extended beyond the times measured (took quite a while for the blk trace data to be output). We're only concerned with the impact on the "application" performance in this instance. Results are: Kernel w/out BT STDDEV w/ BTSTDDEV - - -- - -- 2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34 27.7547962.09 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers 14.9930410.59 26.6949933.23 Interesting results, although we cannot say any of the solutions has much impact due to the std dev. Also, it could be interesting to add the "blktrace compiled out" as a base line. Thanks for running those tests, Mathieu Mathieu: Here are the results from 6 different kernels (including ones with blktrace not configured in), with now performing 40 runs per kernel. o All kernels start off with Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 o '- bt cfg' or '+ bt cfg' means a kernel without or with blktrace configured respectively. o '- markers' or '+ markers' means a kernel without or with the 11-patch marker series respectively. 38 runs without blk traces being captured (dropped hi/lo value from 40 runs) Kernel Options Min valAvg valMax valStd Dev -- - - - - - markers - bt cfg 15.349127 16.169459 16.372980 0.184417 + markers - bt cfg 15.280382 16.202398 16.409257 0.191861 - markers + bt cfg 14.464366 14.754347 16.052306 0.463665 + markers + bt cfg 14.421765 14.644406 15.690871 0.233885 38 runs with blk traces being captured (dropped hi/lo value from 40 runs) Kernel Options Min valAvg valMax valStd Dev -- - - - - - markers + bt cfg 24.675859 28.480446 32.571484 1.713603 + markers + bt cfg 18.713280 27.054927 31.684325 2.857186 o It is not at all clear why running without blk trace configured into the kernel runs slower than with blk trace configured in. (9.6 to 10.6% reduction) o The data is still not conclusive with respect to whether the marker patches change performance characteristics when we're not gathering traces. It appears that any change in performance is minimal at worst for this test. o The data so far still doesn't conclusively show a win in this case even when we are capturing traces, although, the average certainly seems to be in its favor. One concern that I should be able to deal easily with is the choice of the IO scheduler being used for both the volume being used to perform the test on, as well as the one used for storing blk traces (when enabled). Right now I was using the default CFQ, when perhaps NOOP or DEADLINE would be a better choice. If there is enough interest in seeing how that changes things I could try to get some runs in later this week. Alan D. Brunelle Hewlett-Packard / Open Source and Linux Organization / Scalability and Performance Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system
Mathieu Desnoyers wrote: * Alan D. Brunelle ([EMAIL PROTECTED]) wrote: Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample runs of the following on both it and after applying Mathieu Desnoyers 11-patch sequence (19 September 2007). * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB volume per MSA used) * 10 runs with each configuration, averages shown below o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running * A run consists of doing the following in parallel: o Make an ext3 FS on each of the 10 volumes o Mount unmount each volume + The unmounting generates a tremendous amount of writes to the disks - thus stressing the intended storage devices (10 volumes) plus the separate volume for all the blktrace data (when blk tracing is enabled). + Note the times reported below only cover the make/mount/unmount time - the actual blktrace runs extended beyond the times measured (took quite a while for the blk trace data to be output). We're only concerned with the impact on the application performance in this instance. Results are: Kernel w/out BT STDDEV w/ BTSTDDEV - - -- - -- 2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34 27.7547962.09 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers 14.9930410.59 26.6949933.23 Interesting results, although we cannot say any of the solutions has much impact due to the std dev. Also, it could be interesting to add the blktrace compiled out as a base line. Thanks for running those tests, Mathieu Mathieu: Here are the results from 6 different kernels (including ones with blktrace not configured in), with now performing 40 runs per kernel. o All kernels start off with Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 o '- bt cfg' or '+ bt cfg' means a kernel without or with blktrace configured respectively. o '- markers' or '+ markers' means a kernel without or with the 11-patch marker series respectively. 38 runs without blk traces being captured (dropped hi/lo value from 40 runs) Kernel Options Min valAvg valMax valStd Dev -- - - - - - markers - bt cfg 15.349127 16.169459 16.372980 0.184417 + markers - bt cfg 15.280382 16.202398 16.409257 0.191861 - markers + bt cfg 14.464366 14.754347 16.052306 0.463665 + markers + bt cfg 14.421765 14.644406 15.690871 0.233885 38 runs with blk traces being captured (dropped hi/lo value from 40 runs) Kernel Options Min valAvg valMax valStd Dev -- - - - - - markers + bt cfg 24.675859 28.480446 32.571484 1.713603 + markers + bt cfg 18.713280 27.054927 31.684325 2.857186 o It is not at all clear why running without blk trace configured into the kernel runs slower than with blk trace configured in. (9.6 to 10.6% reduction) o The data is still not conclusive with respect to whether the marker patches change performance characteristics when we're not gathering traces. It appears that any change in performance is minimal at worst for this test. o The data so far still doesn't conclusively show a win in this case even when we are capturing traces, although, the average certainly seems to be in its favor. One concern that I should be able to deal easily with is the choice of the IO scheduler being used for both the volume being used to perform the test on, as well as the one used for storing blk traces (when enabled). Right now I was using the default CFQ, when perhaps NOOP or DEADLINE would be a better choice. If there is enough interest in seeing how that changes things I could try to get some runs in later this week. Alan D. Brunelle Hewlett-Packard / Open Source and Linux Organization / Scalability and Performance Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system
Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample runs of the following on both it and after applying Mathieu Desnoyers 11-patch sequence (19 September 2007). * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB volume per MSA used) * 10 runs with each configuration, averages shown below o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running * A run consists of doing the following in parallel: o Make an ext3 FS on each of the 10 volumes o Mount & unmount each volume + The unmounting generates a tremendous amount of writes to the disks - thus stressing the intended storage devices (10 volumes) plus the separate volume for all the blktrace data (when blk tracing is enabled). + Note the times reported below only cover the make/mount/unmount time - the actual blktrace runs extended beyond the times measured (took quite a while for the blk trace data to be output). We're only concerned with the impact on the "application" performance in this instance. Results are: Kernel w/out BT STDDEV w/ BTSTDDEV - - -- - -- 2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34 27.7547962.09 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers 14.9930410.59 26.6949933.23 It looks to be about 2.1% increase in time to do the make/mount/unmount operations with the marker patches in place and no blktrace operations. With the blktrace operations in place we see about a 3.8% decrease in time to do the same ops. When our Oracle benchmarking machine frees up, and when the marker/blktrace patches are more stable, we'll try to get some "real" Oracle benchmark runs done to gage the impact of the markers changes to performance... Alan D. Brunelle Hewlett-Packard / Open Source and Linux Organization / Scalability and Performance Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Markers - performance characterization with large IO load on large-ish system
Taking Linux 2.6.23-rc6 + 2.6.23-rc6-mm1 as a basis, I took some sample runs of the following on both it and after applying Mathieu Desnoyers 11-patch sequence (19 September 2007). * 32-way IA64 + 132GiB + 10 FC adapters + 10 HP MSA 1000s (one 72GiB volume per MSA used) * 10 runs with each configuration, averages shown below o 2.6.23-rc6 + 2.6.23-rc6-mm1 without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 with blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers without blktrace running o 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers with blktrace running * A run consists of doing the following in parallel: o Make an ext3 FS on each of the 10 volumes o Mount unmount each volume + The unmounting generates a tremendous amount of writes to the disks - thus stressing the intended storage devices (10 volumes) plus the separate volume for all the blktrace data (when blk tracing is enabled). + Note the times reported below only cover the make/mount/unmount time - the actual blktrace runs extended beyond the times measured (took quite a while for the blk trace data to be output). We're only concerned with the impact on the application performance in this instance. Results are: Kernel w/out BT STDDEV w/ BTSTDDEV - - -- - -- 2.6.23-rc6 + 2.6.23-rc6-mm114.6799820.34 27.7547962.09 2.6.23-rc6 + 2.6.23-rc6-mm1 + markers 14.9930410.59 26.6949933.23 It looks to be about 2.1% increase in time to do the make/mount/unmount operations with the marker patches in place and no blktrace operations. With the blktrace operations in place we see about a 3.8% decrease in time to do the same ops. When our Oracle benchmarking machine frees up, and when the marker/blktrace patches are more stable, we'll try to get some real Oracle benchmark runs done to gage the impact of the markers changes to performance... Alan D. Brunelle Hewlett-Packard / Open Source and Linux Organization / Scalability and Performance Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix remap handling by blktrace
This patch provides more information concerning REMAP operations on block IOs. The additional information provides clearer details at the user level, and supports post-processing analysis in btt. o Adds in partition remaps on the same device. o Fixed up the remap information in DM to be in the right order o Sent up mapped-from and mapped-to device information Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]> --- block/ll_rw_blk.c|4 drivers/md/dm.c |4 ++-- include/linux/blktrace_api.h |3 ++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 8c2caff..a15845c 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -3047,6 +3047,10 @@ static inline void blk_partition_remap(struct bio *bio) bio->bi_sector += p->start_sect; bio->bi_bdev = bdev->bd_contains; + + blk_add_trace_remap(bdev_get_queue(bio->bi_bdev), bio, +bdev->bd_dev, bio->bi_sector, +bio->bi_sector - p->start_sect); } } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 141ff9f..2120155 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -580,8 +580,8 @@ static void __map_bio(struct dm_target *ti, struct bio *clone, /* the bio has been remapped so dispatch it */ blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone, -tio->io->bio->bi_bdev->bd_dev, sector, -clone->bi_sector); +tio->io->bio->bi_bdev->bd_dev, +clone->bi_sector, sector); generic_make_request(clone); } else if (r < 0 || r == DM_MAPIO_REQUEUE) { diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h index 90874a5..7b5d56b 100644 --- a/include/linux/blktrace_api.h +++ b/include/linux/blktrace_api.h @@ -105,7 +105,7 @@ struct blk_io_trace { */ struct blk_io_trace_remap { __be32 device; - u32 __pad; + __be32 device_from; __be64 sector; }; @@ -272,6 +272,7 @@ static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio, return; r.device = cpu_to_be32(dev); + r.device_from = cpu_to_be32(bio->bi_bdev->bd_dev); r.sector = cpu_to_be64(to); __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), );
[PATCH] Fix remap handling by blktrace
This patch provides more information concerning REMAP operations on block IOs. The additional information provides clearer details at the user level, and supports post-processing analysis in btt. o Adds in partition remaps on the same device. o Fixed up the remap information in DM to be in the right order o Sent up mapped-from and mapped-to device information Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED] --- block/ll_rw_blk.c|4 drivers/md/dm.c |4 ++-- include/linux/blktrace_api.h |3 ++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 8c2caff..a15845c 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -3047,6 +3047,10 @@ static inline void blk_partition_remap(struct bio *bio) bio-bi_sector += p-start_sect; bio-bi_bdev = bdev-bd_contains; + + blk_add_trace_remap(bdev_get_queue(bio-bi_bdev), bio, +bdev-bd_dev, bio-bi_sector, +bio-bi_sector - p-start_sect); } } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 141ff9f..2120155 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -580,8 +580,8 @@ static void __map_bio(struct dm_target *ti, struct bio *clone, /* the bio has been remapped so dispatch it */ blk_add_trace_remap(bdev_get_queue(clone-bi_bdev), clone, -tio-io-bio-bi_bdev-bd_dev, sector, -clone-bi_sector); +tio-io-bio-bi_bdev-bd_dev, +clone-bi_sector, sector); generic_make_request(clone); } else if (r 0 || r == DM_MAPIO_REQUEUE) { diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h index 90874a5..7b5d56b 100644 --- a/include/linux/blktrace_api.h +++ b/include/linux/blktrace_api.h @@ -105,7 +105,7 @@ struct blk_io_trace { */ struct blk_io_trace_remap { __be32 device; - u32 __pad; + __be32 device_from; __be64 sector; }; @@ -272,6 +272,7 @@ static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio, return; r.device = cpu_to_be32(dev); + r.device_from = cpu_to_be32(bio-bi_bdev-bd_dev); r.sector = cpu_to_be64(to); __blk_add_trace(bt, from, bio-bi_size, bio-bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), r);
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Mon, May 21 2007, Alan D. Brunelle wrote: Jens Axboe wrote: On Tue, May 01 2007, Alan D. Brunelle wrote: Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? We performed two point runs yesterday (20,000 and 50,000 tasks) and here are the results: Kernel Tasks Jobs per Minute %sys (avg) -- - --- -- 2.6.21 2 60,831.139.83% CFQ br 2 60,237.440.80% -0.98%+2.44% 2.6.21 5 60,881.640.43% CFQ br 5 60,400.640.80% -0.79%+0.92% So we're seeing a slight IO performance regression with a slight increase in %system with the CFQ branch. (A chart of the complete run values is up on http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png ). Alan, can you repeat that same run with this patch applied? It reinstates the cfq lookup hash, which could account for increased system utilization. Hi Jens - This test was performed over the weekend, results are updated on http://free.linux.hp.com/~adb/cfq/cfq_dbase.png Thanks a lot, Alan! So the cfq hash does indeed improve things a little, that's a shame. I guess I'll just reinstate the hash lookup. You're welcome Jens, but remember: It's one set of data; from one benchmark; on one architecture; on one platform...don't know if you should scrap the whole thing for that! :-) At the very least, I could look into trying it out on another architecture. Let me see what I can dig up... Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Tue, May 01 2007, Alan D. Brunelle wrote: Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? We performed two point runs yesterday (20,000 and 50,000 tasks) and here are the results: Kernel Tasks Jobs per Minute %sys (avg) -- - --- -- 2.6.21 2 60,831.139.83% CFQ br 2 60,237.440.80% -0.98%+2.44% 2.6.21 5 60,881.640.43% CFQ br 5 60,400.640.80% -0.79%+0.92% So we're seeing a slight IO performance regression with a slight increase in %system with the CFQ branch. (A chart of the complete run values is up on http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png ). Alan, can you repeat that same run with this patch applied? It reinstates the cfq lookup hash, which could account for increased system utilization. Hi Jens - This test was performed over the weekend, results are updated on http://free.linux.hp.com/~adb/cfq/cfq_dbase.png Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Tue, May 01 2007, Alan D. Brunelle wrote: Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? We performed two point runs yesterday (20,000 and 50,000 tasks) and here are the results: Kernel Tasks Jobs per Minute %sys (avg) -- - --- -- 2.6.21 2 60,831.139.83% CFQ br 2 60,237.440.80% -0.98%+2.44% 2.6.21 5 60,881.640.43% CFQ br 5 60,400.640.80% -0.79%+0.92% So we're seeing a slight IO performance regression with a slight increase in %system with the CFQ branch. (A chart of the complete run values is up on http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png ). Alan, can you repeat that same run with this patch applied? It reinstates the cfq lookup hash, which could account for increased system utilization. Hi Jens - This test was performed over the weekend, results are updated on http://free.linux.hp.com/~adb/cfq/cfq_dbase.png Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Mon, May 21 2007, Alan D. Brunelle wrote: Jens Axboe wrote: On Tue, May 01 2007, Alan D. Brunelle wrote: Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? We performed two point runs yesterday (20,000 and 50,000 tasks) and here are the results: Kernel Tasks Jobs per Minute %sys (avg) -- - --- -- 2.6.21 2 60,831.139.83% CFQ br 2 60,237.440.80% -0.98%+2.44% 2.6.21 5 60,881.640.43% CFQ br 5 60,400.640.80% -0.79%+0.92% So we're seeing a slight IO performance regression with a slight increase in %system with the CFQ branch. (A chart of the complete run values is up on http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png ). Alan, can you repeat that same run with this patch applied? It reinstates the cfq lookup hash, which could account for increased system utilization. Hi Jens - This test was performed over the weekend, results are updated on http://free.linux.hp.com/~adb/cfq/cfq_dbase.png Thanks a lot, Alan! So the cfq hash does indeed improve things a little, that's a shame. I guess I'll just reinstate the hash lookup. You're welcome Jens, but remember: It's one set of data; from one benchmark; on one architecture; on one platform...don't know if you should scrap the whole thing for that! :-) At the very least, I could look into trying it out on another architecture. Let me see what I can dig up... Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? We performed two point runs yesterday (20,000 and 50,000 tasks) and here are the results: Kernel Tasks Jobs per Minute %sys (avg) -- - --- -- 2.6.21 2 60,831.139.83% CFQ br 2 60,237.440.80% -0.98%+2.44% 2.6.21 5 60,881.640.43% CFQ br 5 60,400.640.80% -0.79%+0.92% So we're seeing a slight IO performance regression with a slight increase in %system with the CFQ branch. (A chart of the complete run values is up on http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png ). Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? We performed two point runs yesterday (20,000 and 50,000 tasks) and here are the results: Kernel Tasks Jobs per Minute %sys (avg) -- - --- -- 2.6.21 2 60,831.139.83% CFQ br 2 60,237.440.80% -0.98%+2.44% 2.6.21 5 60,881.640.43% CFQ br 5 60,400.640.80% -0.79%+0.92% So we're seeing a slight IO performance regression with a slight increase in %system with the CFQ branch. (A chart of the complete run values is up on http://free.linux.hp.com/~adb/cfq/cfq_20k50k.png ). Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? Unfortunately, the runs generate different X points - I'm going to try and get a second run with the same X-points, and then I can compare iostat results (these are being collected). I do have some iostat data from these runs, and I am trying to make sense of them. But, with only about a 0.5% difference in performance, and different X values, not much can be gleaned. We'll see when the second run of a kernel can be done, and I'll get back to you on that. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Alan D. Brunelle - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Alan D. Brunelle - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFQ IO scheduler patch series - AIM7 DBase results on a 16-way IA64
Jens Axboe wrote: On Mon, Apr 30 2007, Alan D. Brunelle wrote: The results from a single run of an AIM7 DBase load on a 16-way ia64 box (64GB RAM + 144 FC disks) showed a slight regression (~0.5%) by adding in this patch. (Graph can be found at http://free.linux.hp.com/~adb/cfq/cfq_dbase.png ) It is only a single set of runs, on a single platform, but it is something to keep an eye on as the regression showed itself across the complete run. Do you know if this regression is due to worse IO performance, or increased system CPU usage? Unfortunately, the runs generate different X points - I'm going to try and get a second run with the same X-points, and then I can compare iostat results (these are being collected). I do have some iostat data from these runs, and I am trying to make sense of them. But, with only about a 0.5% difference in performance, and different X values, not much can be gleaned. We'll see when the second run of a kernel can be done, and I'll get back to you on that. Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH linux-2.6-block.git] Fix blktrace trace ordering for plug branch
The attached patch will correct the ordering of trace output between request queue insertions (I) and unplug calls (U). Right now the insert precedes the unplug, which just isn't right: 65,128 0167.699868965 7882 Q R 0 + 1 [aiod] 65,128 0267.699876462 7882 G R 0 + 1 [aiod] 65,128 0367.699878286 7882 P W [aiod] 65,128 0467.699880491 7882 I R 0 + 1 [aiod] 65,128 0567.699887589 7882 U R [aiod] 1 65,128 0667.69989831754 D R 0 + 1 [kblockd/0] 65,128 2 15367.700126590 0 C R 0 + 1 [0] With the patch provided the unplug comes first: 65,128 31 0.0 7045 Q R 0 + 1 [aiod] 65,128 32 0.02295 7045 G R 0 + 1 [aiod] 65,128 33 0.02617 7045 P W [aiod] 65,128 34 0.03685 7045 U R [aiod] 1 65,128 35 0.04107 7045 I R 0 + 1 [aiod] 65,128 36 0.0949157 D R 0 + 1 [kblockd/3] 65,128 21 0.000232447 0 C R 0 + 1 [0] Jens: If you agree, the patch can be applied to your plug branch for git://git.kernel.dk/data/git/linux-2.6-block.git Thanks, Alan From: Alan D. Brunelle <[EMAIL PROTECTED]> Fix unplug/insert trace inversion problem. Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]> --- block/ll_rw_blk.c |8 include/linux/blkdev.h |1 + 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 46d29f7..3bec97f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -2981,6 +2981,7 @@ out_unlock: if (bio_data_dir(bio) == WRITE && ioc->qrcu_idx == -1) ioc->qrcu_idx = qrcu_read_lock(>qrcu); list_add_tail(>queuelist, >plugged_list); + ioc->plugged_list_len++; } out: @@ -3720,7 +3721,6 @@ void blk_unplug_current(void) struct io_context *ioc = current->io_context; struct request *req; request_queue_t *q; - int nr_unplug; if (!ioc) return; @@ -3735,19 +3735,19 @@ void blk_unplug_current(void) if (list_empty(>plugged_list)) goto out; - nr_unplug = 0; + blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, ioc->plugged_list_len); + spin_lock_irq(q->queue_lock); do { req = list_entry_rq(ioc->plugged_list.next); list_del_init(>queuelist); add_request(q, req); - nr_unplug++; } while (!list_empty(>plugged_list)); + ioc->plugged_list_len = 0; spin_unlock_irq(q->queue_lock); queue_delayed_work(kblockd_workqueue, >delay_work, 0); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, nr_unplug); out: /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f8cdd44..848564c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -113,6 +113,7 @@ struct io_context { */ int plugged; int qrcu_idx; + int plugged_list_len; struct list_head plugged_list; struct request_queue *plugged_queue;
[PATCH linux-2.6-block.git] Fix blktrace trace ordering for plug branch
The attached patch will correct the ordering of trace output between request queue insertions (I) and unplug calls (U). Right now the insert precedes the unplug, which just isn't right: 65,128 0167.699868965 7882 Q R 0 + 1 [aiod] 65,128 0267.699876462 7882 G R 0 + 1 [aiod] 65,128 0367.699878286 7882 P W [aiod] 65,128 0467.699880491 7882 I R 0 + 1 [aiod] 65,128 0567.699887589 7882 U R [aiod] 1 65,128 0667.69989831754 D R 0 + 1 [kblockd/0] 65,128 2 15367.700126590 0 C R 0 + 1 [0] With the patch provided the unplug comes first: 65,128 31 0.0 7045 Q R 0 + 1 [aiod] 65,128 32 0.02295 7045 G R 0 + 1 [aiod] 65,128 33 0.02617 7045 P W [aiod] 65,128 34 0.03685 7045 U R [aiod] 1 65,128 35 0.04107 7045 I R 0 + 1 [aiod] 65,128 36 0.0949157 D R 0 + 1 [kblockd/3] 65,128 21 0.000232447 0 C R 0 + 1 [0] Jens: If you agree, the patch can be applied to your plug branch for git://git.kernel.dk/data/git/linux-2.6-block.git Thanks, Alan From: Alan D. Brunelle [EMAIL PROTECTED] Fix unplug/insert trace inversion problem. Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED] --- block/ll_rw_blk.c |8 include/linux/blkdev.h |1 + 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 46d29f7..3bec97f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -2981,6 +2981,7 @@ out_unlock: if (bio_data_dir(bio) == WRITE ioc-qrcu_idx == -1) ioc-qrcu_idx = qrcu_read_lock(q-qrcu); list_add_tail(req-queuelist, ioc-plugged_list); + ioc-plugged_list_len++; } out: @@ -3720,7 +3721,6 @@ void blk_unplug_current(void) struct io_context *ioc = current-io_context; struct request *req; request_queue_t *q; - int nr_unplug; if (!ioc) return; @@ -3735,19 +3735,19 @@ void blk_unplug_current(void) if (list_empty(ioc-plugged_list)) goto out; - nr_unplug = 0; + blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, ioc-plugged_list_len); + spin_lock_irq(q-queue_lock); do { req = list_entry_rq(ioc-plugged_list.next); list_del_init(req-queuelist); add_request(q, req); - nr_unplug++; } while (!list_empty(ioc-plugged_list)); + ioc-plugged_list_len = 0; spin_unlock_irq(q-queue_lock); queue_delayed_work(kblockd_workqueue, q-delay_work, 0); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, nr_unplug); out: /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f8cdd44..848564c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -113,6 +113,7 @@ struct io_context { */ int plugged; int qrcu_idx; + int plugged_list_len; struct list_head plugged_list; struct request_queue *plugged_queue;
Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling
Jens Axboe wrote: On Wed, Apr 25 2007, Jens Axboe wrote: On Wed, Apr 25 2007, Jens Axboe wrote: On Wed, Apr 25 2007, Alan D. Brunelle wrote: Hi Jens - The attached patch speeds it up even more - I'm finding a >9% reduction in %system with no loss in IO performance. This just sets the cached element when the first is looked for. Interesting, good thinking. It should not change the IO pattern, as the end result should be the same. Thanks Alan, will commit! I'll give elevator.c the same treatment, should be even more beneficial. Stay tuned for a test patch. Something like this, totally untested (it compiles). I initially wanted to fold the cfq addon into the elevator.h provided implementation, but that requires more extensive changes. Given how little code it is, I think I'll keep them seperate. Booted, seems to work fine for me. In a null ended IO test, I get about a 1-2% speedup for a single queue of depth 64 using libaio. So it's definitely worth it, will commit. After longer runs last night, I think the patched elevator code /does/ help (albeit ever so slightly - about 0.6% performance improvement at a 1.1% %system overhead). rkB_s%system Kernel - --- 1022942.2 3.69 Original patch + fix to cfq_rb_first 1029087.0 3.73 This patch stream (including fixes to elevator code) Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling
Jens Axboe wrote: On Wed, Apr 25 2007, Jens Axboe wrote: On Wed, Apr 25 2007, Jens Axboe wrote: On Wed, Apr 25 2007, Alan D. Brunelle wrote: Hi Jens - The attached patch speeds it up even more - I'm finding a 9% reduction in %system with no loss in IO performance. This just sets the cached element when the first is looked for. Interesting, good thinking. It should not change the IO pattern, as the end result should be the same. Thanks Alan, will commit! I'll give elevator.c the same treatment, should be even more beneficial. Stay tuned for a test patch. Something like this, totally untested (it compiles). I initially wanted to fold the cfq addon into the elevator.h provided implementation, but that requires more extensive changes. Given how little code it is, I think I'll keep them seperate. Booted, seems to work fine for me. In a null ended IO test, I get about a 1-2% speedup for a single queue of depth 64 using libaio. So it's definitely worth it, will commit. After longer runs last night, I think the patched elevator code /does/ help (albeit ever so slightly - about 0.6% performance improvement at a 1.1% %system overhead). rkB_s%system Kernel - --- 1022942.2 3.69 Original patch + fix to cfq_rb_first 1029087.0 3.73 This patch stream (including fixes to elevator code) Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling
Hi Jens - The attached patch speeds it up even more - I'm finding a >9% reduction in %system with no loss in IO performance. This just sets the cached element when the first is looked for. Alan From: Alan D. Brunelle <[EMAIL PROTECTED]> Update cached leftmost every time it is found. Signed-off-by: Alan D. Brunelle <[EMAIL PROTECTED]> --- block/cfq-iosched.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 8093733..a86a7c3 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -388,10 +388,10 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2) */ static struct rb_node *cfq_rb_first(struct cfq_rb_root *root) { - if (root->left) - return root->left; + if (!root->left) + root->left = rb_first(>rb); - return rb_first(>rb); + return root->left; } static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
Re: [PATCH 0/15] CFQ IO scheduler patch series
Using the patches posted yesterday (http://marc.info/?l=linux-kernel=117740312628325=2) here are some quick read results (as measured by iostat over a 5 minute period, taken in 6 second intervals) on a 4-way IA64 box with 42 disks (24 FC and 18 U320), 42 processes (1 per disk) with 256 AIOs (16KB) outstanding at all times per device: 2.6.21-rc7: 1,006.023 MB/second 2.6.21-rc7 + new CFQ IO scheduler: 1,030.767 MB/second showing about a 2.46% performance improvement with a 2.43% increase in %system used (3.738% -> 3.829%). Interestingly enough this patch also seems to remove some noise during the run - see the chart at http://free.linux.hp.com/~adb/cfq/rkb_s.png Alan D. Brunelle HP / Open Source and Linux Organization / Scalability and Performance Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/15] CFQ IO scheduler patch series
Using the patches posted yesterday (http://marc.info/?l=linux-kernelm=117740312628325w=2) here are some quick read results (as measured by iostat over a 5 minute period, taken in 6 second intervals) on a 4-way IA64 box with 42 disks (24 FC and 18 U320), 42 processes (1 per disk) with 256 AIOs (16KB) outstanding at all times per device: 2.6.21-rc7: 1,006.023 MB/second 2.6.21-rc7 + new CFQ IO scheduler: 1,030.767 MB/second showing about a 2.46% performance improvement with a 2.43% increase in %system used (3.738% - 3.829%). Interestingly enough this patch also seems to remove some noise during the run - see the chart at http://free.linux.hp.com/~adb/cfq/rkb_s.png Alan D. Brunelle HP / Open Source and Linux Organization / Scalability and Performance Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/15] cfq-iosched: speed up rbtree handling
Hi Jens - The attached patch speeds it up even more - I'm finding a 9% reduction in %system with no loss in IO performance. This just sets the cached element when the first is looked for. Alan From: Alan D. Brunelle [EMAIL PROTECTED] Update cached leftmost every time it is found. Signed-off-by: Alan D. Brunelle [EMAIL PROTECTED] --- block/cfq-iosched.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 8093733..a86a7c3 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -388,10 +388,10 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2) */ static struct rb_node *cfq_rb_first(struct cfq_rb_root *root) { - if (root-left) - return root-left; + if (!root-left) + root-left = rb_first(root-rb); - return rb_first(root-rb); + return root-left; } static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)