Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler <[EMAIL PROTECTED]> wrote: > > there are a handful of 'scheduler feature bits' in > > /proc/sys/kernel/sched_features: > > > > enum { > > SCHED_FEAT_NEW_FAIR_SLEEPERS= 1, > > SCHED_FEAT_WAKEUP_PREEMPT = 2, > > SCHED_FEAT_START_DEBIT = 4, > > SCHED_FEAT_TREE_AVG = 8, > > SCHED_FEAT_APPROX_AVG = 16, > > }; > > > > Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or > SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my > 2.6.22 results. Toggling them both to 0 gives me slightly better > results than 2.6.22! ok, but it would be nice to avoid having to turn these off. Could you try whether tuning the /proc/sys/kernel/*granularity* values (in particular wakeup_granularity) has any positive effect on your workload? also, could you run your workload as SCHED_BATCH [via schedtool -B ], does that improve the results as well on a default-tuned kernel? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler [EMAIL PROTECTED] wrote: there are a handful of 'scheduler feature bits' in /proc/sys/kernel/sched_features: enum { SCHED_FEAT_NEW_FAIR_SLEEPERS= 1, SCHED_FEAT_WAKEUP_PREEMPT = 2, SCHED_FEAT_START_DEBIT = 4, SCHED_FEAT_TREE_AVG = 8, SCHED_FEAT_APPROX_AVG = 16, }; Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my 2.6.22 results. Toggling them both to 0 gives me slightly better results than 2.6.22! ok, but it would be nice to avoid having to turn these off. Could you try whether tuning the /proc/sys/kernel/*granularity* values (in particular wakeup_granularity) has any positive effect on your workload? also, could you run your workload as SCHED_BATCH [via schedtool -B ], does that improve the results as well on a default-tuned kernel? Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
> there are a handful of 'scheduler feature bits' in > /proc/sys/kernel/sched_features: > > enum { > SCHED_FEAT_NEW_FAIR_SLEEPERS= 1, > SCHED_FEAT_WAKEUP_PREEMPT = 2, > SCHED_FEAT_START_DEBIT = 4, > SCHED_FEAT_TREE_AVG = 8, > SCHED_FEAT_APPROX_AVG = 16, > }; > Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my 2.6.22 results. Toggling them both to 0 gives me slightly better results than 2.6.22! > /sys/devices/system/cpu/sched_mc_power_savings > > does that change the results? > no measurable difference on this toggle that I can see. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, I have permission for a binary only release (mailed the supervisor intermediately after your earler mail). I'm sure the abstract code simulating the workload will be alright too, but I need time to put it together as I'm a bit swamped at the moment. Will hope to have it in the next few days. regards, Colin On Jan 16, 2008 4:19 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Colin Fowler <[EMAIL PROTECTED]> wrote: > > > Hi Ingo, I'll need to convince my supervisor first if I can release a > > binary. Technically anythin glike this needs to go through our > > University's "innovations department" and requires lengthy paperwork > > and NDAs :(. > > a binary wouldnt work for me anyway. But you could try to write a > "workload simulator": just pick out the pthread ops and replace the > worker functions with some dummy stuff that just touches an array that > has similar size to the tiles (in a tight loop). Make sure it has > similar context-switch rate and idle percentage as your real workload - > then send us the .c file. As long as it's a single .c file that runs for > a few seconds and outputs a precise enough "run time" result, kernel > developers would pick it up and use it for optimizations. To get the # > of cpus automatically you can do: > > cpus = system("exit `grep processor /proc/cpuinfo | wc -l`"); > cpus = WEXITSTATUS(cpus); > > and start as many threads as many CPUs there are in the system. > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler <[EMAIL PROTECTED]> wrote: > Hi Ingo, I'll need to convince my supervisor first if I can release a > binary. Technically anythin glike this needs to go through our > University's "innovations department" and requires lengthy paperwork > and NDAs :(. a binary wouldnt work for me anyway. But you could try to write a "workload simulator": just pick out the pthread ops and replace the worker functions with some dummy stuff that just touches an array that has similar size to the tiles (in a tight loop). Make sure it has similar context-switch rate and idle percentage as your real workload - then send us the .c file. As long as it's a single .c file that runs for a few seconds and outputs a precise enough "run time" result, kernel developers would pick it up and use it for optimizations. To get the # of cpus automatically you can do: cpus = system("exit `grep processor /proc/cpuinfo | wc -l`"); cpus = WEXITSTATUS(cpus); and start as many threads as many CPUs there are in the system. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, I'll need to convince my supervisor first if I can release a binary. Technically anythin glike this needs to go through our University's "innovations department" and requires lengthy paperwork and NDAs :(. regards, Colin On Jan 16, 2008 3:35 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Colin Fowler <[EMAIL PROTECTED]> wrote: > > > > > and context-switches 45K times a second. Do you know what is going > > > on there? I thought ray-tracing is something that can be > > > parallelized pretty efficiently, without having to contend and > > > schedule too much. > > > > This is a RTRT (real-time ray tracing) system and as a result differs > > from traditional offline ray-tracers as it is optimised for speed. The > > benchmark I ran while these data were collected renders an 80K polygon > > scene to a 512x512 buffer at just over 100fps. > > > > The context switches are most likely caused by the pthreads > > synchronisation code. There are two mutexs. Each job is a 32x32 tile > > and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for > > 100fps) * 2 =~50k. There's very likely where our context switches are > > coming from. Larger tile sizes would of course reduce the locking > > overhead, but then the ray-tracer suffers form load imbalance as some > > tiles are much quicker to render than others. Empircally we've found > > that this tile-size works the best for us. > > > > The CPU idling occurs as the system doesn't yet perform asynchronous > > rendering. When all tiles in a current job queue are finished the > > current frame is done. At this point all worker threads sleep while > > the master thread blits the image to the screen and fills the job > > queue for the next frame. The data probably shows that one CPU is kept > > maxed and the others reach about 90% most of the time. This is > > something on my TODO list to fix along with a myriad of other > > optimisations :) > > is this something i could run myself and see how it behaves with various > scheduler settings? (if yes, where can i download it and is there any > sample scene that would show similar effects.) > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler <[EMAIL PROTECTED]> wrote: > > and context-switches 45K times a second. Do you know what is going > > on there? I thought ray-tracing is something that can be > > parallelized pretty efficiently, without having to contend and > > schedule too much. > > This is a RTRT (real-time ray tracing) system and as a result differs > from traditional offline ray-tracers as it is optimised for speed. The > benchmark I ran while these data were collected renders an 80K polygon > scene to a 512x512 buffer at just over 100fps. > > The context switches are most likely caused by the pthreads > synchronisation code. There are two mutexs. Each job is a 32x32 tile > and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for > 100fps) * 2 =~50k. There's very likely where our context switches are > coming from. Larger tile sizes would of course reduce the locking > overhead, but then the ray-tracer suffers form load imbalance as some > tiles are much quicker to render than others. Empircally we've found > that this tile-size works the best for us. > > The CPU idling occurs as the system doesn't yet perform asynchronous > rendering. When all tiles in a current job queue are finished the > current frame is done. At this point all worker threads sleep while > the master thread blits the image to the screen and fills the job > queue for the next frame. The data probably shows that one CPU is kept > maxed and the others reach about 90% most of the time. This is > something on my TODO list to fix along with a myriad of other > optimisations :) is this something i could run myself and see how it behaves with various scheduler settings? (if yes, where can i download it and is there any sample scene that would show similar effects.) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler [EMAIL PROTECTED] wrote: and context-switches 45K times a second. Do you know what is going on there? I thought ray-tracing is something that can be parallelized pretty efficiently, without having to contend and schedule too much. This is a RTRT (real-time ray tracing) system and as a result differs from traditional offline ray-tracers as it is optimised for speed. The benchmark I ran while these data were collected renders an 80K polygon scene to a 512x512 buffer at just over 100fps. The context switches are most likely caused by the pthreads synchronisation code. There are two mutexs. Each job is a 32x32 tile and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for 100fps) * 2 =~50k. There's very likely where our context switches are coming from. Larger tile sizes would of course reduce the locking overhead, but then the ray-tracer suffers form load imbalance as some tiles are much quicker to render than others. Empircally we've found that this tile-size works the best for us. The CPU idling occurs as the system doesn't yet perform asynchronous rendering. When all tiles in a current job queue are finished the current frame is done. At this point all worker threads sleep while the master thread blits the image to the screen and fills the job queue for the next frame. The data probably shows that one CPU is kept maxed and the others reach about 90% most of the time. This is something on my TODO list to fix along with a myriad of other optimisations :) is this something i could run myself and see how it behaves with various scheduler settings? (if yes, where can i download it and is there any sample scene that would show similar effects.) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, I'll need to convince my supervisor first if I can release a binary. Technically anythin glike this needs to go through our University's innovations department and requires lengthy paperwork and NDAs :(. regards, Colin On Jan 16, 2008 3:35 PM, Ingo Molnar [EMAIL PROTECTED] wrote: * Colin Fowler [EMAIL PROTECTED] wrote: and context-switches 45K times a second. Do you know what is going on there? I thought ray-tracing is something that can be parallelized pretty efficiently, without having to contend and schedule too much. This is a RTRT (real-time ray tracing) system and as a result differs from traditional offline ray-tracers as it is optimised for speed. The benchmark I ran while these data were collected renders an 80K polygon scene to a 512x512 buffer at just over 100fps. The context switches are most likely caused by the pthreads synchronisation code. There are two mutexs. Each job is a 32x32 tile and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for 100fps) * 2 =~50k. There's very likely where our context switches are coming from. Larger tile sizes would of course reduce the locking overhead, but then the ray-tracer suffers form load imbalance as some tiles are much quicker to render than others. Empircally we've found that this tile-size works the best for us. The CPU idling occurs as the system doesn't yet perform asynchronous rendering. When all tiles in a current job queue are finished the current frame is done. At this point all worker threads sleep while the master thread blits the image to the screen and fills the job queue for the next frame. The data probably shows that one CPU is kept maxed and the others reach about 90% most of the time. This is something on my TODO list to fix along with a myriad of other optimisations :) is this something i could run myself and see how it behaves with various scheduler settings? (if yes, where can i download it and is there any sample scene that would show similar effects.) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler [EMAIL PROTECTED] wrote: Hi Ingo, I'll need to convince my supervisor first if I can release a binary. Technically anythin glike this needs to go through our University's innovations department and requires lengthy paperwork and NDAs :(. a binary wouldnt work for me anyway. But you could try to write a workload simulator: just pick out the pthread ops and replace the worker functions with some dummy stuff that just touches an array that has similar size to the tiles (in a tight loop). Make sure it has similar context-switch rate and idle percentage as your real workload - then send us the .c file. As long as it's a single .c file that runs for a few seconds and outputs a precise enough run time result, kernel developers would pick it up and use it for optimizations. To get the # of cpus automatically you can do: cpus = system(exit `grep processor /proc/cpuinfo | wc -l`); cpus = WEXITSTATUS(cpus); and start as many threads as many CPUs there are in the system. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, I have permission for a binary only release (mailed the supervisor intermediately after your earler mail). I'm sure the abstract code simulating the workload will be alright too, but I need time to put it together as I'm a bit swamped at the moment. Will hope to have it in the next few days. regards, Colin On Jan 16, 2008 4:19 PM, Ingo Molnar [EMAIL PROTECTED] wrote: * Colin Fowler [EMAIL PROTECTED] wrote: Hi Ingo, I'll need to convince my supervisor first if I can release a binary. Technically anythin glike this needs to go through our University's innovations department and requires lengthy paperwork and NDAs :(. a binary wouldnt work for me anyway. But you could try to write a workload simulator: just pick out the pthread ops and replace the worker functions with some dummy stuff that just touches an array that has similar size to the tiles (in a tight loop). Make sure it has similar context-switch rate and idle percentage as your real workload - then send us the .c file. As long as it's a single .c file that runs for a few seconds and outputs a precise enough run time result, kernel developers would pick it up and use it for optimizations. To get the # of cpus automatically you can do: cpus = system(exit `grep processor /proc/cpuinfo | wc -l`); cpus = WEXITSTATUS(cpus); and start as many threads as many CPUs there are in the system. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
there are a handful of 'scheduler feature bits' in /proc/sys/kernel/sched_features: enum { SCHED_FEAT_NEW_FAIR_SLEEPERS= 1, SCHED_FEAT_WAKEUP_PREEMPT = 2, SCHED_FEAT_START_DEBIT = 4, SCHED_FEAT_TREE_AVG = 8, SCHED_FEAT_APPROX_AVG = 16, }; Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my 2.6.22 results. Toggling them both to 0 gives me slightly better results than 2.6.22! /sys/devices/system/cpu/sched_mc_power_savings does that change the results? no measurable difference on this toggle that I can see. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, I'll get the results tomorrow as I'm now out of the office, but I can perhaps answer some of your queries now. On Jan 15, 2008 10:06 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > hm, the system has considerable idle time left: > > r b swpd free buff cache si so bi bo incs us sy id wa > 8 00 1201920 683840 1039100 0 0 3 2 2746 1 0 99 0 > 2 00 1202168 683840 1039112 0 0 0 0 245 45339 80 2 17 0 > 2 00 1202168 683840 1039112 0 0 0 0 263 47349 84 3 14 0 > 2 00 1202300 683848 1039112 0 0 0 76 255 47057 84 3 13 0 > > and context-switches 45K times a second. Do you know what is going on > there? I thought ray-tracing is something that can be parallelized > pretty efficiently, without having to contend and schedule too much. > This is a RTRT (real-time ray tracing) system and as a result differs from traditional offline ray-tracers as it is optimised for speed. The benchmark I ran while these data were collected renders an 80K polygon scene to a 512x512 buffer at just over 100fps. The context switches are most likely caused by the pthreads synchronisation code. There are two mutexs. Each job is a 32x32 tile and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for 100fps) * 2 =~50k. There's very likely where our context switches are coming from. Larger tile sizes would of course reduce the locking overhead, but then the ray-tracer suffers form load imbalance as some tiles are much quicker to render than others. Empircally we've found that this tile-size works the best for us. The CPU idling occurs as the system doesn't yet perform asynchronous rendering. When all tiles in a current job queue are finished the current frame is done. At this point all worker threads sleep while the master thread blits the image to the screen and fills the job queue for the next frame. The data probably shows that one CPU is kept maxed and the others reach about 90% most of the time. This is something on my TODO list to fix along with a myriad of other optimisations :) regards, Colin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler <[EMAIL PROTECTED]> wrote: > These data may be much better for you. It's a single 15 second data > collection run only when the actual ray-tracing is happening. These > data do not therefore cover the data structure building phase. > > http://vangogh.cs.tcd.ie/fowler/cfs2/ hm, the system has considerable idle time left: r b swpd free buff cache si so bi bo incs us sy id wa 8 00 1201920 683840 1039100 0 0 3 2 2746 1 0 99 0 2 00 1202168 683840 1039112 0 0 0 0 245 45339 80 2 17 0 2 00 1202168 683840 1039112 0 0 0 0 263 47349 84 3 14 0 2 00 1202300 683848 1039112 0 0 0 76 255 47057 84 3 13 0 and context-switches 45K times a second. Do you know what is going on there? I thought ray-tracing is something that can be parallelized pretty efficiently, without having to contend and schedule too much. could you try to do a similar capture on 2.6.22 as well (under the same phase of the same workload), as comparison? there are a handful of 'scheduler feature bits' in /proc/sys/kernel/sched_features: enum { SCHED_FEAT_NEW_FAIR_SLEEPERS= 1, SCHED_FEAT_WAKEUP_PREEMPT = 2, SCHED_FEAT_START_DEBIT = 4, SCHED_FEAT_TREE_AVG = 8, SCHED_FEAT_APPROX_AVG = 16, }; const_debug unsigned int sysctl_sched_features = SCHED_FEAT_NEW_FAIR_SLEEPERS* 1 | SCHED_FEAT_WAKEUP_PREEMPT * 1 | SCHED_FEAT_START_DEBIT * 1 | SCHED_FEAT_TREE_AVG * 0 | SCHED_FEAT_APPROX_AVG * 0; [as of 2.6.24-rc7] could you try to turn some of them off/on. In particular toggling WAKEUP_PREEMPT might have an effect, and NEW_FAIR_SLEEPERS might have an effect as well. (TREE_AVG and APPROX_AVG has probably little effect) other debug-tunables you might want to look into are in the /proc/sys/kernel/sched_domains hierarchy. also, if you toggle: /sys/devices/system/cpu/sched_mc_power_savings does that change the results? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
These data may be much better for you. It's a single 15 second data collection run only when the actual ray-tracing is happening. These data do not therefore cover the data structure building phase. http://vangogh.cs.tcd.ie/fowler/cfs2/ Colin On Jan 14, 2008 10:42 PM, Colin Fowler <[EMAIL PROTECTED]> wrote: > Hi Ingo, thanks for the reply. > > Modifying /proc/sys/kernel/sched_latency_ns to be double may have in > fact made things slightly worse. I used 24-rc7 > > Your script was only written to run for 15 seconds, so I ran it so it > multiple times so it covered most of the benchmark. > Other issues with these data may be that for much of the benchmark I > am building data structures utilizing at most 1 to 3 cores. I'm not > concerned with these timings personally as this is considered the > offline part of the render. Once these data structures are built I > proceed to render across 8 cores. This is the section of the benchmark > I get my timings from ( I use RDTSC before and after the render > segment). The majority of the overall time taken for a run is > therefore data structure building. I do not time this. > > Colin. > > > > On Jan 14, 2008 6:55 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > * Colin Fowler <[EMAIL PROTECTED]> wrote: > > > > > Benchmark : A ray-trace is performed on 500 times on 17 separate > > > scenes. Workload is distributed by tiling the framebuffer into N 32x32 > > > pixel tiles. Each CPU grabs one of N tiles out of the queue and > > > repeats until no jobs are left. Rendering is to a shared framebuffer > > > (obviously this causes problems with caching). Locking and > > > synchronization is done using pthreads. > > > > > > Other details: The system is cleanly booted for each run. No I/O is > > > performed during the timed portions of the test. The benchmark does > > > however read a model file from the drive and build a data structure > > > from it before each timed portion. > > > > > > On the 2.6.22 series of kernels results are pretty much the same. On > > > 2.6.23 series kernels I see a loss in speed of ~2% across the board. > > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). > > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 > > > 2.6.23 Kernels tested: 23.1, 23.3, 23.13 > > > 2.6.24 Kernels tested: 24-rc7 > > > > > > I have my kernel compiled to use the SLAB allocator. All other > > > tweaking options are set as defaults. My config files are available at > > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring > > > something wrong for the type of work I do? > > > > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double > > the value of /proc/sys/kernel/sched_latency_ns - does that make any > > difference? Please also run the following script while the ray-trace app > > is running: > > > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > > > and send me the output of it, so that we can have an idea about what's > > going on in your system during this workload. > > > > Ingo > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
These data may be much better for you. It's a single 15 second data collection run only when the actual ray-tracing is happening. These data do not therefore cover the data structure building phase. http://vangogh.cs.tcd.ie/fowler/cfs2/ Colin On Jan 14, 2008 10:42 PM, Colin Fowler [EMAIL PROTECTED] wrote: Hi Ingo, thanks for the reply. Modifying /proc/sys/kernel/sched_latency_ns to be double may have in fact made things slightly worse. I used 24-rc7 Your script was only written to run for 15 seconds, so I ran it so it multiple times so it covered most of the benchmark. Other issues with these data may be that for much of the benchmark I am building data structures utilizing at most 1 to 3 cores. I'm not concerned with these timings personally as this is considered the offline part of the render. Once these data structures are built I proceed to render across 8 cores. This is the section of the benchmark I get my timings from ( I use RDTSC before and after the render segment). The majority of the overall time taken for a run is therefore data structure building. I do not time this. Colin. On Jan 14, 2008 6:55 PM, Ingo Molnar [EMAIL PROTECTED] wrote: * Colin Fowler [EMAIL PROTECTED] wrote: Benchmark : A ray-trace is performed on 500 times on 17 separate scenes. Workload is distributed by tiling the framebuffer into N 32x32 pixel tiles. Each CPU grabs one of N tiles out of the queue and repeats until no jobs are left. Rendering is to a shared framebuffer (obviously this causes problems with caching). Locking and synchronization is done using pthreads. Other details: The system is cleanly booted for each run. No I/O is performed during the timed portions of the test. The benchmark does however read a model file from the drive and build a data structure from it before each timed portion. On the 2.6.22 series of kernels results are pretty much the same. On 2.6.23 series kernels I see a loss in speed of ~2% across the board. On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 2.6.23 Kernels tested: 23.1, 23.3, 23.13 2.6.24 Kernels tested: 24-rc7 I have my kernel compiled to use the SLAB allocator. All other tweaking options are set as defaults. My config files are available at http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring something wrong for the type of work I do? Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double the value of /proc/sys/kernel/sched_latency_ns - does that make any difference? Please also run the following script while the ray-trace app is running: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh and send me the output of it, so that we can have an idea about what's going on in your system during this workload. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler [EMAIL PROTECTED] wrote: These data may be much better for you. It's a single 15 second data collection run only when the actual ray-tracing is happening. These data do not therefore cover the data structure building phase. http://vangogh.cs.tcd.ie/fowler/cfs2/ hm, the system has considerable idle time left: r b swpd free buff cache si so bi bo incs us sy id wa 8 00 1201920 683840 1039100 0 0 3 2 2746 1 0 99 0 2 00 1202168 683840 1039112 0 0 0 0 245 45339 80 2 17 0 2 00 1202168 683840 1039112 0 0 0 0 263 47349 84 3 14 0 2 00 1202300 683848 1039112 0 0 0 76 255 47057 84 3 13 0 and context-switches 45K times a second. Do you know what is going on there? I thought ray-tracing is something that can be parallelized pretty efficiently, without having to contend and schedule too much. could you try to do a similar capture on 2.6.22 as well (under the same phase of the same workload), as comparison? there are a handful of 'scheduler feature bits' in /proc/sys/kernel/sched_features: enum { SCHED_FEAT_NEW_FAIR_SLEEPERS= 1, SCHED_FEAT_WAKEUP_PREEMPT = 2, SCHED_FEAT_START_DEBIT = 4, SCHED_FEAT_TREE_AVG = 8, SCHED_FEAT_APPROX_AVG = 16, }; const_debug unsigned int sysctl_sched_features = SCHED_FEAT_NEW_FAIR_SLEEPERS* 1 | SCHED_FEAT_WAKEUP_PREEMPT * 1 | SCHED_FEAT_START_DEBIT * 1 | SCHED_FEAT_TREE_AVG * 0 | SCHED_FEAT_APPROX_AVG * 0; [as of 2.6.24-rc7] could you try to turn some of them off/on. In particular toggling WAKEUP_PREEMPT might have an effect, and NEW_FAIR_SLEEPERS might have an effect as well. (TREE_AVG and APPROX_AVG has probably little effect) other debug-tunables you might want to look into are in the /proc/sys/kernel/sched_domains hierarchy. also, if you toggle: /sys/devices/system/cpu/sched_mc_power_savings does that change the results? Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, I'll get the results tomorrow as I'm now out of the office, but I can perhaps answer some of your queries now. On Jan 15, 2008 10:06 PM, Ingo Molnar [EMAIL PROTECTED] wrote: hm, the system has considerable idle time left: r b swpd free buff cache si so bi bo incs us sy id wa 8 00 1201920 683840 1039100 0 0 3 2 2746 1 0 99 0 2 00 1202168 683840 1039112 0 0 0 0 245 45339 80 2 17 0 2 00 1202168 683840 1039112 0 0 0 0 263 47349 84 3 14 0 2 00 1202300 683848 1039112 0 0 0 76 255 47057 84 3 13 0 and context-switches 45K times a second. Do you know what is going on there? I thought ray-tracing is something that can be parallelized pretty efficiently, without having to contend and schedule too much. This is a RTRT (real-time ray tracing) system and as a result differs from traditional offline ray-tracers as it is optimised for speed. The benchmark I ran while these data were collected renders an 80K polygon scene to a 512x512 buffer at just over 100fps. The context switches are most likely caused by the pthreads synchronisation code. There are two mutexs. Each job is a 32x32 tile and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for 100fps) * 2 =~50k. There's very likely where our context switches are coming from. Larger tile sizes would of course reduce the locking overhead, but then the ray-tracer suffers form load imbalance as some tiles are much quicker to render than others. Empircally we've found that this tile-size works the best for us. The CPU idling occurs as the system doesn't yet perform asynchronous rendering. When all tiles in a current job queue are finished the current frame is done. At this point all worker threads sleep while the master thread blits the image to the screen and fills the job queue for the next frame. The data probably shows that one CPU is kept maxed and the others reach about 90% most of the time. This is something on my TODO list to fix along with a myriad of other optimisations :) regards, Colin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, thanks for the reply. Modifying /proc/sys/kernel/sched_latency_ns to be double may have in fact made things slightly worse. I used 24-rc7 Your script was only written to run for 15 seconds, so I ran it so it multiple times so it covered most of the benchmark. Other issues with these data may be that for much of the benchmark I am building data structures utilizing at most 1 to 3 cores. I'm not concerned with these timings personally as this is considered the offline part of the render. Once these data structures are built I proceed to render across 8 cores. This is the section of the benchmark I get my timings from ( I use RDTSC before and after the render segment). The majority of the overall time taken for a run is therefore data structure building. I do not time this. Colin. On Jan 14, 2008 6:55 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Colin Fowler <[EMAIL PROTECTED]> wrote: > > > Benchmark : A ray-trace is performed on 500 times on 17 separate > > scenes. Workload is distributed by tiling the framebuffer into N 32x32 > > pixel tiles. Each CPU grabs one of N tiles out of the queue and > > repeats until no jobs are left. Rendering is to a shared framebuffer > > (obviously this causes problems with caching). Locking and > > synchronization is done using pthreads. > > > > Other details: The system is cleanly booted for each run. No I/O is > > performed during the timed portions of the test. The benchmark does > > however read a model file from the drive and build a data structure > > from it before each timed portion. > > > > On the 2.6.22 series of kernels results are pretty much the same. On > > 2.6.23 series kernels I see a loss in speed of ~2% across the board. > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 > > 2.6.23 Kernels tested: 23.1, 23.3, 23.13 > > 2.6.24 Kernels tested: 24-rc7 > > > > I have my kernel compiled to use the SLAB allocator. All other > > tweaking options are set as defaults. My config files are available at > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring > > something wrong for the type of work I do? > > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double > the value of /proc/sys/kernel/sched_latency_ns - does that make any > difference? Please also run the following script while the ray-trace app > is running: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > and send me the output of it, so that we can have an idea about what's > going on in your system during this workload. > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Forgot to add that the results are at http://vangogh.cs.tcd.ie/fowler/cfs/ On Jan 14, 2008 10:42 PM, Colin Fowler <[EMAIL PROTECTED]> wrote: > Hi Ingo, thanks for the reply. > > Modifying /proc/sys/kernel/sched_latency_ns to be double may have in > fact made things slightly worse. I used 24-rc7 > > Your script was only written to run for 15 seconds, so I ran it so it > multiple times so it covered most of the benchmark. > Other issues with these data may be that for much of the benchmark I > am building data structures utilizing at most 1 to 3 cores. I'm not > concerned with these timings personally as this is considered the > offline part of the render. Once these data structures are built I > proceed to render across 8 cores. This is the section of the benchmark > I get my timings from ( I use RDTSC before and after the render > segment). The majority of the overall time taken for a run is > therefore data structure building. I do not time this. > > Colin. > > > > On Jan 14, 2008 6:55 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > * Colin Fowler <[EMAIL PROTECTED]> wrote: > > > > > Benchmark : A ray-trace is performed on 500 times on 17 separate > > > scenes. Workload is distributed by tiling the framebuffer into N 32x32 > > > pixel tiles. Each CPU grabs one of N tiles out of the queue and > > > repeats until no jobs are left. Rendering is to a shared framebuffer > > > (obviously this causes problems with caching). Locking and > > > synchronization is done using pthreads. > > > > > > Other details: The system is cleanly booted for each run. No I/O is > > > performed during the timed portions of the test. The benchmark does > > > however read a model file from the drive and build a data structure > > > from it before each timed portion. > > > > > > On the 2.6.22 series of kernels results are pretty much the same. On > > > 2.6.23 series kernels I see a loss in speed of ~2% across the board. > > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). > > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 > > > 2.6.23 Kernels tested: 23.1, 23.3, 23.13 > > > 2.6.24 Kernels tested: 24-rc7 > > > > > > I have my kernel compiled to use the SLAB allocator. All other > > > tweaking options are set as defaults. My config files are available at > > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring > > > something wrong for the type of work I do? > > > > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double > > the value of /proc/sys/kernel/sched_latency_ns - does that make any > > difference? Please also run the following script while the ray-trace app > > is running: > > > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > > > and send me the output of it, so that we can have an idea about what's > > going on in your system during this workload. > > > > Ingo > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
* Colin Fowler <[EMAIL PROTECTED]> wrote: > Benchmark : A ray-trace is performed on 500 times on 17 separate > scenes. Workload is distributed by tiling the framebuffer into N 32x32 > pixel tiles. Each CPU grabs one of N tiles out of the queue and > repeats until no jobs are left. Rendering is to a shared framebuffer > (obviously this causes problems with caching). Locking and > synchronization is done using pthreads. > > Other details: The system is cleanly booted for each run. No I/O is > performed during the timed portions of the test. The benchmark does > however read a model file from the drive and build a data structure > from it before each timed portion. > > On the 2.6.22 series of kernels results are pretty much the same. On > 2.6.23 series kernels I see a loss in speed of ~2% across the board. > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 > 2.6.23 Kernels tested: 23.1, 23.3, 23.13 > 2.6.24 Kernels tested: 24-rc7 > > I have my kernel compiled to use the SLAB allocator. All other > tweaking options are set as defaults. My config files are available at > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring > something wrong for the type of work I do? Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double the value of /proc/sys/kernel/sched_latency_ns - does that make any difference? Please also run the following script while the ray-trace app is running: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh and send me the output of it, so that we can have an idea about what's going on in your system during this workload. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Please CC me as I'm not subscribed. I have (what is to me) a strange and very repeatable slowdown for a CPU intensive benchmark on my system on newer kernels. Hardware : Dell Precision 470. CPU 2x2.0GHz Quad Core Xeon E5335 CPUs Memory 4GB ECC RAM. OS Ubuntu x86_64 7.10 (Gutsy Gibbon) Compiler : gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) Benchmark : A ray-trace is performed on 500 times on 17 separate scenes. Workload is distributed by tiling the framebuffer into N 32x32 pixel tiles. Each CPU grabs one of N tiles out of the queue and repeats until no jobs are left. Rendering is to a shared framebuffer (obviously this causes problems with caching). Locking and synchronization is done using pthreads. Other details: The system is cleanly booted for each run. No I/O is performed during the timed portions of the test. The benchmark does however read a model file from the drive and build a data structure from it before each timed portion. On the 2.6.22 series of kernels results are pretty much the same. On 2.6.23 series kernels I see a loss in speed of ~2% across the board. On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 2.6.23 Kernels tested: 23.1, 23.3, 23.13 2.6.24 Kernels tested: 24-rc7 I have my kernel compiled to use the SLAB allocator. All other tweaking options are set as defaults. My config files are available at http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring something wrong for the type of work I do? regards, Colin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Please CC me as I'm not subscribed. I have (what is to me) a strange and very repeatable slowdown for a CPU intensive benchmark on my system on newer kernels. Hardware : Dell Precision 470. CPU 2x2.0GHz Quad Core Xeon E5335 CPUs Memory 4GB ECC RAM. OS Ubuntu x86_64 7.10 (Gutsy Gibbon) Compiler : gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) Benchmark : A ray-trace is performed on 500 times on 17 separate scenes. Workload is distributed by tiling the framebuffer into N 32x32 pixel tiles. Each CPU grabs one of N tiles out of the queue and repeats until no jobs are left. Rendering is to a shared framebuffer (obviously this causes problems with caching). Locking and synchronization is done using pthreads. Other details: The system is cleanly booted for each run. No I/O is performed during the timed portions of the test. The benchmark does however read a model file from the drive and build a data structure from it before each timed portion. On the 2.6.22 series of kernels results are pretty much the same. On 2.6.23 series kernels I see a loss in speed of ~2% across the board. On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 2.6.23 Kernels tested: 23.1, 23.3, 23.13 2.6.24 Kernels tested: 24-rc7 I have my kernel compiled to use the SLAB allocator. All other tweaking options are set as defaults. My config files are available at http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring something wrong for the type of work I do? regards, Colin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Hi Ingo, thanks for the reply. Modifying /proc/sys/kernel/sched_latency_ns to be double may have in fact made things slightly worse. I used 24-rc7 Your script was only written to run for 15 seconds, so I ran it so it multiple times so it covered most of the benchmark. Other issues with these data may be that for much of the benchmark I am building data structures utilizing at most 1 to 3 cores. I'm not concerned with these timings personally as this is considered the offline part of the render. Once these data structures are built I proceed to render across 8 cores. This is the section of the benchmark I get my timings from ( I use RDTSC before and after the render segment). The majority of the overall time taken for a run is therefore data structure building. I do not time this. Colin. On Jan 14, 2008 6:55 PM, Ingo Molnar [EMAIL PROTECTED] wrote: * Colin Fowler [EMAIL PROTECTED] wrote: Benchmark : A ray-trace is performed on 500 times on 17 separate scenes. Workload is distributed by tiling the framebuffer into N 32x32 pixel tiles. Each CPU grabs one of N tiles out of the queue and repeats until no jobs are left. Rendering is to a shared framebuffer (obviously this causes problems with caching). Locking and synchronization is done using pthreads. Other details: The system is cleanly booted for each run. No I/O is performed during the timed portions of the test. The benchmark does however read a model file from the drive and build a data structure from it before each timed portion. On the 2.6.22 series of kernels results are pretty much the same. On 2.6.23 series kernels I see a loss in speed of ~2% across the board. On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 2.6.23 Kernels tested: 23.1, 23.3, 23.13 2.6.24 Kernels tested: 24-rc7 I have my kernel compiled to use the SLAB allocator. All other tweaking options are set as defaults. My config files are available at http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring something wrong for the type of work I do? Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double the value of /proc/sys/kernel/sched_latency_ns - does that make any difference? Please also run the following script while the ray-trace app is running: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh and send me the output of it, so that we can have an idea about what's going on in your system during this workload. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
Forgot to add that the results are at http://vangogh.cs.tcd.ie/fowler/cfs/ On Jan 14, 2008 10:42 PM, Colin Fowler [EMAIL PROTECTED] wrote: Hi Ingo, thanks for the reply. Modifying /proc/sys/kernel/sched_latency_ns to be double may have in fact made things slightly worse. I used 24-rc7 Your script was only written to run for 15 seconds, so I ran it so it multiple times so it covered most of the benchmark. Other issues with these data may be that for much of the benchmark I am building data structures utilizing at most 1 to 3 cores. I'm not concerned with these timings personally as this is considered the offline part of the render. Once these data structures are built I proceed to render across 8 cores. This is the section of the benchmark I get my timings from ( I use RDTSC before and after the render segment). The majority of the overall time taken for a run is therefore data structure building. I do not time this. Colin. On Jan 14, 2008 6:55 PM, Ingo Molnar [EMAIL PROTECTED] wrote: * Colin Fowler [EMAIL PROTECTED] wrote: Benchmark : A ray-trace is performed on 500 times on 17 separate scenes. Workload is distributed by tiling the framebuffer into N 32x32 pixel tiles. Each CPU grabs one of N tiles out of the queue and repeats until no jobs are left. Rendering is to a shared framebuffer (obviously this causes problems with caching). Locking and synchronization is done using pthreads. Other details: The system is cleanly booted for each run. No I/O is performed during the timed portions of the test. The benchmark does however read a model file from the drive and build a data structure from it before each timed portion. On the 2.6.22 series of kernels results are pretty much the same. On 2.6.23 series kernels I see a loss in speed of ~2% across the board. On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%). 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15 2.6.23 Kernels tested: 23.1, 23.3, 23.13 2.6.24 Kernels tested: 24-rc7 I have my kernel compiled to use the SLAB allocator. All other tweaking options are set as defaults. My config files are available at http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring something wrong for the type of work I do? Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double the value of /proc/sys/kernel/sched_latency_ns - does that make any difference? Please also run the following script while the ray-trace app is running: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh and send me the output of it, so that we can have an idea about what's going on in your system during this workload. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/