Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-18 Thread Ingo Molnar

* Colin Fowler <[EMAIL PROTECTED]> wrote:

> > there are a handful of 'scheduler feature bits' in
> > /proc/sys/kernel/sched_features:
> >
> > enum {
> > SCHED_FEAT_NEW_FAIR_SLEEPERS= 1,
> > SCHED_FEAT_WAKEUP_PREEMPT   = 2,
> > SCHED_FEAT_START_DEBIT  = 4,
> > SCHED_FEAT_TREE_AVG = 8,
> > SCHED_FEAT_APPROX_AVG   = 16,
> > };
> >
> 
> Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or 
> SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my 
> 2.6.22 results. Toggling them both to 0 gives me slightly better 
> results than 2.6.22!

ok, but it would be nice to avoid having to turn these off. Could you 
try whether tuning the /proc/sys/kernel/*granularity* values (in 
particular wakeup_granularity) has any positive effect on your workload?

also, could you run your workload as SCHED_BATCH [via schedtool -B ], 
does that improve the results as well on a default-tuned kernel?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-18 Thread Ingo Molnar

* Colin Fowler [EMAIL PROTECTED] wrote:

  there are a handful of 'scheduler feature bits' in
  /proc/sys/kernel/sched_features:
 
  enum {
  SCHED_FEAT_NEW_FAIR_SLEEPERS= 1,
  SCHED_FEAT_WAKEUP_PREEMPT   = 2,
  SCHED_FEAT_START_DEBIT  = 4,
  SCHED_FEAT_TREE_AVG = 8,
  SCHED_FEAT_APPROX_AVG   = 16,
  };
 
 
 Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or 
 SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my 
 2.6.22 results. Toggling them both to 0 gives me slightly better 
 results than 2.6.22!

ok, but it would be nice to avoid having to turn these off. Could you 
try whether tuning the /proc/sys/kernel/*granularity* values (in 
particular wakeup_granularity) has any positive effect on your workload?

also, could you run your workload as SCHED_BATCH [via schedtool -B ], 
does that improve the results as well on a default-tuned kernel?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Colin Fowler
> there are a handful of 'scheduler feature bits' in
> /proc/sys/kernel/sched_features:
>
> enum {
> SCHED_FEAT_NEW_FAIR_SLEEPERS= 1,
> SCHED_FEAT_WAKEUP_PREEMPT   = 2,
> SCHED_FEAT_START_DEBIT  = 4,
> SCHED_FEAT_TREE_AVG = 8,
> SCHED_FEAT_APPROX_AVG   = 16,
> };
>

Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or
SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my
2.6.22 results. Toggling them both to 0 gives me slightly better
results than 2.6.22!

>   /sys/devices/system/cpu/sched_mc_power_savings
>
> does that change the results?
>

no measurable difference on this toggle that I can see.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Colin Fowler
Hi Ingo,

I have permission for a binary only release (mailed the supervisor
intermediately after your earler mail). I'm sure the abstract code
simulating the workload will be alright too, but I need time to put it
together as I'm a bit swamped at the moment. Will hope to have it in
the next few days.

regards,
   Colin

On Jan 16, 2008 4:19 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> * Colin Fowler <[EMAIL PROTECTED]> wrote:
>
> > Hi Ingo, I'll need to convince my supervisor first if I can release a
> > binary. Technically anythin glike this needs to go through our
> > University's "innovations department" and requires lengthy paperwork
> > and NDAs :(.
>
> a binary wouldnt work for me anyway. But you could try to write a
> "workload simulator": just pick out the pthread ops and replace the
> worker functions with some dummy stuff that just touches an array that
> has similar size to the tiles (in a tight loop). Make sure it has
> similar context-switch rate and idle percentage as your real workload -
> then send us the .c file. As long as it's a single .c file that runs for
> a few seconds and outputs a precise enough "run time" result, kernel
> developers would pick it up and use it for optimizations. To get the #
> of cpus automatically you can do:
>
> cpus = system("exit `grep processor /proc/cpuinfo  | wc -l`");
> cpus = WEXITSTATUS(cpus);
>
> and start as many threads as many CPUs there are in the system.
>
> Ingo
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Ingo Molnar

* Colin Fowler <[EMAIL PROTECTED]> wrote:

> Hi Ingo, I'll need to convince my supervisor first if I can release a 
> binary. Technically anythin glike this needs to go through our 
> University's "innovations department" and requires lengthy paperwork 
> and NDAs :(.

a binary wouldnt work for me anyway. But you could try to write a 
"workload simulator": just pick out the pthread ops and replace the 
worker functions with some dummy stuff that just touches an array that 
has similar size to the tiles (in a tight loop). Make sure it has 
similar context-switch rate and idle percentage as your real workload - 
then send us the .c file. As long as it's a single .c file that runs for 
a few seconds and outputs a precise enough "run time" result, kernel 
developers would pick it up and use it for optimizations. To get the # 
of cpus automatically you can do:

cpus = system("exit `grep processor /proc/cpuinfo  | wc -l`");
cpus = WEXITSTATUS(cpus);

and start as many threads as many CPUs there are in the system.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Colin Fowler
Hi Ingo, I'll need to convince my supervisor first if I can release a
binary. Technically anythin glike this needs to go through our
University's "innovations department" and requires lengthy paperwork
and NDAs :(.

regards,
Colin

On Jan 16, 2008 3:35 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> * Colin Fowler <[EMAIL PROTECTED]> wrote:
>
>
> > > and context-switches 45K times a second. Do you know what is going
> > > on there? I thought ray-tracing is something that can be
> > > parallelized pretty efficiently, without having to contend and
> > > schedule too much.
> >
> > This is a RTRT (real-time ray tracing) system and as a result differs
> > from traditional offline ray-tracers as it is optimised for speed. The
> > benchmark I ran while these data were collected renders an 80K polygon
> > scene to a 512x512 buffer at just over 100fps.
> >
> > The context switches are most likely caused by the pthreads
> > synchronisation code. There are two mutexs. Each job is a 32x32 tile
> > and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
> > 100fps) * 2 =~50k. There's very likely where our context switches are
> > coming from. Larger tile sizes would of course reduce the locking
> > overhead, but then the ray-tracer suffers form load imbalance as some
> > tiles are much quicker to render than others. Empircally we've found
> > that this tile-size works the best for us.
> >
> > The CPU idling occurs as the system doesn't yet perform asynchronous
> > rendering. When all tiles in a current job queue are finished the
> > current frame is done. At this point all worker threads sleep while
> > the master thread blits the image to the screen and fills the job
> > queue for the next frame. The data probably shows that one CPU is kept
> > maxed and the others reach about 90% most of the time. This is
> > something on my TODO list to fix along with a myriad of other
> > optimisations :)
>
> is this something i could run myself and see how it behaves with various
> scheduler settings? (if yes, where can i download it and is there any
> sample scene that would show similar effects.)
>
> Ingo
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Ingo Molnar

* Colin Fowler <[EMAIL PROTECTED]> wrote:

> > and context-switches 45K times a second. Do you know what is going 
> > on there? I thought ray-tracing is something that can be 
> > parallelized pretty efficiently, without having to contend and 
> > schedule too much.
> 
> This is a RTRT (real-time ray tracing) system and as a result differs 
> from traditional offline ray-tracers as it is optimised for speed. The 
> benchmark I ran while these data were collected renders an 80K polygon 
> scene to a 512x512 buffer at just over 100fps.
> 
> The context switches are most likely caused by the pthreads 
> synchronisation code. There are two mutexs. Each job is a 32x32 tile 
> and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for 
> 100fps) * 2 =~50k. There's very likely where our context switches are 
> coming from. Larger tile sizes would of course reduce the locking 
> overhead, but then the ray-tracer suffers form load imbalance as some 
> tiles are much quicker to render than others. Empircally we've found 
> that this tile-size works the best for us.
> 
> The CPU idling occurs as the system doesn't yet perform asynchronous 
> rendering. When all tiles in a current job queue are finished the 
> current frame is done. At this point all worker threads sleep while 
> the master thread blits the image to the screen and fills the job 
> queue for the next frame. The data probably shows that one CPU is kept 
> maxed and the others reach about 90% most of the time. This is 
> something on my TODO list to fix along with a myriad of other 
> optimisations :)

is this something i could run myself and see how it behaves with various 
scheduler settings? (if yes, where can i download it and is there any 
sample scene that would show similar effects.)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Ingo Molnar

* Colin Fowler [EMAIL PROTECTED] wrote:

  and context-switches 45K times a second. Do you know what is going 
  on there? I thought ray-tracing is something that can be 
  parallelized pretty efficiently, without having to contend and 
  schedule too much.
 
 This is a RTRT (real-time ray tracing) system and as a result differs 
 from traditional offline ray-tracers as it is optimised for speed. The 
 benchmark I ran while these data were collected renders an 80K polygon 
 scene to a 512x512 buffer at just over 100fps.
 
 The context switches are most likely caused by the pthreads 
 synchronisation code. There are two mutexs. Each job is a 32x32 tile 
 and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for 
 100fps) * 2 =~50k. There's very likely where our context switches are 
 coming from. Larger tile sizes would of course reduce the locking 
 overhead, but then the ray-tracer suffers form load imbalance as some 
 tiles are much quicker to render than others. Empircally we've found 
 that this tile-size works the best for us.
 
 The CPU idling occurs as the system doesn't yet perform asynchronous 
 rendering. When all tiles in a current job queue are finished the 
 current frame is done. At this point all worker threads sleep while 
 the master thread blits the image to the screen and fills the job 
 queue for the next frame. The data probably shows that one CPU is kept 
 maxed and the others reach about 90% most of the time. This is 
 something on my TODO list to fix along with a myriad of other 
 optimisations :)

is this something i could run myself and see how it behaves with various 
scheduler settings? (if yes, where can i download it and is there any 
sample scene that would show similar effects.)

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Colin Fowler
Hi Ingo, I'll need to convince my supervisor first if I can release a
binary. Technically anythin glike this needs to go through our
University's innovations department and requires lengthy paperwork
and NDAs :(.

regards,
Colin

On Jan 16, 2008 3:35 PM, Ingo Molnar [EMAIL PROTECTED] wrote:

 * Colin Fowler [EMAIL PROTECTED] wrote:


   and context-switches 45K times a second. Do you know what is going
   on there? I thought ray-tracing is something that can be
   parallelized pretty efficiently, without having to contend and
   schedule too much.
 
  This is a RTRT (real-time ray tracing) system and as a result differs
  from traditional offline ray-tracers as it is optimised for speed. The
  benchmark I ran while these data were collected renders an 80K polygon
  scene to a 512x512 buffer at just over 100fps.
 
  The context switches are most likely caused by the pthreads
  synchronisation code. There are two mutexs. Each job is a 32x32 tile
  and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
  100fps) * 2 =~50k. There's very likely where our context switches are
  coming from. Larger tile sizes would of course reduce the locking
  overhead, but then the ray-tracer suffers form load imbalance as some
  tiles are much quicker to render than others. Empircally we've found
  that this tile-size works the best for us.
 
  The CPU idling occurs as the system doesn't yet perform asynchronous
  rendering. When all tiles in a current job queue are finished the
  current frame is done. At this point all worker threads sleep while
  the master thread blits the image to the screen and fills the job
  queue for the next frame. The data probably shows that one CPU is kept
  maxed and the others reach about 90% most of the time. This is
  something on my TODO list to fix along with a myriad of other
  optimisations :)

 is this something i could run myself and see how it behaves with various
 scheduler settings? (if yes, where can i download it and is there any
 sample scene that would show similar effects.)

 Ingo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Ingo Molnar

* Colin Fowler [EMAIL PROTECTED] wrote:

 Hi Ingo, I'll need to convince my supervisor first if I can release a 
 binary. Technically anythin glike this needs to go through our 
 University's innovations department and requires lengthy paperwork 
 and NDAs :(.

a binary wouldnt work for me anyway. But you could try to write a 
workload simulator: just pick out the pthread ops and replace the 
worker functions with some dummy stuff that just touches an array that 
has similar size to the tiles (in a tight loop). Make sure it has 
similar context-switch rate and idle percentage as your real workload - 
then send us the .c file. As long as it's a single .c file that runs for 
a few seconds and outputs a precise enough run time result, kernel 
developers would pick it up and use it for optimizations. To get the # 
of cpus automatically you can do:

cpus = system(exit `grep processor /proc/cpuinfo  | wc -l`);
cpus = WEXITSTATUS(cpus);

and start as many threads as many CPUs there are in the system.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Colin Fowler
Hi Ingo,

I have permission for a binary only release (mailed the supervisor
intermediately after your earler mail). I'm sure the abstract code
simulating the workload will be alright too, but I need time to put it
together as I'm a bit swamped at the moment. Will hope to have it in
the next few days.

regards,
   Colin

On Jan 16, 2008 4:19 PM, Ingo Molnar [EMAIL PROTECTED] wrote:

 * Colin Fowler [EMAIL PROTECTED] wrote:

  Hi Ingo, I'll need to convince my supervisor first if I can release a
  binary. Technically anythin glike this needs to go through our
  University's innovations department and requires lengthy paperwork
  and NDAs :(.

 a binary wouldnt work for me anyway. But you could try to write a
 workload simulator: just pick out the pthread ops and replace the
 worker functions with some dummy stuff that just touches an array that
 has similar size to the tiles (in a tight loop). Make sure it has
 similar context-switch rate and idle percentage as your real workload -
 then send us the .c file. As long as it's a single .c file that runs for
 a few seconds and outputs a precise enough run time result, kernel
 developers would pick it up and use it for optimizations. To get the #
 of cpus automatically you can do:

 cpus = system(exit `grep processor /proc/cpuinfo  | wc -l`);
 cpus = WEXITSTATUS(cpus);

 and start as many threads as many CPUs there are in the system.

 Ingo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-16 Thread Colin Fowler
 there are a handful of 'scheduler feature bits' in
 /proc/sys/kernel/sched_features:

 enum {
 SCHED_FEAT_NEW_FAIR_SLEEPERS= 1,
 SCHED_FEAT_WAKEUP_PREEMPT   = 2,
 SCHED_FEAT_START_DEBIT  = 4,
 SCHED_FEAT_TREE_AVG = 8,
 SCHED_FEAT_APPROX_AVG   = 16,
 };


Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or
SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my
2.6.22 results. Toggling them both to 0 gives me slightly better
results than 2.6.22!

   /sys/devices/system/cpu/sched_mc_power_savings

 does that change the results?


no measurable difference on this toggle that I can see.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-15 Thread Colin Fowler
Hi Ingo,
  I'll get the results tomorrow as I'm now out of the office, but
I can perhaps answer some of your queries now.

On Jan 15, 2008 10:06 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:

> hm, the system has considerable idle time left:
>
>  r  b swpd   free   buff  cache   si so  bi bo   incs  us sy id wa
>  8  00 1201920 683840 1039100  0  0   3  2   2746   1  0 99  0
>  2  00 1202168 683840 1039112  0  0   0  0  245 45339  80  2 17  0
>  2  00 1202168 683840 1039112  0  0   0  0  263 47349  84  3 14  0
>  2  00 1202300 683848 1039112  0  0   0 76  255 47057  84  3 13  0
>
> and context-switches 45K times a second. Do you know what is going on
> there? I thought ray-tracing is something that can be parallelized
> pretty efficiently, without having to contend and schedule too much.
>

This is a RTRT (real-time ray tracing) system and as a result differs
from traditional offline ray-tracers as it is optimised for speed. The
benchmark I ran while these data were collected renders an 80K polygon
scene to a 512x512 buffer at just over 100fps.

The context switches are most likely caused by the pthreads
synchronisation code. There are two mutexs. Each job is a 32x32 tile
and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
100fps) * 2 =~50k. There's very likely where our context switches are
coming from. Larger tile sizes would of course reduce the locking
overhead, but then the ray-tracer suffers form load imbalance as some
tiles are much quicker to render than others. Empircally we've found
that this tile-size works the best for us.

The CPU idling occurs as the system doesn't yet perform asynchronous
rendering. When all tiles in a current job queue are finished the
current frame is done. At this point all worker threads sleep while
the master thread blits the image to the screen and fills the job
queue for the next frame. The data probably shows that one CPU is kept
maxed and the others reach about 90% most of the time. This is
something on my TODO list to fix along with a myriad of other
optimisations :)

regards,
Colin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-15 Thread Ingo Molnar

* Colin Fowler <[EMAIL PROTECTED]> wrote:

> These data may be much better for you. It's a single 15 second data 
> collection run only when the actual ray-tracing is happening. These 
> data do not therefore cover the data structure building phase.
> 
> http://vangogh.cs.tcd.ie/fowler/cfs2/

hm, the system has considerable idle time left:

 r  b swpd   free   buff  cache   si so  bi bo   incs  us sy id wa
 8  00 1201920 683840 1039100  0  0   3  2   2746   1  0 99  0
 2  00 1202168 683840 1039112  0  0   0  0  245 45339  80  2 17  0
 2  00 1202168 683840 1039112  0  0   0  0  263 47349  84  3 14  0
 2  00 1202300 683848 1039112  0  0   0 76  255 47057  84  3 13  0

and context-switches 45K times a second. Do you know what is going on 
there? I thought ray-tracing is something that can be parallelized 
pretty efficiently, without having to contend and schedule too much.

could you try to do a similar capture on 2.6.22 as well (under the same 
phase of the same workload), as comparison?

there are a handful of 'scheduler feature bits' in 
/proc/sys/kernel/sched_features:

enum {
SCHED_FEAT_NEW_FAIR_SLEEPERS= 1,
SCHED_FEAT_WAKEUP_PREEMPT   = 2,
SCHED_FEAT_START_DEBIT  = 4,
SCHED_FEAT_TREE_AVG = 8,
SCHED_FEAT_APPROX_AVG   = 16,
};

const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_NEW_FAIR_SLEEPERS* 1 |
SCHED_FEAT_WAKEUP_PREEMPT   * 1 |
SCHED_FEAT_START_DEBIT  * 1 |
SCHED_FEAT_TREE_AVG * 0 |
SCHED_FEAT_APPROX_AVG   * 0;

[as of 2.6.24-rc7]

could you try to turn some of them off/on. In particular toggling 
WAKEUP_PREEMPT might have an effect, and NEW_FAIR_SLEEPERS might have an 
effect as well. (TREE_AVG and APPROX_AVG has probably little effect)

other debug-tunables you might want to look into are in the 
/proc/sys/kernel/sched_domains hierarchy.

also, if you toggle:

  /sys/devices/system/cpu/sched_mc_power_savings

does that change the results?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-15 Thread Colin Fowler
These data may be much better for you. It's a single 15 second data
collection run only when the actual ray-tracing is happening. These
data do not therefore cover the data structure building phase.

http://vangogh.cs.tcd.ie/fowler/cfs2/

Colin


On Jan 14, 2008 10:42 PM, Colin Fowler <[EMAIL PROTECTED]> wrote:
> Hi Ingo, thanks for the reply.
>
> Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
> fact made things slightly worse. I used 24-rc7
>
> Your script was only written to run for 15 seconds, so I ran it so it
> multiple times so it covered most of the benchmark.
> Other issues with these data may be that for much of the benchmark I
> am building data structures utilizing at most 1 to 3 cores. I'm not
> concerned with these timings personally as this is considered the
> offline part of the render. Once these data structures are built I
> proceed to render across 8 cores. This is the section of the benchmark
> I get my timings from ( I use RDTSC before and after the render
> segment). The majority of the overall time taken for a run is
> therefore data structure building. I do not time this.
>
> Colin.
>
>
>
> On Jan 14, 2008 6:55 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >
> > * Colin Fowler <[EMAIL PROTECTED]> wrote:
> >
> > > Benchmark : A ray-trace is performed on 500 times on 17 separate
> > > scenes. Workload is distributed by tiling the framebuffer into N 32x32
> > > pixel tiles. Each CPU grabs one of N tiles out of the queue and
> > > repeats until no jobs are left. Rendering is to a shared framebuffer
> > > (obviously this causes problems with caching). Locking and
> > > synchronization is done using pthreads.
> > >
> > > Other details: The system is cleanly booted for each run. No I/O is
> > > performed during the timed portions of the test. The benchmark does
> > > however read a model file from the drive and build a data structure
> > > from it before each timed portion.
> > >
> > > On the 2.6.22 series of kernels results are pretty much the same. On
> > > 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> > > 2.6.23 Kernels tested: 23.1, 23.3,  23.13
> > > 2.6.24 Kernels tested: 24-rc7
> > >
> > > I have my kernel compiled to use the SLAB allocator. All other
> > > tweaking options are set as defaults. My config files are available at
> > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> > > something wrong for the type of work I do?
> >
> > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
> > the value of /proc/sys/kernel/sched_latency_ns - does that make any
> > difference? Please also run the following script while the ray-trace app
> > is running:
> >
> >   http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> >
> > and send me the output of it, so that we can have an idea about what's
> > going on in your system during this workload.
> >
> > Ingo
> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-15 Thread Colin Fowler
These data may be much better for you. It's a single 15 second data
collection run only when the actual ray-tracing is happening. These
data do not therefore cover the data structure building phase.

http://vangogh.cs.tcd.ie/fowler/cfs2/

Colin


On Jan 14, 2008 10:42 PM, Colin Fowler [EMAIL PROTECTED] wrote:
 Hi Ingo, thanks for the reply.

 Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
 fact made things slightly worse. I used 24-rc7

 Your script was only written to run for 15 seconds, so I ran it so it
 multiple times so it covered most of the benchmark.
 Other issues with these data may be that for much of the benchmark I
 am building data structures utilizing at most 1 to 3 cores. I'm not
 concerned with these timings personally as this is considered the
 offline part of the render. Once these data structures are built I
 proceed to render across 8 cores. This is the section of the benchmark
 I get my timings from ( I use RDTSC before and after the render
 segment). The majority of the overall time taken for a run is
 therefore data structure building. I do not time this.

 Colin.



 On Jan 14, 2008 6:55 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
 
  * Colin Fowler [EMAIL PROTECTED] wrote:
 
   Benchmark : A ray-trace is performed on 500 times on 17 separate
   scenes. Workload is distributed by tiling the framebuffer into N 32x32
   pixel tiles. Each CPU grabs one of N tiles out of the queue and
   repeats until no jobs are left. Rendering is to a shared framebuffer
   (obviously this causes problems with caching). Locking and
   synchronization is done using pthreads.
  
   Other details: The system is cleanly booted for each run. No I/O is
   performed during the timed portions of the test. The benchmark does
   however read a model file from the drive and build a data structure
   from it before each timed portion.
  
   On the 2.6.22 series of kernels results are pretty much the same. On
   2.6.23 series kernels I see a loss in speed of ~2% across the board.
   On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
   2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
   2.6.23 Kernels tested: 23.1, 23.3,  23.13
   2.6.24 Kernels tested: 24-rc7
  
   I have my kernel compiled to use the SLAB allocator. All other
   tweaking options are set as defaults. My config files are available at
   http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
   something wrong for the type of work I do?
 
  Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
  the value of /proc/sys/kernel/sched_latency_ns - does that make any
  difference? Please also run the following script while the ray-trace app
  is running:
 
http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
 
  and send me the output of it, so that we can have an idea about what's
  going on in your system during this workload.
 
  Ingo
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-15 Thread Ingo Molnar

* Colin Fowler [EMAIL PROTECTED] wrote:

 These data may be much better for you. It's a single 15 second data 
 collection run only when the actual ray-tracing is happening. These 
 data do not therefore cover the data structure building phase.
 
 http://vangogh.cs.tcd.ie/fowler/cfs2/

hm, the system has considerable idle time left:

 r  b swpd   free   buff  cache   si so  bi bo   incs  us sy id wa
 8  00 1201920 683840 1039100  0  0   3  2   2746   1  0 99  0
 2  00 1202168 683840 1039112  0  0   0  0  245 45339  80  2 17  0
 2  00 1202168 683840 1039112  0  0   0  0  263 47349  84  3 14  0
 2  00 1202300 683848 1039112  0  0   0 76  255 47057  84  3 13  0

and context-switches 45K times a second. Do you know what is going on 
there? I thought ray-tracing is something that can be parallelized 
pretty efficiently, without having to contend and schedule too much.

could you try to do a similar capture on 2.6.22 as well (under the same 
phase of the same workload), as comparison?

there are a handful of 'scheduler feature bits' in 
/proc/sys/kernel/sched_features:

enum {
SCHED_FEAT_NEW_FAIR_SLEEPERS= 1,
SCHED_FEAT_WAKEUP_PREEMPT   = 2,
SCHED_FEAT_START_DEBIT  = 4,
SCHED_FEAT_TREE_AVG = 8,
SCHED_FEAT_APPROX_AVG   = 16,
};

const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_NEW_FAIR_SLEEPERS* 1 |
SCHED_FEAT_WAKEUP_PREEMPT   * 1 |
SCHED_FEAT_START_DEBIT  * 1 |
SCHED_FEAT_TREE_AVG * 0 |
SCHED_FEAT_APPROX_AVG   * 0;

[as of 2.6.24-rc7]

could you try to turn some of them off/on. In particular toggling 
WAKEUP_PREEMPT might have an effect, and NEW_FAIR_SLEEPERS might have an 
effect as well. (TREE_AVG and APPROX_AVG has probably little effect)

other debug-tunables you might want to look into are in the 
/proc/sys/kernel/sched_domains hierarchy.

also, if you toggle:

  /sys/devices/system/cpu/sched_mc_power_savings

does that change the results?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-15 Thread Colin Fowler
Hi Ingo,
  I'll get the results tomorrow as I'm now out of the office, but
I can perhaps answer some of your queries now.

On Jan 15, 2008 10:06 PM, Ingo Molnar [EMAIL PROTECTED] wrote:

 hm, the system has considerable idle time left:

  r  b swpd   free   buff  cache   si so  bi bo   incs  us sy id wa
  8  00 1201920 683840 1039100  0  0   3  2   2746   1  0 99  0
  2  00 1202168 683840 1039112  0  0   0  0  245 45339  80  2 17  0
  2  00 1202168 683840 1039112  0  0   0  0  263 47349  84  3 14  0
  2  00 1202300 683848 1039112  0  0   0 76  255 47057  84  3 13  0

 and context-switches 45K times a second. Do you know what is going on
 there? I thought ray-tracing is something that can be parallelized
 pretty efficiently, without having to contend and schedule too much.


This is a RTRT (real-time ray tracing) system and as a result differs
from traditional offline ray-tracers as it is optimised for speed. The
benchmark I ran while these data were collected renders an 80K polygon
scene to a 512x512 buffer at just over 100fps.

The context switches are most likely caused by the pthreads
synchronisation code. There are two mutexs. Each job is a 32x32 tile
and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
100fps) * 2 =~50k. There's very likely where our context switches are
coming from. Larger tile sizes would of course reduce the locking
overhead, but then the ray-tracer suffers form load imbalance as some
tiles are much quicker to render than others. Empircally we've found
that this tile-size works the best for us.

The CPU idling occurs as the system doesn't yet perform asynchronous
rendering. When all tiles in a current job queue are finished the
current frame is done. At this point all worker threads sleep while
the master thread blits the image to the screen and fills the job
queue for the next frame. The data probably shows that one CPU is kept
maxed and the others reach about 90% most of the time. This is
something on my TODO list to fix along with a myriad of other
optimisations :)

regards,
Colin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Colin Fowler
Hi Ingo, thanks for the reply.

Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
fact made things slightly worse. I used 24-rc7

Your script was only written to run for 15 seconds, so I ran it so it
multiple times so it covered most of the benchmark.
Other issues with these data may be that for much of the benchmark I
am building data structures utilizing at most 1 to 3 cores. I'm not
concerned with these timings personally as this is considered the
offline part of the render. Once these data structures are built I
proceed to render across 8 cores. This is the section of the benchmark
I get my timings from ( I use RDTSC before and after the render
segment). The majority of the overall time taken for a run is
therefore data structure building. I do not time this.

Colin.


On Jan 14, 2008 6:55 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> * Colin Fowler <[EMAIL PROTECTED]> wrote:
>
> > Benchmark : A ray-trace is performed on 500 times on 17 separate
> > scenes. Workload is distributed by tiling the framebuffer into N 32x32
> > pixel tiles. Each CPU grabs one of N tiles out of the queue and
> > repeats until no jobs are left. Rendering is to a shared framebuffer
> > (obviously this causes problems with caching). Locking and
> > synchronization is done using pthreads.
> >
> > Other details: The system is cleanly booted for each run. No I/O is
> > performed during the timed portions of the test. The benchmark does
> > however read a model file from the drive and build a data structure
> > from it before each timed portion.
> >
> > On the 2.6.22 series of kernels results are pretty much the same. On
> > 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> > 2.6.23 Kernels tested: 23.1, 23.3,  23.13
> > 2.6.24 Kernels tested: 24-rc7
> >
> > I have my kernel compiled to use the SLAB allocator. All other
> > tweaking options are set as defaults. My config files are available at
> > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> > something wrong for the type of work I do?
>
> Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
> the value of /proc/sys/kernel/sched_latency_ns - does that make any
> difference? Please also run the following script while the ray-trace app
> is running:
>
>   http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> and send me the output of it, so that we can have an idea about what's
> going on in your system during this workload.
>
> Ingo
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Colin Fowler
Forgot to add that the results are at http://vangogh.cs.tcd.ie/fowler/cfs/


On Jan 14, 2008 10:42 PM, Colin Fowler <[EMAIL PROTECTED]> wrote:
> Hi Ingo, thanks for the reply.
>
> Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
> fact made things slightly worse. I used 24-rc7
>
> Your script was only written to run for 15 seconds, so I ran it so it
> multiple times so it covered most of the benchmark.
> Other issues with these data may be that for much of the benchmark I
> am building data structures utilizing at most 1 to 3 cores. I'm not
> concerned with these timings personally as this is considered the
> offline part of the render. Once these data structures are built I
> proceed to render across 8 cores. This is the section of the benchmark
> I get my timings from ( I use RDTSC before and after the render
> segment). The majority of the overall time taken for a run is
> therefore data structure building. I do not time this.
>
> Colin.
>
>
>
> On Jan 14, 2008 6:55 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >
> > * Colin Fowler <[EMAIL PROTECTED]> wrote:
> >
> > > Benchmark : A ray-trace is performed on 500 times on 17 separate
> > > scenes. Workload is distributed by tiling the framebuffer into N 32x32
> > > pixel tiles. Each CPU grabs one of N tiles out of the queue and
> > > repeats until no jobs are left. Rendering is to a shared framebuffer
> > > (obviously this causes problems with caching). Locking and
> > > synchronization is done using pthreads.
> > >
> > > Other details: The system is cleanly booted for each run. No I/O is
> > > performed during the timed portions of the test. The benchmark does
> > > however read a model file from the drive and build a data structure
> > > from it before each timed portion.
> > >
> > > On the 2.6.22 series of kernels results are pretty much the same. On
> > > 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> > > 2.6.23 Kernels tested: 23.1, 23.3,  23.13
> > > 2.6.24 Kernels tested: 24-rc7
> > >
> > > I have my kernel compiled to use the SLAB allocator. All other
> > > tweaking options are set as defaults. My config files are available at
> > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> > > something wrong for the type of work I do?
> >
> > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
> > the value of /proc/sys/kernel/sched_latency_ns - does that make any
> > difference? Please also run the following script while the ray-trace app
> > is running:
> >
> >   http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> >
> > and send me the output of it, so that we can have an idea about what's
> > going on in your system during this workload.
> >
> > Ingo
> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Ingo Molnar

* Colin Fowler <[EMAIL PROTECTED]> wrote:

> Benchmark : A ray-trace is performed on 500 times on 17 separate 
> scenes. Workload is distributed by tiling the framebuffer into N 32x32 
> pixel tiles. Each CPU grabs one of N tiles out of the queue and 
> repeats until no jobs are left. Rendering is to a shared framebuffer 
> (obviously this causes problems with caching). Locking and 
> synchronization is done using pthreads.
> 
> Other details: The system is cleanly booted for each run. No I/O is 
> performed during the timed portions of the test. The benchmark does 
> however read a model file from the drive and build a data structure 
> from it before each timed portion.
> 
> On the 2.6.22 series of kernels results are pretty much the same. On
> 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> 2.6.23 Kernels tested: 23.1, 23.3,  23.13
> 2.6.24 Kernels tested: 24-rc7
> 
> I have my kernel compiled to use the SLAB allocator. All other 
> tweaking options are set as defaults. My config files are available at 
> http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring 
> something wrong for the type of work I do?

Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double 
the value of /proc/sys/kernel/sched_latency_ns - does that make any 
difference? Please also run the following script while the ray-trace app 
is running:

  http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

and send me the output of it, so that we can have an idea about what's 
going on in your system during this workload.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Colin Fowler
Please CC me as I'm not subscribed.

I have (what is to me) a strange and very repeatable slowdown for a
CPU intensive benchmark on my system  on newer kernels.

Hardware : Dell Precision 470.
CPU 2x2.0GHz Quad Core Xeon E5335 CPUs
Memory 4GB ECC RAM.
OS Ubuntu x86_64 7.10 (Gutsy Gibbon)
Compiler : gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)

Benchmark : A ray-trace is performed on 500 times on 17 separate
scenes. Workload is distributed by tiling the framebuffer into N 32x32
pixel tiles. Each CPU grabs one of N tiles out of the queue and
repeats until no jobs are left. Rendering is to a shared framebuffer
(obviously this causes problems with caching). Locking and
synchronization is done using pthreads.

Other details: The system is cleanly booted for each run. No I/O is
performed during the timed portions of the test. The benchmark does
however read a model file from the drive and build a data structure
from it before each timed portion.


On the 2.6.22 series of kernels results are pretty much the same. On
2.6.23 series kernels I see a loss in speed of ~2% across the board.
On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
2.6.23 Kernels tested: 23.1, 23.3,  23.13
2.6.24 Kernels tested: 24-rc7

I have my kernel compiled to use the SLAB allocator. All other
tweaking options are set as defaults. My config files are available at
http://vangogh.cs.tcd.ie/fowler/configs  . Perhaps I'm configuring
something wrong for the type of work I do?

regards,
Colin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Colin Fowler
Please CC me as I'm not subscribed.

I have (what is to me) a strange and very repeatable slowdown for a
CPU intensive benchmark on my system  on newer kernels.

Hardware : Dell Precision 470.
CPU 2x2.0GHz Quad Core Xeon E5335 CPUs
Memory 4GB ECC RAM.
OS Ubuntu x86_64 7.10 (Gutsy Gibbon)
Compiler : gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)

Benchmark : A ray-trace is performed on 500 times on 17 separate
scenes. Workload is distributed by tiling the framebuffer into N 32x32
pixel tiles. Each CPU grabs one of N tiles out of the queue and
repeats until no jobs are left. Rendering is to a shared framebuffer
(obviously this causes problems with caching). Locking and
synchronization is done using pthreads.

Other details: The system is cleanly booted for each run. No I/O is
performed during the timed portions of the test. The benchmark does
however read a model file from the drive and build a data structure
from it before each timed portion.


On the 2.6.22 series of kernels results are pretty much the same. On
2.6.23 series kernels I see a loss in speed of ~2% across the board.
On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
2.6.23 Kernels tested: 23.1, 23.3,  23.13
2.6.24 Kernels tested: 24-rc7

I have my kernel compiled to use the SLAB allocator. All other
tweaking options are set as defaults. My config files are available at
http://vangogh.cs.tcd.ie/fowler/configs  . Perhaps I'm configuring
something wrong for the type of work I do?

regards,
Colin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Colin Fowler
Hi Ingo, thanks for the reply.

Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
fact made things slightly worse. I used 24-rc7

Your script was only written to run for 15 seconds, so I ran it so it
multiple times so it covered most of the benchmark.
Other issues with these data may be that for much of the benchmark I
am building data structures utilizing at most 1 to 3 cores. I'm not
concerned with these timings personally as this is considered the
offline part of the render. Once these data structures are built I
proceed to render across 8 cores. This is the section of the benchmark
I get my timings from ( I use RDTSC before and after the render
segment). The majority of the overall time taken for a run is
therefore data structure building. I do not time this.

Colin.


On Jan 14, 2008 6:55 PM, Ingo Molnar [EMAIL PROTECTED] wrote:

 * Colin Fowler [EMAIL PROTECTED] wrote:

  Benchmark : A ray-trace is performed on 500 times on 17 separate
  scenes. Workload is distributed by tiling the framebuffer into N 32x32
  pixel tiles. Each CPU grabs one of N tiles out of the queue and
  repeats until no jobs are left. Rendering is to a shared framebuffer
  (obviously this causes problems with caching). Locking and
  synchronization is done using pthreads.
 
  Other details: The system is cleanly booted for each run. No I/O is
  performed during the timed portions of the test. The benchmark does
  however read a model file from the drive and build a data structure
  from it before each timed portion.
 
  On the 2.6.22 series of kernels results are pretty much the same. On
  2.6.23 series kernels I see a loss in speed of ~2% across the board.
  On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
  2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
  2.6.23 Kernels tested: 23.1, 23.3,  23.13
  2.6.24 Kernels tested: 24-rc7
 
  I have my kernel compiled to use the SLAB allocator. All other
  tweaking options are set as defaults. My config files are available at
  http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
  something wrong for the type of work I do?

 Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
 the value of /proc/sys/kernel/sched_latency_ns - does that make any
 difference? Please also run the following script while the ray-trace app
 is running:

   http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

 and send me the output of it, so that we can have an idea about what's
 going on in your system during this workload.

 Ingo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance loss 2.6.22-22.6.23-2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

2008-01-14 Thread Colin Fowler
Forgot to add that the results are at http://vangogh.cs.tcd.ie/fowler/cfs/


On Jan 14, 2008 10:42 PM, Colin Fowler [EMAIL PROTECTED] wrote:
 Hi Ingo, thanks for the reply.

 Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
 fact made things slightly worse. I used 24-rc7

 Your script was only written to run for 15 seconds, so I ran it so it
 multiple times so it covered most of the benchmark.
 Other issues with these data may be that for much of the benchmark I
 am building data structures utilizing at most 1 to 3 cores. I'm not
 concerned with these timings personally as this is considered the
 offline part of the render. Once these data structures are built I
 proceed to render across 8 cores. This is the section of the benchmark
 I get my timings from ( I use RDTSC before and after the render
 segment). The majority of the overall time taken for a run is
 therefore data structure building. I do not time this.

 Colin.



 On Jan 14, 2008 6:55 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
 
  * Colin Fowler [EMAIL PROTECTED] wrote:
 
   Benchmark : A ray-trace is performed on 500 times on 17 separate
   scenes. Workload is distributed by tiling the framebuffer into N 32x32
   pixel tiles. Each CPU grabs one of N tiles out of the queue and
   repeats until no jobs are left. Rendering is to a shared framebuffer
   (obviously this causes problems with caching). Locking and
   synchronization is done using pthreads.
  
   Other details: The system is cleanly booted for each run. No I/O is
   performed during the timed portions of the test. The benchmark does
   however read a model file from the drive and build a data structure
   from it before each timed portion.
  
   On the 2.6.22 series of kernels results are pretty much the same. On
   2.6.23 series kernels I see a loss in speed of ~2% across the board.
   On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
   2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
   2.6.23 Kernels tested: 23.1, 23.3,  23.13
   2.6.24 Kernels tested: 24-rc7
  
   I have my kernel compiled to use the SLAB allocator. All other
   tweaking options are set as defaults. My config files are available at
   http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
   something wrong for the type of work I do?
 
  Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
  the value of /proc/sys/kernel/sched_latency_ns - does that make any
  difference? Please also run the following script while the ray-trace app
  is running:
 
http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
 
  and send me the output of it, so that we can have an idea about what's
  going on in your system during this workload.
 
  Ingo
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/