[patch 17/24] Immediate Values - Documentation
Changelog: - Remove imv_set_early (removed from API). - Use imv_* instead of immediate_*. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: Rusty Russell <[EMAIL PROTECTED]> --- Documentation/immediate.txt | 221 1 file changed, 221 insertions(+) Index: linux-2.6-lttng/Documentation/immediate.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-lttng/Documentation/immediate.txt 2007-11-03 20:28:58.0 -0400 @@ -0,0 +1,221 @@ + Using the Immediate Values + + Mathieu Desnoyers + + +This document introduces Immediate Values and their use. + + +* Purpose of immediate values + +An immediate value is used to compile into the kernel variables that sit within +the instruction stream. They are meant to be rarely updated but read often. +Using immediate values for these variables will save cache lines. + +This infrastructure is specialized in supporting dynamic patching of the values +in the instruction stream when multiple CPUs are running without disturbing the +normal system behavior. + +Compiling code meant to be rarely enabled at runtime can be done using +if (unlikely(imv_read(var))) as condition surrounding the code. The +smallest data type required for the test (an 8 bits char) is preferred, since +some architectures, such as powerpc, only allow up to 16 bits immediate values. + + +* Usage + +In order to use the "immediate" macros, you should include linux/immediate.h. + +#include + +DEFINE_IMV(char, this_immediate); +EXPORT_IMV_SYMBOL(this_immediate); + + +And use, in the body of a function: + +Use imv_set(this_immediate) to set the immediate value. + +Use imv_read(this_immediate) to read the immediate value. + +The immediate mechanism supports inserting multiple instances of the same +immediate. Immediate values can be put in inline functions, inlined static +functions, and unrolled loops. + +If you have to read the immediate values from a function declared as __init or +__exit, you should explicitly use _imv_read(), which will fall back on a +global variable read. Failing to do so will leave a reference to the __init +section after it is freed (it would generate a modpost warning). + +You can choose to set an initial static value to the immediate by using, for +instance: + +DEFINE_IMV(long, myptr) = 10; + + +* Optimization for a given architecture + +One can implement optimized immediate values for a given architecture by +replacing asm-$ARCH/immediate.h. + + +* Performance improvement + + + * Memory hit for a data-based branch + +Here are the results on a 3GHz Pentium 4: + +number of tests: 100 +number of branches per test: 10 +memory hit cycles per iteration (mean): 636.611 +L1 cache hit cycles per iteration (mean): 89.6413 +instruction stream based test, cycles per iteration (mean): 85.3438 +Just getting the pointer from a modulo on a pseudo-random value, doing + nothing with it, cycles per iteration (mean): 77.5044 + +So: +Base case: 77.50 cycles +instruction stream based test: +7.8394 cycles +L1 cache hit based test:+12.1369 cycles +Memory load based test: +559.1066 cycles + +So let's say we have a ping flood coming at +(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms) +7674 packets per second. If we put 2 markers for irq entry/exit, it +brings us to 15348 markers sites executed per second. + +(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029 +We therefore have a 0.29% slowdown just on this case. + +Compared to this, the instruction stream based test will cause a +slowdown of: + +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.4 +For a 0.004% slowdown. + +If we plan to use this for memory allocation, spinlock, and all sorts of +very high event rate tracing, we can assume it will execute 10 to 100 +times more sites per second, which brings us to 0.4% slowdown with the +instruction stream based test compared to 29% slowdown with the memory +load based test on a system with high memory pressure. + + + + * Markers impact under heavy memory load + +Running a kernel with my LTTng instrumentation set, in a test that +generates memory pressure (from userspace) by trashing L1 and L2 caches +between calls to getppid() (note: syscall_trace is active and calls +a marker upon syscall entry and syscall exit; markers are disarmed). +This test is done in user-space, so there are some delays due to IRQs +coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20 +nice level) + +My first set of results: Linear cache trashing, turned out not to be +very interesting, because it seems like the linearity of the memset on a +full array is somehow detected and it does not "really" trash the +caches. + +Now the most interesting result: Random walk L1 and L2 trashing +surrounding a getppid() call. + +- Markers compiled out (but
[patch 17/24] Immediate Values - Documentation
Changelog: - Remove imv_set_early (removed from API). - Use imv_* instead of immediate_*. Signed-off-by: Mathieu Desnoyers [EMAIL PROTECTED] CC: Rusty Russell [EMAIL PROTECTED] --- Documentation/immediate.txt | 221 1 file changed, 221 insertions(+) Index: linux-2.6-lttng/Documentation/immediate.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6-lttng/Documentation/immediate.txt 2007-11-03 20:28:58.0 -0400 @@ -0,0 +1,221 @@ + Using the Immediate Values + + Mathieu Desnoyers + + +This document introduces Immediate Values and their use. + + +* Purpose of immediate values + +An immediate value is used to compile into the kernel variables that sit within +the instruction stream. They are meant to be rarely updated but read often. +Using immediate values for these variables will save cache lines. + +This infrastructure is specialized in supporting dynamic patching of the values +in the instruction stream when multiple CPUs are running without disturbing the +normal system behavior. + +Compiling code meant to be rarely enabled at runtime can be done using +if (unlikely(imv_read(var))) as condition surrounding the code. The +smallest data type required for the test (an 8 bits char) is preferred, since +some architectures, such as powerpc, only allow up to 16 bits immediate values. + + +* Usage + +In order to use the immediate macros, you should include linux/immediate.h. + +#include linux/immediate.h + +DEFINE_IMV(char, this_immediate); +EXPORT_IMV_SYMBOL(this_immediate); + + +And use, in the body of a function: + +Use imv_set(this_immediate) to set the immediate value. + +Use imv_read(this_immediate) to read the immediate value. + +The immediate mechanism supports inserting multiple instances of the same +immediate. Immediate values can be put in inline functions, inlined static +functions, and unrolled loops. + +If you have to read the immediate values from a function declared as __init or +__exit, you should explicitly use _imv_read(), which will fall back on a +global variable read. Failing to do so will leave a reference to the __init +section after it is freed (it would generate a modpost warning). + +You can choose to set an initial static value to the immediate by using, for +instance: + +DEFINE_IMV(long, myptr) = 10; + + +* Optimization for a given architecture + +One can implement optimized immediate values for a given architecture by +replacing asm-$ARCH/immediate.h. + + +* Performance improvement + + + * Memory hit for a data-based branch + +Here are the results on a 3GHz Pentium 4: + +number of tests: 100 +number of branches per test: 10 +memory hit cycles per iteration (mean): 636.611 +L1 cache hit cycles per iteration (mean): 89.6413 +instruction stream based test, cycles per iteration (mean): 85.3438 +Just getting the pointer from a modulo on a pseudo-random value, doing + nothing with it, cycles per iteration (mean): 77.5044 + +So: +Base case: 77.50 cycles +instruction stream based test: +7.8394 cycles +L1 cache hit based test:+12.1369 cycles +Memory load based test: +559.1066 cycles + +So let's say we have a ping flood coming at +(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms) +7674 packets per second. If we put 2 markers for irq entry/exit, it +brings us to 15348 markers sites executed per second. + +(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029 +We therefore have a 0.29% slowdown just on this case. + +Compared to this, the instruction stream based test will cause a +slowdown of: + +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.4 +For a 0.004% slowdown. + +If we plan to use this for memory allocation, spinlock, and all sorts of +very high event rate tracing, we can assume it will execute 10 to 100 +times more sites per second, which brings us to 0.4% slowdown with the +instruction stream based test compared to 29% slowdown with the memory +load based test on a system with high memory pressure. + + + + * Markers impact under heavy memory load + +Running a kernel with my LTTng instrumentation set, in a test that +generates memory pressure (from userspace) by trashing L1 and L2 caches +between calls to getppid() (note: syscall_trace is active and calls +a marker upon syscall entry and syscall exit; markers are disarmed). +This test is done in user-space, so there are some delays due to IRQs +coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20 +nice level) + +My first set of results: Linear cache trashing, turned out not to be +very interesting, because it seems like the linearity of the memset on a +full array is somehow detected and it does not really trash the +caches. + +Now the most interesting result: Random walk L1 and L2 trashing +surrounding a getppid() call. + +- Markers compiled out