Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
From: Andi Kleen <[EMAIL PROTECTED]> Date: Tue, 20 Nov 2007 04:25:34 +0100 > > > Although we have a per-cpu area base in a fixed global register > > for addressing, the above isn't beneficial on sparc64 because > > the atomic is much slower than doing a: > > > > local_irq_disable(); > > nonatomic_percpu_memory_op(); > > local_irq_enable(); > > Again might be pointing out the obvious, but you > need of course save_flags()/restore_flags(), not disable/enable(). Right, but the cost is the same for that on sparc64 unlike x86 et al. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
n Tue, 20 Nov 2007, Andi Kleen wrote: > > > Although we have a per-cpu area base in a fixed global register > > for addressing, the above isn't beneficial on sparc64 because > > the atomic is much slower than doing a: > > > > local_irq_disable(); > > nonatomic_percpu_memory_op(); > > local_irq_enable(); > > Again might be pointing out the obvious, but you > need of course save_flags()/restore_flags(), not disable/enable(). > > If it was just disable/enable x86 could do it much faster too > and Christoph probably would never felt the need to approach > this project for his SLUB fast path. I already have no need for that anymore with the material now in Andrews tree. However, this cuts out another 6 cycles from the fastpath and I found that the same principles reduce overhead all over the kernel. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
> Although we have a per-cpu area base in a fixed global register > for addressing, the above isn't beneficial on sparc64 because > the atomic is much slower than doing a: > > local_irq_disable(); > nonatomic_percpu_memory_op(); > local_irq_enable(); Again might be pointing out the obvious, but you need of course save_flags()/restore_flags(), not disable/enable(). If it was just disable/enable x86 could do it much faster too and Christoph probably would never felt the need to approach this project for his SLUB fast path. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
On Mon, 19 Nov 2007, David Miller wrote: > From: Christoph Lameter <[EMAIL PROTECTED]> > Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST) > > > In that case the generic fallbacks can just provide what you already > > have. > > I understand, I was just letting you know why we probably won't > take advantage of this new stuff :-) On the other hand: The pointer array removal and the allocation density improvements of the cpu_alloc should also help sparc to increase the information density in cachelines and thus increase overall speed of your 64p box. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
From: Christoph Lameter <[EMAIL PROTECTED]> Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST) > In that case the generic fallbacks can just provide what you already > have. I understand, I was just letting you know why we probably won't take advantage of this new stuff :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
On Mon, 19 Nov 2007, David Miller wrote: > Although we have a per-cpu area base in a fixed global register > for addressing, the above isn't beneficial on sparc64 because > the atomic is much slower than doing a: > > local_irq_disable(); > nonatomic_percpu_memory_op(); > local_irq_enable(); > > local_irq_{disable,enable}() together is about 18 cycles. > Just the cmpxchg() part of the atomic sequence is at least > 32 cycles and requires a loop: > > while (1) { > x = ld(); > if (cmpxchg(x, op(x))) > break; > } > > which bloats up the atomic version even more. In that case the generic fallbacks can just provide what you already have. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
From: [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 17:11:32 -0800 > Before: > > mov%gs:0x8,%rdx Get smp_processor_id > movtableoffset,%rax Get table base > incq varoffset(%rax,%rdx,1) Perform the operation with a complex lookup > adding the var offset > > An interrupt or a reschedule action can move the execution thread to another > processor if interrupt or preempt is not disabled. Then the variable of > the wrong processor may be updated in a racy way. > > After: > > incq %gs:varoffset(%rip) > > Single instruction that is safe from interrupts or moving of the execution > thread. It will reliably operate on the current processors data area. > > Other platforms can also perform address relocation plus atomic ops on > a memory location. Exploiting of the atomicity of instructions vs interrupts > is therefore possible and will reduce the cpu op processing overhead. > > F.e on IA64 we have per cpu virtual mapping of the per cpu area. If > we add an offset to the per cpu area variable address then we can guarantee > that we always hit the per cpu areas local to a processor. > > Other platforms (SPARC?) have registers that can be used to form addresses. > If the cpu area address is in one of those then atomic per cpu modifications > can be generated for those platforms in the same way. Although we have a per-cpu area base in a fixed global register for addressing, the above isn't beneficial on sparc64 because the atomic is much slower than doing a: local_irq_disable(); nonatomic_percpu_memory_op(); local_irq_enable(); local_irq_{disable,enable}() together is about 18 cycles. Just the cmpxchg() part of the atomic sequence is at least 32 cycles and requires a loop: while (1) { x = ld(); if (cmpxchg(x, op(x))) break; } which bloats up the atomic version even more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
Duh a patch did not make it due to XXX in the header. I will put this also on git.kernel.org slab tree branch cpudata x86_64: Strip down PDA operations through the use of CPU_XXX operations. The *_pda operations behave in the same way as the CPU_XX ops. They both access data that is relative to a segment register. So strip out as much as we can. What is left after this patchset are some special pda ops for x86_64 or_pda() test_and_clear_bit_pda() Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/asm-x86/current_64.h |2 +- include/asm-x86/pda.h| 34 -- include/asm-x86/thread_info_64.h |2 ++ 3 files changed, 7 insertions(+), 31 deletions(-) Index: linux-2.6/include/asm-x86/pda.h === --- linux-2.6.orig/include/asm-x86/pda.h2007-11-19 16:17:22.325639807 -0800 +++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:24:13.569640223 -0800 @@ -81,36 +81,10 @@ extern struct x8664_pda _proxy_pda; } \ } while (0) -#define pda_from_op(op,field) ({ \ - typeof(_proxy_pda.field) ret__; \ - switch (sizeof(_proxy_pda.field)) { \ - case 2: \ - asm(op "w %%gs:%c1,%0" :\ - "=r" (ret__) : \ - "i" (pda_offset(field)),\ - "m" (_proxy_pda.field));\ -break; \ - case 4: \ - asm(op "l %%gs:%c1,%0": \ - "=r" (ret__): \ - "i" (pda_offset(field)),\ - "m" (_proxy_pda.field));\ -break; \ - case 8: \ - asm(op "q %%gs:%c1,%0": \ - "=r" (ret__) : \ - "i" (pda_offset(field)),\ - "m" (_proxy_pda.field));\ -break; \ - default:\ - __bad_pda_field(); \ - } \ - ret__; }) - -#define read_pda(field) pda_from_op("mov",field) -#define write_pda(field,val) pda_to_op("mov",field,val) -#define add_pda(field,val) pda_to_op("add",field,val) -#define sub_pda(field,val) pda_to_op("sub",field,val) +#define read_pda(field) CPU_READ(per_cpu_var(pda).field) +#define write_pda(field,val) CPU_WRITE(per_cpu_var(pda).field, val) +#define add_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val) +#define sub_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val) #define or_pda(field,val) pda_to_op("or",field,val) /* This is not atomic against other CPUs -- CPU preemption needs to be off */ Index: linux-2.6/include/asm-x86/current_64.h === --- linux-2.6.orig/include/asm-x86/current_64.h 2007-11-19 15:45:03.470390243 -0800 +++ linux-2.6/include/asm-x86/current_64.h 2007-11-19 16:24:13.569640223 -0800 @@ -4,7 +4,7 @@ #if !defined(__ASSEMBLY__) struct task_struct; -#include +#include static inline struct task_struct *get_current(void) { Index: linux-2.6/include/asm-x86/thread_info_64.h === --- linux-2.6.orig/include/asm-x86/thread_info_64.h 2007-11-19 15:45:03.482390495 -0800 +++ linux-2.6/include/asm-x86/thread_info_64.h 2007-11-19 16:24:13.569640223 -0800 @@ -41,6 +41,8 @@ struct thread_info { * preempt_count needs to be 1 initially, until the scheduler is functional. */ #ifndef __ASSEMBLY__ +#include + #define INIT_THREAD_INFO(tsk) \ { \ .task = , \ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
Duh a patch did not make it due to XXX in the header. I will put this also on git.kernel.org slab tree branch cpudata x86_64: Strip down PDA operations through the use of CPU_XXX operations. The *_pda operations behave in the same way as the CPU_XX ops. They both access data that is relative to a segment register. So strip out as much as we can. What is left after this patchset are some special pda ops for x86_64 or_pda() test_and_clear_bit_pda() Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/asm-x86/current_64.h |2 +- include/asm-x86/pda.h| 34 -- include/asm-x86/thread_info_64.h |2 ++ 3 files changed, 7 insertions(+), 31 deletions(-) Index: linux-2.6/include/asm-x86/pda.h === --- linux-2.6.orig/include/asm-x86/pda.h2007-11-19 16:17:22.325639807 -0800 +++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:24:13.569640223 -0800 @@ -81,36 +81,10 @@ extern struct x8664_pda _proxy_pda; } \ } while (0) -#define pda_from_op(op,field) ({ \ - typeof(_proxy_pda.field) ret__; \ - switch (sizeof(_proxy_pda.field)) { \ - case 2: \ - asm(op w %%gs:%c1,%0 :\ - =r (ret__) : \ - i (pda_offset(field)),\ - m (_proxy_pda.field));\ -break; \ - case 4: \ - asm(op l %%gs:%c1,%0: \ - =r (ret__): \ - i (pda_offset(field)),\ - m (_proxy_pda.field));\ -break; \ - case 8: \ - asm(op q %%gs:%c1,%0: \ - =r (ret__) : \ - i (pda_offset(field)),\ - m (_proxy_pda.field));\ -break; \ - default:\ - __bad_pda_field(); \ - } \ - ret__; }) - -#define read_pda(field) pda_from_op(mov,field) -#define write_pda(field,val) pda_to_op(mov,field,val) -#define add_pda(field,val) pda_to_op(add,field,val) -#define sub_pda(field,val) pda_to_op(sub,field,val) +#define read_pda(field) CPU_READ(per_cpu_var(pda).field) +#define write_pda(field,val) CPU_WRITE(per_cpu_var(pda).field, val) +#define add_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val) +#define sub_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val) #define or_pda(field,val) pda_to_op(or,field,val) /* This is not atomic against other CPUs -- CPU preemption needs to be off */ Index: linux-2.6/include/asm-x86/current_64.h === --- linux-2.6.orig/include/asm-x86/current_64.h 2007-11-19 15:45:03.470390243 -0800 +++ linux-2.6/include/asm-x86/current_64.h 2007-11-19 16:24:13.569640223 -0800 @@ -4,7 +4,7 @@ #if !defined(__ASSEMBLY__) struct task_struct; -#include asm/pda.h +#include asm/percpu.h static inline struct task_struct *get_current(void) { Index: linux-2.6/include/asm-x86/thread_info_64.h === --- linux-2.6.orig/include/asm-x86/thread_info_64.h 2007-11-19 15:45:03.482390495 -0800 +++ linux-2.6/include/asm-x86/thread_info_64.h 2007-11-19 16:24:13.569640223 -0800 @@ -41,6 +41,8 @@ struct thread_info { * preempt_count needs to be 1 initially, until the scheduler is functional. */ #ifndef __ASSEMBLY__ +#include asm/percpu_64.h + #define INIT_THREAD_INFO(tsk) \ { \ .task = tsk, \ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
From: [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 17:11:32 -0800 Before: mov%gs:0x8,%rdx Get smp_processor_id movtableoffset,%rax Get table base incq varoffset(%rax,%rdx,1) Perform the operation with a complex lookup adding the var offset An interrupt or a reschedule action can move the execution thread to another processor if interrupt or preempt is not disabled. Then the variable of the wrong processor may be updated in a racy way. After: incq %gs:varoffset(%rip) Single instruction that is safe from interrupts or moving of the execution thread. It will reliably operate on the current processors data area. Other platforms can also perform address relocation plus atomic ops on a memory location. Exploiting of the atomicity of instructions vs interrupts is therefore possible and will reduce the cpu op processing overhead. F.e on IA64 we have per cpu virtual mapping of the per cpu area. If we add an offset to the per cpu area variable address then we can guarantee that we always hit the per cpu areas local to a processor. Other platforms (SPARC?) have registers that can be used to form addresses. If the cpu area address is in one of those then atomic per cpu modifications can be generated for those platforms in the same way. Although we have a per-cpu area base in a fixed global register for addressing, the above isn't beneficial on sparc64 because the atomic is much slower than doing a: local_irq_disable(); nonatomic_percpu_memory_op(); local_irq_enable(); local_irq_{disable,enable}() together is about 18 cycles. Just the cmpxchg() part of the atomic sequence is at least 32 cycles and requires a loop: while (1) { x = ld(); if (cmpxchg(x, op(x))) break; } which bloats up the atomic version even more. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
On Mon, 19 Nov 2007, David Miller wrote: Although we have a per-cpu area base in a fixed global register for addressing, the above isn't beneficial on sparc64 because the atomic is much slower than doing a: local_irq_disable(); nonatomic_percpu_memory_op(); local_irq_enable(); local_irq_{disable,enable}() together is about 18 cycles. Just the cmpxchg() part of the atomic sequence is at least 32 cycles and requires a loop: while (1) { x = ld(); if (cmpxchg(x, op(x))) break; } which bloats up the atomic version even more. In that case the generic fallbacks can just provide what you already have. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
From: Christoph Lameter [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST) In that case the generic fallbacks can just provide what you already have. I understand, I was just letting you know why we probably won't take advantage of this new stuff :-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
On Mon, 19 Nov 2007, David Miller wrote: From: Christoph Lameter [EMAIL PROTECTED] Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST) In that case the generic fallbacks can just provide what you already have. I understand, I was just letting you know why we probably won't take advantage of this new stuff :-) On the other hand: The pointer array removal and the allocation density improvements of the cpu_alloc should also help sparc to increase the information density in cachelines and thus increase overall speed of your 64p box. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
Although we have a per-cpu area base in a fixed global register for addressing, the above isn't beneficial on sparc64 because the atomic is much slower than doing a: local_irq_disable(); nonatomic_percpu_memory_op(); local_irq_enable(); Again might be pointing out the obvious, but you need of course save_flags()/restore_flags(), not disable/enable(). If it was just disable/enable x86 could do it much faster too and Christoph probably would never felt the need to approach this project for his SLUB fast path. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
n Tue, 20 Nov 2007, Andi Kleen wrote: Although we have a per-cpu area base in a fixed global register for addressing, the above isn't beneficial on sparc64 because the atomic is much slower than doing a: local_irq_disable(); nonatomic_percpu_memory_op(); local_irq_enable(); Again might be pointing out the obvious, but you need of course save_flags()/restore_flags(), not disable/enable(). If it was just disable/enable x86 could do it much faster too and Christoph probably would never felt the need to approach this project for his SLUB fast path. I already have no need for that anymore with the material now in Andrews tree. However, this cuts out another 6 cycles from the fastpath and I found that the same principles reduce overhead all over the kernel. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 04:25:34 +0100 Although we have a per-cpu area base in a fixed global register for addressing, the above isn't beneficial on sparc64 because the atomic is much slower than doing a: local_irq_disable(); nonatomic_percpu_memory_op(); local_irq_enable(); Again might be pointing out the obvious, but you need of course save_flags()/restore_flags(), not disable/enable(). Right, but the cost is the same for that on sparc64 unlike x86 et al. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/