Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread David Miller
From: Andi Kleen <[EMAIL PROTECTED]>
Date: Tue, 20 Nov 2007 04:25:34 +0100

> 
> > Although we have a per-cpu area base in a fixed global register
> > for addressing, the above isn't beneficial on sparc64 because
> > the atomic is much slower than doing a:
> >
> > local_irq_disable();
> > nonatomic_percpu_memory_op();
> > local_irq_enable();
> 
> Again might be pointing out the obvious, but you 
> need of course save_flags()/restore_flags(), not disable/enable().

Right, but the cost is the same for that on sparc64 unlike
x86 et al.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
n Tue, 20 Nov 2007, Andi Kleen wrote:

> 
> > Although we have a per-cpu area base in a fixed global register
> > for addressing, the above isn't beneficial on sparc64 because
> > the atomic is much slower than doing a:
> >
> > local_irq_disable();
> > nonatomic_percpu_memory_op();
> > local_irq_enable();
> 
> Again might be pointing out the obvious, but you 
> need of course save_flags()/restore_flags(), not disable/enable().
> 
> If it was just disable/enable x86 could do it much faster too 
> and Christoph probably would never felt the need to approach
> this project for his SLUB fast path.

I already have no need for that anymore with the material now in Andrews 
tree. However, this cuts out another 6 cycles from the fastpath and I 
found that the same principles reduce overhead all over the kernel.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Andi Kleen

> Although we have a per-cpu area base in a fixed global register
> for addressing, the above isn't beneficial on sparc64 because
> the atomic is much slower than doing a:
>
>   local_irq_disable();
>   nonatomic_percpu_memory_op();
>   local_irq_enable();

Again might be pointing out the obvious, but you 
need of course save_flags()/restore_flags(), not disable/enable().

If it was just disable/enable x86 could do it much faster too 
and Christoph probably would never felt the need to approach
this project for his SLUB fast path.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
On Mon, 19 Nov 2007, David Miller wrote:

> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST)
> 
> > In that case the generic fallbacks can just provide what you already
> > have.
> 
> I understand, I was just letting you know why we probably won't
> take advantage of this new stuff :-)

On the other hand: The pointer array removal and the allocation 
density improvements of the cpu_alloc should also help sparc to 
increase the information density in cachelines and thus increase overall 
speed of your 64p box.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST)

> In that case the generic fallbacks can just provide what you already
> have.

I understand, I was just letting you know why we probably won't
take advantage of this new stuff :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
On Mon, 19 Nov 2007, David Miller wrote:

> Although we have a per-cpu area base in a fixed global register
> for addressing, the above isn't beneficial on sparc64 because
> the atomic is much slower than doing a:
> 
>   local_irq_disable();
>   nonatomic_percpu_memory_op();
>   local_irq_enable();
> 
> local_irq_{disable,enable}() together is about 18 cycles.
> Just the cmpxchg() part of the atomic sequence is at least
> 32 cycles and requires a loop:
> 
>   while (1) {
>   x = ld();
>   if (cmpxchg(x, op(x)))
>   break;
>   }
> 
> which bloats up the atomic version even more.

In that case the generic fallbacks can just provide what you already have.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread David Miller
From: [EMAIL PROTECTED]
Date: Mon, 19 Nov 2007 17:11:32 -0800

> Before:
> 
> mov%gs:0x8,%rdx   Get smp_processor_id
> movtableoffset,%rax   Get table base
> incq   varoffset(%rax,%rdx,1) Perform the operation with a complex lookup
>   adding the var offset
> 
> An interrupt or a reschedule action can move the execution thread to another
> processor if interrupt or preempt is not disabled. Then the variable of
> the wrong processor may be updated in a racy way.
> 
> After:
> 
> incq   %gs:varoffset(%rip)
> 
> Single instruction that is safe from interrupts or moving of the execution
> thread. It will reliably operate on the current processors data area.
> 
> Other platforms can also perform address relocation plus atomic ops on
> a memory location. Exploiting of the atomicity of instructions vs interrupts
> is therefore possible and will reduce the cpu op processing overhead.
> 
> F.e on IA64 we have per cpu virtual mapping of the per cpu area. If
> we add an offset to the per cpu area variable address then we can guarantee
> that we always hit the per cpu areas local to a processor.
> 
> Other platforms (SPARC?) have registers that can be used to form addresses.
> If the cpu area address is in one of those then atomic per cpu modifications
> can be generated for those platforms in the same way.

Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
the atomic is much slower than doing a:

local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();

local_irq_{disable,enable}() together is about 18 cycles.
Just the cmpxchg() part of the atomic sequence is at least
32 cycles and requires a loop:

while (1) {
x = ld();
if (cmpxchg(x, op(x)))
break;
}

which bloats up the atomic version even more.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
Duh a  patch did not make it due to XXX in the header.

I will put this also on git.kernel.org slab tree branch cpudata


x86_64: Strip down PDA operations through the use of CPU_XXX operations.

The *_pda operations behave in the same way as the CPU_XX ops. They both access 
data
that is relative to a segment register. So strip out as much as we can.

What is left after this patchset are some special pda ops for x86_64

or_pda()
test_and_clear_bit_pda()

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/asm-x86/current_64.h |2 +-
 include/asm-x86/pda.h|   34 --
 include/asm-x86/thread_info_64.h |2 ++
 3 files changed, 7 insertions(+), 31 deletions(-)

Index: linux-2.6/include/asm-x86/pda.h
===
--- linux-2.6.orig/include/asm-x86/pda.h2007-11-19 16:17:22.325639807 
-0800
+++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:24:13.569640223 -0800
@@ -81,36 +81,10 @@ extern struct x8664_pda _proxy_pda;
}   \
} while (0)
 
-#define pda_from_op(op,field) ({   \
-   typeof(_proxy_pda.field) ret__; \
-   switch (sizeof(_proxy_pda.field)) { \
-   case 2: \
-   asm(op "w %%gs:%c1,%0" :\
-   "=r" (ret__) :  \
-   "i" (pda_offset(field)),\
-   "m" (_proxy_pda.field));\
-break; \
-   case 4: \
-   asm(op "l %%gs:%c1,%0": \
-   "=r" (ret__):   \
-   "i" (pda_offset(field)),\
-   "m" (_proxy_pda.field));\
-break; \
-   case 8: \
-   asm(op "q %%gs:%c1,%0": \
-   "=r" (ret__) :  \
-   "i" (pda_offset(field)),\
-   "m" (_proxy_pda.field));\
-break; \
-   default:\
-   __bad_pda_field();  \
-   }   \
-   ret__; })
-
-#define read_pda(field) pda_from_op("mov",field)
-#define write_pda(field,val) pda_to_op("mov",field,val)
-#define add_pda(field,val) pda_to_op("add",field,val)
-#define sub_pda(field,val) pda_to_op("sub",field,val)
+#define read_pda(field) CPU_READ(per_cpu_var(pda).field)
+#define write_pda(field,val) CPU_WRITE(per_cpu_var(pda).field, val)
+#define add_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
+#define sub_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
 #define or_pda(field,val) pda_to_op("or",field,val)
 
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
Index: linux-2.6/include/asm-x86/current_64.h
===
--- linux-2.6.orig/include/asm-x86/current_64.h 2007-11-19 15:45:03.470390243 
-0800
+++ linux-2.6/include/asm-x86/current_64.h  2007-11-19 16:24:13.569640223 
-0800
@@ -4,7 +4,7 @@
 #if !defined(__ASSEMBLY__) 
 struct task_struct;
 
-#include 
+#include 
 
 static inline struct task_struct *get_current(void) 
 { 
Index: linux-2.6/include/asm-x86/thread_info_64.h
===
--- linux-2.6.orig/include/asm-x86/thread_info_64.h 2007-11-19 
15:45:03.482390495 -0800
+++ linux-2.6/include/asm-x86/thread_info_64.h  2007-11-19 16:24:13.569640223 
-0800
@@ -41,6 +41,8 @@ struct thread_info {
  * preempt_count needs to be 1 initially, until the scheduler is functional.
  */
 #ifndef __ASSEMBLY__
+#include 
+
 #define INIT_THREAD_INFO(tsk)  \
 {  \
.task  = ,  \

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
Duh a  patch did not make it due to XXX in the header.

I will put this also on git.kernel.org slab tree branch cpudata


x86_64: Strip down PDA operations through the use of CPU_XXX operations.

The *_pda operations behave in the same way as the CPU_XX ops. They both access 
data
that is relative to a segment register. So strip out as much as we can.

What is left after this patchset are some special pda ops for x86_64

or_pda()
test_and_clear_bit_pda()

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/asm-x86/current_64.h |2 +-
 include/asm-x86/pda.h|   34 --
 include/asm-x86/thread_info_64.h |2 ++
 3 files changed, 7 insertions(+), 31 deletions(-)

Index: linux-2.6/include/asm-x86/pda.h
===
--- linux-2.6.orig/include/asm-x86/pda.h2007-11-19 16:17:22.325639807 
-0800
+++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:24:13.569640223 -0800
@@ -81,36 +81,10 @@ extern struct x8664_pda _proxy_pda;
}   \
} while (0)
 
-#define pda_from_op(op,field) ({   \
-   typeof(_proxy_pda.field) ret__; \
-   switch (sizeof(_proxy_pda.field)) { \
-   case 2: \
-   asm(op w %%gs:%c1,%0 :\
-   =r (ret__) :  \
-   i (pda_offset(field)),\
-   m (_proxy_pda.field));\
-break; \
-   case 4: \
-   asm(op l %%gs:%c1,%0: \
-   =r (ret__):   \
-   i (pda_offset(field)),\
-   m (_proxy_pda.field));\
-break; \
-   case 8: \
-   asm(op q %%gs:%c1,%0: \
-   =r (ret__) :  \
-   i (pda_offset(field)),\
-   m (_proxy_pda.field));\
-break; \
-   default:\
-   __bad_pda_field();  \
-   }   \
-   ret__; })
-
-#define read_pda(field) pda_from_op(mov,field)
-#define write_pda(field,val) pda_to_op(mov,field,val)
-#define add_pda(field,val) pda_to_op(add,field,val)
-#define sub_pda(field,val) pda_to_op(sub,field,val)
+#define read_pda(field) CPU_READ(per_cpu_var(pda).field)
+#define write_pda(field,val) CPU_WRITE(per_cpu_var(pda).field, val)
+#define add_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
+#define sub_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
 #define or_pda(field,val) pda_to_op(or,field,val)
 
 /* This is not atomic against other CPUs -- CPU preemption needs to be off */
Index: linux-2.6/include/asm-x86/current_64.h
===
--- linux-2.6.orig/include/asm-x86/current_64.h 2007-11-19 15:45:03.470390243 
-0800
+++ linux-2.6/include/asm-x86/current_64.h  2007-11-19 16:24:13.569640223 
-0800
@@ -4,7 +4,7 @@
 #if !defined(__ASSEMBLY__) 
 struct task_struct;
 
-#include asm/pda.h
+#include asm/percpu.h
 
 static inline struct task_struct *get_current(void) 
 { 
Index: linux-2.6/include/asm-x86/thread_info_64.h
===
--- linux-2.6.orig/include/asm-x86/thread_info_64.h 2007-11-19 
15:45:03.482390495 -0800
+++ linux-2.6/include/asm-x86/thread_info_64.h  2007-11-19 16:24:13.569640223 
-0800
@@ -41,6 +41,8 @@ struct thread_info {
  * preempt_count needs to be 1 initially, until the scheduler is functional.
  */
 #ifndef __ASSEMBLY__
+#include asm/percpu_64.h
+
 #define INIT_THREAD_INFO(tsk)  \
 {  \
.task  = tsk,  \

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread David Miller
From: [EMAIL PROTECTED]
Date: Mon, 19 Nov 2007 17:11:32 -0800

 Before:
 
 mov%gs:0x8,%rdx   Get smp_processor_id
 movtableoffset,%rax   Get table base
 incq   varoffset(%rax,%rdx,1) Perform the operation with a complex lookup
   adding the var offset
 
 An interrupt or a reschedule action can move the execution thread to another
 processor if interrupt or preempt is not disabled. Then the variable of
 the wrong processor may be updated in a racy way.
 
 After:
 
 incq   %gs:varoffset(%rip)
 
 Single instruction that is safe from interrupts or moving of the execution
 thread. It will reliably operate on the current processors data area.
 
 Other platforms can also perform address relocation plus atomic ops on
 a memory location. Exploiting of the atomicity of instructions vs interrupts
 is therefore possible and will reduce the cpu op processing overhead.
 
 F.e on IA64 we have per cpu virtual mapping of the per cpu area. If
 we add an offset to the per cpu area variable address then we can guarantee
 that we always hit the per cpu areas local to a processor.
 
 Other platforms (SPARC?) have registers that can be used to form addresses.
 If the cpu area address is in one of those then atomic per cpu modifications
 can be generated for those platforms in the same way.

Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
the atomic is much slower than doing a:

local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();

local_irq_{disable,enable}() together is about 18 cycles.
Just the cmpxchg() part of the atomic sequence is at least
32 cycles and requires a loop:

while (1) {
x = ld();
if (cmpxchg(x, op(x)))
break;
}

which bloats up the atomic version even more.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
On Mon, 19 Nov 2007, David Miller wrote:

 Although we have a per-cpu area base in a fixed global register
 for addressing, the above isn't beneficial on sparc64 because
 the atomic is much slower than doing a:
 
   local_irq_disable();
   nonatomic_percpu_memory_op();
   local_irq_enable();
 
 local_irq_{disable,enable}() together is about 18 cycles.
 Just the cmpxchg() part of the atomic sequence is at least
 32 cycles and requires a loop:
 
   while (1) {
   x = ld();
   if (cmpxchg(x, op(x)))
   break;
   }
 
 which bloats up the atomic version even more.

In that case the generic fallbacks can just provide what you already have.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST)

 In that case the generic fallbacks can just provide what you already
 have.

I understand, I was just letting you know why we probably won't
take advantage of this new stuff :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
On Mon, 19 Nov 2007, David Miller wrote:

 From: Christoph Lameter [EMAIL PROTECTED]
 Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST)
 
  In that case the generic fallbacks can just provide what you already
  have.
 
 I understand, I was just letting you know why we probably won't
 take advantage of this new stuff :-)

On the other hand: The pointer array removal and the allocation 
density improvements of the cpu_alloc should also help sparc to 
increase the information density in cachelines and thus increase overall 
speed of your 64p box.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Andi Kleen

 Although we have a per-cpu area base in a fixed global register
 for addressing, the above isn't beneficial on sparc64 because
 the atomic is much slower than doing a:

   local_irq_disable();
   nonatomic_percpu_memory_op();
   local_irq_enable();

Again might be pointing out the obvious, but you 
need of course save_flags()/restore_flags(), not disable/enable().

If it was just disable/enable x86 could do it much faster too 
and Christoph probably would never felt the need to approach
this project for his SLUB fast path.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread Christoph Lameter
n Tue, 20 Nov 2007, Andi Kleen wrote:

 
  Although we have a per-cpu area base in a fixed global register
  for addressing, the above isn't beneficial on sparc64 because
  the atomic is much slower than doing a:
 
  local_irq_disable();
  nonatomic_percpu_memory_op();
  local_irq_enable();
 
 Again might be pointing out the obvious, but you 
 need of course save_flags()/restore_flags(), not disable/enable().
 
 If it was just disable/enable x86 could do it much faster too 
 and Christoph probably would never felt the need to approach
 this project for his SLUB fast path.

I already have no need for that anymore with the material now in Andrews 
tree. However, this cuts out another 6 cycles from the fastpath and I 
found that the same principles reduce overhead all over the kernel.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

2007-11-19 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Tue, 20 Nov 2007 04:25:34 +0100

 
  Although we have a per-cpu area base in a fixed global register
  for addressing, the above isn't beneficial on sparc64 because
  the atomic is much slower than doing a:
 
  local_irq_disable();
  nonatomic_percpu_memory_op();
  local_irq_enable();
 
 Again might be pointing out the obvious, but you 
 need of course save_flags()/restore_flags(), not disable/enable().

Right, but the cost is the same for that on sparc64 unlike
x86 et al.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/