Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-04-02 Thread Siddha, Suresh B
On Mon, Apr 02, 2007 at 05:23:20PM -0700, Christoph Lameter wrote:
> On Mon, 2 Apr 2007, Siddha, Suresh B wrote:
> 
> > Set the node_possible_map at runtime. On a non NUMA system,
> > num_possible_nodes() will now say '1'
> 
> How does this relate to nr_node_ids?

With this patch, nr_node_ids on non NUMA will also be '1' and
as before nr_node_ids is same as num_possible_nodes()

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-04-02 Thread Christoph Lameter
On Mon, 2 Apr 2007, Siddha, Suresh B wrote:

> Set the node_possible_map at runtime. On a non NUMA system,
> num_possible_nodes() will now say '1'

How does this relate to nr_node_ids?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-04-02 Thread Siddha, Suresh B
On Fri, Mar 23, 2007 at 03:12:10PM +0100, Andi Kleen wrote:
> > But that is based on compile time option, isn't it? Perhaps I need
> > to use some other mechanism to find out the platform is not NUMA capable..
> 
> We can probably make it runtime on x86. That will be needed sooner or 
> later for correct NUMA hotplug support anyways.

How about this patch? Thanks.

---
From: Suresh Siddha <[EMAIL PROTECTED]>
[patch] x86_64: set node_possible_map at runtime.

Set the node_possible_map at runtime. On a non NUMA system,
num_possible_nodes() will now say '1'

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c
index b5b8dba..d6f4447 100644
--- a/arch/x86_64/mm/k8topology.c
+++ b/arch/x86_64/mm/k8topology.c
@@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long start, unsigned long 
end)
int found = 0;
u32 reg;
unsigned numnodes;
-   nodemask_t nodes_parsed;
unsigned dualcore = 0;
 
-   nodes_clear(nodes_parsed);
-
if (!early_pci_allowed())
return -1;
 
@@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long 
end)
   nodeid, (base>>8)&3, (limit>>8) & 3); 
return -1; 
}   
-   if (node_isset(nodeid, nodes_parsed)) { 
+   if (node_isset(nodeid, node_possible_map)) { 
printk(KERN_INFO "Node %d already present. Skipping\n", 
   nodeid);
continue;
@@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long 
end)
 
prevbase = base;
 
-   node_set(nodeid, nodes_parsed);
+   node_set(nodeid, node_possible_map);
} 
 
if (!found)
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 41b8fb0..5f7d4d8 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -383,6 +383,7 @@ static int __init numa_emulation(unsigned long start_pfn, 
unsigned long end_pfn)
   i,
   nodes[i].start, nodes[i].end,
   (nodes[i].end - nodes[i].start) >> 20);
+   node_set(i, node_possible_map);
node_set_online(i);
}
memnode_shift = compute_hash_shift(nodes, numa_fake);
@@ -405,6 +406,8 @@ void __init numa_initmem_init(unsigned long start_pfn, 
unsigned long end_pfn)
 { 
int i;
 
+   nodes_clear(node_possible_map);
+
 #ifdef CONFIG_NUMA_EMU
if (numa_fake && !numa_emulation(start_pfn, end_pfn))
return;
@@ -432,6 +435,7 @@ void __init numa_initmem_init(unsigned long start_pfn, 
unsigned long end_pfn)
memnodemap[0] = 0;
nodes_clear(node_online_map);
node_set_online(0);
+   node_set(0, node_possible_map);
for (i = 0; i < NR_CPUS; i++)
numa_set_node(i, 0);
node_to_cpumask[0] = cpumask_of_cpu(0);
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 2efe215..9f26e2b 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -25,7 +25,6 @@ int acpi_numa __initdata;
 
 static struct acpi_table_slit *acpi_slit;
 
-static nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
 static int found_add_area __initdata;
@@ -43,7 +42,7 @@ static __init int setup_node(int pxm)
 static __init int conflicting_nodes(unsigned long start, unsigned long end)
 {
int i;
-   for_each_node_mask(i, nodes_parsed) {
+   for_each_node_mask(i, node_possible_map) {
struct bootnode *nd = [i];
if (nd->start == nd->end)
continue;
@@ -321,7 +320,7 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
}
nd = [node];
oldnode = *nd;
-   if (!node_test_and_set(node, nodes_parsed)) {
+   if (!node_test_and_set(node, node_possible_map)) {
nd->start = start;
nd->end = end;
} else {
@@ -344,7 +343,7 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
printk(KERN_NOTICE "SRAT: Hotplug region ignored\n");
*nd = oldnode;
if ((nd->start | nd->end) == 0)
-   node_clear(node, nodes_parsed);
+   node_clear(node, node_possible_map);
}
 }
 
@@ -356,7 +355,7 @@ static int nodes_cover_memory(void)
unsigned long pxmram, e820ram;
 
pxmram = 0;
-   for_each_node_mask(i, nodes_parsed) {
+   for_each_node_mask(i, node_possible_map) {
unsigned long s = nodes[i].start >> PAGE_SHIFT;
unsigned long e = nodes[i].end >> PAGE_SHIFT;
pxmram += e - s;
@@ -380,7 +379,7 @@ static int nodes_cover_memory(void)
 static void unparse_node(int node)
 {
int 

Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-04-02 Thread Siddha, Suresh B
On Fri, Mar 23, 2007 at 03:12:10PM +0100, Andi Kleen wrote:
  But that is based on compile time option, isn't it? Perhaps I need
  to use some other mechanism to find out the platform is not NUMA capable..
 
 We can probably make it runtime on x86. That will be needed sooner or 
 later for correct NUMA hotplug support anyways.

How about this patch? Thanks.

---
From: Suresh Siddha [EMAIL PROTECTED]
[patch] x86_64: set node_possible_map at runtime.

Set the node_possible_map at runtime. On a non NUMA system,
num_possible_nodes() will now say '1'

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c
index b5b8dba..d6f4447 100644
--- a/arch/x86_64/mm/k8topology.c
+++ b/arch/x86_64/mm/k8topology.c
@@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long start, unsigned long 
end)
int found = 0;
u32 reg;
unsigned numnodes;
-   nodemask_t nodes_parsed;
unsigned dualcore = 0;
 
-   nodes_clear(nodes_parsed);
-
if (!early_pci_allowed())
return -1;
 
@@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long 
end)
   nodeid, (base8)3, (limit8)  3); 
return -1; 
}   
-   if (node_isset(nodeid, nodes_parsed)) { 
+   if (node_isset(nodeid, node_possible_map)) { 
printk(KERN_INFO Node %d already present. Skipping\n, 
   nodeid);
continue;
@@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long 
end)
 
prevbase = base;
 
-   node_set(nodeid, nodes_parsed);
+   node_set(nodeid, node_possible_map);
} 
 
if (!found)
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 41b8fb0..5f7d4d8 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -383,6 +383,7 @@ static int __init numa_emulation(unsigned long start_pfn, 
unsigned long end_pfn)
   i,
   nodes[i].start, nodes[i].end,
   (nodes[i].end - nodes[i].start)  20);
+   node_set(i, node_possible_map);
node_set_online(i);
}
memnode_shift = compute_hash_shift(nodes, numa_fake);
@@ -405,6 +406,8 @@ void __init numa_initmem_init(unsigned long start_pfn, 
unsigned long end_pfn)
 { 
int i;
 
+   nodes_clear(node_possible_map);
+
 #ifdef CONFIG_NUMA_EMU
if (numa_fake  !numa_emulation(start_pfn, end_pfn))
return;
@@ -432,6 +435,7 @@ void __init numa_initmem_init(unsigned long start_pfn, 
unsigned long end_pfn)
memnodemap[0] = 0;
nodes_clear(node_online_map);
node_set_online(0);
+   node_set(0, node_possible_map);
for (i = 0; i  NR_CPUS; i++)
numa_set_node(i, 0);
node_to_cpumask[0] = cpumask_of_cpu(0);
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 2efe215..9f26e2b 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -25,7 +25,6 @@ int acpi_numa __initdata;
 
 static struct acpi_table_slit *acpi_slit;
 
-static nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
 static int found_add_area __initdata;
@@ -43,7 +42,7 @@ static __init int setup_node(int pxm)
 static __init int conflicting_nodes(unsigned long start, unsigned long end)
 {
int i;
-   for_each_node_mask(i, nodes_parsed) {
+   for_each_node_mask(i, node_possible_map) {
struct bootnode *nd = nodes[i];
if (nd-start == nd-end)
continue;
@@ -321,7 +320,7 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
}
nd = nodes[node];
oldnode = *nd;
-   if (!node_test_and_set(node, nodes_parsed)) {
+   if (!node_test_and_set(node, node_possible_map)) {
nd-start = start;
nd-end = end;
} else {
@@ -344,7 +343,7 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
printk(KERN_NOTICE SRAT: Hotplug region ignored\n);
*nd = oldnode;
if ((nd-start | nd-end) == 0)
-   node_clear(node, nodes_parsed);
+   node_clear(node, node_possible_map);
}
 }
 
@@ -356,7 +355,7 @@ static int nodes_cover_memory(void)
unsigned long pxmram, e820ram;
 
pxmram = 0;
-   for_each_node_mask(i, nodes_parsed) {
+   for_each_node_mask(i, node_possible_map) {
unsigned long s = nodes[i].start  PAGE_SHIFT;
unsigned long e = nodes[i].end  PAGE_SHIFT;
pxmram += e - s;
@@ -380,7 +379,7 @@ static int nodes_cover_memory(void)
 static void unparse_node(int node)
 {
int i;
-   

Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-04-02 Thread Christoph Lameter
On Mon, 2 Apr 2007, Siddha, Suresh B wrote:

 Set the node_possible_map at runtime. On a non NUMA system,
 num_possible_nodes() will now say '1'

How does this relate to nr_node_ids?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-04-02 Thread Siddha, Suresh B
On Mon, Apr 02, 2007 at 05:23:20PM -0700, Christoph Lameter wrote:
 On Mon, 2 Apr 2007, Siddha, Suresh B wrote:
 
  Set the node_possible_map at runtime. On a non NUMA system,
  num_possible_nodes() will now say '1'
 
 How does this relate to nr_node_ids?

With this patch, nr_node_ids on non NUMA will also be '1' and
as before nr_node_ids is same as num_possible_nodes()

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-23 Thread Andi Kleen
On Thu, Mar 22, 2007 at 06:25:16PM -0700, Christoph Lameter wrote:
> On Thu, 22 Mar 2007, Siddha, Suresh B wrote:
> 
> > > You should check num_possible_nodes(), or nr_node_ids (this one is 
> > > cheaper, 
> > > its a variable instead of a function call)
> > 
> > But that is based on compile time option, isn't it? Perhaps I need
> > to use some other mechanism to find out the platform is not NUMA capable..
> 
> No its runtime.

I don't see any code that would ever change the mask from the compile
default. But that is easy to fix.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-23 Thread Andi Kleen
> But that is based on compile time option, isn't it? Perhaps I need
> to use some other mechanism to find out the platform is not NUMA capable..

We can probably make it runtime on x86. That will be needed sooner or 
later for correct NUMA hotplug support anyways.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-23 Thread Andi Kleen
 But that is based on compile time option, isn't it? Perhaps I need
 to use some other mechanism to find out the platform is not NUMA capable..

We can probably make it runtime on x86. That will be needed sooner or 
later for correct NUMA hotplug support anyways.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-23 Thread Andi Kleen
On Thu, Mar 22, 2007 at 06:25:16PM -0700, Christoph Lameter wrote:
 On Thu, 22 Mar 2007, Siddha, Suresh B wrote:
 
   You should check num_possible_nodes(), or nr_node_ids (this one is 
   cheaper, 
   its a variable instead of a function call)
  
  But that is based on compile time option, isn't it? Perhaps I need
  to use some other mechanism to find out the platform is not NUMA capable..
 
 No its runtime.

I don't see any code that would ever change the mask from the compile
default. But that is easy to fix.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Christoph Lameter
On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

> > You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
> > its a variable instead of a function call)
> 
> But that is based on compile time option, isn't it? Perhaps I need
> to use some other mechanism to find out the platform is not NUMA capable..

No its runtime.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Eric Dumazet

Siddha, Suresh B a écrit :

On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote:

Siddha, Suresh B a écrit :

+   if (num_online_nodes() == 1)
+   use_alien_caches = 0;
+

Unfortunatly this part is wrong.


oops.

You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
its a variable instead of a function call)


But that is based on compile time option, isn't it? Perhaps I need
to use some other mechanism to find out the platform is not NUMA capable..


nr_node_ids is defined to 1 if you compile a non NUMA kernel.

If CONFIG_NUMA is on, then nr_node_ids is a variable, that is filled with the 
maximum nodeid of possible node (+1). If your machine is not CPU hot plug 
capable, and you have say one node, (one dual core processor for example), 
then nr_node_ids will be set to 1

(see mm/page_alloc.c function setup_nr_node_ids() )

So this is OK for your need...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Siddha, Suresh B
On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote:
> Siddha, Suresh B a écrit :
> >+if (num_online_nodes() == 1)
> >+use_alien_caches = 0;
> >+
> 
> Unfortunatly this part is wrong.

oops.

> 
> You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
> its a variable instead of a function call)

But that is based on compile time option, isn't it? Perhaps I need
to use some other mechanism to find out the platform is not NUMA capable..

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Eric Dumazet

Siddha, Suresh B a écrit :

Christoph,

While we are at this topic, recently I had reports that
cache_free_alien() is costly on non NUMA platforms too (similar
to the cache miss issues that Eric was referring to on NUMA)
and the appended patch seems to fix it for non NUMA atleast.

Appended patch gives a nice 1% perf improvement on non-NUMA platform
with database workload.

Please comment or Ack for mainline :)


I have one comment :)


@@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
int order;
int node;
 
+	if (num_online_nodes() == 1)

+   use_alien_caches = 0;
+


Unfortunatly this part is wrong.

You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
its a variable instead of a function call)


I wonder if we could add a new SLAB_NUMA_BYPASS, so that we can declare some 
kmem_cache as non NUMA aware (for example, I feel network skb dont need the 
NUMA overhead)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Christoph Lameter
On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

> @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
>   int order;
>   int node;
>  
> + if (num_online_nodes() == 1)
> + use_alien_caches = 0;
> +

What happens if you bring up a second node?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Siddha, Suresh B
Christoph,

While we are at this topic, recently I had reports that
cache_free_alien() is costly on non NUMA platforms too (similar
to the cache miss issues that Eric was referring to on NUMA)
and the appended patch seems to fix it for non NUMA atleast.

Appended patch gives a nice 1% perf improvement on non-NUMA platform
with database workload.

Please comment or Ack for mainline :)

thanks,
suresh
---

Subject: [patch] slab: skip cache_free_alien() on non NUMA
From: Suresh Siddha <[EMAIL PROTECTED]>

set use_alien_caches to 0 on non NUMA platforms. And avoid calling
the cache_free_alien() when use_alien_caches is not set. This will avoid
the cache miss that happens while dereferencing slabp to get nodeid.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/mm/slab.c b/mm/slab.c
index 8fdaffa..146082d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1146,7 +1146,7 @@ static inline int cache_free_alien(struct kmem_cache 
*cachep, void *objp)
 * Make sure we are not freeing a object from another node to the array
 * cache on this cpu.
 */
-   if (likely(slabp->nodeid == node) || unlikely(!use_alien_caches))
+   if (likely(slabp->nodeid == node))
return 0;
 
l3 = cachep->nodelists[node];
@@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
int order;
int node;
 
+   if (num_online_nodes() == 1)
+   use_alien_caches = 0;
+
for (i = 0; i < NUM_INIT_LISTS; i++) {
kmem_list3_init(_list3[i]);
if (i < MAX_NUMNODES)
@@ -3563,7 +3566,7 @@ static inline void __cache_free(struct kmem_cache 
*cachep, void *objp)
check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-   if (cache_free_alien(cachep, objp))
+   if (use_alien_caches && cache_free_alien(cachep, objp))
return;
 
if (likely(ac->avail < ac->limit)) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Siddha, Suresh B
Christoph,

While we are at this topic, recently I had reports that
cache_free_alien() is costly on non NUMA platforms too (similar
to the cache miss issues that Eric was referring to on NUMA)
and the appended patch seems to fix it for non NUMA atleast.

Appended patch gives a nice 1% perf improvement on non-NUMA platform
with database workload.

Please comment or Ack for mainline :)

thanks,
suresh
---

Subject: [patch] slab: skip cache_free_alien() on non NUMA
From: Suresh Siddha [EMAIL PROTECTED]

set use_alien_caches to 0 on non NUMA platforms. And avoid calling
the cache_free_alien() when use_alien_caches is not set. This will avoid
the cache miss that happens while dereferencing slabp to get nodeid.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/mm/slab.c b/mm/slab.c
index 8fdaffa..146082d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1146,7 +1146,7 @@ static inline int cache_free_alien(struct kmem_cache 
*cachep, void *objp)
 * Make sure we are not freeing a object from another node to the array
 * cache on this cpu.
 */
-   if (likely(slabp-nodeid == node) || unlikely(!use_alien_caches))
+   if (likely(slabp-nodeid == node))
return 0;
 
l3 = cachep-nodelists[node];
@@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
int order;
int node;
 
+   if (num_online_nodes() == 1)
+   use_alien_caches = 0;
+
for (i = 0; i  NUM_INIT_LISTS; i++) {
kmem_list3_init(initkmem_list3[i]);
if (i  MAX_NUMNODES)
@@ -3563,7 +3566,7 @@ static inline void __cache_free(struct kmem_cache 
*cachep, void *objp)
check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-   if (cache_free_alien(cachep, objp))
+   if (use_alien_caches  cache_free_alien(cachep, objp))
return;
 
if (likely(ac-avail  ac-limit)) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Christoph Lameter
On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

 @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
   int order;
   int node;
  
 + if (num_online_nodes() == 1)
 + use_alien_caches = 0;
 +

What happens if you bring up a second node?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Eric Dumazet

Siddha, Suresh B a écrit :

Christoph,

While we are at this topic, recently I had reports that
cache_free_alien() is costly on non NUMA platforms too (similar
to the cache miss issues that Eric was referring to on NUMA)
and the appended patch seems to fix it for non NUMA atleast.

Appended patch gives a nice 1% perf improvement on non-NUMA platform
with database workload.

Please comment or Ack for mainline :)


I have one comment :)


@@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
int order;
int node;
 
+	if (num_online_nodes() == 1)

+   use_alien_caches = 0;
+


Unfortunatly this part is wrong.

You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
its a variable instead of a function call)


I wonder if we could add a new SLAB_NUMA_BYPASS, so that we can declare some 
kmem_cache as non NUMA aware (for example, I feel network skb dont need the 
NUMA overhead)



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Siddha, Suresh B
On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote:
 Siddha, Suresh B a écrit :
 +if (num_online_nodes() == 1)
 +use_alien_caches = 0;
 +
 
 Unfortunatly this part is wrong.

oops.

 
 You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
 its a variable instead of a function call)

But that is based on compile time option, isn't it? Perhaps I need
to use some other mechanism to find out the platform is not NUMA capable..

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Eric Dumazet

Siddha, Suresh B a écrit :

On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote:

Siddha, Suresh B a écrit :

+   if (num_online_nodes() == 1)
+   use_alien_caches = 0;
+

Unfortunatly this part is wrong.


oops.

You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
its a variable instead of a function call)


But that is based on compile time option, isn't it? Perhaps I need
to use some other mechanism to find out the platform is not NUMA capable..


nr_node_ids is defined to 1 if you compile a non NUMA kernel.

If CONFIG_NUMA is on, then nr_node_ids is a variable, that is filled with the 
maximum nodeid of possible node (+1). If your machine is not CPU hot plug 
capable, and you have say one node, (one dual core processor for example), 
then nr_node_ids will be set to 1

(see mm/page_alloc.c function setup_nr_node_ids() )

So this is OK for your need...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Christoph Lameter
On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

  You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
  its a variable instead of a function call)
 
 But that is based on compile time option, isn't it? Perhaps I need
 to use some other mechanism to find out the platform is not NUMA capable..

No its runtime.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Eric Dumazet wrote:

> If numa_node_id() is equal to the node of the page containing the first byte
> of the object, then object is on the local node. Or what ?

No. The slab (the page you are referring to) may have been allocated for 
another node and been tracked via the node structs of that other node. We 
were just falling back to the node that now appears to be local.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-21 Thread Eric Dumazet

Christoph Lameter a écrit :

On Wed, 21 Mar 2007, Eric Dumazet wrote:


The fast path is to put the pointer, into the cpu array cache. This object
might be given back some cycles later, because of a kmem_cache_alloc() : No
need to access the two cache lines (struct page, struct slab)


If you do that then the slab will no longer return objects from the 
desired nodes. The assumption is that cpu array objects are from the local 
node.


Me confused.

How the following could be wrong ?

static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
{
int mynode = numa_node_id();
int objnode = virt_to_nid(objp); // or whatever

if (mynode == objnode)
return 0;
...
}

If numa_node_id() is equal to the node of the page containing the first byte 
of the object, then object is on the local node. Or what ?


Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Eric Dumazet wrote:

> The fast path is to put the pointer, into the cpu array cache. This object
> might be given back some cycles later, because of a kmem_cache_alloc() : No
> need to access the two cache lines (struct page, struct slab)

If you do that then the slab will no longer return objects from the 
desired nodes. The assumption is that cpu array objects are from the local 
node.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Eric Dumazet wrote:

 The fast path is to put the pointer, into the cpu array cache. This object
 might be given back some cycles later, because of a kmem_cache_alloc() : No
 need to access the two cache lines (struct page, struct slab)

If you do that then the slab will no longer return objects from the 
desired nodes. The assumption is that cpu array objects are from the local 
node.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-21 Thread Eric Dumazet

Christoph Lameter a écrit :

On Wed, 21 Mar 2007, Eric Dumazet wrote:


The fast path is to put the pointer, into the cpu array cache. This object
might be given back some cycles later, because of a kmem_cache_alloc() : No
need to access the two cache lines (struct page, struct slab)


If you do that then the slab will no longer return objects from the 
desired nodes. The assumption is that cpu array objects are from the local 
node.


Me confused.

How the following could be wrong ?

static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
{
int mynode = numa_node_id();
int objnode = virt_to_nid(objp); // or whatever

if (mynode == objnode)
return 0;
...
}

If numa_node_id() is equal to the node of the page containing the first byte 
of the object, then object is on the local node. Or what ?


Thank you
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Eric Dumazet wrote:

 If numa_node_id() is equal to the node of the page containing the first byte
 of the object, then object is on the local node. Or what ?

No. The slab (the page you are referring to) may have been allocated for 
another node and been tracked via the node structs of that other node. We 
were just falling back to the node that now appears to be local.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Eric Dumazet

Christoph Lameter a écrit :

On Tue, 20 Mar 2007, Eric Dumazet wrote:


I understand we want to do special things (fallback and such tricks) at
allocation time, but I believe that we can just trust the real nid of memory
at free time.


Sorry no. The node at allocation time determines which node specific 
structure tracks the slab. If we fall back then the node is allocated 
from one node but entered in the node structure of another. Thus you 
cannot free the slab without knowing the node at allocation time.


I think you dont understand my point.

When we enter kmem_cache_free(), we are not freeing a slab, but an object, 
knowing a pointer to this object.


The fast path is to put the pointer, into the cpu array cache. This object 
might be given back some cycles later, because of a kmem_cache_alloc() : No 
need to access the two cache lines (struct page, struct slab)


This fast path could be entered checking the node of the page, which is 
faster, but eventually different from the virt_to_slab(obj)->nodeid. Do we 
care ? Definitly not : Node is guaranted to be correct.


Then, if we must flush the cpu array cache because it is full, we *may* access 
the slabs of the objects we are flushing, and then check the 
virt_to_slab(obj)->nodeid to be able to do the correct thing.


Fortunatly, flushing cache array is not a frequent event, and the cost of 
access to cache lines (truct page, struct slab) can be amortized because 
several 'transfered or freed' objects might share them at this time.



Actually I had to disable NUMA on my platforms because it is just overkill and 
slower. Maybe its something OK for very big machines, and not dual nodes 
Opterons ? Let me know so that I dont waste your time (and mine)



Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Christoph Lameter
On Wed, 21 Mar 2007, Andi Kleen wrote:

> > We usually use page_to_nid(). Sure this will determine the node the object 
> > resides on. But this may not be the node on which the slab is tracked 
> > since there may have been a fallback at alloc time.
> 
> How about your slab rewrite?  I assume it would make more sense to fix
> such problems in that code instead of the old which is going to be replaced
> at some point.

The slab rewrite first allocates a page and then determines where it 
came from instead of requiring the page allocator to allocate from a 
certain node. Plus SLUB does not keep per cpu or per node object queues. 
So the problem does not occur in the same way. The per cpu slab in SLUB 
can contain objects from another node whereas SLAB can only put node local 
objects on its queues.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Andi Kleen
> We usually use page_to_nid(). Sure this will determine the node the object 
> resides on. But this may not be the node on which the slab is tracked 
> since there may have been a fallback at alloc time.

How about your slab rewrite?  I assume it would make more sense to fix
such problems in that code instead of the old which is going to be replaced
at some point.

-Andi
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Christoph Lameter
On Tue, 20 Mar 2007, Eric Dumazet wrote:

> I understand we want to do special things (fallback and such tricks) at
> allocation time, but I believe that we can just trust the real nid of memory
> at free time.

Sorry no. The node at allocation time determines which node specific 
structure tracks the slab. If we fall back then the node is allocated 
from one node but entered in the node structure of another. Thus you 
cannot free the slab without knowing the node at allocation time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Christoph Lameter
On Tue, 20 Mar 2007, Andi Kleen wrote:

> > > Is it possible virt_to_slab(objp)->nodeid being different from 
> > > pfn_to_nid(objp) ?
> > 
> > It is possible the page allocator falls back to another node than 
> > requested. We would need to check that this never occurs.
> 
> The only way to ensure that would be to set a strict mempolicy.
> But I'm not sure that's a good idea -- after all you don't want
> to fail an allocation in this case.
> 
> But pfn_to_nid on the object like proposed by Eric should work anyways.
> But I'm not sure the tables used for that will be more often cache hot
> than the slab.

We usually use page_to_nid(). Sure this will determine the node the object 
resides on. But this may not be the node on which the slab is tracked 
since there may have been a fallback at alloc time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Eric Dumazet

Andi Kleen a écrit :

Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) 
?
It is possible the page allocator falls back to another node than 
requested. We would need to check that this never occurs.


The only way to ensure that would be to set a strict mempolicy.
But I'm not sure that's a good idea -- after all you don't want
to fail an allocation in this case.

But pfn_to_nid on the object like proposed by Eric should work anyways.
But I'm not sure the tables used for that will be more often cache hot
than the slab.


pfn_to_nid() on most x86_64 machines access one cache line (struct memnode).

Node 0 MemBase  Limit 00028000
Node 1 MemBase 00028000 Limit 00048000
NUMA: Using 31 for the hash shift.

On this example, we use only 8 bytes of memnode.embedded_map[] to find nid of 
all 16 GB of ram. On profiles I have, memnode is always hot (no cache miss on it).


While virt_to_slab() has to access :

1) struct page -> page_get_slab() (page->lru.prev) (one cache miss)
2) struct slab -> nodeid (one other cache miss)


So using pfn_to_nid() would avoid 2 cache misses.

I understand we want to do special things (fallback and such tricks) at 
allocation time, but I believe that we can just trust the real nid of memory 
at free time.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Andi Kleen
> > Is it possible virt_to_slab(objp)->nodeid being different from 
> > pfn_to_nid(objp) ?
> 
> It is possible the page allocator falls back to another node than 
> requested. We would need to check that this never occurs.

The only way to ensure that would be to set a strict mempolicy.
But I'm not sure that's a good idea -- after all you don't want
to fail an allocation in this case.

But pfn_to_nid on the object like proposed by Eric should work anyways.
But I'm not sure the tables used for that will be more often cache hot
than the slab.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Christoph Lameter
On Tue, 20 Mar 2007, Eric Dumazet wrote:

> 
> I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is 
> very expensive.
> This is because of a cache miss on struct slab.
> At the time an object is freed (call to kmem_cache_free() for example), the 
> underlying 'struct slab' is not anymore cache-hot.
> 
> struct slab *slabp = virt_to_slab(objp);
> nodeid = slabp->nodeid; // cache miss
> 
> So we currently need slab only to lookup nodeid, to be able to use the cachep 
> cpu cache, or not.
> 
> Couldn't we use something less expensive, like pfn_to_nid() ?

Nodeid describes the node that the slab was allocated for which may 
not be equal to the node that the page came from. But if GFP_THISNODE is 
used then this should always be the same. That those two nodeid are 
different was certainly frequent before the GFP_THISNODE work went in. Now 
it may just occur in corner cases. Perhaps during bootup on 
some machines that boot with empty nodes. I vaguely recall a powerpc 
issue.

> Is it possible virt_to_slab(objp)->nodeid being different from 
> pfn_to_nid(objp) ?

It is possible the page allocator falls back to another node than 
requested. We would need to check that this never occurs.

If we are sure then we could drop the nodeid field completely.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;

2007-03-20 Thread Eric Dumazet
Hi

I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is 
very expensive.
This is because of a cache miss on struct slab.
At the time an object is freed (call to kmem_cache_free() for example), the 
underlying 'struct slab' is not anymore cache-hot.

struct slab *slabp = virt_to_slab(objp);
nodeid = slabp->nodeid; // cache miss

So we currently need slab only to lookup nodeid, to be able to use the cachep 
cpu cache, or not.

Couldn't we use something less expensive, like pfn_to_nid() ?
On x86_64 pfn_to_nid usually shares one cache line for all objects (struct 
memnode)

Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) 
?

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Eric Dumazet
Hi

I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is 
very expensive.
This is because of a cache miss on struct slab.
At the time an object is freed (call to kmem_cache_free() for example), the 
underlying 'struct slab' is not anymore cache-hot.

struct slab *slabp = virt_to_slab(objp);
nodeid = slabp-nodeid; // cache miss

So we currently need slab only to lookup nodeid, to be able to use the cachep 
cpu cache, or not.

Couldn't we use something less expensive, like pfn_to_nid() ?
On x86_64 pfn_to_nid usually shares one cache line for all objects (struct 
memnode)

Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) 
?

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Christoph Lameter
On Tue, 20 Mar 2007, Eric Dumazet wrote:

 
 I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is 
 very expensive.
 This is because of a cache miss on struct slab.
 At the time an object is freed (call to kmem_cache_free() for example), the 
 underlying 'struct slab' is not anymore cache-hot.
 
 struct slab *slabp = virt_to_slab(objp);
 nodeid = slabp-nodeid; // cache miss
 
 So we currently need slab only to lookup nodeid, to be able to use the cachep 
 cpu cache, or not.
 
 Couldn't we use something less expensive, like pfn_to_nid() ?

Nodeid describes the node that the slab was allocated for which may 
not be equal to the node that the page came from. But if GFP_THISNODE is 
used then this should always be the same. That those two nodeid are 
different was certainly frequent before the GFP_THISNODE work went in. Now 
it may just occur in corner cases. Perhaps during bootup on 
some machines that boot with empty nodes. I vaguely recall a powerpc 
issue.

 Is it possible virt_to_slab(objp)-nodeid being different from 
 pfn_to_nid(objp) ?

It is possible the page allocator falls back to another node than 
requested. We would need to check that this never occurs.

If we are sure then we could drop the nodeid field completely.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Andi Kleen
  Is it possible virt_to_slab(objp)-nodeid being different from 
  pfn_to_nid(objp) ?
 
 It is possible the page allocator falls back to another node than 
 requested. We would need to check that this never occurs.

The only way to ensure that would be to set a strict mempolicy.
But I'm not sure that's a good idea -- after all you don't want
to fail an allocation in this case.

But pfn_to_nid on the object like proposed by Eric should work anyways.
But I'm not sure the tables used for that will be more often cache hot
than the slab.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Eric Dumazet

Andi Kleen a écrit :

Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) 
?
It is possible the page allocator falls back to another node than 
requested. We would need to check that this never occurs.


The only way to ensure that would be to set a strict mempolicy.
But I'm not sure that's a good idea -- after all you don't want
to fail an allocation in this case.

But pfn_to_nid on the object like proposed by Eric should work anyways.
But I'm not sure the tables used for that will be more often cache hot
than the slab.


pfn_to_nid() on most x86_64 machines access one cache line (struct memnode).

Node 0 MemBase  Limit 00028000
Node 1 MemBase 00028000 Limit 00048000
NUMA: Using 31 for the hash shift.

On this example, we use only 8 bytes of memnode.embedded_map[] to find nid of 
all 16 GB of ram. On profiles I have, memnode is always hot (no cache miss on it).


While virt_to_slab() has to access :

1) struct page - page_get_slab() (page-lru.prev) (one cache miss)
2) struct slab - nodeid (one other cache miss)


So using pfn_to_nid() would avoid 2 cache misses.

I understand we want to do special things (fallback and such tricks) at 
allocation time, but I believe that we can just trust the real nid of memory 
at free time.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Christoph Lameter
On Tue, 20 Mar 2007, Eric Dumazet wrote:

 I understand we want to do special things (fallback and such tricks) at
 allocation time, but I believe that we can just trust the real nid of memory
 at free time.

Sorry no. The node at allocation time determines which node specific 
structure tracks the slab. If we fall back then the node is allocated 
from one node but entered in the node structure of another. Thus you 
cannot free the slab without knowing the node at allocation time.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Christoph Lameter
On Tue, 20 Mar 2007, Andi Kleen wrote:

   Is it possible virt_to_slab(objp)-nodeid being different from 
   pfn_to_nid(objp) ?
  
  It is possible the page allocator falls back to another node than 
  requested. We would need to check that this never occurs.
 
 The only way to ensure that would be to set a strict mempolicy.
 But I'm not sure that's a good idea -- after all you don't want
 to fail an allocation in this case.
 
 But pfn_to_nid on the object like proposed by Eric should work anyways.
 But I'm not sure the tables used for that will be more often cache hot
 than the slab.

We usually use page_to_nid(). Sure this will determine the node the object 
resides on. But this may not be the node on which the slab is tracked 
since there may have been a fallback at alloc time.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Andi Kleen
 We usually use page_to_nid(). Sure this will determine the node the object 
 resides on. But this may not be the node on which the slab is tracked 
 since there may have been a fallback at alloc time.

How about your slab rewrite?  I assume it would make more sense to fix
such problems in that code instead of the old which is going to be replaced
at some point.

-Andi
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Christoph Lameter
On Wed, 21 Mar 2007, Andi Kleen wrote:

  We usually use page_to_nid(). Sure this will determine the node the object 
  resides on. But this may not be the node on which the slab is tracked 
  since there may have been a fallback at alloc time.
 
 How about your slab rewrite?  I assume it would make more sense to fix
 such problems in that code instead of the old which is going to be replaced
 at some point.

The slab rewrite first allocates a page and then determines where it 
came from instead of requiring the page allocator to allocate from a 
certain node. Plus SLUB does not keep per cpu or per node object queues. 
So the problem does not occur in the same way. The per cpu slab in SLUB 
can contain objects from another node whereas SLAB can only put node local 
objects on its queues.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Eric Dumazet

Christoph Lameter a écrit :

On Tue, 20 Mar 2007, Eric Dumazet wrote:


I understand we want to do special things (fallback and such tricks) at
allocation time, but I believe that we can just trust the real nid of memory
at free time.


Sorry no. The node at allocation time determines which node specific 
structure tracks the slab. If we fall back then the node is allocated 
from one node but entered in the node structure of another. Thus you 
cannot free the slab without knowing the node at allocation time.


I think you dont understand my point.

When we enter kmem_cache_free(), we are not freeing a slab, but an object, 
knowing a pointer to this object.


The fast path is to put the pointer, into the cpu array cache. This object 
might be given back some cycles later, because of a kmem_cache_alloc() : No 
need to access the two cache lines (struct page, struct slab)


This fast path could be entered checking the node of the page, which is 
faster, but eventually different from the virt_to_slab(obj)-nodeid. Do we 
care ? Definitly not : Node is guaranted to be correct.


Then, if we must flush the cpu array cache because it is full, we *may* access 
the slabs of the objects we are flushing, and then check the 
virt_to_slab(obj)-nodeid to be able to do the correct thing.


Fortunatly, flushing cache array is not a frequent event, and the cost of 
access to cache lines (truct page, struct slab) can be amortized because 
several 'transfered or freed' objects might share them at this time.



Actually I had to disable NUMA on my platforms because it is just overkill and 
slower. Maybe its something OK for very big machines, and not dual nodes 
Opterons ? Let me know so that I dont waste your time (and mine)



Thank you
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/