Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Mon, Apr 02, 2007 at 05:23:20PM -0700, Christoph Lameter wrote: > On Mon, 2 Apr 2007, Siddha, Suresh B wrote: > > > Set the node_possible_map at runtime. On a non NUMA system, > > num_possible_nodes() will now say '1' > > How does this relate to nr_node_ids? With this patch, nr_node_ids on non NUMA will also be '1' and as before nr_node_ids is same as num_possible_nodes() thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Mon, 2 Apr 2007, Siddha, Suresh B wrote: > Set the node_possible_map at runtime. On a non NUMA system, > num_possible_nodes() will now say '1' How does this relate to nr_node_ids? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Fri, Mar 23, 2007 at 03:12:10PM +0100, Andi Kleen wrote: > > But that is based on compile time option, isn't it? Perhaps I need > > to use some other mechanism to find out the platform is not NUMA capable.. > > We can probably make it runtime on x86. That will be needed sooner or > later for correct NUMA hotplug support anyways. How about this patch? Thanks. --- From: Suresh Siddha <[EMAIL PROTECTED]> [patch] x86_64: set node_possible_map at runtime. Set the node_possible_map at runtime. On a non NUMA system, num_possible_nodes() will now say '1' Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> --- diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c index b5b8dba..d6f4447 100644 --- a/arch/x86_64/mm/k8topology.c +++ b/arch/x86_64/mm/k8topology.c @@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end) int found = 0; u32 reg; unsigned numnodes; - nodemask_t nodes_parsed; unsigned dualcore = 0; - nodes_clear(nodes_parsed); - if (!early_pci_allowed()) return -1; @@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end) nodeid, (base>>8)&3, (limit>>8) & 3); return -1; } - if (node_isset(nodeid, nodes_parsed)) { + if (node_isset(nodeid, node_possible_map)) { printk(KERN_INFO "Node %d already present. Skipping\n", nodeid); continue; @@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end) prevbase = base; - node_set(nodeid, nodes_parsed); + node_set(nodeid, node_possible_map); } if (!found) diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c index 41b8fb0..5f7d4d8 100644 --- a/arch/x86_64/mm/numa.c +++ b/arch/x86_64/mm/numa.c @@ -383,6 +383,7 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn) i, nodes[i].start, nodes[i].end, (nodes[i].end - nodes[i].start) >> 20); + node_set(i, node_possible_map); node_set_online(i); } memnode_shift = compute_hash_shift(nodes, numa_fake); @@ -405,6 +406,8 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) { int i; + nodes_clear(node_possible_map); + #ifdef CONFIG_NUMA_EMU if (numa_fake && !numa_emulation(start_pfn, end_pfn)) return; @@ -432,6 +435,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) memnodemap[0] = 0; nodes_clear(node_online_map); node_set_online(0); + node_set(0, node_possible_map); for (i = 0; i < NR_CPUS; i++) numa_set_node(i, 0); node_to_cpumask[0] = cpumask_of_cpu(0); diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c index 2efe215..9f26e2b 100644 --- a/arch/x86_64/mm/srat.c +++ b/arch/x86_64/mm/srat.c @@ -25,7 +25,6 @@ int acpi_numa __initdata; static struct acpi_table_slit *acpi_slit; -static nodemask_t nodes_parsed __initdata; static struct bootnode nodes[MAX_NUMNODES] __initdata; static struct bootnode nodes_add[MAX_NUMNODES]; static int found_add_area __initdata; @@ -43,7 +42,7 @@ static __init int setup_node(int pxm) static __init int conflicting_nodes(unsigned long start, unsigned long end) { int i; - for_each_node_mask(i, nodes_parsed) { + for_each_node_mask(i, node_possible_map) { struct bootnode *nd = [i]; if (nd->start == nd->end) continue; @@ -321,7 +320,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma) } nd = [node]; oldnode = *nd; - if (!node_test_and_set(node, nodes_parsed)) { + if (!node_test_and_set(node, node_possible_map)) { nd->start = start; nd->end = end; } else { @@ -344,7 +343,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma) printk(KERN_NOTICE "SRAT: Hotplug region ignored\n"); *nd = oldnode; if ((nd->start | nd->end) == 0) - node_clear(node, nodes_parsed); + node_clear(node, node_possible_map); } } @@ -356,7 +355,7 @@ static int nodes_cover_memory(void) unsigned long pxmram, e820ram; pxmram = 0; - for_each_node_mask(i, nodes_parsed) { + for_each_node_mask(i, node_possible_map) { unsigned long s = nodes[i].start >> PAGE_SHIFT; unsigned long e = nodes[i].end >> PAGE_SHIFT; pxmram += e - s; @@ -380,7 +379,7 @@ static int nodes_cover_memory(void) static void unparse_node(int node) { int
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Fri, Mar 23, 2007 at 03:12:10PM +0100, Andi Kleen wrote: But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. We can probably make it runtime on x86. That will be needed sooner or later for correct NUMA hotplug support anyways. How about this patch? Thanks. --- From: Suresh Siddha [EMAIL PROTECTED] [patch] x86_64: set node_possible_map at runtime. Set the node_possible_map at runtime. On a non NUMA system, num_possible_nodes() will now say '1' Signed-off-by: Suresh Siddha [EMAIL PROTECTED] --- diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c index b5b8dba..d6f4447 100644 --- a/arch/x86_64/mm/k8topology.c +++ b/arch/x86_64/mm/k8topology.c @@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end) int found = 0; u32 reg; unsigned numnodes; - nodemask_t nodes_parsed; unsigned dualcore = 0; - nodes_clear(nodes_parsed); - if (!early_pci_allowed()) return -1; @@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end) nodeid, (base8)3, (limit8) 3); return -1; } - if (node_isset(nodeid, nodes_parsed)) { + if (node_isset(nodeid, node_possible_map)) { printk(KERN_INFO Node %d already present. Skipping\n, nodeid); continue; @@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end) prevbase = base; - node_set(nodeid, nodes_parsed); + node_set(nodeid, node_possible_map); } if (!found) diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c index 41b8fb0..5f7d4d8 100644 --- a/arch/x86_64/mm/numa.c +++ b/arch/x86_64/mm/numa.c @@ -383,6 +383,7 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn) i, nodes[i].start, nodes[i].end, (nodes[i].end - nodes[i].start) 20); + node_set(i, node_possible_map); node_set_online(i); } memnode_shift = compute_hash_shift(nodes, numa_fake); @@ -405,6 +406,8 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) { int i; + nodes_clear(node_possible_map); + #ifdef CONFIG_NUMA_EMU if (numa_fake !numa_emulation(start_pfn, end_pfn)) return; @@ -432,6 +435,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) memnodemap[0] = 0; nodes_clear(node_online_map); node_set_online(0); + node_set(0, node_possible_map); for (i = 0; i NR_CPUS; i++) numa_set_node(i, 0); node_to_cpumask[0] = cpumask_of_cpu(0); diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c index 2efe215..9f26e2b 100644 --- a/arch/x86_64/mm/srat.c +++ b/arch/x86_64/mm/srat.c @@ -25,7 +25,6 @@ int acpi_numa __initdata; static struct acpi_table_slit *acpi_slit; -static nodemask_t nodes_parsed __initdata; static struct bootnode nodes[MAX_NUMNODES] __initdata; static struct bootnode nodes_add[MAX_NUMNODES]; static int found_add_area __initdata; @@ -43,7 +42,7 @@ static __init int setup_node(int pxm) static __init int conflicting_nodes(unsigned long start, unsigned long end) { int i; - for_each_node_mask(i, nodes_parsed) { + for_each_node_mask(i, node_possible_map) { struct bootnode *nd = nodes[i]; if (nd-start == nd-end) continue; @@ -321,7 +320,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma) } nd = nodes[node]; oldnode = *nd; - if (!node_test_and_set(node, nodes_parsed)) { + if (!node_test_and_set(node, node_possible_map)) { nd-start = start; nd-end = end; } else { @@ -344,7 +343,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma) printk(KERN_NOTICE SRAT: Hotplug region ignored\n); *nd = oldnode; if ((nd-start | nd-end) == 0) - node_clear(node, nodes_parsed); + node_clear(node, node_possible_map); } } @@ -356,7 +355,7 @@ static int nodes_cover_memory(void) unsigned long pxmram, e820ram; pxmram = 0; - for_each_node_mask(i, nodes_parsed) { + for_each_node_mask(i, node_possible_map) { unsigned long s = nodes[i].start PAGE_SHIFT; unsigned long e = nodes[i].end PAGE_SHIFT; pxmram += e - s; @@ -380,7 +379,7 @@ static int nodes_cover_memory(void) static void unparse_node(int node) { int i; -
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Mon, 2 Apr 2007, Siddha, Suresh B wrote: Set the node_possible_map at runtime. On a non NUMA system, num_possible_nodes() will now say '1' How does this relate to nr_node_ids? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Mon, Apr 02, 2007 at 05:23:20PM -0700, Christoph Lameter wrote: On Mon, 2 Apr 2007, Siddha, Suresh B wrote: Set the node_possible_map at runtime. On a non NUMA system, num_possible_nodes() will now say '1' How does this relate to nr_node_ids? With this patch, nr_node_ids on non NUMA will also be '1' and as before nr_node_ids is same as num_possible_nodes() thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Thu, Mar 22, 2007 at 06:25:16PM -0700, Christoph Lameter wrote: > On Thu, 22 Mar 2007, Siddha, Suresh B wrote: > > > > You should check num_possible_nodes(), or nr_node_ids (this one is > > > cheaper, > > > its a variable instead of a function call) > > > > But that is based on compile time option, isn't it? Perhaps I need > > to use some other mechanism to find out the platform is not NUMA capable.. > > No its runtime. I don't see any code that would ever change the mask from the compile default. But that is easy to fix. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
> But that is based on compile time option, isn't it? Perhaps I need > to use some other mechanism to find out the platform is not NUMA capable.. We can probably make it runtime on x86. That will be needed sooner or later for correct NUMA hotplug support anyways. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. We can probably make it runtime on x86. That will be needed sooner or later for correct NUMA hotplug support anyways. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Thu, Mar 22, 2007 at 06:25:16PM -0700, Christoph Lameter wrote: On Thu, 22 Mar 2007, Siddha, Suresh B wrote: You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. No its runtime. I don't see any code that would ever change the mask from the compile default. But that is easy to fix. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Thu, 22 Mar 2007, Siddha, Suresh B wrote: > > You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, > > its a variable instead of a function call) > > But that is based on compile time option, isn't it? Perhaps I need > to use some other mechanism to find out the platform is not NUMA capable.. No its runtime. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
Siddha, Suresh B a écrit : On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote: Siddha, Suresh B a écrit : + if (num_online_nodes() == 1) + use_alien_caches = 0; + Unfortunatly this part is wrong. oops. You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. nr_node_ids is defined to 1 if you compile a non NUMA kernel. If CONFIG_NUMA is on, then nr_node_ids is a variable, that is filled with the maximum nodeid of possible node (+1). If your machine is not CPU hot plug capable, and you have say one node, (one dual core processor for example), then nr_node_ids will be set to 1 (see mm/page_alloc.c function setup_nr_node_ids() ) So this is OK for your need... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote: > Siddha, Suresh B a écrit : > >+if (num_online_nodes() == 1) > >+use_alien_caches = 0; > >+ > > Unfortunatly this part is wrong. oops. > > You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, > its a variable instead of a function call) But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
Siddha, Suresh B a écrit : Christoph, While we are at this topic, recently I had reports that cache_free_alien() is costly on non NUMA platforms too (similar to the cache miss issues that Eric was referring to on NUMA) and the appended patch seems to fix it for non NUMA atleast. Appended patch gives a nice 1% perf improvement on non-NUMA platform with database workload. Please comment or Ack for mainline :) I have one comment :) @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void) int order; int node; + if (num_online_nodes() == 1) + use_alien_caches = 0; + Unfortunatly this part is wrong. You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) I wonder if we could add a new SLAB_NUMA_BYPASS, so that we can declare some kmem_cache as non NUMA aware (for example, I feel network skb dont need the NUMA overhead) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
On Thu, 22 Mar 2007, Siddha, Suresh B wrote: > @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void) > int order; > int node; > > + if (num_online_nodes() == 1) > + use_alien_caches = 0; > + What happens if you bring up a second node? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
Christoph, While we are at this topic, recently I had reports that cache_free_alien() is costly on non NUMA platforms too (similar to the cache miss issues that Eric was referring to on NUMA) and the appended patch seems to fix it for non NUMA atleast. Appended patch gives a nice 1% perf improvement on non-NUMA platform with database workload. Please comment or Ack for mainline :) thanks, suresh --- Subject: [patch] slab: skip cache_free_alien() on non NUMA From: Suresh Siddha <[EMAIL PROTECTED]> set use_alien_caches to 0 on non NUMA platforms. And avoid calling the cache_free_alien() when use_alien_caches is not set. This will avoid the cache miss that happens while dereferencing slabp to get nodeid. Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> --- diff --git a/mm/slab.c b/mm/slab.c index 8fdaffa..146082d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1146,7 +1146,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) * Make sure we are not freeing a object from another node to the array * cache on this cpu. */ - if (likely(slabp->nodeid == node) || unlikely(!use_alien_caches)) + if (likely(slabp->nodeid == node)) return 0; l3 = cachep->nodelists[node]; @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void) int order; int node; + if (num_online_nodes() == 1) + use_alien_caches = 0; + for (i = 0; i < NUM_INIT_LISTS; i++) { kmem_list3_init(_list3[i]); if (i < MAX_NUMNODES) @@ -3563,7 +3566,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp) check_irq_off(); objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0)); - if (cache_free_alien(cachep, objp)) + if (use_alien_caches && cache_free_alien(cachep, objp)) return; if (likely(ac->avail < ac->limit)) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
Christoph, While we are at this topic, recently I had reports that cache_free_alien() is costly on non NUMA platforms too (similar to the cache miss issues that Eric was referring to on NUMA) and the appended patch seems to fix it for non NUMA atleast. Appended patch gives a nice 1% perf improvement on non-NUMA platform with database workload. Please comment or Ack for mainline :) thanks, suresh --- Subject: [patch] slab: skip cache_free_alien() on non NUMA From: Suresh Siddha [EMAIL PROTECTED] set use_alien_caches to 0 on non NUMA platforms. And avoid calling the cache_free_alien() when use_alien_caches is not set. This will avoid the cache miss that happens while dereferencing slabp to get nodeid. Signed-off-by: Suresh Siddha [EMAIL PROTECTED] --- diff --git a/mm/slab.c b/mm/slab.c index 8fdaffa..146082d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1146,7 +1146,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) * Make sure we are not freeing a object from another node to the array * cache on this cpu. */ - if (likely(slabp-nodeid == node) || unlikely(!use_alien_caches)) + if (likely(slabp-nodeid == node)) return 0; l3 = cachep-nodelists[node]; @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void) int order; int node; + if (num_online_nodes() == 1) + use_alien_caches = 0; + for (i = 0; i NUM_INIT_LISTS; i++) { kmem_list3_init(initkmem_list3[i]); if (i MAX_NUMNODES) @@ -3563,7 +3566,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp) check_irq_off(); objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0)); - if (cache_free_alien(cachep, objp)) + if (use_alien_caches cache_free_alien(cachep, objp)) return; if (likely(ac-avail ac-limit)) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Thu, 22 Mar 2007, Siddha, Suresh B wrote: @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void) int order; int node; + if (num_online_nodes() == 1) + use_alien_caches = 0; + What happens if you bring up a second node? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
Siddha, Suresh B a écrit : Christoph, While we are at this topic, recently I had reports that cache_free_alien() is costly on non NUMA platforms too (similar to the cache miss issues that Eric was referring to on NUMA) and the appended patch seems to fix it for non NUMA atleast. Appended patch gives a nice 1% perf improvement on non-NUMA platform with database workload. Please comment or Ack for mainline :) I have one comment :) @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void) int order; int node; + if (num_online_nodes() == 1) + use_alien_caches = 0; + Unfortunatly this part is wrong. You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) I wonder if we could add a new SLAB_NUMA_BYPASS, so that we can declare some kmem_cache as non NUMA aware (for example, I feel network skb dont need the NUMA overhead) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote: Siddha, Suresh B a écrit : +if (num_online_nodes() == 1) +use_alien_caches = 0; + Unfortunatly this part is wrong. oops. You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
Siddha, Suresh B a écrit : On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote: Siddha, Suresh B a écrit : + if (num_online_nodes() == 1) + use_alien_caches = 0; + Unfortunatly this part is wrong. oops. You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. nr_node_ids is defined to 1 if you compile a non NUMA kernel. If CONFIG_NUMA is on, then nr_node_ids is a variable, that is filled with the maximum nodeid of possible node (+1). If your machine is not CPU hot plug capable, and you have say one node, (one dual core processor for example), then nr_node_ids will be set to 1 (see mm/page_alloc.c function setup_nr_node_ids() ) So this is OK for your need... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)
On Thu, 22 Mar 2007, Siddha, Suresh B wrote: You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, its a variable instead of a function call) But that is based on compile time option, isn't it? Perhaps I need to use some other mechanism to find out the platform is not NUMA capable.. No its runtime. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
On Wed, 21 Mar 2007, Eric Dumazet wrote: > If numa_node_id() is equal to the node of the page containing the first byte > of the object, then object is on the local node. Or what ? No. The slab (the page you are referring to) may have been allocated for another node and been tracked via the node structs of that other node. We were just falling back to the node that now appears to be local. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
Christoph Lameter a écrit : On Wed, 21 Mar 2007, Eric Dumazet wrote: The fast path is to put the pointer, into the cpu array cache. This object might be given back some cycles later, because of a kmem_cache_alloc() : No need to access the two cache lines (struct page, struct slab) If you do that then the slab will no longer return objects from the desired nodes. The assumption is that cpu array objects are from the local node. Me confused. How the following could be wrong ? static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) { int mynode = numa_node_id(); int objnode = virt_to_nid(objp); // or whatever if (mynode == objnode) return 0; ... } If numa_node_id() is equal to the node of the page containing the first byte of the object, then object is on the local node. Or what ? Thank you - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
On Wed, 21 Mar 2007, Eric Dumazet wrote: > The fast path is to put the pointer, into the cpu array cache. This object > might be given back some cycles later, because of a kmem_cache_alloc() : No > need to access the two cache lines (struct page, struct slab) If you do that then the slab will no longer return objects from the desired nodes. The assumption is that cpu array objects are from the local node. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
On Wed, 21 Mar 2007, Eric Dumazet wrote: The fast path is to put the pointer, into the cpu array cache. This object might be given back some cycles later, because of a kmem_cache_alloc() : No need to access the two cache lines (struct page, struct slab) If you do that then the slab will no longer return objects from the desired nodes. The assumption is that cpu array objects are from the local node. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
Christoph Lameter a écrit : On Wed, 21 Mar 2007, Eric Dumazet wrote: The fast path is to put the pointer, into the cpu array cache. This object might be given back some cycles later, because of a kmem_cache_alloc() : No need to access the two cache lines (struct page, struct slab) If you do that then the slab will no longer return objects from the desired nodes. The assumption is that cpu array objects are from the local node. Me confused. How the following could be wrong ? static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) { int mynode = numa_node_id(); int objnode = virt_to_nid(objp); // or whatever if (mynode == objnode) return 0; ... } If numa_node_id() is equal to the node of the page containing the first byte of the object, then object is on the local node. Or what ? Thank you - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
On Wed, 21 Mar 2007, Eric Dumazet wrote: If numa_node_id() is equal to the node of the page containing the first byte of the object, then object is on the local node. Or what ? No. The slab (the page you are referring to) may have been allocated for another node and been tracked via the node structs of that other node. We were just falling back to the node that now appears to be local. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
Christoph Lameter a écrit : On Tue, 20 Mar 2007, Eric Dumazet wrote: I understand we want to do special things (fallback and such tricks) at allocation time, but I believe that we can just trust the real nid of memory at free time. Sorry no. The node at allocation time determines which node specific structure tracks the slab. If we fall back then the node is allocated from one node but entered in the node structure of another. Thus you cannot free the slab without knowing the node at allocation time. I think you dont understand my point. When we enter kmem_cache_free(), we are not freeing a slab, but an object, knowing a pointer to this object. The fast path is to put the pointer, into the cpu array cache. This object might be given back some cycles later, because of a kmem_cache_alloc() : No need to access the two cache lines (struct page, struct slab) This fast path could be entered checking the node of the page, which is faster, but eventually different from the virt_to_slab(obj)->nodeid. Do we care ? Definitly not : Node is guaranted to be correct. Then, if we must flush the cpu array cache because it is full, we *may* access the slabs of the objects we are flushing, and then check the virt_to_slab(obj)->nodeid to be able to do the correct thing. Fortunatly, flushing cache array is not a frequent event, and the cost of access to cache lines (truct page, struct slab) can be amortized because several 'transfered or freed' objects might share them at this time. Actually I had to disable NUMA on my platforms because it is just overkill and slower. Maybe its something OK for very big machines, and not dual nodes Opterons ? Let me know so that I dont waste your time (and mine) Thank you - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
On Wed, 21 Mar 2007, Andi Kleen wrote: > > We usually use page_to_nid(). Sure this will determine the node the object > > resides on. But this may not be the node on which the slab is tracked > > since there may have been a fallback at alloc time. > > How about your slab rewrite? I assume it would make more sense to fix > such problems in that code instead of the old which is going to be replaced > at some point. The slab rewrite first allocates a page and then determines where it came from instead of requiring the page allocator to allocate from a certain node. Plus SLUB does not keep per cpu or per node object queues. So the problem does not occur in the same way. The per cpu slab in SLUB can contain objects from another node whereas SLAB can only put node local objects on its queues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
> We usually use page_to_nid(). Sure this will determine the node the object > resides on. But this may not be the node on which the slab is tracked > since there may have been a fallback at alloc time. How about your slab rewrite? I assume it would make more sense to fix such problems in that code instead of the old which is going to be replaced at some point. -Andi > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
On Tue, 20 Mar 2007, Eric Dumazet wrote: > I understand we want to do special things (fallback and such tricks) at > allocation time, but I believe that we can just trust the real nid of memory > at free time. Sorry no. The node at allocation time determines which node specific structure tracks the slab. If we fall back then the node is allocated from one node but entered in the node structure of another. Thus you cannot free the slab without knowing the node at allocation time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
On Tue, 20 Mar 2007, Andi Kleen wrote: > > > Is it possible virt_to_slab(objp)->nodeid being different from > > > pfn_to_nid(objp) ? > > > > It is possible the page allocator falls back to another node than > > requested. We would need to check that this never occurs. > > The only way to ensure that would be to set a strict mempolicy. > But I'm not sure that's a good idea -- after all you don't want > to fail an allocation in this case. > > But pfn_to_nid on the object like proposed by Eric should work anyways. > But I'm not sure the tables used for that will be more often cache hot > than the slab. We usually use page_to_nid(). Sure this will determine the node the object resides on. But this may not be the node on which the slab is tracked since there may have been a fallback at alloc time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
Andi Kleen a écrit : Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. The only way to ensure that would be to set a strict mempolicy. But I'm not sure that's a good idea -- after all you don't want to fail an allocation in this case. But pfn_to_nid on the object like proposed by Eric should work anyways. But I'm not sure the tables used for that will be more often cache hot than the slab. pfn_to_nid() on most x86_64 machines access one cache line (struct memnode). Node 0 MemBase Limit 00028000 Node 1 MemBase 00028000 Limit 00048000 NUMA: Using 31 for the hash shift. On this example, we use only 8 bytes of memnode.embedded_map[] to find nid of all 16 GB of ram. On profiles I have, memnode is always hot (no cache miss on it). While virt_to_slab() has to access : 1) struct page -> page_get_slab() (page->lru.prev) (one cache miss) 2) struct slab -> nodeid (one other cache miss) So using pfn_to_nid() would avoid 2 cache misses. I understand we want to do special things (fallback and such tricks) at allocation time, but I believe that we can just trust the real nid of memory at free time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
> > Is it possible virt_to_slab(objp)->nodeid being different from > > pfn_to_nid(objp) ? > > It is possible the page allocator falls back to another node than > requested. We would need to check that this never occurs. The only way to ensure that would be to set a strict mempolicy. But I'm not sure that's a good idea -- after all you don't want to fail an allocation in this case. But pfn_to_nid on the object like proposed by Eric should work anyways. But I'm not sure the tables used for that will be more often cache hot than the slab. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
On Tue, 20 Mar 2007, Eric Dumazet wrote: > > I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is > very expensive. > This is because of a cache miss on struct slab. > At the time an object is freed (call to kmem_cache_free() for example), the > underlying 'struct slab' is not anymore cache-hot. > > struct slab *slabp = virt_to_slab(objp); > nodeid = slabp->nodeid; // cache miss > > So we currently need slab only to lookup nodeid, to be able to use the cachep > cpu cache, or not. > > Couldn't we use something less expensive, like pfn_to_nid() ? Nodeid describes the node that the slab was allocated for which may not be equal to the node that the page came from. But if GFP_THISNODE is used then this should always be the same. That those two nodeid are different was certainly frequent before the GFP_THISNODE work went in. Now it may just occur in corner cases. Perhaps during bootup on some machines that boot with empty nodes. I vaguely recall a powerpc issue. > Is it possible virt_to_slab(objp)->nodeid being different from > pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. If we are sure then we could drop the nodeid field completely. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
Hi I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is very expensive. This is because of a cache miss on struct slab. At the time an object is freed (call to kmem_cache_free() for example), the underlying 'struct slab' is not anymore cache-hot. struct slab *slabp = virt_to_slab(objp); nodeid = slabp->nodeid; // cache miss So we currently need slab only to lookup nodeid, to be able to use the cachep cpu cache, or not. Couldn't we use something less expensive, like pfn_to_nid() ? On x86_64 pfn_to_nid usually shares one cache line for all objects (struct memnode) Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ? Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
Hi I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is very expensive. This is because of a cache miss on struct slab. At the time an object is freed (call to kmem_cache_free() for example), the underlying 'struct slab' is not anymore cache-hot. struct slab *slabp = virt_to_slab(objp); nodeid = slabp-nodeid; // cache miss So we currently need slab only to lookup nodeid, to be able to use the cachep cpu cache, or not. Couldn't we use something less expensive, like pfn_to_nid() ? On x86_64 pfn_to_nid usually shares one cache line for all objects (struct memnode) Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) ? Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
On Tue, 20 Mar 2007, Eric Dumazet wrote: I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is very expensive. This is because of a cache miss on struct slab. At the time an object is freed (call to kmem_cache_free() for example), the underlying 'struct slab' is not anymore cache-hot. struct slab *slabp = virt_to_slab(objp); nodeid = slabp-nodeid; // cache miss So we currently need slab only to lookup nodeid, to be able to use the cachep cpu cache, or not. Couldn't we use something less expensive, like pfn_to_nid() ? Nodeid describes the node that the slab was allocated for which may not be equal to the node that the page came from. But if GFP_THISNODE is used then this should always be the same. That those two nodeid are different was certainly frequent before the GFP_THISNODE work went in. Now it may just occur in corner cases. Perhaps during bootup on some machines that boot with empty nodes. I vaguely recall a powerpc issue. Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. If we are sure then we could drop the nodeid field completely. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. The only way to ensure that would be to set a strict mempolicy. But I'm not sure that's a good idea -- after all you don't want to fail an allocation in this case. But pfn_to_nid on the object like proposed by Eric should work anyways. But I'm not sure the tables used for that will be more often cache hot than the slab. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
Andi Kleen a écrit : Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. The only way to ensure that would be to set a strict mempolicy. But I'm not sure that's a good idea -- after all you don't want to fail an allocation in this case. But pfn_to_nid on the object like proposed by Eric should work anyways. But I'm not sure the tables used for that will be more often cache hot than the slab. pfn_to_nid() on most x86_64 machines access one cache line (struct memnode). Node 0 MemBase Limit 00028000 Node 1 MemBase 00028000 Limit 00048000 NUMA: Using 31 for the hash shift. On this example, we use only 8 bytes of memnode.embedded_map[] to find nid of all 16 GB of ram. On profiles I have, memnode is always hot (no cache miss on it). While virt_to_slab() has to access : 1) struct page - page_get_slab() (page-lru.prev) (one cache miss) 2) struct slab - nodeid (one other cache miss) So using pfn_to_nid() would avoid 2 cache misses. I understand we want to do special things (fallback and such tricks) at allocation time, but I believe that we can just trust the real nid of memory at free time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
On Tue, 20 Mar 2007, Eric Dumazet wrote: I understand we want to do special things (fallback and such tricks) at allocation time, but I believe that we can just trust the real nid of memory at free time. Sorry no. The node at allocation time determines which node specific structure tracks the slab. If we fall back then the node is allocated from one node but entered in the node structure of another. Thus you cannot free the slab without knowing the node at allocation time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
On Tue, 20 Mar 2007, Andi Kleen wrote: Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. The only way to ensure that would be to set a strict mempolicy. But I'm not sure that's a good idea -- after all you don't want to fail an allocation in this case. But pfn_to_nid on the object like proposed by Eric should work anyways. But I'm not sure the tables used for that will be more often cache hot than the slab. We usually use page_to_nid(). Sure this will determine the node the object resides on. But this may not be the node on which the slab is tracked since there may have been a fallback at alloc time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
We usually use page_to_nid(). Sure this will determine the node the object resides on. But this may not be the node on which the slab is tracked since there may have been a fallback at alloc time. How about your slab rewrite? I assume it would make more sense to fix such problems in that code instead of the old which is going to be replaced at some point. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
On Wed, 21 Mar 2007, Andi Kleen wrote: We usually use page_to_nid(). Sure this will determine the node the object resides on. But this may not be the node on which the slab is tracked since there may have been a fallback at alloc time. How about your slab rewrite? I assume it would make more sense to fix such problems in that code instead of the old which is going to be replaced at some point. The slab rewrite first allocates a page and then determines where it came from instead of requiring the page allocator to allocate from a certain node. Plus SLUB does not keep per cpu or per node object queues. So the problem does not occur in the same way. The per cpu slab in SLUB can contain objects from another node whereas SLAB can only put node local objects on its queues. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;
Christoph Lameter a écrit : On Tue, 20 Mar 2007, Eric Dumazet wrote: I understand we want to do special things (fallback and such tricks) at allocation time, but I believe that we can just trust the real nid of memory at free time. Sorry no. The node at allocation time determines which node specific structure tracks the slab. If we fall back then the node is allocated from one node but entered in the node structure of another. Thus you cannot free the slab without knowing the node at allocation time. I think you dont understand my point. When we enter kmem_cache_free(), we are not freeing a slab, but an object, knowing a pointer to this object. The fast path is to put the pointer, into the cpu array cache. This object might be given back some cycles later, because of a kmem_cache_alloc() : No need to access the two cache lines (struct page, struct slab) This fast path could be entered checking the node of the page, which is faster, but eventually different from the virt_to_slab(obj)-nodeid. Do we care ? Definitly not : Node is guaranted to be correct. Then, if we must flush the cpu array cache because it is full, we *may* access the slabs of the objects we are flushing, and then check the virt_to_slab(obj)-nodeid to be able to do the correct thing. Fortunatly, flushing cache array is not a frequent event, and the cost of access to cache lines (truct page, struct slab) can be amortized because several 'transfered or freed' objects might share them at this time. Actually I had to disable NUMA on my platforms because it is just overkill and slower. Maybe its something OK for very big machines, and not dual nodes Opterons ? Let me know so that I dont waste your time (and mine) Thank you - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/