Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one
[ Apologies for replying from a different address, we have a service outage at work. ] On Fri, Jul 18, 2014 at 1:11 AM, Lai Jiangshan la...@cn.fujitsu.com wrote: Hi, Thank you for your response! I'm curious about what will it happen when alloc_pages_node(memoryless_node). alloc_pages_node() is only involved in one of the possible paths (maybe this occurs on x86 with THREAD_INFO PAGE_SIZE?) On powerpc, though, that's not the case. Details: 1. pool-node is used in the invocation of kthread_create_on_node() in create_worker(). 2. kthread_create_on_node sets up a struct kthread_create_info with create-node = node and wakes up kthreadd. 3. kthreadd calls create_kthread, which sets current-pref_node_fork = create-node. 4. dup_task_struct() calls node = tsk_fork_get_node() before invoking alloc_task_struct_node(node) and alloc_thread_info_node(node). 5. tsk_fork_get_node() returns current-pref_node_fork for kthreadd. 6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node) 7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL, node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of THREAD_INFO relative to PAGE_SIZE. 8a. alloc_kmem_pages_node() - alloc_pages_node - __alloc_pages with a zonelist built from node_zonelist. This should lead to proper fallback. 8b. kmem_cache_alloc_node() calls slab_alloc_node() 9. For a memoryless node, we will trigger the following: if (unlikely(!object || !node_match(page, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); stat(s, ALLOC_SLOWPATH); } 10. __slab_alloc() in turn will: if (unlikely(!node_match(page, node))) { stat(s, ALLOC_NODE_MISMATCH); deactivate_slab(s, page, c-freelist); c-page = NULL; c-freelist = NULL; goto new_slab; } deactivating the slab. Thus, every kthread created with a node specification leads to a single object on a slab. We see an explosion in the slab consumption, all of which is unreclaimable. Anton originally proposed not deactivating slabs when we *know* the allocation will be remote (i.e., from a memoryless node). Joonsoo and Christoph disagreed with this and proposed alternative solutions, which weren't agreed upon at the time. If the memory is allocated from the most preferable node for the @memoryless_node, why we need to bother and use cpu_to_mem() in the caller site? The reason is that the node passed is a hint into the MM subsystem of what node we want memory to come from. Well, I take that back, I think semantically there are two ways to interpret the node parameter: 1) The NUMA node we want memory from 2) The NUMA node we expect memory from The path through the MM above sort of conflates the two, the caller specified an impossible request (which the MM subsystem technically knows but which knowledge it isn't using at this point) of memory from a node that has none. We could change the core MM to do better in the presence of memoryless nodes, and should, but this seems far less invasive and does the right thing. Semantically, I think the workqueue's pool-node is meant to be the node from which we want memory allocated, which is the node with memory closest to the CPU. Thanks, Nish If not, why the memory allocation subsystem refuses to find a preferable node for @memoryless_node in this case? Does it intend on some purpose or it can't find in some cases? Thanks, Lai Added CC to Tejun (workqueue maintainer). On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote: In the presence of memoryless nodes, the workqueue code incorrectly uses cpu_to_node() to determine what node to prefer memory allocations come from. cpu_to_mem() should be used instead, which will use the nearest NUMA node with memory. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 35974ac..0bba022 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) for_each_node(node) { if (cpumask_subset(pool-attrs-cpumask, wq_numa_possible_cpumask[node])) { - pool-node = node; + /* + * We could use local_memory_node(node) here, + * but it is expensive and the following caches + * the same value. + */ + pool-node = cpu_to_mem(cpumask_first(pool-attrs-cpumask)); break; } } @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void) pool-cpu = cpu; cpumask_copy(pool-attrs-cpumask,
Re: [RFC 0/2] Memoryless nodes and kworker
Hi Tejun, On Fri, Jul 18, 2014 at 4:20 AM, Tejun Heo t...@kernel.org wrote: On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote: [Apologies for the large Cc list, but I believe we have the following interested parties: x86 (recently posted memoryless node support) ia64 (existing memoryless node support) ppc (existing memoryless node support) previous discussion of how to solve Anton's issue with slab usage workqueue contributors/maintainers] Well, you forgot to cc me. Ah I'm very sorry! That's what I get for editing e-mails... Thank you for your reply! ... It turns out we see this large slab usage due to using the wrong NUMA information when creating kthreads. Two changes are required, one of which is in the workqueue code and one of which is in the powerpc initialization. Note that ia64 may want to consider something similar. Wasn't there a thread on this exact subject a few weeks ago? Was that someone else? Memory-less node detail leaking out of allocator proper isn't a good idea. Please allow allocator users to specify the nodes they're on and let the allocator layer deal with mapping that to whatever is appropriate. Please don't push that to everybody. I didn't send anything for the workqueue logic anytime recently. Jiang sent out a patchset for x86 memoryless node support, which may have touched kernel/workqueue.c. So, to be clear, this is not *necessarily* about memoryless nodes. It's about the semantics intended. The workqueue code currently calls cpu_to_node() in a few places, and passes that node into the core MM as a hint about where the memory should come from. However, when memoryless nodes are present, that hint is guaranteed to be wrong, as it's the nearest NUMA node to the CPU (which happens to be the one its on), not the nearest NUMA node with memory. The hint is correctly specified as cpu_to_mem(), which does the right thing in the presence or absence of memoryless nodes. And I think encapsulates the hint's semantics correctly -- please give me memory from where I expect it, which is the closest NUMA node. I guess we could also change tsk_fork_get_node to return local_memory_node(tsk-pref_node_fork), but that can be a bit expensive, as it generates a new zonelist each time to determine the first fallback node. We get the exact same semantics (because cpu_to_mem() caches the result of local_memory_node) by using cpu_to_mem directly. Again, apologies for not Cc'ing you originally. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Memoryless nodes and kworker
Hi Tejun, [I found the other thread where you made these points, thanks you for expressing them so clearly again!] On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo t...@kernel.org wrote: Hello, On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote: So, to be clear, this is not *necessarily* about memoryless nodes. It's about the semantics intended. The workqueue code currently calls cpu_to_node() in a few places, and passes that node into the core MM as a hint about where the memory should come from. However, when memoryless nodes are present, that hint is guaranteed to be wrong, as it's the nearest NUMA node to the CPU (which happens to be the one its on), not the nearest NUMA node with memory. The hint is correctly specified as cpu_to_mem(), It's telling the allocator the node the CPU is on. Choosing and falling back the actual allocation is the allocator's job. Ok, I agree with you then, if that's all the semantic is supposed to be. But looking at the comment for kthread_create_on_node: * If thread is going to be bound on a particular cpu, give its node * in @node, to get NUMA affinity for kthread stack, or else give -1. so the API interprets it as a suggestion for the affinity itself, *not* the node the kthread should be on. Piddly, yes, but actually I have another thought altogether, and in reviewing Jiang's patches this seems like the right approach: why aren't these callers using kthread_create_on_cpu()? That API was already change to use cpu_to_mem() [so one change, rather than of all over the kernel source]. We could change it back to cpu_to_node and push down the knowledge about the fallback. which does the right thing in the presence or absence of memoryless nodes. And I think encapsulates the hint's semantics correctly -- please give me memory from where I expect it, which is the closest NUMA node. I don't think it does. It loses information at too high a layer. Workqueue here doesn't care how memory subsystem is structured, it's just telling the allocator where it's at and expecting it to do the right thing. Please consider the following scenario. A - B - C - D - E Let's say C is a memory-less node. If we map from C to either B or D from individual users and that node can't serve that memory request, the allocator would fall back to A or E respectively when the right thing to do would be falling back to D or B respectively, right? Yes, this is a good point. But honestly, we're not really even to the point of talking about fallback here, at least in my testing, going off-node at all causes SLUB-configured slabs to deactivate, which then leads to an explosion in the unreclaimable slab. This isn't a huge issue but it shows that this is the wrong layer to deal with this issue. Let the allocators express where they are. Choosing and falling back belong to the memory allocator. That's the only place which has all the information that's necessary and those details must be contained there. Please don't leak it to memory allocator users. Ok, I will continue to work at that level of abstraction. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Memoryless nodes and kworker
On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo t...@kernel.org wrote: Hello, On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote: why aren't these callers using kthread_create_on_cpu()? That API was It is using that. There just are other data structures too. Sorry, I might not have been clear. Why are any callers of the format kthread_create_on_node(..., cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)? In total in Linus' tree, there are only two APIs that use kthread_create_on_cpu() -- smpboot_create_threads() and smpboot_register_percpu_thread(). Neither of those seem to be used by the workqueue code that I can see as of yet. already change to use cpu_to_mem() [so one change, rather than of all over the kernel source]. We could change it back to cpu_to_node and push down the knowledge about the fallback. And once it's properly solved, please convert back kthread to use cpu_to_node() too. We really shouldn't be sprinkling the new subtly different variant across the kernel. It's wrong and confusing. I understand what you mean, but it's equally wrong for the kernel to be wasting GBs of slab. Different kinds of wrongness :) Yes, this is a good point. But honestly, we're not really even to the point of talking about fallback here, at least in my testing, going off-node at all causes SLUB-configured slabs to deactivate, which then leads to an explosion in the unreclaimable slab. I don't think moving the logic inside allocator proper is a huge amount of work and this isn't the first spillage of this subtlety out of allocator proper. Fortunately, it hasn't spread too much yet. Let's please stop it here. I'm not saying you shouldn't or can't fix the off-node allocation. It seems like an additional reasonable approach would be to provide a suitable _cpu() API for the allocators. I'm not sure why saying that callers should know about NUMA (in order to call cpu_to_node() in every caller) is any better than saying that callers should know about memoryless nodes (in order to call cpu_to_mem() in every caller instead) -- when at least in several cases that I've seen the relevant data is what CPU we're expecting to run or are running on. Seems like the _cpu API would specify -- please allocate memory local to this CPU, wherever it is? Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Fake NUMA emulation for PowerPC (Take 2)
On 1/27/08, Balbir Singh [EMAIL PROTECTED] wrote: * Paul Mackerras [EMAIL PROTECTED] [2008-01-27 22:55:43]: Balbir Singh writes: Here's a better and more complete fix for the problem. Could you please see if it works for you? I tested it on a real NUMA box and it seemed to work fine there. There are a couple of other changes in behaviour that your patch introduces, and I'd like to understand them better before taking the patch. First, with your patch we don't set nodes online if they end up having no memory in them because of the memory limit, whereas previously we did. Secondly, in the case where we don't have NUMA information, we now set node 0 online after adding each LMB, whereas previously we only set it online once. If in fact these changes are benign, then your patch description should mention them and explain why they are benign. Yes, they are. I'll try and justify the changes with a good detailed changelog. If people prefer it, I can hide fake NUMA nodes under a config option, so that it does not come enabled by default. Sigh, there already *is* a fake NUMA config option: CONFIG_NUMA_EMU. CONFIG_NUMA_EMU: Enable NUMA emulation. A flat machine will be split into virtual nodes when booted with numa=fake=N, where N is the number of nodes. This is only useful for debugging. I have to assume your patch is implementing the same feature for powerpc (really just extending the x86_64 one), and thus should share the config option. Any chance you can just make some of that code common? Maybe as a follow-on patch. I expect that some of Mel's (added to Cc) work to allow NUMA to be set on x86 more easily will flow quite simply into adding fake NUMA support there as well. So moving the code to a common place (at least the parsing) makes sense. I also feel like you want to be able to online memoryless nodes -- that's where we've been hitting a number of bugs lately in the VM. I can't tell from Paul's comment if your patch prevents that from being faked or not. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: crash in kmem_cache_init
On 1/22/08, Olaf Hering [EMAIL PROTECTED] wrote: On Tue, Jan 22, Mel Gorman wrote: http://www.csn.ul.ie/~mel/postings/slab-20080122/partial-revert-slab-changes.patch .. Can you please check on your machine if it fixes your problem? It does not fix or change the nature of the crash. Olaf, please confirm whether you need the patch below as well as the revert to make your machine boot. It crashes now in a different way if the patch below is applied: Was this with the revert Mel mentioned applied as well? I get the feeling both patches are needed to fix up the memoryless SLAB issue. Linux version 2.6.24-rc8-ppc64 ([EMAIL PROTECTED]) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #43 SMP Tue Jan 22 22:39:05 CET 2008 snip early_node_map[1] active PFN ranges 1:0 - 892928 snip Unable to handle kernel paging request for data at address 0x0058 Faulting instruction address: 0xc00fe018 cpu 0x0: Vector: 300 (Data Access) at [c075bac0] pc: c00fe018: .setup_cpu_cache+0x184/0x1f4 lr: c00fdfa8: .setup_cpu_cache+0x114/0x1f4 sp: c075bd40 msr: 80009032 dar: 58 dsisr: 4200 current = 0xc0665a50 paca= 0xc0666380 pid = 0, comm = swapper enter ? for help [c075bd40] c00fb368 .kmem_cache_create+0x3c0/0x478 (unreliable) [c075be20] c05e6780 .kmem_cache_init+0x284/0x4f4 [c075bee0] c05bf8ec .start_kernel+0x2f8/0x3fc [c075bf90] c0008590 .start_here_common+0x60/0xd0 0:mon 0xc00fe018 is in setup_cpu_cache (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111). 2106BUG_ON(!cachep-nodelists[node]); 2107 kmem_list3_init(cachep-nodelists[node]); I might be barking up the wrong tree, but this block above is supposed to set up the cachep-nodeslists[*] that are used immediately below. But if the loop wasn't changed from N_NORMAL_MEMORY to N_ONLINE or whatever, you might get a bad access right below for node 0 that has no memory, if that's the node we're running on... 2108} 2109} 2110} 2111cachep-nodelists[numa_node_id()]-next_reap = 2112jiffies + REAPTIMEOUT_LIST3 + 2113((unsigned long)cachep) % REAPTIMEOUT_LIST3; 2114 2115cpu_cache_get(cachep)-avail = 0; Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: crash in kmem_cache_init
On 1/18/08, Christoph Lameter [EMAIL PROTECTED] wrote: Could you try this patch? Memoryless nodes: Set N_NORMAL_MEMORY for a node if we do not support HIGHMEM It seems that we only scan through zones to set N_NORMAL_MEMORY only if CONFIG_HIGHMEM and CONFIG_NUMA are set. We need to set N_NORMAL_MEMORY in the !CONFIG_HIGHMEM case. I'm testing this exact patch right now on the machine Mel saw the issues with. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2] powerpc: add hugepagesz boot-time parameter
On 11/28/07, Randy Dunlap [EMAIL PROTECTED] wrote: On Tue, 27 Nov 2007 23:03:10 -0600 Jon Tollefson wrote: This patch adds the hugepagesz boot-time parameter for ppc64 that lets you pick the size for your huge pages. The choices available are 64K and 16M. It defaults to 16M (previously the only choice) if nothing or an invalid choice is specified. Tested 64K huge pages with the libhugetlbfs 1.2 release with its 'make func' and 'make stress' test invocations. This patch requires the patch posted by Mel Gorman that adds HUGETLB_PAGE_SIZE_VARIABLE; [PATCH] Fix boot problem with iSeries lacking hugepage support on 2007-11-15. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- Documentation/kernel-parameters.txt |1 arch/powerpc/mm/hash_utils_64.c | 11 + arch/powerpc/mm/hugetlbpage.c | 41 include/asm-powerpc/mmu-hash64.h|1 mm/hugetlb.c|1 5 files changed, 46 insertions(+), 9 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 33121d6..2fc1fb8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -685,6 +685,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/isdn/README.HiSax. hugepages= [HW,X86-32,IA-64] Maximal number of HugeTLB pages. + hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages. Any chance of spelling it as hugepagesize so that it's a little less cryptic and more difficult to typo as hugepages? (i.e., less confusion between them) It already exists as hugepagesz= for IA64. Changing it to hugepagesize would either make ppc be different than IA64, or require keeping both so as to make IA64 setups continue working as before? Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2] powerpc: add hugepagesz boot-time parameter
On 11/28/07, Randy Dunlap [EMAIL PROTECTED] wrote: Nish Aravamudan wrote: On 11/28/07, Randy Dunlap [EMAIL PROTECTED] wrote: On Tue, 27 Nov 2007 23:03:10 -0600 Jon Tollefson wrote: This patch adds the hugepagesz boot-time parameter for ppc64 that lets you pick the size for your huge pages. The choices available are 64K and 16M. It defaults to 16M (previously the only choice) if nothing or an invalid choice is specified. Tested 64K huge pages with the libhugetlbfs 1.2 release with its 'make func' and 'make stress' test invocations. This patch requires the patch posted by Mel Gorman that adds HUGETLB_PAGE_SIZE_VARIABLE; [PATCH] Fix boot problem with iSeries lacking hugepage support on 2007-11-15. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- Documentation/kernel-parameters.txt |1 arch/powerpc/mm/hash_utils_64.c | 11 + arch/powerpc/mm/hugetlbpage.c | 41 include/asm-powerpc/mmu-hash64.h|1 mm/hugetlb.c|1 5 files changed, 46 insertions(+), 9 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 33121d6..2fc1fb8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -685,6 +685,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/isdn/README.HiSax. hugepages= [HW,X86-32,IA-64] Maximal number of HugeTLB pages. + hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages. Any chance of spelling it as hugepagesize so that it's a little less cryptic and more difficult to typo as hugepages? (i.e., less confusion between them) It already exists as hugepagesz= for IA64. Changing it to hugepagesize would either make ppc be different than IA64, or require keeping both so as to make IA64 setups continue working as before? Oh, but it wasn't in Doc/kernel-parameters.txt ? :( Nope :( I wonder how many other kernel parameters are in the same boat? Where's an RPJDay-script when you need it? OK, just leave it as is, I think. Yeah, I guess that's probably easiest...unfortunately. -Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On 10/5/07, Linas Vepstas [EMAIL PROTECTED] wrote: On Thu, Oct 04, 2007 at 05:01:47PM -0700, Nish Aravamudan wrote: On 10/2/07, Tony Breeds [EMAIL PROTECTED] wrote: On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote: I realise it'll make the patch bigger, but this doesn't seem like a particularly good name for the variable anymore. Sure, what about? Clarify when RTAS logging is enabled. Signed-off-by: Tony Breeds [EMAIL PROTECTED] For what it's worth, on a different ppc64 box, this resolves a similar panic for me. Tested-by: Nishanth Aravamudan [EMAIL PROTECTED] For the reasons explained, I'd really like to nack Tony's patch. I see. Can you reply in this thread with the patch you mentioned in your other reply? (or point me to a copy of it) Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: 2.6.23-rc7-mm1 -- powerpc rtas panic
On 10/2/07, Tony Breeds [EMAIL PROTECTED] wrote: On Wed, Oct 03, 2007 at 10:30:16AM +1000, Michael Ellerman wrote: I realise it'll make the patch bigger, but this doesn't seem like a particularly good name for the variable anymore. Sure, what about? Clarify when RTAS logging is enabled. Signed-off-by: Tony Breeds [EMAIL PROTECTED] For what it's worth, on a different ppc64 box, this resolves a similar panic for me. Tested-by: Nishanth Aravamudan [EMAIL PROTECTED] Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev