Re: [PATCH] module, fix percpu reserved memory exhaustion
Hello, Rusty. On Sat, Jan 12, 2013 at 11:36:09AM +1030, Rusty Russell wrote: > > I've been thinking about that. The problem is that at the same time the kvm > > problem occurs I'm attempting to load a debug module that I've written to > > debug > > some cpu timer issues that allocates a large amount of percpu data > > (~.5K/cpu). > > While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not > > work > > tomorrow if I have the need to increase the size of my log buffer. > > Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it > could easily be halved. I guess this is because dynamic per-cpu data is > now such a nice alternative (thanks to Tejun). Heh, thanks. :) There are two percpu constants that I basically pulled out of my ass. * PERCPU_MODULE_RESERVE: It's used on architectures with data model which restricts the distance between code and data. e.g. x86_64 genereates code which can't address full 64bit for static data and as percpu addressing depends on the regular address generation to derive percpu addresses, the percpu offsets need to be restricted somehow. Percpu allocator currently achieves this by reserving certain amount in the first percpu chunk which is guaranteed to have very low (starting from 0) percpu address space offsets. On those archs, this becomes the hard limit for all module static percpu areas. Maybe 8k is way too high and we can go with 4k. Worth a try I guess. * PERCPU_DYNAMIC_RESERVE: The amount of space at the end of the first chunk used for dynamic allocation. This isn't as critical as the above and the only reason it exists is because the first chunk is often embedded in the kernel linear address space instead of vmalloc area and the former is often mapped with larger page table size than the latter. The goal of PERCPU_DYNAMIC_RESERVE is to have just enough space to cover the usual percpu allocations without wasting too much space to minimize impact on TLB pressure. The current number isn't entirely made up. I did some testing with some random config. Don't know how it holds up now. It probably can be increased given that we developed quite a few percpu dynamic allocations since then. It would be awesome to have a way to somehow dynamically adjust this one; unfortunately, that would require persistent data across boots. :( Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] module, fix percpu reserved memory exhaustion
On 01/11/2013 08:06 PM, Rusty Russell wrote: > Prarit Bhargava writes: >> On 01/10/2013 10:48 PM, Rusty Russell wrote: >> The timing were similar. I didn't see any huge delays, etc. Can the >> relocations really cause a long delay? I thought we were pretty much writing >> values to memory... > > For x86 that's true, but look at what ppc64 has to do for example. I'm > guessing you don't have a giant Nvidia proprietary driver module > loading, either. Ah -- I see. I hadn't thought much about the other arches and I see what ppc64 does ... > >> [I should point out that I'm booting a 32 physical/64 logical, with 64GB of >> memory] > > I figured it had to be something big ;) :) Imagine what happens at 4096 cpus (SGI territory). I'm wondering about that kvm commit. Maybe the systemd/udev rule needs to be rewritten to avoid a 'kvm loading flood' during boot ... I'll talk with Kay Sievers about it to see if there's a way around that. > > OTOH, Tested-by: means it actually fixed someone's problem. Got it. For the record over-the-weekend testing didn't show any bizarre results. The boot times were all around 20-23 seconds. Tested-by: Prarit Bhargava P. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] module, fix percpu reserved memory exhaustion
On 01/11/2013 08:06 PM, Rusty Russell wrote: Prarit Bhargava pra...@redhat.com writes: On 01/10/2013 10:48 PM, Rusty Russell wrote: The timing were similar. I didn't see any huge delays, etc. Can the relocations really cause a long delay? I thought we were pretty much writing values to memory... For x86 that's true, but look at what ppc64 has to do for example. I'm guessing you don't have a giant Nvidia proprietary driver module loading, either. Ah -- I see. I hadn't thought much about the other arches and I see what ppc64 does ... [I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory] I figured it had to be something big ;) :) Imagine what happens at 4096 cpus (SGI territory). I'm wondering about that kvm commit. Maybe the systemd/udev rule needs to be rewritten to avoid a 'kvm loading flood' during boot ... I'll talk with Kay Sievers about it to see if there's a way around that. OTOH, Tested-by: means it actually fixed someone's problem. Got it. For the record over-the-weekend testing didn't show any bizarre results. The boot times were all around 20-23 seconds. Tested-by: Prarit Bhargava pra...@redhat.com P. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] module, fix percpu reserved memory exhaustion
Hello, Rusty. On Sat, Jan 12, 2013 at 11:36:09AM +1030, Rusty Russell wrote: I've been thinking about that. The problem is that at the same time the kvm problem occurs I'm attempting to load a debug module that I've written to debug some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu). While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work tomorrow if I have the need to increase the size of my log buffer. Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it could easily be halved. I guess this is because dynamic per-cpu data is now such a nice alternative (thanks to Tejun). Heh, thanks. :) There are two percpu constants that I basically pulled out of my ass. * PERCPU_MODULE_RESERVE: It's used on architectures with data model which restricts the distance between code and data. e.g. x86_64 genereates code which can't address full 64bit for static data and as percpu addressing depends on the regular address generation to derive percpu addresses, the percpu offsets need to be restricted somehow. Percpu allocator currently achieves this by reserving certain amount in the first percpu chunk which is guaranteed to have very low (starting from 0) percpu address space offsets. On those archs, this becomes the hard limit for all module static percpu areas. Maybe 8k is way too high and we can go with 4k. Worth a try I guess. * PERCPU_DYNAMIC_RESERVE: The amount of space at the end of the first chunk used for dynamic allocation. This isn't as critical as the above and the only reason it exists is because the first chunk is often embedded in the kernel linear address space instead of vmalloc area and the former is often mapped with larger page table size than the latter. The goal of PERCPU_DYNAMIC_RESERVE is to have just enough space to cover the usual percpu allocations without wasting too much space to minimize impact on TLB pressure. The current number isn't entirely made up. I did some testing with some random config. Don't know how it holds up now. It probably can be increased given that we developed quite a few percpu dynamic allocations since then. It would be awesome to have a way to somehow dynamically adjust this one; unfortunately, that would require persistent data across boots. :( Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] module, fix percpu reserved memory exhaustion
Prarit Bhargava writes: > On 01/10/2013 10:48 PM, Rusty Russell wrote: > The timing were similar. I didn't see any huge delays, etc. Can the > relocations really cause a long delay? I thought we were pretty much writing > values to memory... For x86 that's true, but look at what ppc64 has to do for example. I'm guessing you don't have a giant Nvidia proprietary driver module loading, either. It just makes me nervous; this kind of boot slowdown typically won't get diagnosed for several releases, if ever. Now I've done the work, I'm going to apply my patch (with an additional fix: I forgot to change kgdb, which traverses the module list too). > [I should point out that I'm booting a 32 physical/64 logical, with 64GB of > memory] I figured it had to be something big ;) >> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit >> allmodconfig build, there are only three modules with per-cpu data, >> totalling 328 bytes. So it's not reasonable to increase that number to >> paper over this. > > I've been thinking about that. The problem is that at the same time the kvm > problem occurs I'm attempting to load a debug module that I've written to > debug > some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu). > While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work > tomorrow if I have the need to increase the size of my log buffer. Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it could easily be halved. I guess this is because dynamic per-cpu data is now such a nice alternative (thanks to Tejun). > > > Tested-by: Prarit Bhargava > > Rusty, you can change that to an Acked-by if you prefer that. I know some > engineers prefer one over the other. I'll also continue doing some reboot > testing and will email back in a few days to let you know what the timing > looks > like. There seem to be two kinds of Acked-by: 1) Acked-by: . ie. "this should go through my tree, but it's going via someone else". I like this: shows the normal maintainer is aware of the change. 2) Acked-by: . ie. "I like the concept of the patch though I haven't actually read it or tested it". Completely useless. OTOH, Tested-by: means it actually fixed someone's problem. Thanks! Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] module, fix percpu reserved memory exhaustion
On 01/10/2013 10:48 PM, Rusty Russell wrote: > Prarit Bhargava writes: >> [ 15.478160] kvm: Could not allocate 304 bytes percpu data >> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc >> from reserved chunk failed > ... >> What is happening is systemd is loading an instance of the kvm module for >> each cpu found (see commit e9bda3b). When the module load occurs the kernel >> currently allocates the modules percpu data area prior to checking to see >> if the module is already loaded or is in the process of being loaded. If >> the module is already loaded, or finishes load, the module loading code >> releases the current instance's module's percpu data. > > Wow, what a cool bug! Classic unforseen side-effect. > > I'd prefer not to do relocations with the module_lock held: it can be > relatively slow. Yet we can't do relocations before the per-cpu > allocation, obviously. Did you do boot timings before and after? Heh ... I did! :) I had a lot of concerns about moving the mutex around so I put in print at the end of boot to see how long the boot time actually was. >From stock kernel: [ 22.893015] PRARIT: FINAL BOOT MESSAGE >From stock kernel + my patch: [ 22.673214] PRARIT: FINAL BOOT MESSAGE Both kernel boots showed the problem with kvm loading. A quick grep through my bootlogs of stock kernel + my patch don't show anything greater than 23.539392 and less than 20.980321. Those numbers are similar to the numbers from the stock kernel (23.569450 - 20.898321). ie) I don't think there's an increase due to calling the relocation under the module mutex, and if there is it is definitely lost within the noise of boot. The timing were similar. I didn't see any huge delays, etc. Can the relocations really cause a long delay? I thought we were pretty much writing values to memory... [I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory] > > An alternative would be to put the module into the list even earlier > (say, just after layout_and_allocate) so we could block on concurrent > loads at that point. But then we have to make sure noone looks in the > module too early before it's completely set up, and that's complicated > and error-prone too. A separate list is kind of icky. Yeah -- that was my first attempt actually, and it got very complex very quickly. I abandoned that approach in favor of moving the percpu allocations under the lock. I thought that was likely the easiest approach. > > We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit > allmodconfig build, there are only three modules with per-cpu data, > totalling 328 bytes. So it's not reasonable to increase that number to > paper over this. I've been thinking about that. The problem is that at the same time the kvm problem occurs I'm attempting to load a debug module that I've written to debug some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu). While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work tomorrow if I have the need to increase the size of my log buffer. ... that is ;), I prefer your and my approach of fixing this problem. > > This is what a new boot state looks like (pains not to break ksplice). > It's two patches, but I'll just post them back to back: > > module: add new state MODULE_STATE_UNFORMED > > You should never look at such a module, so it's excised from all paths > which traverse the modules list. > > We add the state at the end, to avoid gratuitous ABI break (ksplice). > > Signed-off-by: Rusty Russell > Sure, but I'm always nervous about expanding any state machine ;). That's just me though :). > > module: put modules in list much earlier. > > Prarit's excellent bug report: >> In recent Fedora releases (F17 & F18) some users have reported seeing >> messages similar to >> >> [ 15.478160] kvm: Could not allocate 304 bytes percpu data >> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from >> reserved chunk failed >> >> during system boot. In some cases, users have also reported seeing this >> message along with a failed load of other modules. >> >> What is happening is systemd is loading an instance of the kvm module for >> each cpu found (see commit e9bda3b). When the module load occurs the kernel >> currently allocates the modules percpu data area prior to checking to see >> if the module is already loaded or is in the process of being loaded. If >> the module is already loaded, or finishes load, the module loading code >> releases the current instance's module's percpu data. > > Now we have a new state MODULE_STATE_UNFORMED, we can insert the > module into the list (and thus guarantee its uniqueness) before we > allocate the per-cpu region. > > Reported-by: Prarit Bhargava > Signed-off-by: Rusty Russell > Tested-by: Prarit Bhargava Rusty, you can change that to an Acked-by if you prefer that. I know some engineers prefer one over the
Re: [PATCH] module, fix percpu reserved memory exhaustion
On 01/10/2013 10:48 PM, Rusty Russell wrote: Prarit Bhargava pra...@redhat.com writes: [ 15.478160] kvm: Could not allocate 304 bytes percpu data [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from reserved chunk failed ... What is happening is systemd is loading an instance of the kvm module for each cpu found (see commit e9bda3b). When the module load occurs the kernel currently allocates the modules percpu data area prior to checking to see if the module is already loaded or is in the process of being loaded. If the module is already loaded, or finishes load, the module loading code releases the current instance's module's percpu data. Wow, what a cool bug! Classic unforseen side-effect. I'd prefer not to do relocations with the module_lock held: it can be relatively slow. Yet we can't do relocations before the per-cpu allocation, obviously. Did you do boot timings before and after? Heh ... I did! :) I had a lot of concerns about moving the mutex around so I put in print at the end of boot to see how long the boot time actually was. From stock kernel: [ 22.893015] PRARIT: FINAL BOOT MESSAGE From stock kernel + my patch: [ 22.673214] PRARIT: FINAL BOOT MESSAGE Both kernel boots showed the problem with kvm loading. A quick grep through my bootlogs of stock kernel + my patch don't show anything greater than 23.539392 and less than 20.980321. Those numbers are similar to the numbers from the stock kernel (23.569450 - 20.898321). ie) I don't think there's an increase due to calling the relocation under the module mutex, and if there is it is definitely lost within the noise of boot. The timing were similar. I didn't see any huge delays, etc. Can the relocations really cause a long delay? I thought we were pretty much writing values to memory... [I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory] An alternative would be to put the module into the list even earlier (say, just after layout_and_allocate) so we could block on concurrent loads at that point. But then we have to make sure noone looks in the module too early before it's completely set up, and that's complicated and error-prone too. A separate list is kind of icky. Yeah -- that was my first attempt actually, and it got very complex very quickly. I abandoned that approach in favor of moving the percpu allocations under the lock. I thought that was likely the easiest approach. We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit allmodconfig build, there are only three modules with per-cpu data, totalling 328 bytes. So it's not reasonable to increase that number to paper over this. I've been thinking about that. The problem is that at the same time the kvm problem occurs I'm attempting to load a debug module that I've written to debug some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu). While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work tomorrow if I have the need to increase the size of my log buffer. ... that is ;), I prefer your and my approach of fixing this problem. This is what a new boot state looks like (pains not to break ksplice). It's two patches, but I'll just post them back to back: module: add new state MODULE_STATE_UNFORMED You should never look at such a module, so it's excised from all paths which traverse the modules list. We add the state at the end, to avoid gratuitous ABI break (ksplice). Signed-off-by: Rusty Russell ru...@rustcorp.com.au snip patch Sure, but I'm always nervous about expanding any state machine ;). That's just me though :). module: put modules in list much earlier. Prarit's excellent bug report: In recent Fedora releases (F17 F18) some users have reported seeing messages similar to [ 15.478160] kvm: Could not allocate 304 bytes percpu data [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from reserved chunk failed during system boot. In some cases, users have also reported seeing this message along with a failed load of other modules. What is happening is systemd is loading an instance of the kvm module for each cpu found (see commit e9bda3b). When the module load occurs the kernel currently allocates the modules percpu data area prior to checking to see if the module is already loaded or is in the process of being loaded. If the module is already loaded, or finishes load, the module loading code releases the current instance's module's percpu data. Now we have a new state MODULE_STATE_UNFORMED, we can insert the module into the list (and thus guarantee its uniqueness) before we allocate the per-cpu region. Reported-by: Prarit Bhargava pra...@redhat.com Signed-off-by: Rusty Russell ru...@rustcorp.com.au snip patch Tested-by: Prarit Bhargava pra...@redhat.com Rusty, you can change that to an Acked-by if you prefer that. I know some engineers prefer
Re: [PATCH] module, fix percpu reserved memory exhaustion
Prarit Bhargava pra...@redhat.com writes: On 01/10/2013 10:48 PM, Rusty Russell wrote: The timing were similar. I didn't see any huge delays, etc. Can the relocations really cause a long delay? I thought we were pretty much writing values to memory... For x86 that's true, but look at what ppc64 has to do for example. I'm guessing you don't have a giant Nvidia proprietary driver module loading, either. It just makes me nervous; this kind of boot slowdown typically won't get diagnosed for several releases, if ever. Now I've done the work, I'm going to apply my patch (with an additional fix: I forgot to change kgdb, which traverses the module list too). [I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory] I figured it had to be something big ;) We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit allmodconfig build, there are only three modules with per-cpu data, totalling 328 bytes. So it's not reasonable to increase that number to paper over this. I've been thinking about that. The problem is that at the same time the kvm problem occurs I'm attempting to load a debug module that I've written to debug some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu). While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work tomorrow if I have the need to increase the size of my log buffer. Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it could easily be halved. I guess this is because dynamic per-cpu data is now such a nice alternative (thanks to Tejun). snip patch Tested-by: Prarit Bhargava pra...@redhat.com Rusty, you can change that to an Acked-by if you prefer that. I know some engineers prefer one over the other. I'll also continue doing some reboot testing and will email back in a few days to let you know what the timing looks like. There seem to be two kinds of Acked-by: 1) Acked-by: maintainer. ie. this should go through my tree, but it's going via someone else. I like this: shows the normal maintainer is aware of the change. 2) Acked-by: random. ie. I like the concept of the patch though I haven't actually read it or tested it. Completely useless. OTOH, Tested-by: means it actually fixed someone's problem. Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] module, fix percpu reserved memory exhaustion
Prarit Bhargava writes: > [ 15.478160] kvm: Could not allocate 304 bytes percpu data > [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc > from reserved chunk failed ... > What is happening is systemd is loading an instance of the kvm module for > each cpu found (see commit e9bda3b). When the module load occurs the kernel > currently allocates the modules percpu data area prior to checking to see > if the module is already loaded or is in the process of being loaded. If > the module is already loaded, or finishes load, the module loading code > releases the current instance's module's percpu data. Wow, what a cool bug! Classic unforseen side-effect. I'd prefer not to do relocations with the module_lock held: it can be relatively slow. Yet we can't do relocations before the per-cpu allocation, obviously. Did you do boot timings before and after? An alternative would be to put the module into the list even earlier (say, just after layout_and_allocate) so we could block on concurrent loads at that point. But then we have to make sure noone looks in the module too early before it's completely set up, and that's complicated and error-prone too. A separate list is kind of icky. We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit allmodconfig build, there are only three modules with per-cpu data, totalling 328 bytes. So it's not reasonable to increase that number to paper over this. This is what a new boot state looks like (pains not to break ksplice). It's two patches, but I'll just post them back to back: module: add new state MODULE_STATE_UNFORMED. You should never look at such a module, so it's excised from all paths which traverse the modules list. We add the state at the end, to avoid gratuitous ABI break (ksplice). Signed-off-by: Rusty Russell diff --git a/include/linux/module.h b/include/linux/module.h index 7760c6d..4432373 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -199,11 +199,11 @@ struct module_use { struct module *source, *target; }; -enum module_state -{ - MODULE_STATE_LIVE, - MODULE_STATE_COMING, - MODULE_STATE_GOING, +enum module_state { + MODULE_STATE_LIVE, /* Normal state. */ + MODULE_STATE_COMING,/* Full formed, running module_init. */ + MODULE_STATE_GOING, /* Going away. */ + MODULE_STATE_UNFORMED, /* Still setting it up. */ }; /** diff --git a/kernel/module.c b/kernel/module.c index 41bc118..c3a2ee8 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -188,6 +188,7 @@ struct load_info { ongoing or failed initialization etc. */ static inline int strong_try_module_get(struct module *mod) { + BUG_ON(mod && mod->state == MODULE_STATE_UNFORMED); if (mod && mod->state == MODULE_STATE_COMING) return -EBUSY; if (try_module_get(mod)) @@ -343,6 +344,9 @@ bool each_symbol_section(bool (*fn)(const struct symsearch *arr, #endif }; + if (mod->state == MODULE_STATE_UNFORMED) + continue; + if (each_symbol_in_section(arr, ARRAY_SIZE(arr), mod, fn, data)) return true; } @@ -450,16 +454,24 @@ const struct kernel_symbol *find_symbol(const char *name, EXPORT_SYMBOL_GPL(find_symbol); /* Search for module by name: must hold module_mutex. */ -struct module *find_module(const char *name) +static struct module *find_module_all(const char *name, + bool even_unformed) { struct module *mod; list_for_each_entry(mod, , list) { + if (!even_unformed && mod->state == MODULE_STATE_UNFORMED) + continue; if (strcmp(mod->name, name) == 0) return mod; } return NULL; } + +struct module *find_module(const char *name) +{ + return find_module_all(name, false); +} EXPORT_SYMBOL_GPL(find_module); #ifdef CONFIG_SMP @@ -525,6 +537,8 @@ bool is_module_percpu_address(unsigned long addr) preempt_disable(); list_for_each_entry_rcu(mod, , list) { + if (mod->state == MODULE_STATE_UNFORMED) + continue; if (!mod->percpu_size) continue; for_each_possible_cpu(cpu) { @@ -1048,6 +1062,8 @@ static ssize_t show_initstate(struct module_attribute *mattr, case MODULE_STATE_GOING: state = "going"; break; + default: + BUG(); } return sprintf(buffer, "%s\n", state); } @@ -1786,6 +1802,8 @@ void set_all_modules_text_rw(void) mutex_lock(_mutex); list_for_each_entry_rcu(mod, , list) { + if (mod->state == MODULE_STATE_UNFORMED) + continue; if ((mod->module_core) && (mod->core_text_size)) { set_page_attributes(mod->module_core,
Re: [PATCH] module, fix percpu reserved memory exhaustion
Prarit Bhargava pra...@redhat.com writes: [ 15.478160] kvm: Could not allocate 304 bytes percpu data [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from reserved chunk failed ... What is happening is systemd is loading an instance of the kvm module for each cpu found (see commit e9bda3b). When the module load occurs the kernel currently allocates the modules percpu data area prior to checking to see if the module is already loaded or is in the process of being loaded. If the module is already loaded, or finishes load, the module loading code releases the current instance's module's percpu data. Wow, what a cool bug! Classic unforseen side-effect. I'd prefer not to do relocations with the module_lock held: it can be relatively slow. Yet we can't do relocations before the per-cpu allocation, obviously. Did you do boot timings before and after? An alternative would be to put the module into the list even earlier (say, just after layout_and_allocate) so we could block on concurrent loads at that point. But then we have to make sure noone looks in the module too early before it's completely set up, and that's complicated and error-prone too. A separate list is kind of icky. We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit allmodconfig build, there are only three modules with per-cpu data, totalling 328 bytes. So it's not reasonable to increase that number to paper over this. This is what a new boot state looks like (pains not to break ksplice). It's two patches, but I'll just post them back to back: module: add new state MODULE_STATE_UNFORMED. You should never look at such a module, so it's excised from all paths which traverse the modules list. We add the state at the end, to avoid gratuitous ABI break (ksplice). Signed-off-by: Rusty Russell ru...@rustcorp.com.au diff --git a/include/linux/module.h b/include/linux/module.h index 7760c6d..4432373 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -199,11 +199,11 @@ struct module_use { struct module *source, *target; }; -enum module_state -{ - MODULE_STATE_LIVE, - MODULE_STATE_COMING, - MODULE_STATE_GOING, +enum module_state { + MODULE_STATE_LIVE, /* Normal state. */ + MODULE_STATE_COMING,/* Full formed, running module_init. */ + MODULE_STATE_GOING, /* Going away. */ + MODULE_STATE_UNFORMED, /* Still setting it up. */ }; /** diff --git a/kernel/module.c b/kernel/module.c index 41bc118..c3a2ee8 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -188,6 +188,7 @@ struct load_info { ongoing or failed initialization etc. */ static inline int strong_try_module_get(struct module *mod) { + BUG_ON(mod mod-state == MODULE_STATE_UNFORMED); if (mod mod-state == MODULE_STATE_COMING) return -EBUSY; if (try_module_get(mod)) @@ -343,6 +344,9 @@ bool each_symbol_section(bool (*fn)(const struct symsearch *arr, #endif }; + if (mod-state == MODULE_STATE_UNFORMED) + continue; + if (each_symbol_in_section(arr, ARRAY_SIZE(arr), mod, fn, data)) return true; } @@ -450,16 +454,24 @@ const struct kernel_symbol *find_symbol(const char *name, EXPORT_SYMBOL_GPL(find_symbol); /* Search for module by name: must hold module_mutex. */ -struct module *find_module(const char *name) +static struct module *find_module_all(const char *name, + bool even_unformed) { struct module *mod; list_for_each_entry(mod, modules, list) { + if (!even_unformed mod-state == MODULE_STATE_UNFORMED) + continue; if (strcmp(mod-name, name) == 0) return mod; } return NULL; } + +struct module *find_module(const char *name) +{ + return find_module_all(name, false); +} EXPORT_SYMBOL_GPL(find_module); #ifdef CONFIG_SMP @@ -525,6 +537,8 @@ bool is_module_percpu_address(unsigned long addr) preempt_disable(); list_for_each_entry_rcu(mod, modules, list) { + if (mod-state == MODULE_STATE_UNFORMED) + continue; if (!mod-percpu_size) continue; for_each_possible_cpu(cpu) { @@ -1048,6 +1062,8 @@ static ssize_t show_initstate(struct module_attribute *mattr, case MODULE_STATE_GOING: state = going; break; + default: + BUG(); } return sprintf(buffer, %s\n, state); } @@ -1786,6 +1802,8 @@ void set_all_modules_text_rw(void) mutex_lock(module_mutex); list_for_each_entry_rcu(mod, modules, list) { + if (mod-state == MODULE_STATE_UNFORMED) + continue; if ((mod-module_core) (mod-core_text_size)) {
[PATCH] module, fix percpu reserved memory exhaustion
Rusty, There is likely some subtlety of moving the module mutex that I'm unaware of. What I can say is that this patch seems to resolve the problem for me, or at least through 100+ reboots I have not seen the problem (I'm still testing as I write this). I'm more than willing to hear an alternative approach, or test an alternative patch. Thanks, P. 8< In recent Fedora releases (F17 & F18) some users have reported seeing messages similar to [ 15.478121] Pid: 727, comm: systemd-udevd Tainted: GF 3.8.0-rc2+ #1 [ 15.478121] Call Trace: [ 15.478131] [] pcpu_alloc+0xa01/0xa60 [ 15.478137] [] ? printk+0x61/0x63 [ 15.478140] [] __alloc_reserved_percpu+0x13/0x20 [ 15.478145] [] load_module+0x1dc2/0x20b0 [ 15.478150] [] ? do_page_fault+0xe/0x10 [ 15.478152] [] ? page_fault+0x28/0x30 [ 15.478155] [] sys_init_module+0xd7/0x120 [ 15.478159] [] system_call_fastpath+0x16/0x1b [ 15.478160] kvm: Could not allocate 304 bytes percpu data [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from reserved chunk failed during system boot. In some cases, users have also reported seeing this message along with a failed load of other modules. As the message indicates, the reserved chunk of percpu memory (where modules allocate their memory) is exhausted. A debug printk inserted in the code shows [ 15.478533] PRARIT size = 304 > chunk->contig_hint = 208 ie) the reserved chunk of percpu has only 208 bytes of available space. What is happening is systemd is loading an instance of the kvm module for each cpu found (see commit e9bda3b). When the module load occurs the kernel currently allocates the modules percpu data area prior to checking to see if the module is already loaded or is in the process of being loaded. If the module is already loaded, or finishes load, the module loading code releases the current instance's module's percpu data. The problem is that these module loads race and it is possible that all of the percpu reserved area is consumed by repeated loads of the same module which results in the failure of other drivers to load. This patch moves the module percpu allocation after the check for an existing instance of the module. Signed-off-by: Prarit Bhargava Cc: Rusty Russell Cc: Mike Galbraith --- kernel/module.c | 124 ++- 1 file changed, 85 insertions(+), 39 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 250092c..e7e9b57 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1929,6 +1929,27 @@ static int verify_export_symbols(struct module *mod) return 0; } +static void simplify_percpu_symbols(struct module *mod, + const struct load_info *info) +{ + Elf_Shdr *symsec = >sechdrs[info->index.sym]; + Elf_Sym *sym = (void *)symsec->sh_addr; + unsigned long secbase; + unsigned int i; + + /* +* No need for error checking in this function because +* simplify_symbols has already been called. +*/ + for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) { + /* Divert to percpu allocation if a percpu var. */ + if (sym[i].st_shndx == info->index.pcpu) { + secbase = (unsigned long)mod_percpu(mod); + sym[i].st_value += secbase; + } + } +} + /* Change all symbols so that st_value encodes the pointer directly. */ static int simplify_symbols(struct module *mod, const struct load_info *info) { @@ -1976,12 +1997,11 @@ static int simplify_symbols(struct module *mod, const struct load_info *info) break; default: - /* Divert to percpu allocation if a percpu var. */ - if (sym[i].st_shndx == info->index.pcpu) - secbase = (unsigned long)mod_percpu(mod); - else + /* percpu diverts handled in simplify_percpu_symbols */ + if (sym[i].st_shndx != info->index.pcpu) { secbase = info->sechdrs[sym[i].st_shndx].sh_addr; - sym[i].st_value += secbase; + sym[i].st_value += secbase; + } break; } } @@ -2899,11 +2919,29 @@ int __weak module_frob_arch_sections(Elf_Ehdr *hdr, return 0; } +static int allocate_percpu(struct module *mod, struct load_info *info) +{ + Elf_Shdr *pcpusec; + int err; + + pcpusec = >sechdrs[info->index.pcpu]; + if (pcpusec->sh_size) { + /* We have a special allocation for this section. */ + pr_debug("module %s attempting to percpu with size %d\n", +mod->name, pcpusec->sh_size); + err = percpu_modalloc(mod, + pcpusec->sh_size,
[PATCH] module, fix percpu reserved memory exhaustion
Rusty, There is likely some subtlety of moving the module mutex that I'm unaware of. What I can say is that this patch seems to resolve the problem for me, or at least through 100+ reboots I have not seen the problem (I'm still testing as I write this). I'm more than willing to hear an alternative approach, or test an alternative patch. Thanks, P. 8 In recent Fedora releases (F17 F18) some users have reported seeing messages similar to [ 15.478121] Pid: 727, comm: systemd-udevd Tainted: GF 3.8.0-rc2+ #1 [ 15.478121] Call Trace: [ 15.478131] [81153001] pcpu_alloc+0xa01/0xa60 [ 15.478137] [8163d8b0] ? printk+0x61/0x63 [ 15.478140] [811532f3] __alloc_reserved_percpu+0x13/0x20 [ 15.478145] [810c42e2] load_module+0x1dc2/0x20b0 [ 15.478150] [8164b21e] ? do_page_fault+0xe/0x10 [ 15.478152] [81647858] ? page_fault+0x28/0x30 [ 15.478155] [810c46a7] sys_init_module+0xd7/0x120 [ 15.478159] [8164f859] system_call_fastpath+0x16/0x1b [ 15.478160] kvm: Could not allocate 304 bytes percpu data [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from reserved chunk failed during system boot. In some cases, users have also reported seeing this message along with a failed load of other modules. As the message indicates, the reserved chunk of percpu memory (where modules allocate their memory) is exhausted. A debug printk inserted in the code shows [ 15.478533] PRARIT size = 304 chunk-contig_hint = 208 ie) the reserved chunk of percpu has only 208 bytes of available space. What is happening is systemd is loading an instance of the kvm module for each cpu found (see commit e9bda3b). When the module load occurs the kernel currently allocates the modules percpu data area prior to checking to see if the module is already loaded or is in the process of being loaded. If the module is already loaded, or finishes load, the module loading code releases the current instance's module's percpu data. The problem is that these module loads race and it is possible that all of the percpu reserved area is consumed by repeated loads of the same module which results in the failure of other drivers to load. This patch moves the module percpu allocation after the check for an existing instance of the module. Signed-off-by: Prarit Bhargava pra...@redhat.com Cc: Rusty Russell ru...@rustcorp.com.au Cc: Mike Galbraith efa...@gmx.de --- kernel/module.c | 124 ++- 1 file changed, 85 insertions(+), 39 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 250092c..e7e9b57 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1929,6 +1929,27 @@ static int verify_export_symbols(struct module *mod) return 0; } +static void simplify_percpu_symbols(struct module *mod, + const struct load_info *info) +{ + Elf_Shdr *symsec = info-sechdrs[info-index.sym]; + Elf_Sym *sym = (void *)symsec-sh_addr; + unsigned long secbase; + unsigned int i; + + /* +* No need for error checking in this function because +* simplify_symbols has already been called. +*/ + for (i = 1; i symsec-sh_size / sizeof(Elf_Sym); i++) { + /* Divert to percpu allocation if a percpu var. */ + if (sym[i].st_shndx == info-index.pcpu) { + secbase = (unsigned long)mod_percpu(mod); + sym[i].st_value += secbase; + } + } +} + /* Change all symbols so that st_value encodes the pointer directly. */ static int simplify_symbols(struct module *mod, const struct load_info *info) { @@ -1976,12 +1997,11 @@ static int simplify_symbols(struct module *mod, const struct load_info *info) break; default: - /* Divert to percpu allocation if a percpu var. */ - if (sym[i].st_shndx == info-index.pcpu) - secbase = (unsigned long)mod_percpu(mod); - else + /* percpu diverts handled in simplify_percpu_symbols */ + if (sym[i].st_shndx != info-index.pcpu) { secbase = info-sechdrs[sym[i].st_shndx].sh_addr; - sym[i].st_value += secbase; + sym[i].st_value += secbase; + } break; } } @@ -2899,11 +2919,29 @@ int __weak module_frob_arch_sections(Elf_Ehdr *hdr, return 0; } +static int allocate_percpu(struct module *mod, struct load_info *info) +{ + Elf_Shdr *pcpusec; + int err; + + pcpusec = info-sechdrs[info-index.pcpu]; + if (pcpusec-sh_size) { + /* We have a special allocation for this section. */ + pr_debug(module %s attempting to