Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-14 Thread Tejun Heo
Hello, Rusty.

On Sat, Jan 12, 2013 at 11:36:09AM +1030, Rusty Russell wrote:
> > I've been thinking about that.  The problem is that at the same time the kvm
> > problem occurs I'm attempting to load a debug module that I've written to 
> > debug
> > some cpu timer issues that allocates a large amount of percpu data 
> > (~.5K/cpu).
> > While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not 
> > work
> > tomorrow if I have the need to increase the size of my log buffer.
> 
> Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it
> could easily be halved.  I guess this is because dynamic per-cpu data is
> now such a nice alternative (thanks to Tejun).

Heh, thanks. :)

There are two percpu constants that I basically pulled out of my ass.

* PERCPU_MODULE_RESERVE: It's used on architectures with data model
  which restricts the distance between code and data.  e.g. x86_64
  genereates code which can't address full 64bit for static data and
  as percpu addressing depends on the regular address generation to
  derive percpu addresses, the percpu offsets need to be restricted
  somehow.  Percpu allocator currently achieves this by reserving
  certain amount in the first percpu chunk which is guaranteed to have
  very low (starting from 0) percpu address space offsets.

  On those archs, this becomes the hard limit for all module static
  percpu areas.  Maybe 8k is way too high and we can go with 4k.
  Worth a try I guess.

* PERCPU_DYNAMIC_RESERVE: The amount of space at the end of the first
  chunk used for dynamic allocation.  This isn't as critical as the
  above and the only reason it exists is because the first chunk is
  often embedded in the kernel linear address space instead of vmalloc
  area and the former is often mapped with larger page table size than
  the latter.  The goal of PERCPU_DYNAMIC_RESERVE is to have just
  enough space to cover the usual percpu allocations without wasting
  too much space to minimize impact on TLB pressure.

  The current number isn't entirely made up.  I did some testing with
  some random config.  Don't know how it holds up now.  It probably
  can be increased given that we developed quite a few percpu dynamic
  allocations since then.

  It would be awesome to have a way to somehow dynamically adjust this
  one; unfortunately, that would require persistent data across
  boots. :(

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-14 Thread Prarit Bhargava


On 01/11/2013 08:06 PM, Rusty Russell wrote:
> Prarit Bhargava  writes:
>> On 01/10/2013 10:48 PM, Rusty Russell wrote:
>> The timing were similar.  I didn't see any huge delays, etc.  Can the
>> relocations really cause a long delay?  I thought we were pretty much writing
>> values to memory...
> 
> For x86 that's true, but look at what ppc64 has to do for example.  I'm
> guessing you don't have a giant Nvidia proprietary driver module
> loading, either.

Ah -- I see.  I hadn't thought much about the other arches and I see what ppc64
does ...

> 
>> [I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
>> memory]
> 
> I figured it had to be something big ;)

:)  Imagine what happens at 4096 cpus (SGI territory).  I'm wondering about that
kvm commit.  Maybe the systemd/udev rule needs to be rewritten to avoid a 'kvm
loading flood' during boot ... I'll talk with Kay Sievers about it to see if
there's a way around that.

> 
> OTOH, Tested-by: means it actually fixed someone's problem.

Got it.  For the record over-the-weekend testing didn't show any bizarre
results.  The boot times were all around 20-23 seconds.

Tested-by: Prarit Bhargava 

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-14 Thread Prarit Bhargava


On 01/11/2013 08:06 PM, Rusty Russell wrote:
 Prarit Bhargava pra...@redhat.com writes:
 On 01/10/2013 10:48 PM, Rusty Russell wrote:
 The timing were similar.  I didn't see any huge delays, etc.  Can the
 relocations really cause a long delay?  I thought we were pretty much writing
 values to memory...
 
 For x86 that's true, but look at what ppc64 has to do for example.  I'm
 guessing you don't have a giant Nvidia proprietary driver module
 loading, either.

Ah -- I see.  I hadn't thought much about the other arches and I see what ppc64
does ...

 
 [I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
 memory]
 
 I figured it had to be something big ;)

:)  Imagine what happens at 4096 cpus (SGI territory).  I'm wondering about that
kvm commit.  Maybe the systemd/udev rule needs to be rewritten to avoid a 'kvm
loading flood' during boot ... I'll talk with Kay Sievers about it to see if
there's a way around that.

 
 OTOH, Tested-by: means it actually fixed someone's problem.

Got it.  For the record over-the-weekend testing didn't show any bizarre
results.  The boot times were all around 20-23 seconds.

Tested-by: Prarit Bhargava pra...@redhat.com

P.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-14 Thread Tejun Heo
Hello, Rusty.

On Sat, Jan 12, 2013 at 11:36:09AM +1030, Rusty Russell wrote:
  I've been thinking about that.  The problem is that at the same time the kvm
  problem occurs I'm attempting to load a debug module that I've written to 
  debug
  some cpu timer issues that allocates a large amount of percpu data 
  (~.5K/cpu).
  While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not 
  work
  tomorrow if I have the need to increase the size of my log buffer.
 
 Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it
 could easily be halved.  I guess this is because dynamic per-cpu data is
 now such a nice alternative (thanks to Tejun).

Heh, thanks. :)

There are two percpu constants that I basically pulled out of my ass.

* PERCPU_MODULE_RESERVE: It's used on architectures with data model
  which restricts the distance between code and data.  e.g. x86_64
  genereates code which can't address full 64bit for static data and
  as percpu addressing depends on the regular address generation to
  derive percpu addresses, the percpu offsets need to be restricted
  somehow.  Percpu allocator currently achieves this by reserving
  certain amount in the first percpu chunk which is guaranteed to have
  very low (starting from 0) percpu address space offsets.

  On those archs, this becomes the hard limit for all module static
  percpu areas.  Maybe 8k is way too high and we can go with 4k.
  Worth a try I guess.

* PERCPU_DYNAMIC_RESERVE: The amount of space at the end of the first
  chunk used for dynamic allocation.  This isn't as critical as the
  above and the only reason it exists is because the first chunk is
  often embedded in the kernel linear address space instead of vmalloc
  area and the former is often mapped with larger page table size than
  the latter.  The goal of PERCPU_DYNAMIC_RESERVE is to have just
  enough space to cover the usual percpu allocations without wasting
  too much space to minimize impact on TLB pressure.

  The current number isn't entirely made up.  I did some testing with
  some random config.  Don't know how it holds up now.  It probably
  can be increased given that we developed quite a few percpu dynamic
  allocations since then.

  It would be awesome to have a way to somehow dynamically adjust this
  one; unfortunately, that would require persistent data across
  boots. :(

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-11 Thread Rusty Russell
Prarit Bhargava  writes:
> On 01/10/2013 10:48 PM, Rusty Russell wrote:
> The timing were similar.  I didn't see any huge delays, etc.  Can the
> relocations really cause a long delay?  I thought we were pretty much writing
> values to memory...

For x86 that's true, but look at what ppc64 has to do for example.  I'm
guessing you don't have a giant Nvidia proprietary driver module
loading, either.

It just makes me nervous; this kind of boot slowdown typically won't get
diagnosed for several releases, if ever.  Now I've done the work, I'm
going to apply my patch (with an additional fix: I forgot to change
kgdb, which traverses the module list too).

> [I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
> memory]

I figured it had to be something big ;)

>> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
>> allmodconfig build, there are only three modules with per-cpu data,
>> totalling 328 bytes.  So it's not reasonable to increase that number to
>> paper over this.
>
> I've been thinking about that.  The problem is that at the same time the kvm
> problem occurs I'm attempting to load a debug module that I've written to 
> debug
> some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
> While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
> tomorrow if I have the need to increase the size of my log buffer.

Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it
could easily be halved.  I guess this is because dynamic per-cpu data is
now such a nice alternative (thanks to Tejun).

> 
>
> Tested-by: Prarit Bhargava 
>
> Rusty, you can change that to an Acked-by if you prefer that.  I know some
> engineers prefer one over the other.  I'll also continue doing some reboot
> testing and will email back in a few days to let you know what the timing 
> looks
> like.

There seem to be two kinds of Acked-by:

1) Acked-by: .  ie. "this should go through my tree, but
   it's going via someone else".  I like this: shows the normal
   maintainer is aware of the change.

2) Acked-by: . ie. "I like the concept of the patch though I
   haven't actually read it or tested it".  Completely useless.

OTOH, Tested-by: means it actually fixed someone's problem.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-11 Thread Prarit Bhargava


On 01/10/2013 10:48 PM, Rusty Russell wrote:
> Prarit Bhargava  writes:
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
>> from reserved chunk failed
> ...
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Wow, what a cool bug!  Classic unforseen side-effect.
> 
> I'd prefer not to do relocations with the module_lock held: it can be
> relatively slow.  Yet we can't do relocations before the per-cpu
> allocation, obviously.  Did you do boot timings before and after?

Heh ... I did! :)  I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.

>From stock kernel:

[  22.893015] PRARIT: FINAL BOOT MESSAGE

>From stock kernel + my patch:

[  22.673214] PRARIT: FINAL BOOT MESSAGE

Both kernel boots showed the problem with kvm loading.  A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321.  Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).

ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.

The timing were similar.  I didn't see any huge delays, etc.  Can the
relocations really cause a long delay?  I thought we were pretty much writing
values to memory...

[I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
memory]

> 
> An alternative would be to put the module into the list even earlier
> (say, just after layout_and_allocate) so we could block on concurrent
> loads at that point.  But then we have to make sure noone looks in the
> module too early before it's completely set up, and that's complicated
> and error-prone too.  A separate list is kind of icky.

Yeah -- that was my first attempt actually, and it got very complex very
quickly.  I abandoned that approach in favor of moving the percpu allocations
under the lock.  I thought that was likely the easiest approach.

> 
> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
> allmodconfig build, there are only three modules with per-cpu data,
> totalling 328 bytes.  So it's not reasonable to increase that number to
> paper over this.

I've been thinking about that.  The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.

... that is ;), I prefer your and my approach of fixing this problem.

> 
> This is what a new boot state looks like (pains not to break ksplice).
> It's two patches, but I'll just post them back to back:
> 
> module: add new state MODULE_STATE_UNFORMED
> 
> You should never look at such a module, so it's excised from all paths
> which traverse the modules list.
> 
> We add the state at the end, to avoid gratuitous ABI break (ksplice).
> 
> Signed-off-by: Rusty Russell 
> 


Sure, but I'm always nervous about expanding any state machine ;).  That's just
me though :).

> 
> module: put modules in list much earlier.
> 
> Prarit's excellent bug report:
>> In recent Fedora releases (F17 & F18) some users have reported seeing
>> messages similar to
>>
>> [   15.478160] kvm: Could not allocate 304 bytes percpu data
>> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
>> reserved chunk failed
>>
>> during system boot.  In some cases, users have also reported seeing this
>> message along with a failed load of other modules.
>>
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded.  If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
> 
> Now we have a new state MODULE_STATE_UNFORMED, we can insert the
> module into the list (and thus guarantee its uniqueness) before we
> allocate the per-cpu region.
> 
> Reported-by: Prarit Bhargava 
> Signed-off-by: Rusty Russell 
> 



Tested-by: Prarit Bhargava 

Rusty, you can change that to an Acked-by if you prefer that.  I know some
engineers prefer one over the 

Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-11 Thread Prarit Bhargava


On 01/10/2013 10:48 PM, Rusty Russell wrote:
 Prarit Bhargava pra...@redhat.com writes:
 [   15.478160] kvm: Could not allocate 304 bytes percpu data
 [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
 from reserved chunk failed
 ...
 What is happening is systemd is loading an instance of the kvm module for
 each cpu found (see commit e9bda3b).  When the module load occurs the kernel
 currently allocates the modules percpu data area prior to checking to see
 if the module is already loaded or is in the process of being loaded.  If
 the module is already loaded, or finishes load, the module loading code
 releases the current instance's module's percpu data.
 
 Wow, what a cool bug!  Classic unforseen side-effect.
 
 I'd prefer not to do relocations with the module_lock held: it can be
 relatively slow.  Yet we can't do relocations before the per-cpu
 allocation, obviously.  Did you do boot timings before and after?

Heh ... I did! :)  I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.

From stock kernel:

[  22.893015] PRARIT: FINAL BOOT MESSAGE

From stock kernel + my patch:

[  22.673214] PRARIT: FINAL BOOT MESSAGE

Both kernel boots showed the problem with kvm loading.  A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321.  Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).

ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.

The timing were similar.  I didn't see any huge delays, etc.  Can the
relocations really cause a long delay?  I thought we were pretty much writing
values to memory...

[I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
memory]

 
 An alternative would be to put the module into the list even earlier
 (say, just after layout_and_allocate) so we could block on concurrent
 loads at that point.  But then we have to make sure noone looks in the
 module too early before it's completely set up, and that's complicated
 and error-prone too.  A separate list is kind of icky.

Yeah -- that was my first attempt actually, and it got very complex very
quickly.  I abandoned that approach in favor of moving the percpu allocations
under the lock.  I thought that was likely the easiest approach.

 
 We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
 allmodconfig build, there are only three modules with per-cpu data,
 totalling 328 bytes.  So it's not reasonable to increase that number to
 paper over this.

I've been thinking about that.  The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.

... that is ;), I prefer your and my approach of fixing this problem.

 
 This is what a new boot state looks like (pains not to break ksplice).
 It's two patches, but I'll just post them back to back:
 
 module: add new state MODULE_STATE_UNFORMED
 
 You should never look at such a module, so it's excised from all paths
 which traverse the modules list.
 
 We add the state at the end, to avoid gratuitous ABI break (ksplice).
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
snip patch

Sure, but I'm always nervous about expanding any state machine ;).  That's just
me though :).

 
 module: put modules in list much earlier.
 
 Prarit's excellent bug report:
 In recent Fedora releases (F17  F18) some users have reported seeing
 messages similar to

 [   15.478160] kvm: Could not allocate 304 bytes percpu data
 [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
 reserved chunk failed

 during system boot.  In some cases, users have also reported seeing this
 message along with a failed load of other modules.

 What is happening is systemd is loading an instance of the kvm module for
 each cpu found (see commit e9bda3b).  When the module load occurs the kernel
 currently allocates the modules percpu data area prior to checking to see
 if the module is already loaded or is in the process of being loaded.  If
 the module is already loaded, or finishes load, the module loading code
 releases the current instance's module's percpu data.
 
 Now we have a new state MODULE_STATE_UNFORMED, we can insert the
 module into the list (and thus guarantee its uniqueness) before we
 allocate the per-cpu region.
 
 Reported-by: Prarit Bhargava pra...@redhat.com
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 

snip patch

Tested-by: Prarit Bhargava pra...@redhat.com

Rusty, you can change that to an Acked-by if you prefer that.  I know some
engineers prefer 

Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-11 Thread Rusty Russell
Prarit Bhargava pra...@redhat.com writes:
 On 01/10/2013 10:48 PM, Rusty Russell wrote:
 The timing were similar.  I didn't see any huge delays, etc.  Can the
 relocations really cause a long delay?  I thought we were pretty much writing
 values to memory...

For x86 that's true, but look at what ppc64 has to do for example.  I'm
guessing you don't have a giant Nvidia proprietary driver module
loading, either.

It just makes me nervous; this kind of boot slowdown typically won't get
diagnosed for several releases, if ever.  Now I've done the work, I'm
going to apply my patch (with an additional fix: I forgot to change
kgdb, which traverses the module list too).

 [I should point out that I'm booting a 32 physical/64 logical, with 64GB of 
 memory]

I figured it had to be something big ;)

 We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
 allmodconfig build, there are only three modules with per-cpu data,
 totalling 328 bytes.  So it's not reasonable to increase that number to
 paper over this.

 I've been thinking about that.  The problem is that at the same time the kvm
 problem occurs I'm attempting to load a debug module that I've written to 
 debug
 some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
 While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
 tomorrow if I have the need to increase the size of my log buffer.

Well, it looks like PERCPU_MODULE_RESERVE is actually very generous; it
could easily be halved.  I guess this is because dynamic per-cpu data is
now such a nice alternative (thanks to Tejun).

 snip patch

 Tested-by: Prarit Bhargava pra...@redhat.com

 Rusty, you can change that to an Acked-by if you prefer that.  I know some
 engineers prefer one over the other.  I'll also continue doing some reboot
 testing and will email back in a few days to let you know what the timing 
 looks
 like.

There seem to be two kinds of Acked-by:

1) Acked-by: maintainer.  ie. this should go through my tree, but
   it's going via someone else.  I like this: shows the normal
   maintainer is aware of the change.

2) Acked-by: random. ie. I like the concept of the patch though I
   haven't actually read it or tested it.  Completely useless.

OTOH, Tested-by: means it actually fixed someone's problem.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-10 Thread Rusty Russell
Prarit Bhargava  writes:
> [   15.478160] kvm: Could not allocate 304 bytes percpu data
> [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
> from reserved chunk failed
...
> What is happening is systemd is loading an instance of the kvm module for
> each cpu found (see commit e9bda3b).  When the module load occurs the kernel
> currently allocates the modules percpu data area prior to checking to see
> if the module is already loaded or is in the process of being loaded.  If
> the module is already loaded, or finishes load, the module loading code
> releases the current instance's module's percpu data.

Wow, what a cool bug!  Classic unforseen side-effect.

I'd prefer not to do relocations with the module_lock held: it can be
relatively slow.  Yet we can't do relocations before the per-cpu
allocation, obviously.  Did you do boot timings before and after?

An alternative would be to put the module into the list even earlier
(say, just after layout_and_allocate) so we could block on concurrent
loads at that point.  But then we have to make sure noone looks in the
module too early before it's completely set up, and that's complicated
and error-prone too.  A separate list is kind of icky.

We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
allmodconfig build, there are only three modules with per-cpu data,
totalling 328 bytes.  So it's not reasonable to increase that number to
paper over this.

This is what a new boot state looks like (pains not to break ksplice).
It's two patches, but I'll just post them back to back:

module: add new state MODULE_STATE_UNFORMED.

You should never look at such a module, so it's excised from all paths
which traverse the modules list.

We add the state at the end, to avoid gratuitous ABI break (ksplice).

Signed-off-by: Rusty Russell 

diff --git a/include/linux/module.h b/include/linux/module.h
index 7760c6d..4432373 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -199,11 +199,11 @@ struct module_use {
struct module *source, *target;
 };
 
-enum module_state
-{
-   MODULE_STATE_LIVE,
-   MODULE_STATE_COMING,
-   MODULE_STATE_GOING,
+enum module_state {
+   MODULE_STATE_LIVE,  /* Normal state. */
+   MODULE_STATE_COMING,/* Full formed, running module_init. */
+   MODULE_STATE_GOING, /* Going away. */
+   MODULE_STATE_UNFORMED,  /* Still setting it up. */
 };
 
 /**
diff --git a/kernel/module.c b/kernel/module.c
index 41bc118..c3a2ee8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -188,6 +188,7 @@ struct load_info {
ongoing or failed initialization etc. */
 static inline int strong_try_module_get(struct module *mod)
 {
+   BUG_ON(mod && mod->state == MODULE_STATE_UNFORMED);
if (mod && mod->state == MODULE_STATE_COMING)
return -EBUSY;
if (try_module_get(mod))
@@ -343,6 +344,9 @@ bool each_symbol_section(bool (*fn)(const struct symsearch 
*arr,
 #endif
};
 
+   if (mod->state == MODULE_STATE_UNFORMED)
+   continue;
+
if (each_symbol_in_section(arr, ARRAY_SIZE(arr), mod, fn, data))
return true;
}
@@ -450,16 +454,24 @@ const struct kernel_symbol *find_symbol(const char *name,
 EXPORT_SYMBOL_GPL(find_symbol);
 
 /* Search for module by name: must hold module_mutex. */
-struct module *find_module(const char *name)
+static struct module *find_module_all(const char *name,
+ bool even_unformed)
 {
struct module *mod;
 
list_for_each_entry(mod, , list) {
+   if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
+   continue;
if (strcmp(mod->name, name) == 0)
return mod;
}
return NULL;
 }
+
+struct module *find_module(const char *name)
+{
+   return find_module_all(name, false);
+}
 EXPORT_SYMBOL_GPL(find_module);
 
 #ifdef CONFIG_SMP
@@ -525,6 +537,8 @@ bool is_module_percpu_address(unsigned long addr)
preempt_disable();
 
list_for_each_entry_rcu(mod, , list) {
+   if (mod->state == MODULE_STATE_UNFORMED)
+   continue;
if (!mod->percpu_size)
continue;
for_each_possible_cpu(cpu) {
@@ -1048,6 +1062,8 @@ static ssize_t show_initstate(struct module_attribute 
*mattr,
case MODULE_STATE_GOING:
state = "going";
break;
+   default:
+   BUG();
}
return sprintf(buffer, "%s\n", state);
 }
@@ -1786,6 +1802,8 @@ void set_all_modules_text_rw(void)
 
mutex_lock(_mutex);
list_for_each_entry_rcu(mod, , list) {
+   if (mod->state == MODULE_STATE_UNFORMED)
+   continue;
if ((mod->module_core) && (mod->core_text_size)) {
set_page_attributes(mod->module_core,

Re: [PATCH] module, fix percpu reserved memory exhaustion

2013-01-10 Thread Rusty Russell
Prarit Bhargava pra...@redhat.com writes:
 [   15.478160] kvm: Could not allocate 304 bytes percpu data
 [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc
 from reserved chunk failed
...
 What is happening is systemd is loading an instance of the kvm module for
 each cpu found (see commit e9bda3b).  When the module load occurs the kernel
 currently allocates the modules percpu data area prior to checking to see
 if the module is already loaded or is in the process of being loaded.  If
 the module is already loaded, or finishes load, the module loading code
 releases the current instance's module's percpu data.

Wow, what a cool bug!  Classic unforseen side-effect.

I'd prefer not to do relocations with the module_lock held: it can be
relatively slow.  Yet we can't do relocations before the per-cpu
allocation, obviously.  Did you do boot timings before and after?

An alternative would be to put the module into the list even earlier
(say, just after layout_and_allocate) so we could block on concurrent
loads at that point.  But then we have to make sure noone looks in the
module too early before it's completely set up, and that's complicated
and error-prone too.  A separate list is kind of icky.

We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
allmodconfig build, there are only three modules with per-cpu data,
totalling 328 bytes.  So it's not reasonable to increase that number to
paper over this.

This is what a new boot state looks like (pains not to break ksplice).
It's two patches, but I'll just post them back to back:

module: add new state MODULE_STATE_UNFORMED.

You should never look at such a module, so it's excised from all paths
which traverse the modules list.

We add the state at the end, to avoid gratuitous ABI break (ksplice).

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/include/linux/module.h b/include/linux/module.h
index 7760c6d..4432373 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -199,11 +199,11 @@ struct module_use {
struct module *source, *target;
 };
 
-enum module_state
-{
-   MODULE_STATE_LIVE,
-   MODULE_STATE_COMING,
-   MODULE_STATE_GOING,
+enum module_state {
+   MODULE_STATE_LIVE,  /* Normal state. */
+   MODULE_STATE_COMING,/* Full formed, running module_init. */
+   MODULE_STATE_GOING, /* Going away. */
+   MODULE_STATE_UNFORMED,  /* Still setting it up. */
 };
 
 /**
diff --git a/kernel/module.c b/kernel/module.c
index 41bc118..c3a2ee8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -188,6 +188,7 @@ struct load_info {
ongoing or failed initialization etc. */
 static inline int strong_try_module_get(struct module *mod)
 {
+   BUG_ON(mod  mod-state == MODULE_STATE_UNFORMED);
if (mod  mod-state == MODULE_STATE_COMING)
return -EBUSY;
if (try_module_get(mod))
@@ -343,6 +344,9 @@ bool each_symbol_section(bool (*fn)(const struct symsearch 
*arr,
 #endif
};
 
+   if (mod-state == MODULE_STATE_UNFORMED)
+   continue;
+
if (each_symbol_in_section(arr, ARRAY_SIZE(arr), mod, fn, data))
return true;
}
@@ -450,16 +454,24 @@ const struct kernel_symbol *find_symbol(const char *name,
 EXPORT_SYMBOL_GPL(find_symbol);
 
 /* Search for module by name: must hold module_mutex. */
-struct module *find_module(const char *name)
+static struct module *find_module_all(const char *name,
+ bool even_unformed)
 {
struct module *mod;
 
list_for_each_entry(mod, modules, list) {
+   if (!even_unformed  mod-state == MODULE_STATE_UNFORMED)
+   continue;
if (strcmp(mod-name, name) == 0)
return mod;
}
return NULL;
 }
+
+struct module *find_module(const char *name)
+{
+   return find_module_all(name, false);
+}
 EXPORT_SYMBOL_GPL(find_module);
 
 #ifdef CONFIG_SMP
@@ -525,6 +537,8 @@ bool is_module_percpu_address(unsigned long addr)
preempt_disable();
 
list_for_each_entry_rcu(mod, modules, list) {
+   if (mod-state == MODULE_STATE_UNFORMED)
+   continue;
if (!mod-percpu_size)
continue;
for_each_possible_cpu(cpu) {
@@ -1048,6 +1062,8 @@ static ssize_t show_initstate(struct module_attribute 
*mattr,
case MODULE_STATE_GOING:
state = going;
break;
+   default:
+   BUG();
}
return sprintf(buffer, %s\n, state);
 }
@@ -1786,6 +1802,8 @@ void set_all_modules_text_rw(void)
 
mutex_lock(module_mutex);
list_for_each_entry_rcu(mod, modules, list) {
+   if (mod-state == MODULE_STATE_UNFORMED)
+   continue;
if ((mod-module_core)  (mod-core_text_size)) {

[PATCH] module, fix percpu reserved memory exhaustion

2013-01-09 Thread Prarit Bhargava
Rusty,

There is likely some subtlety of moving the module mutex that I'm unaware of.
What I can say is that this patch seems to resolve the problem for me, or at
least through 100+ reboots I have not seen the problem (I'm still testing
as I write this).

I'm more than willing to hear an alternative approach, or test an alternative
patch.

Thanks,

P.

8<

In recent Fedora releases (F17 & F18) some users have reported seeing
messages similar to

[   15.478121] Pid: 727, comm: systemd-udevd Tainted: GF 3.8.0-rc2+ #1
[   15.478121] Call Trace:
[   15.478131]  [] pcpu_alloc+0xa01/0xa60
[   15.478137]  [] ? printk+0x61/0x63
[   15.478140]  [] __alloc_reserved_percpu+0x13/0x20
[   15.478145]  [] load_module+0x1dc2/0x20b0
[   15.478150]  [] ? do_page_fault+0xe/0x10
[   15.478152]  [] ? page_fault+0x28/0x30
[   15.478155]  [] sys_init_module+0xd7/0x120
[   15.478159]  [] system_call_fastpath+0x16/0x1b
[   15.478160] kvm: Could not allocate 304 bytes percpu data
[   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from 
reserved chunk failed

during system boot.  In some cases, users have also reported seeing this
message along with a failed load of other modules.

As the message indicates, the reserved chunk of percpu memory (where
modules allocate their memory) is exhausted.  A debug printk inserted in
the code shows

[   15.478533] PRARIT size = 304 > chunk->contig_hint = 208

ie) the reserved chunk of percpu has only 208 bytes of available space.

What is happening is systemd is loading an instance of the kvm module for
each cpu found (see commit e9bda3b).  When the module load occurs the kernel
currently allocates the modules percpu data area prior to checking to see
if the module is already loaded or is in the process of being loaded.  If
the module is already loaded, or finishes load, the module loading code
releases the current instance's module's percpu data.

The problem is that these module loads race and it is possible that all of
the percpu reserved area is consumed by repeated loads of the same module
which results in the failure of other drivers to load.

This patch moves the module percpu allocation after the check for an
existing instance of the module.

Signed-off-by: Prarit Bhargava 
Cc: Rusty Russell 
Cc: Mike Galbraith 
---
 kernel/module.c |  124 ++-
 1 file changed, 85 insertions(+), 39 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 250092c..e7e9b57 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1929,6 +1929,27 @@ static int verify_export_symbols(struct module *mod)
return 0;
 }
 
+static void simplify_percpu_symbols(struct module *mod,
+   const struct load_info *info)
+{
+   Elf_Shdr *symsec = >sechdrs[info->index.sym];
+   Elf_Sym *sym = (void *)symsec->sh_addr;
+   unsigned long secbase;
+   unsigned int i;
+
+   /*
+* No need for error checking in this function because
+* simplify_symbols has already been called.
+*/
+   for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
+   /* Divert to percpu allocation if a percpu var. */
+   if (sym[i].st_shndx == info->index.pcpu) {
+   secbase = (unsigned long)mod_percpu(mod);
+   sym[i].st_value += secbase;
+   }
+   }
+}
+
 /* Change all symbols so that st_value encodes the pointer directly. */
 static int simplify_symbols(struct module *mod, const struct load_info *info)
 {
@@ -1976,12 +1997,11 @@ static int simplify_symbols(struct module *mod, const 
struct load_info *info)
break;
 
default:
-   /* Divert to percpu allocation if a percpu var. */
-   if (sym[i].st_shndx == info->index.pcpu)
-   secbase = (unsigned long)mod_percpu(mod);
-   else
+   /* percpu diverts handled in simplify_percpu_symbols */
+   if (sym[i].st_shndx != info->index.pcpu) {
secbase = 
info->sechdrs[sym[i].st_shndx].sh_addr;
-   sym[i].st_value += secbase;
+   sym[i].st_value += secbase;
+   }
break;
}
}
@@ -2899,11 +2919,29 @@ int __weak module_frob_arch_sections(Elf_Ehdr *hdr,
return 0;
 }
 
+static int allocate_percpu(struct module *mod, struct load_info *info)
+{
+   Elf_Shdr *pcpusec;
+   int err;
+
+   pcpusec = >sechdrs[info->index.pcpu];
+   if (pcpusec->sh_size) {
+   /* We have a special allocation for this section. */
+   pr_debug("module %s attempting to percpu with size %d\n",
+mod->name, pcpusec->sh_size);
+   err = percpu_modalloc(mod,
+ pcpusec->sh_size, 

[PATCH] module, fix percpu reserved memory exhaustion

2013-01-09 Thread Prarit Bhargava
Rusty,

There is likely some subtlety of moving the module mutex that I'm unaware of.
What I can say is that this patch seems to resolve the problem for me, or at
least through 100+ reboots I have not seen the problem (I'm still testing
as I write this).

I'm more than willing to hear an alternative approach, or test an alternative
patch.

Thanks,

P.

8

In recent Fedora releases (F17  F18) some users have reported seeing
messages similar to

[   15.478121] Pid: 727, comm: systemd-udevd Tainted: GF 3.8.0-rc2+ #1
[   15.478121] Call Trace:
[   15.478131]  [81153001] pcpu_alloc+0xa01/0xa60
[   15.478137]  [8163d8b0] ? printk+0x61/0x63
[   15.478140]  [811532f3] __alloc_reserved_percpu+0x13/0x20
[   15.478145]  [810c42e2] load_module+0x1dc2/0x20b0
[   15.478150]  [8164b21e] ? do_page_fault+0xe/0x10
[   15.478152]  [81647858] ? page_fault+0x28/0x30
[   15.478155]  [810c46a7] sys_init_module+0xd7/0x120
[   15.478159]  [8164f859] system_call_fastpath+0x16/0x1b
[   15.478160] kvm: Could not allocate 304 bytes percpu data
[   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from 
reserved chunk failed

during system boot.  In some cases, users have also reported seeing this
message along with a failed load of other modules.

As the message indicates, the reserved chunk of percpu memory (where
modules allocate their memory) is exhausted.  A debug printk inserted in
the code shows

[   15.478533] PRARIT size = 304  chunk-contig_hint = 208

ie) the reserved chunk of percpu has only 208 bytes of available space.

What is happening is systemd is loading an instance of the kvm module for
each cpu found (see commit e9bda3b).  When the module load occurs the kernel
currently allocates the modules percpu data area prior to checking to see
if the module is already loaded or is in the process of being loaded.  If
the module is already loaded, or finishes load, the module loading code
releases the current instance's module's percpu data.

The problem is that these module loads race and it is possible that all of
the percpu reserved area is consumed by repeated loads of the same module
which results in the failure of other drivers to load.

This patch moves the module percpu allocation after the check for an
existing instance of the module.

Signed-off-by: Prarit Bhargava pra...@redhat.com
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Mike Galbraith efa...@gmx.de
---
 kernel/module.c |  124 ++-
 1 file changed, 85 insertions(+), 39 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 250092c..e7e9b57 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1929,6 +1929,27 @@ static int verify_export_symbols(struct module *mod)
return 0;
 }
 
+static void simplify_percpu_symbols(struct module *mod,
+   const struct load_info *info)
+{
+   Elf_Shdr *symsec = info-sechdrs[info-index.sym];
+   Elf_Sym *sym = (void *)symsec-sh_addr;
+   unsigned long secbase;
+   unsigned int i;
+
+   /*
+* No need for error checking in this function because
+* simplify_symbols has already been called.
+*/
+   for (i = 1; i  symsec-sh_size / sizeof(Elf_Sym); i++) {
+   /* Divert to percpu allocation if a percpu var. */
+   if (sym[i].st_shndx == info-index.pcpu) {
+   secbase = (unsigned long)mod_percpu(mod);
+   sym[i].st_value += secbase;
+   }
+   }
+}
+
 /* Change all symbols so that st_value encodes the pointer directly. */
 static int simplify_symbols(struct module *mod, const struct load_info *info)
 {
@@ -1976,12 +1997,11 @@ static int simplify_symbols(struct module *mod, const 
struct load_info *info)
break;
 
default:
-   /* Divert to percpu allocation if a percpu var. */
-   if (sym[i].st_shndx == info-index.pcpu)
-   secbase = (unsigned long)mod_percpu(mod);
-   else
+   /* percpu diverts handled in simplify_percpu_symbols */
+   if (sym[i].st_shndx != info-index.pcpu) {
secbase = 
info-sechdrs[sym[i].st_shndx].sh_addr;
-   sym[i].st_value += secbase;
+   sym[i].st_value += secbase;
+   }
break;
}
}
@@ -2899,11 +2919,29 @@ int __weak module_frob_arch_sections(Elf_Ehdr *hdr,
return 0;
 }
 
+static int allocate_percpu(struct module *mod, struct load_info *info)
+{
+   Elf_Shdr *pcpusec;
+   int err;
+
+   pcpusec = info-sechdrs[info-index.pcpu];
+   if (pcpusec-sh_size) {
+   /* We have a special allocation for this section. */
+   pr_debug(module %s attempting to