Re: 4.0.0-rc4: panic in free_block

2015-03-24 Thread David Miller
From: Bob Picco Date: Tue, 24 Mar 2015 10:57:53 -0400 > Seems solid with 2.6.39 on M7-4. Jalap?no is happy with current sparc.git. Thanks for all the testing, it's been integrated into the -stable queues as well. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the

Re: 4.0.0-rc4: panic in free_block

2015-03-24 Thread Bob Picco
David Miller wrote: [Mon Mar 23 2015, 12:25:30PM EDT] > From: David Miller > Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT) > > > I'll work on a fix. > > Ok, here is what I committed. David et al., let me know if you still > see the crashes with this applied. > > Of course, I'll queue this

Re: 4.0.0-rc4: panic in free_block

2015-03-24 Thread David Miller
From: Bob Picco bpi...@meloft.net Date: Tue, 24 Mar 2015 10:57:53 -0400 Seems solid with 2.6.39 on M7-4. Jalap?no is happy with current sparc.git. Thanks for all the testing, it's been integrated into the -stable queues as well. -- To unsubscribe from this list: send the line unsubscribe

Re: 4.0.0-rc4: panic in free_block

2015-03-24 Thread Bob Picco
David Miller wrote: [Mon Mar 23 2015, 12:25:30PM EDT] From: David Miller da...@davemloft.net Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT) I'll work on a fix. Ok, here is what I committed. David et al., let me know if you still see the crashes with this applied. Of course, I'll

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Ahern
On 3/23/15 1:35 PM, David Miller wrote: From: David Ahern Date: Mon, 23 Mar 2015 11:34:34 -0600 seems like a formality at this point, but this resolves the panic on the M7-based ldom and baremetal. The T5-8 failed to boot, but it could be a different problem. Specifically, does the T5-8

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: "John Stoffel" Date: Mon, 23 Mar 2015 15:56:02 -0400 >> "David" == David Miller writes: > > David> From: "John Stoffel" > David> Date: Mon, 23 Mar 2015 12:51:03 -0400 > >>> Would it make sense to have some memmove()/memcopy() tests on bootup >>> to catch problems like this? I know

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Ahern
On 3/23/15 1:35 PM, David Miller wrote: From: David Ahern Date: Mon, 23 Mar 2015 11:34:34 -0600 seems like a formality at this point, but this resolves the panic on the M7-based ldom and baremetal. The T5-8 failed to boot, but it could be a different problem. Specifically, does the T5-8

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel
> "David" == David Miller writes: David> From: "John Stoffel" David> Date: Mon, 23 Mar 2015 12:51:03 -0400 >> Would it make sense to have some memmove()/memcopy() tests on bootup >> to catch problems like this? I know this is a strange case, and >> probably not too common, but how hard

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: Linus Torvalds Date: Mon, 23 Mar 2015 12:47:49 -0700 > On Mon, Mar 23, 2015 at 12:08 PM, David Miller wrote: >> >> Sure you could do that in C, but I really want to avoid using memcpy() >> if dst and src overlap in any way at all. >> >> Said another way, I don't want to codify that "64"

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread Linus Torvalds
On Mon, Mar 23, 2015 at 12:08 PM, David Miller wrote: > > Sure you could do that in C, but I really want to avoid using memcpy() > if dst and src overlap in any way at all. > > Said another way, I don't want to codify that "64" thing. The next > chip could do 128 byte initializing stores. But

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: David Ahern Date: Mon, 23 Mar 2015 11:34:34 -0600 > seems like a formality at this point, but this resolves the panic on > the M7-based ldom and baremetal. The T5-8 failed to boot, but it could > be a different problem. Specifically, does the T5-8 boot without my patch applied? -- To

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: "John Stoffel" Date: Mon, 23 Mar 2015 12:51:03 -0400 > Would it make sense to have some memmove()/memcopy() tests on bootup > to catch problems like this? I know this is a strange case, and > probably not too common, but how hard would it be to wire up tests > that go through 1 to 128

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: Linus Torvalds Date: Mon, 23 Mar 2015 10:00:02 -0700 > Maybe the code could be something like > > void *memmove(void *dst, const void *src, size_t n); > { > // non-overlapping cases > if (src + n <= dst) > return memcpy(dst, src, n); > if (dst +

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Ahern
On 3/23/15 10:25 AM, David Miller wrote: [PATCH] sparc64: Fix several bugs in memmove(). Firstly, handle zero length calls properly. Believe it or not there are a few of these happening during early boot. Next, we can't just drop to a memcpy() call in the forward copy case where dst <= src.

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread Linus Torvalds
On Mon, Mar 23, 2015 at 9:25 AM, David Miller wrote: > > Ok, here is what I committed. So I wonder - looking at that assembly, I get the feeling that it isn't any better code than gcc could generate from simple C code. Would it perhaps be better to turn memmove() into C? That's particularly

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel
David> David> [PATCH] sparc64: Fix several bugs in memmove(). David> Firstly, handle zero length calls properly. Believe it or not there David> are a few of these happening during early boot. David> Next, we can't just drop to a memcpy() call in the forward copy case

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: David Miller Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT) > I'll work on a fix. Ok, here is what I committed. David et al., let me know if you still see the crashes with this applied. Of course, I'll queue this up for -stable as well. Thanks! [PATCH] sparc64: Fix

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: John Stoffel j...@stoffel.org Date: Mon, 23 Mar 2015 12:51:03 -0400 Would it make sense to have some memmove()/memcopy() tests on bootup to catch problems like this? I know this is a strange case, and probably not too common, but how hard would it be to wire up tests that go through 1

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel
David == David Miller da...@davemloft.net writes: David From: John Stoffel j...@stoffel.org David Date: Mon, 23 Mar 2015 12:51:03 -0400 Would it make sense to have some memmove()/memcopy() tests on bootup to catch problems like this? I know this is a strange case, and probably not too

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: John Stoffel j...@stoffel.org Date: Mon, 23 Mar 2015 15:56:02 -0400 David == David Miller da...@davemloft.net writes: David From: John Stoffel j...@stoffel.org David Date: Mon, 23 Mar 2015 12:51:03 -0400 Would it make sense to have some memmove()/memcopy() tests on bootup to catch

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel
David David [PATCH] sparc64: Fix several bugs in memmove(). David Firstly, handle zero length calls properly. Believe it or not there David are a few of these happening during early boot. David Next, we can't just drop to a memcpy() call in the forward copy case David

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread Linus Torvalds
On Mon, Mar 23, 2015 at 12:08 PM, David Miller da...@davemloft.net wrote: Sure you could do that in C, but I really want to avoid using memcpy() if dst and src overlap in any way at all. Said another way, I don't want to codify that 64 thing. The next chip could do 128 byte initializing

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: Linus Torvalds torva...@linux-foundation.org Date: Mon, 23 Mar 2015 12:47:49 -0700 On Mon, Mar 23, 2015 at 12:08 PM, David Miller da...@davemloft.net wrote: Sure you could do that in C, but I really want to avoid using memcpy() if dst and src overlap in any way at all. Said another

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Ahern
On 3/23/15 1:35 PM, David Miller wrote: From: David Ahern david.ah...@oracle.com Date: Mon, 23 Mar 2015 11:34:34 -0600 seems like a formality at this point, but this resolves the panic on the M7-based ldom and baremetal. The T5-8 failed to boot, but it could be a different problem.

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: Linus Torvalds torva...@linux-foundation.org Date: Mon, 23 Mar 2015 10:00:02 -0700 Maybe the code could be something like void *memmove(void *dst, const void *src, size_t n); { // non-overlapping cases if (src + n = dst) return memcpy(dst, src,

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: David Ahern david.ah...@oracle.com Date: Mon, 23 Mar 2015 11:34:34 -0600 seems like a formality at this point, but this resolves the panic on the M7-based ldom and baremetal. The T5-8 failed to boot, but it could be a different problem. Specifically, does the T5-8 boot without my patch

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Miller
From: David Miller da...@davemloft.net Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT) I'll work on a fix. Ok, here is what I committed. David et al., let me know if you still see the crashes with this applied. Of course, I'll queue this up for -stable as well. Thanks!

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread Linus Torvalds
On Mon, Mar 23, 2015 at 9:25 AM, David Miller da...@davemloft.net wrote: Ok, here is what I committed. So I wonder - looking at that assembly, I get the feeling that it isn't any better code than gcc could generate from simple C code. Would it perhaps be better to turn memmove() into C?

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Ahern
On 3/23/15 10:25 AM, David Miller wrote: [PATCH] sparc64: Fix several bugs in memmove(). Firstly, handle zero length calls properly. Believe it or not there are a few of these happening during early boot. Next, we can't just drop to a memcpy() call in the forward copy case where dst = src.

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread David Ahern
On 3/23/15 1:35 PM, David Miller wrote: From: David Ahern david.ah...@oracle.com Date: Mon, 23 Mar 2015 11:34:34 -0600 seems like a formality at this point, but this resolves the panic on the M7-based ldom and baremetal. The T5-8 failed to boot, but it could be a different problem.

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
Nevermind I think I figured out the problem. It's the cache initializing stores, we can't do overlapping copies where dst <= src in all cases because of them. A store to a address modulo the cache line size (which for these instructions is 64 bytes), clears that whole line. But when we're

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: David Ahern Date: Sun, 22 Mar 2015 18:03:30 -0600 > On 3/22/15 5:54 PM, David Miller wrote: >>> I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it >>> clearly suggests the memcpy or memmove are the root cause. >> >> Thanks, didn't notice that. >> >> So, something is amuck. >

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Ahern
On 3/22/15 5:54 PM, David Miller wrote: I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it clearly suggests the memcpy or memmove are the root cause. Thanks, didn't notice that. So, something is amuck. to continue to refine the problem ... I modified only the memmove lines

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: Linus Torvalds Date: Sun, 22 Mar 2015 16:49:51 -0700 > On Sun, Mar 22, 2015 at 3:23 PM, David Miller wrote: >> >> Yes, using VIS how we do is alright, and in fact I did an audit of >> this about 1 year ago. This is another one of those "if this is >> wrong, so much stuff would break" >

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: David Ahern Date: Sun, 22 Mar 2015 17:35:49 -0600 > I don't know if you caught Bob's message; he has a hack to bypass > memcpy and memmove in mm/slab.c use a for loop to move entries. With > the hack he is not seeing the problem. > > This is the hack: > > +static void move_entries(void

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread Linus Torvalds
On Sun, Mar 22, 2015 at 3:23 PM, David Miller wrote: > > Yes, using VIS how we do is alright, and in fact I did an audit of > this about 1 year ago. This is another one of those "if this is > wrong, so much stuff would break" Maybe. But it does seem like Bob Picco has narrowed it down to

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Ahern
On 3/22/15 4:23 PM, David Miller wrote: I don't even know which version of memcpy ends up being used on M7. Some of them do things like use VIS. I can follow some regular sparc asm, there's no way I'm even *looking* at that. Is it really ok to use VIS registers in random contexts? Yes, using

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: Linus Torvalds Date: Sun, 22 Mar 2015 12:47:08 -0700 > Which was why I was asking how sure you are that memcpy *always* > copies from low to high. Yeah I'm pretty sure. > I don't even know which version of memcpy ends up being used on M7. > Some of them do things like use VIS. I can

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread Linus Torvalds
On Sun, Mar 22, 2015 at 10:36 AM, David Miller wrote: > > And they end up using that byte-at-a-time code, since SLAB and SLUB > do mmemove() calls of the form: > > memmove(X + N, X, LEN); Actually, the common case in slab is overlapping but of the form memmove(p, p+x, len); which

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread Bob Picco
David Miller wrote: [Sun Mar 22 2015, 01:36:03PM EDT] > From: Linus Torvalds > Date: Sat, 21 Mar 2015 11:49:12 -0700 > > > Davem? I don't read sparc assembly, so I'm *really* not going to try > > to verify that (a) all the memcpy implementations always copy > > low-to-high and (b) that I

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: Linus Torvalds Date: Sat, 21 Mar 2015 11:49:12 -0700 > Davem? I don't read sparc assembly, so I'm *really* not going to try > to verify that (a) all the memcpy implementations always copy > low-to-high and (b) that I even read the address comparisons in > memmove.S right. All of the sparc

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: Linus Torvalds torva...@linux-foundation.org Date: Sun, 22 Mar 2015 16:49:51 -0700 On Sun, Mar 22, 2015 at 3:23 PM, David Miller da...@davemloft.net wrote: Yes, using VIS how we do is alright, and in fact I did an audit of this about 1 year ago. This is another one of those if this is

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: Linus Torvalds torva...@linux-foundation.org Date: Sat, 21 Mar 2015 11:49:12 -0700 Davem? I don't read sparc assembly, so I'm *really* not going to try to verify that (a) all the memcpy implementations always copy low-to-high and (b) that I even read the address comparisons in memmove.S

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread Bob Picco
David Miller wrote: [Sun Mar 22 2015, 01:36:03PM EDT] From: Linus Torvalds torva...@linux-foundation.org Date: Sat, 21 Mar 2015 11:49:12 -0700 Davem? I don't read sparc assembly, so I'm *really* not going to try to verify that (a) all the memcpy implementations always copy

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread Linus Torvalds
On Sun, Mar 22, 2015 at 10:36 AM, David Miller da...@davemloft.net wrote: And they end up using that byte-at-a-time code, since SLAB and SLUB do mmemove() calls of the form: memmove(X + N, X, LEN); Actually, the common case in slab is overlapping but of the form memmove(p, p+x,

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Ahern
On 3/22/15 4:23 PM, David Miller wrote: I don't even know which version of memcpy ends up being used on M7. Some of them do things like use VIS. I can follow some regular sparc asm, there's no way I'm even *looking* at that. Is it really ok to use VIS registers in random contexts? Yes, using

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: David Ahern david.ah...@oracle.com Date: Sun, 22 Mar 2015 17:35:49 -0600 I don't know if you caught Bob's message; he has a hack to bypass memcpy and memmove in mm/slab.c use a for loop to move entries. With the hack he is not seeing the problem. This is the hack: +static void

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: Linus Torvalds torva...@linux-foundation.org Date: Sun, 22 Mar 2015 12:47:08 -0700 Which was why I was asking how sure you are that memcpy *always* copies from low to high. Yeah I'm pretty sure. I don't even know which version of memcpy ends up being used on M7. Some of them do things

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Ahern
On 3/22/15 5:54 PM, David Miller wrote: I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it clearly suggests the memcpy or memmove are the root cause. Thanks, didn't notice that. So, something is amuck. to continue to refine the problem ... I modified only the memmove lines

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
From: David Ahern david.ah...@oracle.com Date: Sun, 22 Mar 2015 18:03:30 -0600 On 3/22/15 5:54 PM, David Miller wrote: I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it clearly suggests the memcpy or memmove are the root cause. Thanks, didn't notice that. So, something is

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread David Miller
Nevermind I think I figured out the problem. It's the cache initializing stores, we can't do overlapping copies where dst = src in all cases because of them. A store to a address modulo the cache line size (which for these instructions is 64 bytes), clears that whole line. But when we're doing

Re: 4.0.0-rc4: panic in free_block

2015-03-22 Thread Linus Torvalds
On Sun, Mar 22, 2015 at 3:23 PM, David Miller da...@davemloft.net wrote: Yes, using VIS how we do is alright, and in fact I did an audit of this about 1 year ago. This is another one of those if this is wrong, so much stuff would break Maybe. But it does seem like Bob Picco has narrowed it

Re: 4.0.0-rc4: panic in free_block

2015-03-21 Thread Linus Torvalds
On Sat, Mar 21, 2015 at 10:45 AM, David Ahern wrote: > > You raise a lot of valid questions and something to look into. But if the > root cause were such a fundamental issue (CPU memory ordering, compiler bug, > etc) why would it only occur on this one code path -- free with SLAB and > NUMA --

Re: 4.0.0-rc4: panic in free_block

2015-03-21 Thread David Ahern
On 3/20/15 6:47 PM, Linus Torvalds wrote: Here's another data point: If I disable NUMA I don't see the problem. Performance drops, but no NULL pointer splats which would have been panics. So the NUMA case triggers the per-node "n->shared" logic, which *should* be protected by "n->list_lock".

Re: 4.0.0-rc4: panic in free_block

2015-03-21 Thread David Ahern
On 3/20/15 6:47 PM, Linus Torvalds wrote: Here's another data point: If I disable NUMA I don't see the problem. Performance drops, but no NULL pointer splats which would have been panics. So the NUMA case triggers the per-node n-shared logic, which *should* be protected by n-list_lock. Maybe

Re: 4.0.0-rc4: panic in free_block

2015-03-21 Thread Linus Torvalds
On Sat, Mar 21, 2015 at 10:45 AM, David Ahern david.ah...@oracle.com wrote: You raise a lot of valid questions and something to look into. But if the root cause were such a fundamental issue (CPU memory ordering, compiler bug, etc) why would it only occur on this one code path -- free with

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 5:18 PM, David Ahern wrote: > On 3/20/15 4:49 PM, David Ahern wrote: >> >> I did ask around and apparently this bug is hit only with the new M7 >> processors. DaveM: that's why you are not hitting this. Quite frankly, this smells even more like an architecture bug. It

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 6:34 PM, David Rientjes wrote: On Fri, 20 Mar 2015, David Ahern wrote: Here's another data point: If I disable NUMA I don't see the problem. Performance drops, but no NULL pointer splats which would have been panics. The 128 cpu ldom with NUMA enabled shows the problem every single

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Rientjes
On Fri, 20 Mar 2015, David Ahern wrote: > Here's another data point: If I disable NUMA I don't see the problem. > Performance drops, but no NULL pointer splats which would have been panics. > > The 128 cpu ldom with NUMA enabled shows the problem every single time I do a > kernel compile (-j

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 4:49 PM, David Ahern wrote: On 3/20/15 3:17 PM, Linus Torvalds wrote: In other words, if I read that sparc asm right (and it is very likely that I do *not*), then "objp" is NULL, and that's why you crash. That does appear to be why. I put a WARN_ON before clear_obj_pfmemalloc() if

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 3:17 PM, Linus Torvalds wrote: In other words, if I read that sparc asm right (and it is very likely that I do *not*), then "objp" is NULL, and that's why you crash. That does appear to be why. I put a WARN_ON before clear_obj_pfmemalloc() if objpp[i] is NULL. I got 2 splats during

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 8:07 AM, David Ahern wrote: > Instruction DUMP: 86230003 8730f00d 8728f006 8600c007 8e0ac008 > 2ac1c002 c658e030 d458e028 Ok, so it's d658c007 that faults, which is that ldx [ %g3 + %g7 ], %o3 instruction. Looking at your objdump: > free_block(): >

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Miller
From: David Ahern Date: Fri, 20 Mar 2015 13:54:09 -0600 > Interesting. With -j <64 and talking softly it completes. But -j 128 > and higher always ends in a panic. Please share more details of your configuration. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Dave Hansen
On 03/20/2015 09:58 AM, Linus Torvalds wrote: > 128 cpu's is still "unusual", of course, but by no means unheard of, > and I'f have expected others to report it too if it was wasy to > trigger on x86-64. FWIW, I configured a kernel with SLAB and kicked off a bunch of compiles on a 160-thread

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 1:47 PM, David Miller wrote: From: David Ahern Date: Fri, 20 Mar 2015 12:05:05 -0600 DaveM: do you mind if I submit a patch to change the default for sparc to SLUB? I think we're jumping the gun about all of this, and doing anything with default Kconfig settings would be entirely

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Miller
From: David Ahern Date: Fri, 20 Mar 2015 12:05:05 -0600 > DaveM: do you mind if I submit a patch to change the default for sparc > to SLUB? I think we're jumping the gun about all of this, and doing anything with default Kconfig settings would be entirely premature until we know what the real

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Miller
From: Linus Torvalds Date: Fri, 20 Mar 2015 09:58:25 -0700 > 128 cpu's is still "unusual" As unusual as the system I do all of my kernel builds on :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 12:53 PM, Linus Torvalds wrote: SLUB should definitely be considered a stable allocator. It's the default allocator for at least Fedora, and that presumably means all of Redhat. SuSE seems to use SLAB still, though, so it must be getting lots of testing on x86 too. Did you test

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 11:05 AM, David Ahern wrote: > > Evidently, it is a well known problem internally that goes back to at least > 2.6.39. > > To this point I have not paid attention to the allocators. At what point is > SLUB considered stable for large systems? Is 2.6.39 stable? SLUB should

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 10:58 AM, Linus Torvalds wrote: That said, SLAB is probably also almost unheard of in high-CPU configurations, since slub has all the magical unlocked lists etc for scalability. So maybe it's a generic SLAB bug, and nobody with lots of CPU's is testing SLAB. Evidently, it is a well

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 9:53 AM, David Ahern wrote: > > I haven't tried 3.19 yet. Just backed up to 3.18 and it shows the same > problem. And I can reproduce the 4.0 crash in a 128 cpu ldom (VM). Ok, so if 3.18 also has it, then trying 3.19 is pointless, this is obviously an old problem. Which

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 10:48 AM, Linus Torvalds wrote: [ Added Davem and the sparc mailing list, since it happens on sparc and that just makes me suspicious ] On Fri, Mar 20, 2015 at 8:07 AM, David Ahern wrote: I can easily reproduce the panic below doing a kernel build with make -j N, N=128, 256, etc.

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
[ Added Davem and the sparc mailing list, since it happens on sparc and that just makes me suspicious ] On Fri, Mar 20, 2015 at 8:07 AM, David Ahern wrote: > I can easily reproduce the panic below doing a kernel build with make -j N, > N=128, 256, etc. This is a 1024 cpu system running

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 6:34 PM, David Rientjes wrote: On Fri, 20 Mar 2015, David Ahern wrote: Here's another data point: If I disable NUMA I don't see the problem. Performance drops, but no NULL pointer splats which would have been panics. The 128 cpu ldom with NUMA enabled shows the problem every single

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 5:18 PM, David Ahern david.ah...@oracle.com wrote: On 3/20/15 4:49 PM, David Ahern wrote: I did ask around and apparently this bug is hit only with the new M7 processors. DaveM: that's why you are not hitting this. Quite frankly, this smells even more like an

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 4:49 PM, David Ahern wrote: On 3/20/15 3:17 PM, Linus Torvalds wrote: In other words, if I read that sparc asm right (and it is very likely that I do *not*), then objp is NULL, and that's why you crash. That does appear to be why. I put a WARN_ON before clear_obj_pfmemalloc() if

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Rientjes
On Fri, 20 Mar 2015, David Ahern wrote: Here's another data point: If I disable NUMA I don't see the problem. Performance drops, but no NULL pointer splats which would have been panics. The 128 cpu ldom with NUMA enabled shows the problem every single time I do a kernel compile (-j 128).

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 12:53 PM, Linus Torvalds wrote: SLUB should definitely be considered a stable allocator. It's the default allocator for at least Fedora, and that presumably means all of Redhat. SuSE seems to use SLAB still, though, so it must be getting lots of testing on x86 too. Did you test

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 11:05 AM, David Ahern david.ah...@oracle.com wrote: Evidently, it is a well known problem internally that goes back to at least 2.6.39. To this point I have not paid attention to the allocators. At what point is SLUB considered stable for large systems? Is 2.6.39

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 10:58 AM, Linus Torvalds wrote: That said, SLAB is probably also almost unheard of in high-CPU configurations, since slub has all the magical unlocked lists etc for scalability. So maybe it's a generic SLAB bug, and nobody with lots of CPU's is testing SLAB. Evidently, it is a well

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 9:53 AM, David Ahern david.ah...@oracle.com wrote: I haven't tried 3.19 yet. Just backed up to 3.18 and it shows the same problem. And I can reproduce the 4.0 crash in a 128 cpu ldom (VM). Ok, so if 3.18 also has it, then trying 3.19 is pointless, this is obviously an

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
[ Added Davem and the sparc mailing list, since it happens on sparc and that just makes me suspicious ] On Fri, Mar 20, 2015 at 8:07 AM, David Ahern david.ah...@oracle.com wrote: I can easily reproduce the panic below doing a kernel build with make -j N, N=128, 256, etc. This is a 1024 cpu

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 10:48 AM, Linus Torvalds wrote: [ Added Davem and the sparc mailing list, since it happens on sparc and that just makes me suspicious ] On Fri, Mar 20, 2015 at 8:07 AM, David Ahern david.ah...@oracle.com wrote: I can easily reproduce the panic below doing a kernel build with make -j

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Miller
From: David Ahern david.ah...@oracle.com Date: Fri, 20 Mar 2015 12:05:05 -0600 DaveM: do you mind if I submit a patch to change the default for sparc to SLUB? I think we're jumping the gun about all of this, and doing anything with default Kconfig settings would be entirely premature until we

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Dave Hansen
On 03/20/2015 09:58 AM, Linus Torvalds wrote: 128 cpu's is still unusual, of course, but by no means unheard of, and I'f have expected others to report it too if it was wasy to trigger on x86-64. FWIW, I configured a kernel with SLAB and kicked off a bunch of compiles on a 160-thread x86_64

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Miller
From: Linus Torvalds torva...@linux-foundation.org Date: Fri, 20 Mar 2015 09:58:25 -0700 128 cpu's is still unusual As unusual as the system I do all of my kernel builds on :-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 1:47 PM, David Miller wrote: From: David Ahern david.ah...@oracle.com Date: Fri, 20 Mar 2015 12:05:05 -0600 DaveM: do you mind if I submit a patch to change the default for sparc to SLUB? I think we're jumping the gun about all of this, and doing anything with default Kconfig

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Miller
From: David Ahern david.ah...@oracle.com Date: Fri, 20 Mar 2015 13:54:09 -0600 Interesting. With -j 64 and talking softly it completes. But -j 128 and higher always ends in a panic. Please share more details of your configuration. -- To unsubscribe from this list: send the line unsubscribe

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread Linus Torvalds
On Fri, Mar 20, 2015 at 8:07 AM, David Ahern david.ah...@oracle.com wrote: Instruction DUMP: 86230003 8730f00d 8728f006 d658c007 8600c007 8e0ac008 2ac1c002 c658e030 d458e028 Ok, so it's d658c007 that faults, which is that ldx [ %g3 + %g7 ], %o3 instruction. Looking at your

Re: 4.0.0-rc4: panic in free_block

2015-03-20 Thread David Ahern
On 3/20/15 3:17 PM, Linus Torvalds wrote: In other words, if I read that sparc asm right (and it is very likely that I do *not*), then objp is NULL, and that's why you crash. That does appear to be why. I put a WARN_ON before clear_obj_pfmemalloc() if objpp[i] is NULL. I got 2 splats during