From: Bob Picco
Date: Tue, 24 Mar 2015 10:57:53 -0400
> Seems solid with 2.6.39 on M7-4. Jalap?no is happy with current sparc.git.
Thanks for all the testing, it's been integrated into the -stable
queues as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the b
David Miller wrote: [Mon Mar 23 2015, 12:25:30PM EDT]
> From: David Miller
> Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT)
>
> > I'll work on a fix.
>
> Ok, here is what I committed. David et al., let me know if you still
> see the crashes with this applied.
>
> Of course, I'll queue this u
On 3/23/15 1:35 PM, David Miller wrote:
From: David Ahern
Date: Mon, 23 Mar 2015 11:34:34 -0600
seems like a formality at this point, but this resolves the panic on
the M7-based ldom and baremetal. The T5-8 failed to boot, but it could
be a different problem.
Specifically, does the T5-8 boot
From: "John Stoffel"
Date: Mon, 23 Mar 2015 15:56:02 -0400
>> "David" == David Miller writes:
>
> David> From: "John Stoffel"
> David> Date: Mon, 23 Mar 2015 12:51:03 -0400
>
>>> Would it make sense to have some memmove()/memcopy() tests on bootup
>>> to catch problems like this? I know
On 3/23/15 1:35 PM, David Miller wrote:
From: David Ahern
Date: Mon, 23 Mar 2015 11:34:34 -0600
seems like a formality at this point, but this resolves the panic on
the M7-based ldom and baremetal. The T5-8 failed to boot, but it could
be a different problem.
Specifically, does the T5-8 boot
> "David" == David Miller writes:
David> From: "John Stoffel"
David> Date: Mon, 23 Mar 2015 12:51:03 -0400
>> Would it make sense to have some memmove()/memcopy() tests on bootup
>> to catch problems like this? I know this is a strange case, and
>> probably not too common, but how hard wou
From: Linus Torvalds
Date: Mon, 23 Mar 2015 12:47:49 -0700
> On Mon, Mar 23, 2015 at 12:08 PM, David Miller wrote:
>>
>> Sure you could do that in C, but I really want to avoid using memcpy()
>> if dst and src overlap in any way at all.
>>
>> Said another way, I don't want to codify that "64" th
On Mon, Mar 23, 2015 at 12:08 PM, David Miller wrote:
>
> Sure you could do that in C, but I really want to avoid using memcpy()
> if dst and src overlap in any way at all.
>
> Said another way, I don't want to codify that "64" thing. The next
> chip could do 128 byte initializing stores.
But Da
From: David Ahern
Date: Mon, 23 Mar 2015 11:34:34 -0600
> seems like a formality at this point, but this resolves the panic on
> the M7-based ldom and baremetal. The T5-8 failed to boot, but it could
> be a different problem.
Specifically, does the T5-8 boot without my patch applied?
--
To unsub
From: "John Stoffel"
Date: Mon, 23 Mar 2015 12:51:03 -0400
> Would it make sense to have some memmove()/memcopy() tests on bootup
> to catch problems like this? I know this is a strange case, and
> probably not too common, but how hard would it be to wire up tests
> that go through 1 to 128 byte
From: Linus Torvalds
Date: Mon, 23 Mar 2015 10:00:02 -0700
> Maybe the code could be something like
>
> void *memmove(void *dst, const void *src, size_t n);
> {
> // non-overlapping cases
> if (src + n <= dst)
> return memcpy(dst, src, n);
> if (dst +
On 3/23/15 10:25 AM, David Miller wrote:
[PATCH] sparc64: Fix several bugs in memmove().
Firstly, handle zero length calls properly. Believe it or not there
are a few of these happening during early boot.
Next, we can't just drop to a memcpy() call in the forward copy case
where dst <= src. T
On Mon, Mar 23, 2015 at 9:25 AM, David Miller wrote:
>
> Ok, here is what I committed.
So I wonder - looking at that assembly, I get the feeling that it
isn't any better code than gcc could generate from simple C code.
Would it perhaps be better to turn memmove() into C?
That's particularly tru
David>
David> [PATCH] sparc64: Fix several bugs in memmove().
David> Firstly, handle zero length calls properly. Believe it or not there
David> are a few of these happening during early boot.
David> Next, we can't just drop to a memcpy() call in the forward copy case
David>
From: David Miller
Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT)
> I'll work on a fix.
Ok, here is what I committed. David et al., let me know if you still
see the crashes with this applied.
Of course, I'll queue this up for -stable as well.
Thanks!
[PATCH] sparc64: Fix s
Nevermind I think I figured out the problem.
It's the cache initializing stores, we can't do overlapping
copies where dst <= src in all cases because of them.
A store to a address modulo the cache line size (which for
these instructions is 64 bytes), clears that whole line.
But when we're doing
From: David Ahern
Date: Sun, 22 Mar 2015 18:03:30 -0600
> On 3/22/15 5:54 PM, David Miller wrote:
>>> I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it
>>> clearly suggests the memcpy or memmove are the root cause.
>>
>> Thanks, didn't notice that.
>>
>> So, something is amuck.
>
On 3/22/15 5:54 PM, David Miller wrote:
I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it
clearly suggests the memcpy or memmove are the root cause.
Thanks, didn't notice that.
So, something is amuck.
to continue to refine the problem ... I modified only the memmove lines
(no
From: Linus Torvalds
Date: Sun, 22 Mar 2015 16:49:51 -0700
> On Sun, Mar 22, 2015 at 3:23 PM, David Miller wrote:
>>
>> Yes, using VIS how we do is alright, and in fact I did an audit of
>> this about 1 year ago. This is another one of those "if this is
>> wrong, so much stuff would break"
>
>
From: David Ahern
Date: Sun, 22 Mar 2015 17:35:49 -0600
> I don't know if you caught Bob's message; he has a hack to bypass
> memcpy and memmove in mm/slab.c use a for loop to move entries. With
> the hack he is not seeing the problem.
>
> This is the hack:
>
> +static void move_entries(void *d
On Sun, Mar 22, 2015 at 3:23 PM, David Miller wrote:
>
> Yes, using VIS how we do is alright, and in fact I did an audit of
> this about 1 year ago. This is another one of those "if this is
> wrong, so much stuff would break"
Maybe. But it does seem like Bob Picco has narrowed it down to memmove
On 3/22/15 4:23 PM, David Miller wrote:
I don't even know which version of memcpy ends up being used on M7.
Some of them do things like use VIS. I can follow some regular sparc
asm, there's no way I'm even *looking* at that. Is it really ok to use
VIS registers in random contexts?
Yes, using VI
From: Linus Torvalds
Date: Sun, 22 Mar 2015 12:47:08 -0700
> Which was why I was asking how sure you are that memcpy *always*
> copies from low to high.
Yeah I'm pretty sure.
> I don't even know which version of memcpy ends up being used on M7.
> Some of them do things like use VIS. I can follo
On Sun, Mar 22, 2015 at 10:36 AM, David Miller wrote:
>
> And they end up using that byte-at-a-time code, since SLAB and SLUB
> do mmemove() calls of the form:
>
> memmove(X + N, X, LEN);
Actually, the common case in slab is overlapping but of the form
memmove(p, p+x, len);
which g
David Miller wrote: [Sun Mar 22 2015, 01:36:03PM EDT]
> From: Linus Torvalds
> Date: Sat, 21 Mar 2015 11:49:12 -0700
>
> > Davem? I don't read sparc assembly, so I'm *really* not going to try
> > to verify that (a) all the memcpy implementations always copy
> > low-to-high and (b) that I even
From: Linus Torvalds
Date: Sat, 21 Mar 2015 11:49:12 -0700
> Davem? I don't read sparc assembly, so I'm *really* not going to try
> to verify that (a) all the memcpy implementations always copy
> low-to-high and (b) that I even read the address comparisons in
> memmove.S right.
All of the sparc
On Sat, Mar 21, 2015 at 10:45 AM, David Ahern wrote:
>
> You raise a lot of valid questions and something to look into. But if the
> root cause were such a fundamental issue (CPU memory ordering, compiler bug,
> etc) why would it only occur on this one code path -- free with SLAB and
> NUMA -- and
On 3/20/15 6:47 PM, Linus Torvalds wrote:
Here's another data point: If I disable NUMA I don't see the problem.
Performance drops, but no NULL pointer splats which would have been panics.
So the NUMA case triggers the per-node "n->shared" logic, which
*should* be protected by "n->list_lock".
On Fri, Mar 20, 2015 at 5:18 PM, David Ahern wrote:
> On 3/20/15 4:49 PM, David Ahern wrote:
>>
>> I did ask around and apparently this bug is hit only with the new M7
>> processors. DaveM: that's why you are not hitting this.
Quite frankly, this smells even more like an architecture bug. It
coul
On 3/20/15 6:34 PM, David Rientjes wrote:
On Fri, 20 Mar 2015, David Ahern wrote:
Here's another data point: If I disable NUMA I don't see the problem.
Performance drops, but no NULL pointer splats which would have been panics.
The 128 cpu ldom with NUMA enabled shows the problem every single
On Fri, 20 Mar 2015, David Ahern wrote:
> Here's another data point: If I disable NUMA I don't see the problem.
> Performance drops, but no NULL pointer splats which would have been panics.
>
> The 128 cpu ldom with NUMA enabled shows the problem every single time I do a
> kernel compile (-j 128)
On 3/20/15 4:49 PM, David Ahern wrote:
On 3/20/15 3:17 PM, Linus Torvalds wrote:
In other words, if I read that sparc asm right (and it is very likely
that I do *not*), then "objp" is NULL, and that's why you crash.
That does appear to be why. I put a WARN_ON before
clear_obj_pfmemalloc() if o
On 3/20/15 3:17 PM, Linus Torvalds wrote:
In other words, if I read that sparc asm right (and it is very likely
that I do *not*), then "objp" is NULL, and that's why you crash.
That does appear to be why. I put a WARN_ON before
clear_obj_pfmemalloc() if objpp[i] is NULL. I got 2 splats during
On Fri, Mar 20, 2015 at 8:07 AM, David Ahern wrote:
> Instruction DUMP: 86230003 8730f00d 8728f006 8600c007 8e0ac008
> 2ac1c002 c658e030 d458e028
Ok, so it's d658c007 that faults, which is that
ldx [ %g3 + %g7 ], %o3
instruction.
Looking at your objdump:
> free_block():
> /opt/
From: David Ahern
Date: Fri, 20 Mar 2015 13:54:09 -0600
> Interesting. With -j <64 and talking softly it completes. But -j 128
> and higher always ends in a panic.
Please share more details of your configuration.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the b
On 03/20/2015 09:58 AM, Linus Torvalds wrote:
> 128 cpu's is still "unusual", of course, but by no means unheard of,
> and I'f have expected others to report it too if it was wasy to
> trigger on x86-64.
FWIW, I configured a kernel with SLAB and kicked off a bunch of compiles
on a 160-thread x86_6
On 3/20/15 1:47 PM, David Miller wrote:
From: David Ahern
Date: Fri, 20 Mar 2015 12:05:05 -0600
DaveM: do you mind if I submit a patch to change the default for sparc
to SLUB?
I think we're jumping the gun about all of this, and doing anything
with default Kconfig settings would be entirely
From: David Ahern
Date: Fri, 20 Mar 2015 12:05:05 -0600
> DaveM: do you mind if I submit a patch to change the default for sparc
> to SLUB?
I think we're jumping the gun about all of this, and doing anything
with default Kconfig settings would be entirely premature until we
know what the real bu
From: Linus Torvalds
Date: Fri, 20 Mar 2015 09:58:25 -0700
> 128 cpu's is still "unusual"
As unusual as the system I do all of my kernel builds on :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo i
On 3/20/15 12:53 PM, Linus Torvalds wrote:
SLUB should definitely be considered a stable allocator. It's the
default allocator for at least Fedora, and that presumably means all
of Redhat.
SuSE seems to use SLAB still, though, so it must be getting lots of
testing on x86 too.
Did you test with
On Fri, Mar 20, 2015 at 11:05 AM, David Ahern wrote:
>
> Evidently, it is a well known problem internally that goes back to at least
> 2.6.39.
>
> To this point I have not paid attention to the allocators. At what point is
> SLUB considered stable for large systems? Is 2.6.39 stable?
SLUB should
On 3/20/15 10:58 AM, Linus Torvalds wrote:
That said, SLAB is probably also almost unheard of in high-CPU
configurations, since slub has all the magical unlocked lists etc for
scalability. So maybe it's a generic SLAB bug, and nobody with lots of
CPU's is testing SLAB.
Evidently, it is a well
On Fri, Mar 20, 2015 at 9:53 AM, David Ahern wrote:
>
> I haven't tried 3.19 yet. Just backed up to 3.18 and it shows the same
> problem. And I can reproduce the 4.0 crash in a 128 cpu ldom (VM).
Ok, so if 3.18 also has it, then trying 3.19 is pointless, this is
obviously an old problem. Which ma
On 3/20/15 10:48 AM, Linus Torvalds wrote:
[ Added Davem and the sparc mailing list, since it happens on sparc
and that just makes me suspicious ]
On Fri, Mar 20, 2015 at 8:07 AM, David Ahern wrote:
I can easily reproduce the panic below doing a kernel build with make -j N,
N=128, 256, etc. Th
[ Added Davem and the sparc mailing list, since it happens on sparc
and that just makes me suspicious ]
On Fri, Mar 20, 2015 at 8:07 AM, David Ahern wrote:
> I can easily reproduce the panic below doing a kernel build with make -j N,
> N=128, 256, etc. This is a 1024 cpu system running 4.0.0-rc4.
I can easily reproduce the panic below doing a kernel build with make -j
N, N=128, 256, etc. This is a 1024 cpu system running 4.0.0-rc4.
The top 3 frames are consistently:
free_block+0x60
cache_flusharray+0xac
kmem_cache_free+0xfc
After that one path has been from __mmdrop and the
46 matches
Mail list logo