Re: kernel crash in setrunqueue

Mike Larkin Wed, 29 Jul 2020 13:17:45 -0700

On Wed, Jul 29, 2020 at 01:03:43PM -0700, Mike Larkin wrote:
> Hi,
>
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
>
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
>
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
>
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
>
>  Crash:
>
> kernel: pretection fault trap, code=0
> Stopped at    setrunqueue+0xa2:       addl    $0x1,0x288(%r13)
>
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(ffffffff82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
>
>  Registers:
> ddb{2}> sh r
> rdi                   0xffffffff821ee728      sched_lock
> rsi                   0xffff800014cc6ff0
> rbp                   0xffff800015ea0e40
> rbx                                    0
> rdx                             0x23ca94      acpi_pdirpa_0x2288fc
> rcx                                  0xc
> rax                                  0xc
> r8                                 0x202
> r9                                   0x2
> r10                                    0
> r11                   0x57f79bf6968709d8
> r12                   0xffff800015e874e0
> r13                   0x27b3d6c24c3fab80
> r14                                 0x32
> r15                   0x27b3d6c24c3fab80
> rip                   0xffffffff81b9df22      setrunqueue+0xa2
> cs                                   0x8
> rflags                                   0x10207      __ALIGN_SIZE+0xf207
> rsp                   0xffff800015ea0df0
> ss                                  0x10
>
>
> The offending instruction is in kern_sched.c:260:
>
>       spc->spc_nrun++;
>
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from 
> something
> else.
>
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
>
>  I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
> configured in them. 4 CPUs, 8GB RAM, etc.
>
> -ml
>


Also I should note that the problem happens with snaps as well as kernels built
from source (-current), so this isn't likely something that is in snaps but not
yet in tree.

-ml

Re: kernel crash in setrunqueue

Reply via email to