Re: kernel crash in setrunqueue

2020-07-29 Thread Mike Larkin
On Wed, Jul 29, 2020 at 10:14:11PM +0200, Mark Kettenis wrote:
> > Date: Wed, 29 Jul 2020 13:03:43 -0700
> > From: Mike Larkin 
> >
> > Hi,
> >
> >  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> > on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> > one. It does not happen on GENERIC kernels.
> >
> >  The crash will happen fairly quickly after the kernel starts executing
> > processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> > or two. It rarely makes it to the login prompt. The problem is 100%
> > reproducible on two different VMs I have, running on two different
> > hypervisors (Hyper-V and ESXi6.7U2).
> >
> >  I first started noticing the problem on the 24th July snap, but TBH these
> > machines were not frequently updated, so the previous snap I had installed
> > might have been a couple months old. Whatever older snap was on them before
> > worked fine.
> >
> >  Since this is happening on two different machines with two different VMs,
> > I'm gonna rule out hardware issues.
> >
> >  Crash:
> >
> > kernel: pretection fault trap, code=0
> > Stopped at  setrunqueue+0xa2:   addl$0x1,0x288(%r13)
> >
> >  Trace:
> > ddb{2}> trace
> > setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
> > sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
> > taskq_thread(82121548) at taskq_thread+0x8d
> > end trace frame: 0x0, count: -3
> >
> >  Registers:
> > ddb{2}> sh r
> > rdi 0x821ee728  sched_lock
> > rsi 0x800014cc6ff0
> > rbp 0x800015ea0e40
> > rbx  0
> > rdx   0x23ca94  acpi_pdirpa_0x2288fc
> > rcx0xc
> > rax0xc
> > r8   0x202
> > r9 0x2
> > r10  0
> > r11 0x57f79bf6968709d8
> > r12 0x800015e874e0
> > r13 0x27b3d6c24c3fab80
> > r14   0x32
> > r15 0x27b3d6c24c3fab80
> > rip 0x81b9df22  setrunqueue+0xa2
> > cs 0x8
> > rflags 0x10207  __ALIGN_SIZE+0xf207
> > rsp 0x800015ea0df0
> > ss0x10
> >
> >
> > The offending instruction is in kern_sched.c:260:
> >
> > spc->spc_nrun++;
> >
> > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> > tests, %r13 always is this same trash value. That comes from 'ci', which is
> > either passed in or chosen by sched_choosecpu. Neither of these functions
> > have changed recently, so I'm guessing this corruption is coming from 
> > something
> > else.
> >
> >  Anyone have ideas where to start looking? I suppose I could start 
> > bisecting,
> > but does anyone know of any changes that would affect this area?
> >
> >  I can send dmesgs if needed, but these are pretty standard VMs,
> > nothing fancy configured in them. 4 CPUs, 8GB RAM, etc.
>
> They're VMs and it turns out that many of the "PV" drivers are/were
> using the intr_barrier() interface the wrong way.
>
> For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun
> 17 snapshot" thread on bugs@ from earlier today.
>
> Cheers,
>
> Mark
>

Thanks. I don't subscribe to bugs@ anymore, so that's why I likely missed it.

-ml



Re: kernel crash in setrunqueue

2020-07-29 Thread Mike Larkin
On Wed, Jul 29, 2020 at 01:03:43PM -0700, Mike Larkin wrote:
> Hi,
>
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
>
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
>
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
>
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
>
>  Crash:
>
> kernel: pretection fault trap, code=0
> Stopped atsetrunqueue+0xa2:   addl$0x1,0x288(%r13)
>
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
>
>  Registers:
> ddb{2}> sh r
> rdi   0x821ee728  sched_lock
> rsi   0x800014cc6ff0
> rbp   0x800015ea0e40
> rbx0
> rdx 0x23ca94  acpi_pdirpa_0x2288fc
> rcx  0xc
> rax  0xc
> r8 0x202
> r9   0x2
> r100
> r11   0x57f79bf6968709d8
> r12   0x800015e874e0
> r13   0x27b3d6c24c3fab80
> r14 0x32
> r15   0x27b3d6c24c3fab80
> rip   0x81b9df22  setrunqueue+0xa2
> cs   0x8
> rflags   0x10207  __ALIGN_SIZE+0xf207
> rsp   0x800015ea0df0
> ss  0x10
>
>
> The offending instruction is in kern_sched.c:260:
>
>   spc->spc_nrun++;
>
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from 
> something
> else.
>
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
>
>  I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
> configured in them. 4 CPUs, 8GB RAM, etc.
>
> -ml
>

Also I should note that the problem happens with snaps as well as kernels built
from source (-current), so this isn't likely something that is in snaps but not
yet in tree.

-ml



Re: kernel crash in setrunqueue

2020-07-29 Thread Mark Kettenis
> Date: Wed, 29 Jul 2020 13:03:43 -0700
> From: Mike Larkin 
> 
> Hi,
> 
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
> 
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
> 
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
> 
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
> 
>  Crash:
> 
> kernel: pretection fault trap, code=0
> Stopped atsetrunqueue+0xa2:   addl$0x1,0x288(%r13)
> 
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
> 
>  Registers:
> ddb{2}> sh r
> rdi   0x821ee728  sched_lock
> rsi   0x800014cc6ff0
> rbp   0x800015ea0e40
> rbx0
> rdx 0x23ca94  acpi_pdirpa_0x2288fc
> rcx  0xc
> rax  0xc
> r8 0x202
> r9   0x2
> r100
> r11   0x57f79bf6968709d8
> r12   0x800015e874e0
> r13   0x27b3d6c24c3fab80
> r14 0x32
> r15   0x27b3d6c24c3fab80
> rip   0x81b9df22  setrunqueue+0xa2
> cs   0x8
> rflags   0x10207  __ALIGN_SIZE+0xf207
> rsp   0x800015ea0df0
> ss  0x10
> 
> 
> The offending instruction is in kern_sched.c:260:
> 
>   spc->spc_nrun++;
> 
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from 
> something
> else.
> 
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
> 
>  I can send dmesgs if needed, but these are pretty standard VMs,
> nothing fancy configured in them. 4 CPUs, 8GB RAM, etc.

They're VMs and it turns out that many of the "PV" drivers are/were
using the intr_barrier() interface the wrong way.

For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun
17 snapshot" thread on bugs@ from earlier today.

Cheers,

Mark



kernel crash in setrunqueue

2020-07-29 Thread Mike Larkin
Hi,

 I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
on GENERIC.MP regardless of whether or not the VM has one cpu or more than
one. It does not happen on GENERIC kernels.

 The crash will happen fairly quickly after the kernel starts executing
processes. Sometimes it crashes instantly, sometimes it lasts for a minute
or two. It rarely makes it to the login prompt. The problem is 100%
reproducible on two different VMs I have, running on two different
hypervisors (Hyper-V and ESXi6.7U2).

 I first started noticing the problem on the 24th July snap, but TBH these
machines were not frequently updated, so the previous snap I had installed
might have been a couple months old. Whatever older snap was on them before
worked fine.

 Since this is happening on two different machines with two different VMs,
I'm gonna rule out hardware issues.

 Crash:

kernel: pretection fault trap, code=0
Stopped at  setrunqueue+0xa2:   addl$0x1,0x288(%r13)

 Trace:
ddb{2}> trace
setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
taskq_thread(82121548) at taskq_thread+0x8d
end trace frame: 0x0, count: -3

 Registers:
ddb{2}> sh r
rdi 0x821ee728  sched_lock
rsi 0x800014cc6ff0
rbp 0x800015ea0e40
rbx  0
rdx   0x23ca94  acpi_pdirpa_0x2288fc
rcx0xc
rax0xc
r8   0x202
r9 0x2
r10  0
r11 0x57f79bf6968709d8
r12 0x800015e874e0
r13 0x27b3d6c24c3fab80
r14   0x32
r15 0x27b3d6c24c3fab80
rip 0x81b9df22  setrunqueue+0xa2
cs 0x8
rflags 0x10207  __ALIGN_SIZE+0xf207
rsp 0x800015ea0df0
ss0x10


The offending instruction is in kern_sched.c:260:

spc->spc_nrun++;

... which indicates 'spc' is trash (and it is, based on %r13 above). In my
tests, %r13 always is this same trash value. That comes from 'ci', which is
either passed in or chosen by sched_choosecpu. Neither of these functions
have changed recently, so I'm guessing this corruption is coming from something
else.

 Anyone have ideas where to start looking? I suppose I could start bisecting,
but does anyone know of any changes that would affect this area?

 I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
configured in them. 4 CPUs, 8GB RAM, etc.

-ml