Re: kernel crash in setrunqueue
On Wed, Jul 29, 2020 at 10:14:11PM +0200, Mark Kettenis wrote: > > Date: Wed, 29 Jul 2020 13:03:43 -0700 > > From: Mike Larkin > > > > Hi, > > > > I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens > > on GENERIC.MP regardless of whether or not the VM has one cpu or more than > > one. It does not happen on GENERIC kernels. > > > > The crash will happen fairly quickly after the kernel starts executing > > processes. Sometimes it crashes instantly, sometimes it lasts for a minute > > or two. It rarely makes it to the login prompt. The problem is 100% > > reproducible on two different VMs I have, running on two different > > hypervisors (Hyper-V and ESXi6.7U2). > > > > I first started noticing the problem on the 24th July snap, but TBH these > > machines were not frequently updated, so the previous snap I had installed > > might have been a couple months old. Whatever older snap was on them before > > worked fine. > > > > Since this is happening on two different machines with two different VMs, > > I'm gonna rule out hardware issues. > > > > Crash: > > > > kernel: pretection fault trap, code=0 > > Stopped at setrunqueue+0xa2: addl$0x1,0x288(%r13) > > > > Trace: > > ddb{2}> trace > > setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2 > > sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c > > taskq_thread(82121548) at taskq_thread+0x8d > > end trace frame: 0x0, count: -3 > > > > Registers: > > ddb{2}> sh r > > rdi 0x821ee728 sched_lock > > rsi 0x800014cc6ff0 > > rbp 0x800015ea0e40 > > rbx 0 > > rdx 0x23ca94 acpi_pdirpa_0x2288fc > > rcx0xc > > rax0xc > > r8 0x202 > > r9 0x2 > > r10 0 > > r11 0x57f79bf6968709d8 > > r12 0x800015e874e0 > > r13 0x27b3d6c24c3fab80 > > r14 0x32 > > r15 0x27b3d6c24c3fab80 > > rip 0x81b9df22 setrunqueue+0xa2 > > cs 0x8 > > rflags 0x10207 __ALIGN_SIZE+0xf207 > > rsp 0x800015ea0df0 > > ss0x10 > > > > > > The offending instruction is in kern_sched.c:260: > > > > spc->spc_nrun++; > > > > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my > > tests, %r13 always is this same trash value. That comes from 'ci', which is > > either passed in or chosen by sched_choosecpu. Neither of these functions > > have changed recently, so I'm guessing this corruption is coming from > > something > > else. > > > > Anyone have ideas where to start looking? I suppose I could start > > bisecting, > > but does anyone know of any changes that would affect this area? > > > > I can send dmesgs if needed, but these are pretty standard VMs, > > nothing fancy configured in them. 4 CPUs, 8GB RAM, etc. > > They're VMs and it turns out that many of the "PV" drivers are/were > using the intr_barrier() interface the wrong way. > > For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun > 17 snapshot" thread on bugs@ from earlier today. > > Cheers, > > Mark > Thanks. I don't subscribe to bugs@ anymore, so that's why I likely missed it. -ml
Re: kernel crash in setrunqueue
On Wed, Jul 29, 2020 at 01:03:43PM -0700, Mike Larkin wrote: > Hi, > > I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens > on GENERIC.MP regardless of whether or not the VM has one cpu or more than > one. It does not happen on GENERIC kernels. > > The crash will happen fairly quickly after the kernel starts executing > processes. Sometimes it crashes instantly, sometimes it lasts for a minute > or two. It rarely makes it to the login prompt. The problem is 100% > reproducible on two different VMs I have, running on two different > hypervisors (Hyper-V and ESXi6.7U2). > > I first started noticing the problem on the 24th July snap, but TBH these > machines were not frequently updated, so the previous snap I had installed > might have been a couple months old. Whatever older snap was on them before > worked fine. > > Since this is happening on two different machines with two different VMs, > I'm gonna rule out hardware issues. > > Crash: > > kernel: pretection fault trap, code=0 > Stopped atsetrunqueue+0xa2: addl$0x1,0x288(%r13) > > Trace: > ddb{2}> trace > setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2 > sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c > taskq_thread(82121548) at taskq_thread+0x8d > end trace frame: 0x0, count: -3 > > Registers: > ddb{2}> sh r > rdi 0x821ee728 sched_lock > rsi 0x800014cc6ff0 > rbp 0x800015ea0e40 > rbx0 > rdx 0x23ca94 acpi_pdirpa_0x2288fc > rcx 0xc > rax 0xc > r8 0x202 > r9 0x2 > r100 > r11 0x57f79bf6968709d8 > r12 0x800015e874e0 > r13 0x27b3d6c24c3fab80 > r14 0x32 > r15 0x27b3d6c24c3fab80 > rip 0x81b9df22 setrunqueue+0xa2 > cs 0x8 > rflags 0x10207 __ALIGN_SIZE+0xf207 > rsp 0x800015ea0df0 > ss 0x10 > > > The offending instruction is in kern_sched.c:260: > > spc->spc_nrun++; > > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my > tests, %r13 always is this same trash value. That comes from 'ci', which is > either passed in or chosen by sched_choosecpu. Neither of these functions > have changed recently, so I'm guessing this corruption is coming from > something > else. > > Anyone have ideas where to start looking? I suppose I could start bisecting, > but does anyone know of any changes that would affect this area? > > I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy > configured in them. 4 CPUs, 8GB RAM, etc. > > -ml > Also I should note that the problem happens with snaps as well as kernels built from source (-current), so this isn't likely something that is in snaps but not yet in tree. -ml
Re: kernel crash in setrunqueue
> Date: Wed, 29 Jul 2020 13:03:43 -0700 > From: Mike Larkin > > Hi, > > I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens > on GENERIC.MP regardless of whether or not the VM has one cpu or more than > one. It does not happen on GENERIC kernels. > > The crash will happen fairly quickly after the kernel starts executing > processes. Sometimes it crashes instantly, sometimes it lasts for a minute > or two. It rarely makes it to the login prompt. The problem is 100% > reproducible on two different VMs I have, running on two different > hypervisors (Hyper-V and ESXi6.7U2). > > I first started noticing the problem on the 24th July snap, but TBH these > machines were not frequently updated, so the previous snap I had installed > might have been a couple months old. Whatever older snap was on them before > worked fine. > > Since this is happening on two different machines with two different VMs, > I'm gonna rule out hardware issues. > > Crash: > > kernel: pretection fault trap, code=0 > Stopped atsetrunqueue+0xa2: addl$0x1,0x288(%r13) > > Trace: > ddb{2}> trace > setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2 > sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c > taskq_thread(82121548) at taskq_thread+0x8d > end trace frame: 0x0, count: -3 > > Registers: > ddb{2}> sh r > rdi 0x821ee728 sched_lock > rsi 0x800014cc6ff0 > rbp 0x800015ea0e40 > rbx0 > rdx 0x23ca94 acpi_pdirpa_0x2288fc > rcx 0xc > rax 0xc > r8 0x202 > r9 0x2 > r100 > r11 0x57f79bf6968709d8 > r12 0x800015e874e0 > r13 0x27b3d6c24c3fab80 > r14 0x32 > r15 0x27b3d6c24c3fab80 > rip 0x81b9df22 setrunqueue+0xa2 > cs 0x8 > rflags 0x10207 __ALIGN_SIZE+0xf207 > rsp 0x800015ea0df0 > ss 0x10 > > > The offending instruction is in kern_sched.c:260: > > spc->spc_nrun++; > > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my > tests, %r13 always is this same trash value. That comes from 'ci', which is > either passed in or chosen by sched_choosecpu. Neither of these functions > have changed recently, so I'm guessing this corruption is coming from > something > else. > > Anyone have ideas where to start looking? I suppose I could start bisecting, > but does anyone know of any changes that would affect this area? > > I can send dmesgs if needed, but these are pretty standard VMs, > nothing fancy configured in them. 4 CPUs, 8GB RAM, etc. They're VMs and it turns out that many of the "PV" drivers are/were using the intr_barrier() interface the wrong way. For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun 17 snapshot" thread on bugs@ from earlier today. Cheers, Mark
kernel crash in setrunqueue
Hi, I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens on GENERIC.MP regardless of whether or not the VM has one cpu or more than one. It does not happen on GENERIC kernels. The crash will happen fairly quickly after the kernel starts executing processes. Sometimes it crashes instantly, sometimes it lasts for a minute or two. It rarely makes it to the login prompt. The problem is 100% reproducible on two different VMs I have, running on two different hypervisors (Hyper-V and ESXi6.7U2). I first started noticing the problem on the 24th July snap, but TBH these machines were not frequently updated, so the previous snap I had installed might have been a couple months old. Whatever older snap was on them before worked fine. Since this is happening on two different machines with two different VMs, I'm gonna rule out hardware issues. Crash: kernel: pretection fault trap, code=0 Stopped at setrunqueue+0xa2: addl$0x1,0x288(%r13) Trace: ddb{2}> trace setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2 sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c taskq_thread(82121548) at taskq_thread+0x8d end trace frame: 0x0, count: -3 Registers: ddb{2}> sh r rdi 0x821ee728 sched_lock rsi 0x800014cc6ff0 rbp 0x800015ea0e40 rbx 0 rdx 0x23ca94 acpi_pdirpa_0x2288fc rcx0xc rax0xc r8 0x202 r9 0x2 r10 0 r11 0x57f79bf6968709d8 r12 0x800015e874e0 r13 0x27b3d6c24c3fab80 r14 0x32 r15 0x27b3d6c24c3fab80 rip 0x81b9df22 setrunqueue+0xa2 cs 0x8 rflags 0x10207 __ALIGN_SIZE+0xf207 rsp 0x800015ea0df0 ss0x10 The offending instruction is in kern_sched.c:260: spc->spc_nrun++; ... which indicates 'spc' is trash (and it is, based on %r13 above). In my tests, %r13 always is this same trash value. That comes from 'ci', which is either passed in or chosen by sched_choosecpu. Neither of these functions have changed recently, so I'm guessing this corruption is coming from something else. Anyone have ideas where to start looking? I suppose I could start bisecting, but does anyone know of any changes that would affect this area? I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy configured in them. 4 CPUs, 8GB RAM, etc. -ml