On Wed, Jul 29, 2020 at 01:03:43PM -0700, Mike Larkin wrote: > Hi, > > I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens > on GENERIC.MP regardless of whether or not the VM has one cpu or more than > one. It does not happen on GENERIC kernels. > > The crash will happen fairly quickly after the kernel starts executing > processes. Sometimes it crashes instantly, sometimes it lasts for a minute > or two. It rarely makes it to the login prompt. The problem is 100% > reproducible on two different VMs I have, running on two different > hypervisors (Hyper-V and ESXi6.7U2). > > I first started noticing the problem on the 24th July snap, but TBH these > machines were not frequently updated, so the previous snap I had installed > might have been a couple months old. Whatever older snap was on them before > worked fine. > > Since this is happening on two different machines with two different VMs, > I'm gonna rule out hardware issues. > > Crash: > > kernel: pretection fault trap, code=0 > Stopped at setrunqueue+0xa2: addl $0x1,0x288(%r13) > > Trace: > ddb{2}> trace > setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2 > sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c > taskq_thread(ffffffff82121548) at taskq_thread+0x8d > end trace frame: 0x0, count: -3 > > Registers: > ddb{2}> sh r > rdi 0xffffffff821ee728 sched_lock > rsi 0xffff800014cc6ff0 > rbp 0xffff800015ea0e40 > rbx 0 > rdx 0x23ca94 acpi_pdirpa_0x2288fc > rcx 0xc > rax 0xc > r8 0x202 > r9 0x2 > r10 0 > r11 0x57f79bf6968709d8 > r12 0xffff800015e874e0 > r13 0x27b3d6c24c3fab80 > r14 0x32 > r15 0x27b3d6c24c3fab80 > rip 0xffffffff81b9df22 setrunqueue+0xa2 > cs 0x8 > rflags 0x10207 __ALIGN_SIZE+0xf207 > rsp 0xffff800015ea0df0 > ss 0x10 > > > The offending instruction is in kern_sched.c:260: > > spc->spc_nrun++; > > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my > tests, %r13 always is this same trash value. That comes from 'ci', which is > either passed in or chosen by sched_choosecpu. Neither of these functions > have changed recently, so I'm guessing this corruption is coming from > something > else. > > Anyone have ideas where to start looking? I suppose I could start bisecting, > but does anyone know of any changes that would affect this area? > > I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy > configured in them. 4 CPUs, 8GB RAM, etc. > > -ml >
Also I should note that the problem happens with snaps as well as kernels built from source (-current), so this isn't likely something that is in snaps but not yet in tree. -ml