Re: unkillable process consuming 100% cpu
On Mon, Nov 11, 2019 at 01:22:09PM +0100, Hans Petter Selasky wrote: > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index a6e0a16ae..0697d70f4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c Are you using ports/graphics/drm-devel-kmod? This file does not exist in drm-current-kmod. > @@ -236,6 +238,12 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct > amdgpu_bo *bo, Using 'nm *.ko | grep eviction_fence' in /boot/modules shows that none of the modules contain amdgpu_amdkfd_remove_eviction_fence(). -- Steve ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: unkillable process consuming 100% cpu
On Wed, Nov 13, 2019 at 04:22:19PM +0100, Hans Petter Selasky wrote: > On 2019-11-13 15:52, Steve Kargl wrote: > > at /usr/src/sys/amd64/amd64/trap.c:743 > > #7 0x808b0468 in trap (frame=0xfe00b460e0c0) > > at /usr/src/sys/amd64/amd64/trap.c:407 > > #8 > > #9 0x in ?? () > > #10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248) > > at > > /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720 > > #11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1, > > flags=2147483647) > > Hi, > > I don't see any function call here. Can you try to double check the > backtrace? > > Which version of FreeBSD is this? > % uname -a (trimmed) FreeBSD 13.0-CURRENT r353571 % kgdb /usr/lib/debug/boot/kernel/kernel.debug vmcore.2 % bt ... #7 0x808b0468 in trap (frame=0xfe00b460e0c0) at /usr/src/sys/amd64/amd64/trap.c:407 #8 #9 0x in ?? () #10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720 #11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1, flags=2147483647) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:804 #12 0x817adc9b in radeon_is_px (dev=0xf8017fe84e00) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_device.c:156 Looking at radeon_ttm.c, line 720 is the if-stmt in this function static struct radeon_ttm_tt *radeon_ttm_tt_to_gtt(struct ttm_tt *ttm) { if (!ttm || ttm->func != _backend_func) return NULL; return (struct radeon_ttm_tt *)ttm; } (kgdb) p ttm->func $2 = (struct ttm_backend_func *) 0x231 (kgdb) p _backend_func $4 = (struct ttm_backend_func *) 0x8186d870 AFAIK, 0x231 is not a valid address. (kgdb) p *ttm $5 = {bdev = 0x819021ef, func = 0x231, dummy_read_page = 0x0, pages = 0xf800612c, page_flags = 2173789980, num_pages = 0, sg = 0x0, glob = 0x2a, swap_storage = 0xf8017fe84e00, caching_state = (unknown: 145613312), state = (tt_unbound | tt_unpopulated | unknown: 4294965248)} Moving to frame 12 suggests that the stack is corrupt (whether by the dump or the crash I don't know) (kgdb) frame 12 #12 0x817adc9b in radeon_is_px (dev=0xf8017fe84e00) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_device.c:156 156 if (rdev->flags & RADEON_IS_PX) (kgdb) p *dev Cannot access memory at address 0xf8017fe84e00 (kgdb) p rdev $25 = (struct radeon_device *) 0x0 -- Steve ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: unkillable process consuming 100% cpu
On 2019-11-13 15:52, Steve Kargl wrote: at /usr/src/sys/amd64/amd64/trap.c:743 #7 0x808b0468 in trap (frame=0xfe00b460e0c0) at /usr/src/sys/amd64/amd64/trap.c:407 #8 #9 0x in ?? () #10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720 #11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1, flags=2147483647) Hi, I don't see any function call here. Can you try to double check the backtrace? Which version of FreeBSD is this? --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: unkillable process consuming 100% cpu
On Wed, Nov 13, 2019 at 09:10:06AM +0100, Hans Petter Selasky wrote: > On 2019-11-13 01:30, Steve Kargl wrote: > > > > I installed the 2nd seqlock.diff, rebuilt drm-current-kmod-4.16.g20191023, > > rebooting, and have been pounding on the system with workloads that are > > similar to what the system was doing during the lockups. So far, I > > cannot ge the system lock-up. Looks like your patch fixes (or at > > least helps). Thanks for taking a look at the problem. > > > > Can you apply the kdb.diff on top and check dmesg for prints? > I could not find the amdgpu_amdkfd_gpuvm.c file when I went looking. Is it autogenerated? I also spoke too soon. I got a panic after my reply above. Fatal trap 12: page fault while in kernel mode cpuid = 5; apic id = 15 fault virtual address = 0x0 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x0 stack pointer = 0x28:0xfe00b460e188 frame pointer = 0x28:0xfe00b460e1c0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 877 (X:rcs0) trap number = 12 panic: page fault cpuid = 5 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe00b460dde0 vpanic() at vpanic+0x17e/frame 0xfe00b460de40 panic() at panic+0x43/frame 0xfe00b460dea0 trap_fatal() at trap_fatal+0x388/frame 0xfe00b460df10 trap_pfault() at trap_pfault+0x4f/frame 0xfe00b460df80 trap() at trap+0x288/frame 0xfe00b460e0b0 calltrap() at calltrap+0x8/frame 0xfe00b460e0b0 --- trap 0xc, rip = 0, rsp = 0xfe00b460e188, rbp = 0xfe00b460e1c0 --- ??() at 0/frame 0xfe00b460e1c0 radeon_cs_ioctl() at radeon_cs_ioctl+0xa0b/frame 0xfe00b460e640 drm_ioctl_kernel() at drm_ioctl_kernel+0xf1/frame 0xfe00b460e680 drm_ioctl() at drm_ioctl+0x279/frame 0xfe00b460e770 linux_file_ioctl() at linux_file_ioctl+0x298/frame 0xfe00b460e7d0 kern_ioctl() at kern_ioctl+0x284/frame 0xfe00b460e840 sys_ioctl() at sys_ioctl+0x157/frame 0xfe00b460e910 amd64_syscall() at amd64_syscall+0x273/frame 0xfe00b460ea30 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfe00b460ea30 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x200cc6bfa, rsp = 0x7fffbfffde98, rbp = 0x7fffbfffdec0 --- Uptime: 5h9m5s Dumping 1472 out of 16327 MB:..2%..11%..21%..31%..41%..52%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 warning: Source file is more recent than executable. 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:392 #2 0x805de452 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:479 #3 0x805de8a6 in vpanic (fmt=, ap=) at /usr/src/sys/kern/kern_shutdown.c:908 #4 0x805de6c3 in panic (fmt=) at /usr/src/sys/kern/kern_shutdown.c:835 #5 0x808b0d58 in trap_fatal (frame=0xfe00b460e0c0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:925 #6 0x808b0daf in trap_pfault (frame=0xfe00b460e0c0, usermode=, signo=, ucode=) at /usr/src/sys/amd64/amd64/trap.c:743 #7 0x808b0468 in trap (frame=0xfe00b460e0c0) at /usr/src/sys/amd64/amd64/trap.c:407 #8 #9 0x in ?? () #10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720 #11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1, flags=2147483647) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:804 #12 0x817adc9b in radeon_is_px (dev=0xf8017fe84e00) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_device.c:156 #13 0x818a9e81 in drm_ioctl_kernel (linux_file=, func=0xfe00b460e428, kdata=0xfe00b31eb000, flags=1521620552) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/drm_ioctl.c:760 #14 0x818aa129 in drm_ioctl (filp=0xf80061198e00, cmd=, arg=65536) at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/drm_ioctl.c:856 #15 0x807c8098 in linux_file_ioctl_sub (fp=, filp=, fop=, cmd=, data=, td=) at /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:965 #16 linux_file_ioctl (fp=, cmd=, data=, cred=, td=0xf800612c) at /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:1558 #17 0x8063ed34 in fo_ioctl (fp=, com=3223348326, data=0x7fff, active_cred=0xfe001f7e6250, td=0xf800612c) at /usr/src/sys/sys/file.h:340 #18 kern_ioctl (td=, fd=9, com=3223348326, data=0x7fff ) at /usr/src/sys/kern/sys_generic.c:801 #19 0x8063ea37 in sys_ioctl
possible bug in devstat_selectdevs()
I wonder if anyone remembers devstat code enough to help me or, at least, to sanity check my line of thinking. I am looking at a crash that happened in devstat_selectdevs(num_selections=27, numdevs=25). At the time of the crash there was some reconfiguration of logical volumes on a RAID controller, so "disks" were coming and going. The first relevant block of code in the function is: /* * In this case, we have selected devices before, but the device * list has changed since we last selected devices, so we need to * either enlarge or reduce the size of the device selection list. */ } else if (*num_selections != numdevs) { *dev_select = (struct device_selection *)reallocf(*dev_select, numdevs * sizeof(struct device_selection)); *select_generation = current_generation; init_selections = 1; So, dev_select array is realloc-ed to have space for numdevs elements. Then we have this: if (((init_selected_var != 0) || (init_selections != 0) || (perf_select != 0)) && (changed == 0)){ old_dev_select = (struct device_selection *)malloc( *num_selections * sizeof(struct device_selection)); if (old_dev_select == NULL) { snprintf(devstat_errbuf, sizeof(devstat_errbuf), "%s: Cannot allocate memory for selection list backup", __func__); return(-1); } old_num_selections = *num_selections; ==> bcopy(*dev_select, old_dev_select, sizeof(struct device_selection) * *num_selections); } The crash happened in the bcopy() call. So, we are trying to copy num_selections (I omit pointer dereferencing) elements from the dev_select array. But in the previous block we resized the array to numdevs and in this case numdevs is less than num_selections. The code is quite unfamiliar to me. My first instinct is to just clamp the copy size, but I am not sure if that would be the right thing. Maybe realloc of dev_select should be done after bcopy-ing out of the array? Or maybe it's okay to realloc only if the size is going up? Any help is appreciated. Thank you very much in advance! -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: unkillable process consuming 100% cpu
On 2019-11-13 01:30, Steve Kargl wrote: On Tue, Nov 12, 2019 at 06:48:22PM +0100, Hans Petter Selasky wrote: On 2019-11-12 18:31, Steve Kargl wrote: Can you open the radeonkms.ko in gdb83 from ports and type: l *(radeon_gem_busy_ioctl+0x30) % /boot/modules/radeonkms.ko (gdb) l *(radeon_gem_busy_ioctl+0x30) 0xa12b0 is in radeon_gem_busy_ioctl (/usr/ports/graphics/drm-current-kmod/work/kms-drm-2d2852e/drivers/gpu/drm/radeon/radeon_gem.c:453). 448 /usr/ports/graphics/drm-current-kmod/work/kms-drm-2d2852e/drivers/gpu/drm/radeon/radeon_gem.c: No such file or directory. (gdb) Like expected. I installed the 2nd seqlock.diff, rebuilt drm-current-kmod-4.16.g20191023, rebooting, and have been pounding on the system with workloads that are similar to what the system was doing during the lockups. So far, I cannot ge the system lock-up. Looks like your patch fixes (or at least helps). Thanks for taking a look at the problem. Can you apply the kdb.diff on top and check dmesg for prints? --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"