Re: unkillable process consuming 100% cpu

2019-11-13 Thread Steve Kargl
On Mon, Nov 11, 2019 at 01:22:09PM +0100, Hans Petter Selasky wrote:

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index a6e0a16ae..0697d70f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

Are you using ports/graphics/drm-devel-kmod?
This file does not exist in drm-current-kmod.

> @@ -236,6 +238,12 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct 
> amdgpu_bo *bo,

Using 'nm *.ko | grep eviction_fence' in /boot/modules shows
that none of the modules contain amdgpu_amdkfd_remove_eviction_fence().

-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: unkillable process consuming 100% cpu

2019-11-13 Thread Steve Kargl
On Wed, Nov 13, 2019 at 04:22:19PM +0100, Hans Petter Selasky wrote:
> On 2019-11-13 15:52, Steve Kargl wrote:
> >  at /usr/src/sys/amd64/amd64/trap.c:743
> > #7  0x808b0468 in trap (frame=0xfe00b460e0c0)
> >  at /usr/src/sys/amd64/amd64/trap.c:407
> > #8  
> > #9  0x in ?? ()
> > #10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248)
> >  at 
> > /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720
> > #11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1,
> >  flags=2147483647)
> 
> Hi,
> 
> I don't see any function call here. Can you try to double check the 
> backtrace?
> 
> Which version of FreeBSD is this?
> 

% uname -a (trimmed)
FreeBSD 13.0-CURRENT r353571

% kgdb /usr/lib/debug/boot/kernel/kernel.debug vmcore.2
% bt
...
#7  0x808b0468 in trap (frame=0xfe00b460e0c0)
at /usr/src/sys/amd64/amd64/trap.c:407
#8  
#9  0x in ?? ()
#10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720
#11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1, 
flags=2147483647)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:804
#12 0x817adc9b in radeon_is_px (dev=0xf8017fe84e00)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_device.c:156

Looking at radeon_ttm.c, line 720 is the if-stmt in this function

static struct radeon_ttm_tt *radeon_ttm_tt_to_gtt(struct ttm_tt *ttm)
{
 if (!ttm || ttm->func != _backend_func)
  return NULL;
 return (struct radeon_ttm_tt *)ttm;
}

(kgdb) p ttm->func
$2 = (struct ttm_backend_func *) 0x231
(kgdb) p _backend_func
$4 = (struct ttm_backend_func *) 0x8186d870 

AFAIK, 0x231 is not a valid address.

(kgdb) p *ttm
$5 = {bdev = 0x819021ef, func = 0x231, dummy_read_page = 0x0, 
  pages = 0xf800612c, page_flags = 2173789980, num_pages = 0, 
  sg = 0x0, glob = 0x2a, swap_storage = 0xf8017fe84e00, 
  caching_state = (unknown: 145613312), 
  state = (tt_unbound | tt_unpopulated | unknown: 4294965248)}

Moving to frame 12 suggests that the stack is corrupt (whether
by the dump or the crash I don't know)

(kgdb) frame 12
#12 0x817adc9b in radeon_is_px (dev=0xf8017fe84e00)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_device.c:156
156 if (rdev->flags & RADEON_IS_PX)
(kgdb) p *dev
Cannot access memory at address 0xf8017fe84e00
(kgdb) p rdev
$25 = (struct radeon_device *) 0x0


-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: unkillable process consuming 100% cpu

2019-11-13 Thread Hans Petter Selasky

On 2019-11-13 15:52, Steve Kargl wrote:

 at /usr/src/sys/amd64/amd64/trap.c:743
#7  0x808b0468 in trap (frame=0xfe00b460e0c0)
 at /usr/src/sys/amd64/amd64/trap.c:407
#8  
#9  0x in ?? ()
#10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248)
 at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720
#11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1,
 flags=2147483647)


Hi,

I don't see any function call here. Can you try to double check the 
backtrace?


Which version of FreeBSD is this?

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: unkillable process consuming 100% cpu

2019-11-13 Thread Steve Kargl
On Wed, Nov 13, 2019 at 09:10:06AM +0100, Hans Petter Selasky wrote:
> On 2019-11-13 01:30, Steve Kargl wrote:
> > 
> > I installed the 2nd seqlock.diff, rebuilt drm-current-kmod-4.16.g20191023,
> > rebooting, and have been pounding on the system with workloads that are
> > similar to what the system was doing during the lockups.  So far, I
> > cannot ge the system lock-up.  Looks like your patch fixes (or at
> > least helps).  Thanks for taking a look at the problem.
> > 
> 
> Can you apply the kdb.diff on top and check dmesg for prints?
> 

I could not find the amdgpu_amdkfd_gpuvm.c file when I went looking.
Is it autogenerated?

I also spoke too soon. I got a panic after my reply above.

Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 15
fault virtual address   = 0x0
fault code  = supervisor read instruction, page not present
instruction pointer = 0x20:0x0
stack pointer   = 0x28:0xfe00b460e188
frame pointer   = 0x28:0xfe00b460e1c0
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 877 (X:rcs0)
trap number = 12
panic: page fault
cpuid = 5

db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe00b460dde0
vpanic() at vpanic+0x17e/frame 0xfe00b460de40
panic() at panic+0x43/frame 0xfe00b460dea0
trap_fatal() at trap_fatal+0x388/frame 0xfe00b460df10
trap_pfault() at trap_pfault+0x4f/frame 0xfe00b460df80
trap() at trap+0x288/frame 0xfe00b460e0b0
calltrap() at calltrap+0x8/frame 0xfe00b460e0b0
--- trap 0xc, rip = 0, rsp = 0xfe00b460e188, rbp = 0xfe00b460e1c0 ---
??() at 0/frame 0xfe00b460e1c0
radeon_cs_ioctl() at radeon_cs_ioctl+0xa0b/frame 0xfe00b460e640
drm_ioctl_kernel() at drm_ioctl_kernel+0xf1/frame 0xfe00b460e680
drm_ioctl() at drm_ioctl+0x279/frame 0xfe00b460e770
linux_file_ioctl() at linux_file_ioctl+0x298/frame 0xfe00b460e7d0
kern_ioctl() at kern_ioctl+0x284/frame 0xfe00b460e840
sys_ioctl() at sys_ioctl+0x157/frame 0xfe00b460e910
amd64_syscall() at amd64_syscall+0x273/frame 0xfe00b460ea30
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfe00b460ea30
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x200cc6bfa, rsp = 
0x7fffbfffde98, rbp = 0x7fffbfffdec0 ---
Uptime: 5h9m5s
Dumping 1472 out of 16327 MB:..2%..11%..21%..31%..41%..52%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
warning: Source file is more recent than executable.
55  __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct 
pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:392
#2  0x805de452 in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:479
#3  0x805de8a6 in vpanic (fmt=, ap=)
at /usr/src/sys/kern/kern_shutdown.c:908
#4  0x805de6c3 in panic (fmt=)
at /usr/src/sys/kern/kern_shutdown.c:835
#5  0x808b0d58 in trap_fatal (frame=0xfe00b460e0c0, eva=0)
at /usr/src/sys/amd64/amd64/trap.c:925
#6  0x808b0daf in trap_pfault (frame=0xfe00b460e0c0, 
usermode=, signo=, ucode=)
at /usr/src/sys/amd64/amd64/trap.c:743
#7  0x808b0468 in trap (frame=0xfe00b460e0c0)
at /usr/src/sys/amd64/amd64/trap.c:407
#8  
#9  0x in ?? ()
#10 0x817d2c0f in radeon_ttm_tt_to_gtt (ttm=0xf80061eeb248)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:720
#11 radeon_ttm_tt_set_userptr (ttm=0xf80061eeb248, addr=1, 
flags=2147483647)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_ttm.c:804
#12 0x817adc9b in radeon_is_px (dev=0xf8017fe84e00)
at 
/usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/radeon/radeon_device.c:156
#13 0x818a9e81 in drm_ioctl_kernel (linux_file=, 
func=0xfe00b460e428, kdata=0xfe00b31eb000, flags=1521620552)
at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/drm_ioctl.c:760
#14 0x818aa129 in drm_ioctl (filp=0xf80061198e00, 
cmd=, arg=65536)
at /usr/local/sys/modules/drm-current-kmod/drivers/gpu/drm/drm_ioctl.c:856
#15 0x807c8098 in linux_file_ioctl_sub (fp=, 
filp=, fop=, cmd=, 
data=, td=)
at /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:965
#16 linux_file_ioctl (fp=, cmd=, 
data=, cred=, td=0xf800612c)
at /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:1558
#17 0x8063ed34 in fo_ioctl (fp=, com=3223348326, 
data=0x7fff, active_cred=0xfe001f7e6250, td=0xf800612c)
at /usr/src/sys/sys/file.h:340
#18 kern_ioctl (td=, fd=9, com=3223348326, 
data=0x7fff )
at /usr/src/sys/kern/sys_generic.c:801
#19 0x8063ea37 in sys_ioctl 

possible bug in devstat_selectdevs()

2019-11-13 Thread Andriy Gapon


I wonder if anyone remembers devstat code enough to help me or, at least, to
sanity check my line of thinking.

I am looking at a crash that happened in devstat_selectdevs(num_selections=27,
numdevs=25).  At the time of the crash there was some reconfiguration of logical
volumes on a RAID controller, so "disks" were coming and going.

The first relevant block of code in the function is:
/*
 * In this case, we have selected devices before, but the device
 * list has changed since we last selected devices, so we need to
 * either enlarge or reduce the size of the device selection list.
 */
} else if (*num_selections != numdevs) {
*dev_select = (struct device_selection *)reallocf(*dev_select,
numdevs * sizeof(struct device_selection));
*select_generation = current_generation;
init_selections = 1;

So, dev_select array is realloc-ed to have space for numdevs elements.
Then we have this:
if (((init_selected_var != 0) || (init_selections != 0)
 || (perf_select != 0)) && (changed == 0)){
old_dev_select = (struct device_selection *)malloc(
*num_selections * sizeof(struct device_selection));
if (old_dev_select == NULL) {
snprintf(devstat_errbuf, sizeof(devstat_errbuf),
 "%s: Cannot allocate memory for selection list
backup",
 __func__);
return(-1);
}
old_num_selections = *num_selections;
==> bcopy(*dev_select, old_dev_select,
sizeof(struct device_selection) * *num_selections);
}

The crash happened in the bcopy() call.
So, we are trying to copy num_selections (I omit pointer dereferencing) elements
from the dev_select array.  But in the previous block we resized the array to
numdevs and in this case numdevs is less than num_selections.

The code is quite unfamiliar to me.  My first instinct is to just clamp the copy
size, but I am not sure if that would be the right thing.
Maybe realloc of dev_select should be done after bcopy-ing out of the array?
Or maybe it's okay to realloc only if the size is going up?

Any help is appreciated.
Thank you very much in advance!
-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: unkillable process consuming 100% cpu

2019-11-13 Thread Hans Petter Selasky

On 2019-11-13 01:30, Steve Kargl wrote:

On Tue, Nov 12, 2019 at 06:48:22PM +0100, Hans Petter Selasky wrote:

On 2019-11-12 18:31, Steve Kargl wrote:

Can you open the radeonkms.ko in gdb83 from ports and type:

l *(radeon_gem_busy_ioctl+0x30)


% /boot/modules/radeonkms.ko
(gdb) l  *(radeon_gem_busy_ioctl+0x30)
0xa12b0 is in radeon_gem_busy_ioctl 
(/usr/ports/graphics/drm-current-kmod/work/kms-drm-2d2852e/drivers/gpu/drm/radeon/radeon_gem.c:453).
448 
/usr/ports/graphics/drm-current-kmod/work/kms-drm-2d2852e/drivers/gpu/drm/radeon/radeon_gem.c:
 No such file or directory.
(gdb)


Like expected.



I installed the 2nd seqlock.diff, rebuilt drm-current-kmod-4.16.g20191023,
rebooting, and have been pounding on the system with workloads that are
similar to what the system was doing during the lockups.  So far, I
cannot ge the system lock-up.  Looks like your patch fixes (or at
least helps).  Thanks for taking a look at the problem.



Can you apply the kdb.diff on top and check dmesg for prints?

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"