Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
On 7/24/23 23:08, Taylor R Campbell wrote: Date: Mon, 24 Jul 2023 12:05:34 +0900 From: PHO I realized the cause of this: [...] There were cases where the function was destroying a condvar that it didn't initialize! Ugh, this is the very reason why I dislike C... Oops... Sorry, was blowing through the vmwgfx code a little too hastily there. Here's a patch to simplify it and make it easier to read; might work better. No worries. Your patch worked fine. Thank you!
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
> Date: Mon, 24 Jul 2023 12:05:34 +0900 > From: PHO > > I realized the cause of this: > [...] > There were cases where the function was destroying a condvar that it > didn't initialize! Ugh, this is the very reason why I dislike C... Oops... Sorry, was blowing through the vmwgfx code a little too hastily there. Here's a patch to simplify it and make it easier to read; might work better. diff --git a/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c b/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c index 970c6ee38235..68a654ddb29a 100644 --- a/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c +++ b/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c @@ -195,87 +195,37 @@ static long vmw_fence_wait(struct dma_fence *f, bool intr, signed long timeout) if (likely(vmw_fence_obj_signaled(fence))) return timeout; + DRM_INIT_WAITQUEUE(, "vmwgfxwf"); + vmw_fifo_ping_host(dev_priv, SVGA_SYNC_GENERIC); vmw_seqno_waiter_add(dev_priv); - spin_lock(f->lock); - - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) - goto out; - - if (intr && signal_pending(current)) { - ret = -ERESTARTSYS; + if (likely(vmw_fence_obj_signaled(fence))) { + ret = timeout; goto out; } -#ifdef __NetBSD__ - DRM_INIT_WAITQUEUE(, "vmwgfxwf"); -#else - cb.task = current; -#endif - spin_unlock(f->lock); ret = dma_fence_add_callback(f, , vmwgfx_wait_cb); - spin_lock(f->lock); - if (ret) + if (ret) { + ret = timeout; goto out; + } -#ifdef __NetBSD__ #defineC (__vmw_fences_update(fman), dma_fence_is_signaled_locked(f)) + spin_lock(f->lock); if (intr) { DRM_SPIN_TIMED_WAIT_UNTIL(ret, , f->lock, timeout, C); } else { DRM_SPIN_TIMED_WAIT_NOINTR_UNTIL(ret, , f->lock, timeout, C); } -#else - for (;;) { - __vmw_fences_update(fman); - - /* -* We can use the barrier free __set_current_state() since -* DMA_FENCE_FLAG_SIGNALED_BIT + wakeup is protected by the -* fence spinlock. -*/ - if (intr) - __set_current_state(TASK_INTERRUPTIBLE); - else - __set_current_state(TASK_UNINTERRUPTIBLE); - - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) { - if (ret == 0 && timeout > 0) - ret = 1; - break; - } - - if (intr && signal_pending(current)) { - ret = -ERESTARTSYS; - break; - } - - if (ret == 0) - break; - - spin_unlock(f->lock); - - ret = schedule_timeout(ret); - - spin_lock(f->lock); - } - __set_current_state(TASK_RUNNING); - if (!list_empty()) - list_del(); -#endif spin_unlock(f->lock); + dma_fence_remove_callback(f, ); - spin_lock(f->lock); -out: - spin_unlock(f->lock); -#ifdef __NetBSD__ - DRM_DESTROY_WAITQUEUE(); -#endif +out: vmw_seqno_waiter_remove(dev_priv); - vmw_seqno_waiter_remove(dev_priv); + DRM_DESTROY_WAITQUEUE(); return ret; }
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
On 7/23/23 17:27, PHO wrote: On 7/22/23 22:41, Taylor R Campbell wrote: Date: Sat, 22 Jul 2023 21:52:40 +0900 From: PHO Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc Just to confirm, what does `info line *(vmw_fence_wait+0xdc)' say in gdb? And, if you can get to the frame in gdb, what does gdb say is in the vmw_fence_wait frame, and what cv is in the cv_destroy frame? Let's confirm it is the cv you think it is -- I suspect it might be a different one. I just encountered the crash and could obtain a crash dump. It is indeed the "DRM_DESTROY_WAITQUEUE()" in vmw_fence_wait() but the contents of cb does not make sense to me: ... CV_SLEEPQ(cv) is 0x01 (wtf) and CV_WMESG(cv) is not even a string? I realized the cause of this: static long vmw_fence_wait(struct dma_fence *f, bool intr, signed long timeout) { ... if (likely(vmw_fence_obj_signaled(fence))) return timeout; ... spin_lock(f->lock); if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags)) goto out; // <-- THIS ONE if (intr && signal_pending(current)) { ret = -ERESTARTSYS; goto out; // <-- OR THIS } #ifdef __NetBSD__ DRM_INIT_WAITQUEUE(, "vmwgfxwf"); #else cb.task = current; #endif ... out: spin_unlock(f->lock); #ifdef __NetBSD__ DRM_DESTROY_WAITQUEUE(); #endif ... } There were cases where the function was destroying a condvar that it didn't initialize! Ugh, this is the very reason why I dislike C...
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
On 7/22/23 22:41, Taylor R Campbell wrote: Date: Sat, 22 Jul 2023 21:52:40 +0900 From: PHO Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc Just to confirm, what does `info line *(vmw_fence_wait+0xdc)' say in gdb? And, if you can get to the frame in gdb, what does gdb say is in the vmw_fence_wait frame, and what cv is in the cv_destroy frame? Let's confirm it is the cv you think it is -- I suspect it might be a different one. I just encountered the crash and could obtain a crash dump. It is indeed the "DRM_DESTROY_WAITQUEUE()" in vmw_fence_wait() but the contents of cb does not make sense to me: - (gdb) fr 6 #6 vmw_fence_wait (f=0xaa67d3170608, intr=, timeout=) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c:289 289 DRM_DESTROY_WAITQUEUE(); (gdb) p cb $11 = {base = {func = 0x814fbb80 , fcb_entry = { tqe_next = 0x80df6a0b , tqe_prev = 0xaa675e629238}, fcb_onqueue = 90}, wq = {cv_opaque = { 0x1, 0x80fa0189 }}} (gdb) fr 4 #4 0x80dcdd20 in cv_destroy (cv=cv@entry=0xbf8138b44630) at /home/pho/sandbox/_netbsd/src/sys/kern/kern_condvar.c:553 553 return !LIST_EMPTY(CV_SLEEPQ(cv)); (gdb) p *cv $12 = {cv_opaque = {0x1, 0x80fa0189 }} - CV_SLEEPQ(cv) is 0x01 (wtf) and CV_WMESG(cv) is not even a string? Here's the stack trace: - #0 0x80239c85 in cpu_reboot (howto=howto@entry=260, bootstr=bootstr@entry=0x0) at /home/pho/sandbox/_netbsd/src/sys/arch/amd64/amd64/machdep.c:717 #1 0x80e03cee in kern_reboot (howto=howto@entry=260, bootstr=bootstr@entry=0x0) at /home/pho/sandbox/_netbsd/src/sys/kern/kern_reboot.c:73 #2 0x80e4c698 in vpanic ( fmt=0x815a5cf8 "kernel %sassertion \"%s\" failed: file \"%s\", line %d ", ap=ap@entry=0xbf8138b44578) at /home/pho/sandbox/_netbsd/src/sys/kern/subr_prf.c:292 #3 0x81012b8f in kern_assert ( fmt=fmt@entry=0x815a5cf8 "kernel %sassertion \"%s\" failed: file \"%s\", line %d ") at /home/pho/sandbox/_netbsd/src/sys/lib/libkern/kern_assert.c:51 #4 0x80dcdd20 in cv_destroy (cv=cv@entry=0xbf8138b44630) at /home/pho/sandbox/_netbsd/src/sys/kern/kern_condvar.c:553 #5 0x80a72824 in DRM_DESTROY_WAITQUEUE (q=0xbf8138b44630) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/include/drm/drm_wait_netbsd.h:59 #6 vmw_fence_wait (f=0xaa67d3170608, intr=, timeout=) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c:289 #7 0x80fa2184 in linux_dma_fence_wait_timeout ( fence=0xaa67d3170608, intr=intr@entry=false, timeout=timeout@entry=1500) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/linux/linux_dma_fence.c:1005 #8 0x80fa4c96 in linux_dma_resv_wait_timeout_rcu ( robj=0xaa675e629238, shared=shared@entry=true, intr=intr@entry=false, timeout=) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/linux/linux_dma_resv.c:1213 #9 0x80fb5671 in ttm_bo_wait (bo=bo@entry=0xaa675e629170, interruptible=interruptible@entry=false, no_wait=) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:1896 #10 0x80a815d0 in vmw_resource_unbind_list ( vbo=vbo@entry=0xaa675e629170) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_resource.c:860 #11 0x80a6712c in vmw_bo_move_notify (bo=bo@entry=0xaa675e629170, mem=mem@entry=0xbf8138b44928) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_bo.c:1205 #12 0x80a8a6c7 in vmw_move_notify (bo=0xaa675e629170, evict=, mem=0xbf8138b44928) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_ttm_buffer.c:933 #13 0x80fb3505 in ttm_bo_handle_move_mem ( bo=bo@entry=0xaa675e629170, mem=mem@entry=0xbf8138b44928, evict=evict@entry=true, ctx=ctx@entry=0xbf8138b44b00) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:380 #14 0x80fb4382 in ttm_bo_evict (ctx=0xbf8138b44b00, bo=0xaa675e629170) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:757 #15 ttm_mem_evict_first (bdev=bdev@entry=0xaa687f031008, mem_type=, place=place@entry=0x814fea78 , ctx=ctx@entry=0xbf8138b44b00, ticket=ticket@entry=0xbf8138b44c98) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:913 #16 0x80fb4850 in ttm_bo_mem_force_space (ctx=0xbf8138b44b00, mem=0xbf8138b44a58, place=0x814fea78 , bo=0xaa6752dd4b30) at /home/pho/sandbox/_netbsd/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:985 #17 ttm_bo_mem_space (bo=bo@entry=0xaa6752dd4b30,
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
n 7/22/23 22:31, Taylor R Campbell wrote: What exactly is the panic you see and the evidence when you see it? Stack trace, gdb print cb in crash dump? Wait, can we use gdb for examining the kernel dump? I thought gdb couldn't read it. Here's the stacktrace found in /var/log/message: Yep, you can use gdb to examine a crash dump: $ gdb ./netbsd.gdb (gdb) target kvm netbsd.123.core Wow, that actually worked. Thank you for the tip. I wish I knew this when I was porting the driver :D
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
On 7/22/23 22:41, Taylor R Campbell wrote: Date: Sat, 22 Jul 2023 21:52:40 +0900 From: PHO Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc Just to confirm, what does `info line *(vmw_fence_wait+0xdc)' say in gdb? And, if you can get to the frame in gdb, what does gdb say is in the vmw_fence_wait frame, and what cv is in the cv_destroy frame? Let's confirm it is the cv you think it is -- I suspect it might be a different one. Since I no longer have the exact kernel that crashed, I tried to reproduce the issue with the latest one I have (with my workaround backed out). But it seems like the panic doesn't occur anymore. I'll give you the info the next time it panics.
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
> Date: Sat, 22 Jul 2023 21:52:40 +0900 > From: PHO > > Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] > vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc Just to confirm, what does `info line *(vmw_fence_wait+0xdc)' say in gdb? And, if you can get to the frame in gdb, what does gdb say is in the vmw_fence_wait frame, and what cv is in the cv_destroy frame? Let's confirm it is the cv you think it is -- I suspect it might be a different one.
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
> Date: Sat, 22 Jul 2023 21:52:40 +0900 > From: PHO > > >> cv_destroy(); // <-- Panics! > >> > >> It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not > >> always. The panic seems to happen when cv_timedwait_sig() exits due to > >> the timeout expiring before it gets signaled. > > > > Confused by `seldom panics on ... but not always' -- was that supposed > > to be `often panics on ... but not always', or is there a more > > frequent panic than KASSERT(!cv_has_waiters(cv))? > > I meant it didn't panic for most cases as if nothing wrong happened, but > it occasionally panicked due to KASSERT(!cv_has_waiters(cv)). Sorry for > my bad English. OK, thanks! No worries, just wasn't sure what you meant. I might phrase that as: `It sometimes panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy(), but it doesn't always panic.' or `I sometimes see KASSERT(!cv_has_waiters(cv)) panics.' > > What exactly is the panic you see and the evidence when you see it? > > Stack trace, gdb print cb in crash dump? > > Wait, can we use gdb for examining the kernel dump? I thought gdb > couldn't read it. Here's the stacktrace found in /var/log/message: Yep, you can use gdb to examine a crash dump: $ gdb ./netbsd.gdb (gdb) target kvm netbsd.123.core You can even do this to poke around in the live kernel you're running! # gdb ./netbsd.gdb (gdb) target kvm /dev/mem (Not at elevated securelevel, of course.)
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
On 7/22/23 20:48, Taylor R Campbell wrote: A note about one of the problems there: spin_unlock(f->lock); ret = dma_fence_add_callback(f, , vmwgfx_wait_cb); spin_lock(f->lock); #if defined(__NetBSD__) /* This is probably an upstream bug: there is a time window between * the call of vmw_fence_obj_signaled() above, and this * dma_fence_add_callback(). If the fence gets signaled during it * dma_fence_add_callback() returns -ENOENT, which is really not an * error condition. By the way, why the heck does dma_fence work in * this way? If a callback is being added but it lost the race, why * not just call it immediately as if it were just signaled? */ Not an upstream bug -- I introduced this bug when I patched the code that reached into the guts of what should be an opaque data structure for direct modification, to use drm_fence_add_callback instead. Need to look at the diff from upstream, not just the #ifdefs. Usually I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms, and just patch the code when the patched code can use a common API that isn't any one OSism. In this case I don't even remember why I left any #ifdefs, was probably just working fast to make progress on a large code base, might have left the #ifdefs in for visual reference while I was editing the code and forgot to remove them. Could also simplify some of the lock/unlock cycles by doing that. Ah okay. I used #if defined(__NetBSD__) for everything needing any changes, and I assumed you did the same without actually checking the original code. cv_destroy(); // <-- Panics! It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not always. The panic seems to happen when cv_timedwait_sig() exits due to the timeout expiring before it gets signaled. Confused by `seldom panics on ... but not always' -- was that supposed to be `often panics on ... but not always', or is there a more frequent panic than KASSERT(!cv_has_waiters(cv))? I meant it didn't panic for most cases as if nothing wrong happened, but it occasionally panicked due to KASSERT(!cv_has_waiters(cv)). Sorry for my bad English. What exactly is the panic you see and the evidence when you see it? Stack trace, gdb print cb in crash dump? Wait, can we use gdb for examining the kernel dump? I thought gdb couldn't read it. Here's the stacktrace found in /var/log/message: Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3652130] panic: kernel diagnostic assertion "!cv_has_waiters(cv)" failed: file "/home/pho/sandbox/_netbsd/src/sys/kern/kern_condvar.c", line 108 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3782663] cpu0: Begin traceback... Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5355447] vpanic() at netbsd:vpanic+0x173 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5454410] kern_assert() at netbsd:kern_assert+0x4b Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5551143] cv_destroy() at netbsd:cv_destroy+0x8a Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] linux_dma_fence_wait_timeout() at netbsd:linux_dma_fence_wait_timeout+0x8b Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] linux_dma_resv_wait_timeout_rcu() at netbsd:linux_dma_resv_wait_timeout_rcu+0xbe Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] ttm_bo_wait() at netbsd:ttm_bo_wait+0x4c Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] vmw_resource_unbind_list() at netbsd:vmw_resource_unbind_list+0x103 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] vmw_move_notify() at netbsd:vmw_move_notify+0x16 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6351198] ttm_bo_handle_move_mem() at netbsd:ttm_bo_handle_move_mem+0xe6 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175] ttm_mem_evict_first() at netbsd:ttm_mem_evict_first+0x702 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175] ttm_bo_mem_space() at netbsd:ttm_bo_mem_space+0x21e Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175] ttm_bo_validate() at netbsd:ttm_bo_validate+0xe6 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178] vmw_validation_bo_validate_single() at netbsd:vmw_validation_bo_validate_single+0x93 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178] vmw_validation_bo_validate() at netbsd:vmw_validation_bo_validate+0xaa Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178] vmw_execbuf_process() at netbsd:vmw_execbuf_process+0x771 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177] vmw_execbuf_ioctl() at netbsd:vmw_execbuf_ioctl+0x97 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177] drm_ioctl() at netbsd:drm_ioctl+0x23d Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] drm_ioctl_shim() at netbsd:drm_ioctl_shim+0x25 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] sys_ioctl() at
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
> Date: Mon, 17 Jul 2023 12:57:42 +0900 > From: PHO > > While further testing my DRM/KMS vmwgfx driver patches by playing > games/minetest from pkgsrc, I experienced inexplicable kernel panics on > this code: > > https://github.com/depressed-pho/netbsd-src/blob/91daa67f17222da355d3fddd6fa849c786d9c545/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c#L289 A note about one of the problems there: spin_unlock(f->lock); ret = dma_fence_add_callback(f, , vmwgfx_wait_cb); spin_lock(f->lock); #if defined(__NetBSD__) /* This is probably an upstream bug: there is a time window between * the call of vmw_fence_obj_signaled() above, and this * dma_fence_add_callback(). If the fence gets signaled during it * dma_fence_add_callback() returns -ENOENT, which is really not an * error condition. By the way, why the heck does dma_fence work in * this way? If a callback is being added but it lost the race, why * not just call it immediately as if it were just signaled? */ Not an upstream bug -- I introduced this bug when I patched the code that reached into the guts of what should be an opaque data structure for direct modification, to use drm_fence_add_callback instead. Need to look at the diff from upstream, not just the #ifdefs. Usually I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms, and just patch the code when the patched code can use a common API that isn't any one OSism. In this case I don't even remember why I left any #ifdefs, was probably just working fast to make progress on a large code base, might have left the #ifdefs in for visual reference while I was editing the code and forgot to remove them. Could also simplify some of the lock/unlock cycles by doing that. >cv_destroy(); // <-- Panics! > > It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not > always. The panic seems to happen when cv_timedwait_sig() exits due to > the timeout expiring before it gets signaled. Confused by `seldom panics on ... but not always' -- was that supposed to be `often panics on ... but not always', or is there a more frequent panic than KASSERT(!cv_has_waiters(cv))? What exactly is the panic you see and the evidence when you see it? Stack trace, gdb print cb in crash dump?