On 7/22/23 20:48, Taylor R Campbell wrote:
A note about one of the problems there:
spin_unlock(f->lock);
ret = dma_fence_add_callback(f, &cb.base, vmwgfx_wait_cb);
spin_lock(f->lock);
#if defined(__NetBSD__)
/* This is probably an upstream bug: there is a time window between
* the call of vmw_fence_obj_signaled() above, and this
* dma_fence_add_callback(). If the fence gets signaled during it
* dma_fence_add_callback() returns -ENOENT, which is really not an
* error condition. By the way, why the heck does dma_fence work in
* this way? If a callback is being added but it lost the race, why
* not just call it immediately as if it were just signaled?
*/
Not an upstream bug -- I introduced this bug when I patched the code
that reached into the guts of what should be an opaque data structure
for direct modification, to use drm_fence_add_callback instead.
Need to look at the diff from upstream, not just the #ifdefs. Usually
I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms,
and just patch the code when the patched code can use a common API
that isn't any one OSism.
In this case I don't even remember why I left any #ifdefs, was
probably just working fast to make progress on a large code base,
might have left the #ifdefs in for visual reference while I was
editing the code and forgot to remove them. Could also simplify some
of the lock/unlock cycles by doing that.
Ah okay. I used #if defined(__NetBSD__) for everything needing any
changes, and I assumed you did the same without actually checking the
original code.
cv_destroy(&cv); // <-- Panics!
It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not
always. The panic seems to happen when cv_timedwait_sig() exits due to
the timeout expiring before it gets signaled.
Confused by `seldom panics on ... but not always' -- was that supposed
to be `often panics on ... but not always', or is there a more
frequent panic than KASSERT(!cv_has_waiters(cv))?
I meant it didn't panic for most cases as if nothing wrong happened, but
it occasionally panicked due to KASSERT(!cv_has_waiters(cv)). Sorry for
my bad English.
What exactly is the panic you see and the evidence when you see it?
Stack trace, gdb print cb in crash dump?
Wait, can we use gdb for examining the kernel dump? I thought gdb
couldn't read it. Here's the stacktrace found in /var/log/message:
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3652130] panic: kernel
diagnostic assertion "!cv_has_waiters(cv)" failed: file
"/home/pho/sandbox/_netbsd/src/sys/kern/kern_condvar.c", line 108
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3782663] cpu0: Begin
traceback...
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5355447] vpanic() at
netbsd:vpanic+0x173
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5454410] kern_assert()
at netbsd:kern_assert+0x4b
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5551143] cv_destroy() at
netbsd:cv_destroy+0x8a
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161]
vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161]
linux_dma_fence_wait_timeout() at netbsd:linux_dma_fence_wait_timeout+0x8b
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161]
linux_dma_resv_wait_timeout_rcu() at
netbsd:linux_dma_resv_wait_timeout_rcu+0xbe
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] ttm_bo_wait()
at netbsd:ttm_bo_wait+0x4c
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241]
vmw_resource_unbind_list() at netbsd:vmw_resource_unbind_list+0x103
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241]
vmw_move_notify() at netbsd:vmw_move_notify+0x16
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6351198]
ttm_bo_handle_move_mem() at netbsd:ttm_bo_handle_move_mem+0xe6
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175]
ttm_mem_evict_first() at netbsd:ttm_mem_evict_first+0x702
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175]
ttm_bo_mem_space() at netbsd:ttm_bo_mem_space+0x21e
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175]
ttm_bo_validate() at netbsd:ttm_bo_validate+0xe6
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178]
vmw_validation_bo_validate_single() at
netbsd:vmw_validation_bo_validate_single+0x93
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178]
vmw_validation_bo_validate() at netbsd:vmw_validation_bo_validate+0xaa
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178]
vmw_execbuf_process() at netbsd:vmw_execbuf_process+0x771
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177]
vmw_execbuf_ioctl() at netbsd:vmw_execbuf_ioctl+0x97
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177] drm_ioctl() at
netbsd:drm_ioctl+0x23d
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234]
drm_ioctl_shim() at netbsd:drm_ioctl_shim+0x25
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] sys_ioctl() at
netbsd:sys_ioctl+0x56d
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] syscall() at
netbsd:syscall+0x196
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] --- syscall
(number 54) ---
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255]
netbsd:syscall+0x196:
Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] cpu0: End
traceback...