Re: zfs deadlock on r360452 relating to busy vm page

2020-05-12 Thread Bryan Drewery
Trivial repro:

dd if=/dev/zero of=blah & tail -F blah
^C
load: 0.21  cmd: tail 72381 [prev->lr_read_cv] 2.17r 0.00u 0.01s 0% 2600k
#0 0x80bce615 at mi_switch+0x155
#1 0x80c1cfea at sleepq_switch+0x11a
#2 0x80b57f0a at _cv_wait+0x15a
#3 0x829ddab6 at rangelock_enter+0x306
#4 0x829ecd3f at zfs_freebsd_getpages+0x14f
#5 0x810e3ab9 at VOP_GETPAGES_APV+0x59
#6 0x80f349e7 at vnode_pager_getpages+0x37
#7 0x80f2a93f at vm_pager_get_pages+0x4f
#8 0x80f054b0 at vm_fault+0x780
#9 0x80f04bde at vm_fault_trap+0x6e
#10 0x8106544e at trap_pfault+0x1ee
#11 0x81064a9c at trap+0x44c
#12 0x8103a978 at calltrap+0x8


On 5/12/2020 3:13 PM, Bryan Drewery wrote:
>> panic: deadlres_td_sleep_q: possible deadlock detected for 
>> 0xfe25eefa2e00 (find), blocked for 1802392 ticks
> 
> 2 stuck processes from procstat -kk before panic
>> 72559 101698 tail-   mi_switch+0x155 
>> sleepq_switch+0x11a _cv_wait+0x15a rangelock_enter+0x306 
>> zfs_freebsd_getpages+0x14f VOP_GETPAGES_APV+0x59 vnode_pager_getpages+0x37 
>> vm_pager_get_pages+0x4f vm_fault+0x780 vm_fault_trap+0x6e trap_pfault+0x1ee
> 
>> 72985 107378 find-   mi_switch+0x155 
>> sleepq_switch+0x11a sleeplk+0x106 lockmgr_slock_hard+0x1f5 
>> VOP_LOCK1_APV+0x40 _vn_lock+0x54 lookup+0xdd namei+0x524 vn_open_cred+0x32b 
>> kern_openat+0x1fa filemon_wrapper_openat+0x15 amd64_syscall+0x73d
> 
> 
> The only find running was thread 107378
> 
> I couldn't record much from ddb but got locked vnodes.
>>
>> db> show lockedvnods
>> Locked vnodes
>> vnode 0xf804de66e500: type VDIR
>> usecount 3, writecount 0, refcount 2 mountedhere 0
>> flags ()
>> v_object 0xf809459cb420 ref 0 pages 0 cleanbuf 0 dirtybuf 0
>> lock type zfs: SHARED (count 1)
>> #0 0x80b94a0f at lockmgr_slock+0xdf
>> #1 0x810e2a40 at VOP_LOCK1_APV+0x40
>> #2 0x80cb14f4 at _vn_lock+0x54
>> #3 0x80c9b3ec at vget_finish+0x6c
>> #4 0x80c8051c at cache_lookup+0x57c
>> #5 0x80c84dad at vfs_cache_lookup+0x7d
>> #6 0x810df996 at VOP_LOOKUP_APV+0x56
>> #7 0x80c8ee61 at lookup+0x601
>> #8 0x80c8e374 at namei+0x524
>> #9 0x80caa83f at kern_statat+0x7f
>> #10 0x80caafff at sys_fstatat+0x2f
>> #11 0x81065c40 at amd64_syscall+0x140
>> #12 0x8103b2a0 at fast_syscall_common+0x101
>> vnode 0xf808a08f0a00: type VDIR
>> usecount 6, writecount 0, refcount 2 mountedhere 0
>> flags ()
>> v_object 0xf801eb93 ref 0 pages 0 cleanbuf 0 dirtybuf 0
>> lock type zfs: EXCL by thread 0xfe24aadb6100 (pid 72267, gmake, tid 
>> 104356)
>>  with shared waiters pending
>> #0 0x80b94a0f at lockmgr_slock+0xdf
>> #1 0x810e2a40 at VOP_LOCK1_APV+0x40
>> #2 0x80cb14f4 at _vn_lock+0x54
>> #3 0x80c8e93d at lookup+0xdd
>> #4 0x80c8e374 at namei+0x524
>> #5 0x80ca9e69 at kern_funlinkat+0xa9
>> #6 0x80ca9db8 at sys_unlink+0x28
>> #7 0x82780586 at filemon_wrapper_unlink+0x16
>> #8 0x8106623d at amd64_syscall+0x73d
>> #9 0x8103b2a0 at fast_syscall_common+0x101
>>
>> vnode 0xf80571f29500: type VREG
>> usecount 6, writecount 1, refcount 2
>> flags ()
>> v_object 0xf806cb637c60 ref 2 pages 1 cleanbuf 0 dirtybuf 0
>> lock type zfs: SHARED (count 2)
>>  with exclusive waiters pending
>> #0 0x80b94a0f at lockmgr_slock+0xdf
>> #1 0x810e2a40 at VOP_LOCK1_APV+0x40
>> #2 0x80cb14f4 at _vn_lock+0x54
>> #3 0x8243af40 at zfs_lookup+0x610
>> #4 0x8243b61e at zfs_freebsd_cachedlookup+0x8e
>> #5 0x810dfb46 at VOP_CACHEDLOOKUP_APV+0x56
>> #6 0x80c84dd8 at vfs_cache_lookup+0xa8
>> #7 0x810df996 at VOP_LOOKUP_APV+0x56
>> #8 0x80c8ee61 at lookup+0x601
>> #9 0x80c8e374 at namei+0x524
>> #10 0x80caa83f at kern_statat+0x7f
>> #11 0x80caafff at sys_fstatat+0x2f
>> #12 0x8106623d at amd64_syscall+0x73d
>> #13 0x8103b2a0 at fast_syscall_common+0x101
> 
> It's nice how recent threads are at the top in gdb...
>> (kgdb) info threads
>>   Id   Target Id Frame
>>   1Thread 107952 (PID=79390: zfs)
>> sched_switch (td=0xfe26ebb36000, flags=) at 
>> /usr/src/sys/kern/sched_ule.c:2147
>>   2Thread 102764 (PID=73218: zfs)
>> sched_switch (td=0xfe2490a12300, flags=) at 
>> /usr/src/sys/kern/sched_ule.c:2147
>>   3Thread 107378 (PID=72985: find)   
>> sched_switch (td=0xfe25eefa2e00, flags=) at 
>> /usr/src/sys/kern/sched_ule.c:2147
>>   4Thread 103940 (PID=72980: rm) 
>> sched_switch (td=0xfe2451932500, flags=) at 
>> /usr/src/sys/kern/sched_ule.c:2147
>>   5Thread 101698 (PID=72559: 

zfs deadlock on r360452 relating to busy vm page

2020-05-12 Thread Bryan Drewery
> panic: deadlres_td_sleep_q: possible deadlock detected for 0xfe25eefa2e00 
> (find), blocked for 1802392 ticks

2 stuck processes from procstat -kk before panic
> 72559 101698 tail-   mi_switch+0x155 
> sleepq_switch+0x11a _cv_wait+0x15a rangelock_enter+0x306 
> zfs_freebsd_getpages+0x14f VOP_GETPAGES_APV+0x59 vnode_pager_getpages+0x37 
> vm_pager_get_pages+0x4f vm_fault+0x780 vm_fault_trap+0x6e trap_pfault+0x1ee

> 72985 107378 find-   mi_switch+0x155 
> sleepq_switch+0x11a sleeplk+0x106 lockmgr_slock_hard+0x1f5 VOP_LOCK1_APV+0x40 
> _vn_lock+0x54 lookup+0xdd namei+0x524 vn_open_cred+0x32b kern_openat+0x1fa 
> filemon_wrapper_openat+0x15 amd64_syscall+0x73d


The only find running was thread 107378

I couldn't record much from ddb but got locked vnodes.
> 
> db> show lockedvnods
> Locked vnodes
> vnode 0xf804de66e500: type VDIR
> usecount 3, writecount 0, refcount 2 mountedhere 0
> flags ()
> v_object 0xf809459cb420 ref 0 pages 0 cleanbuf 0 dirtybuf 0
> lock type zfs: SHARED (count 1)
> #0 0x80b94a0f at lockmgr_slock+0xdf
> #1 0x810e2a40 at VOP_LOCK1_APV+0x40
> #2 0x80cb14f4 at _vn_lock+0x54
> #3 0x80c9b3ec at vget_finish+0x6c
> #4 0x80c8051c at cache_lookup+0x57c
> #5 0x80c84dad at vfs_cache_lookup+0x7d
> #6 0x810df996 at VOP_LOOKUP_APV+0x56
> #7 0x80c8ee61 at lookup+0x601
> #8 0x80c8e374 at namei+0x524
> #9 0x80caa83f at kern_statat+0x7f
> #10 0x80caafff at sys_fstatat+0x2f
> #11 0x81065c40 at amd64_syscall+0x140
> #12 0x8103b2a0 at fast_syscall_common+0x101
> vnode 0xf808a08f0a00: type VDIR
> usecount 6, writecount 0, refcount 2 mountedhere 0
> flags ()
> v_object 0xf801eb93 ref 0 pages 0 cleanbuf 0 dirtybuf 0
> lock type zfs: EXCL by thread 0xfe24aadb6100 (pid 72267, gmake, tid 
> 104356)
>  with shared waiters pending
> #0 0x80b94a0f at lockmgr_slock+0xdf
> #1 0x810e2a40 at VOP_LOCK1_APV+0x40
> #2 0x80cb14f4 at _vn_lock+0x54
> #3 0x80c8e93d at lookup+0xdd
> #4 0x80c8e374 at namei+0x524
> #5 0x80ca9e69 at kern_funlinkat+0xa9
> #6 0x80ca9db8 at sys_unlink+0x28
> #7 0x82780586 at filemon_wrapper_unlink+0x16
> #8 0x8106623d at amd64_syscall+0x73d
> #9 0x8103b2a0 at fast_syscall_common+0x101
> 
> vnode 0xf80571f29500: type VREG
> usecount 6, writecount 1, refcount 2
> flags ()
> v_object 0xf806cb637c60 ref 2 pages 1 cleanbuf 0 dirtybuf 0
> lock type zfs: SHARED (count 2)
>  with exclusive waiters pending
> #0 0x80b94a0f at lockmgr_slock+0xdf
> #1 0x810e2a40 at VOP_LOCK1_APV+0x40
> #2 0x80cb14f4 at _vn_lock+0x54
> #3 0x8243af40 at zfs_lookup+0x610
> #4 0x8243b61e at zfs_freebsd_cachedlookup+0x8e
> #5 0x810dfb46 at VOP_CACHEDLOOKUP_APV+0x56
> #6 0x80c84dd8 at vfs_cache_lookup+0xa8
> #7 0x810df996 at VOP_LOOKUP_APV+0x56
> #8 0x80c8ee61 at lookup+0x601
> #9 0x80c8e374 at namei+0x524
> #10 0x80caa83f at kern_statat+0x7f
> #11 0x80caafff at sys_fstatat+0x2f
> #12 0x8106623d at amd64_syscall+0x73d
> #13 0x8103b2a0 at fast_syscall_common+0x101

It's nice how recent threads are at the top in gdb...
> (kgdb) info threads
>   Id   Target Id Frame
>   1Thread 107952 (PID=79390: zfs)sched_switch 
> (td=0xfe26ebb36000, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   2Thread 102764 (PID=73218: zfs)sched_switch 
> (td=0xfe2490a12300, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   3Thread 107378 (PID=72985: find)   sched_switch 
> (td=0xfe25eefa2e00, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   4Thread 103940 (PID=72980: rm) sched_switch 
> (td=0xfe2451932500, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   5Thread 101698 (PID=72559: tail)   sched_switch 
> (td=0xfe255eac, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   6Thread 103660 (PID=72280: timestamp)  sched_switch 
> (td=0xfe25f948aa00, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   7Thread 101249 (PID=72280: timestamp/prefix_stdout)sched_switch 
> (td=0xfe264412a100, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   8Thread 101255 (PID=72280: timestamp/prefix_stderr)sched_switch 
> (td=0xfe25c8e9bc00, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   9Thread 104356 (PID=72267: gmake)  sched_switch 
> (td=0xfe24aadb6100, flags=) at 
> /usr/src/sys/kern/sched_ule.c:2147
>   10   Thread 108476 (PID=66957: vim)sched_switch 
> (td=0xfe26c8601500, 

Re: lockups on lenovo p43s under current

2020-05-12 Thread Pete Wright



On 5/11/20 4:21 PM, Pete Wright wrote:



On 5/11/20 3:28 PM, Yuri Pankov wrote:

Pete Wright wrote:

hello,
i have a lenovo thinkpad P43s that exhibits lockups under CURRENT 
but behaves fine when running STABLE.  i've tried to find a fully 
reproducible situation to get this system to lockup but haven't 
found anything yet.  i am starting to suspect that the changes 
implemented in this review may be the issue though:


https://reviews.freebsd.org/D23728

my reasoning is that i've observed issues when:
- removing AC power from the laptop, or inserting AC power
- when the system display has gone to sleep
- randomly hanging during boot with this as last line:
battery0: battery enitialization start

unfortunately while the above seem to be cases where this has 
happened i haven't been able to %100 reproduce yet.


so my first question is - would it be possible to just revert the 
changes in that diff, or has too much time gone past to just back 
out that single change.  alternatively, is there any debugging 
information i can get on my end that might help figure out what the 
root cause is?


Not really what you are asking, but it's possible to disable ACPI 
subdevices, so you could check if disabling cmbat completely helps 
and it's indeed the suspect:


debug.acpi.disabled="cmbat"


Thanks Yuri,
So I was able to boot my system once via batter with this set, but 
unfortunately it crashed after I tried to suspend/resume. Realizing 
that was a bit optimistic I attempted to reboot the system and wasn't 
able to get it to fully boot after several attempts.


I believe what the next step at this point is checkout the code right 
before this commit and see if I can get it to successfully boot.  I'll 
report back if I find anything after that test.




To follow-up on this I believe the updates in the above review may be 
the culprit.  What I have done is built a memstick.img set to the commit 
right before the changes in D23728 were merged.  running this image I 
can boot my system, disconnect and reconnect AC power without any issues.


i then booted from a memstick using the latest snapshot of current. i 
can disconnect AC power without issues, but reconnecting hangs the 
system immediately.


i've tested this a couple times and it seems pretty reproducible, not 
sure what the best next step would be though.  would someone here be 
willing to help me debug this, or would it be best to file a PR along 
with a dmesg and output from acpiconf?


cheers!
-pete

--
Pete Wright
p...@nomadlogic.org
@nomadlogicLA

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"