Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311
Hi Mark, On Wed, 13 May 2020 01:43:23 -0700 Mark Millard wrote: > [I'm adding a reference to an old arm64/aarch64 bug that had > pages turning to zero, in case this 32-bit powerpc issue is > somewhat analogous.] > > On 2020-May-13, at 00:29, Mark Millard wrote: > > > [stress alone is sufficient to have jemalloc asserts fail > > in stress, no need for a multi-socket G4 either. No need > > to involve nfsd, mountd, rpcbind or the like. This is not > > a claim that I know all the problems to be the same, just > > that a jemalloc reported failure in this simpler context > > happens and zeroed pages are involved.] > > > > Reminder: head -r360311 based context. > > > > > > First I show a single CPU/core PowerMac G4 context failing > > in stress. (I actually did this later, but it is the > > simpler context.) I simply moved the media from the > > 2-socket G4 to this slower, single-cpu/core one. > > > > cpu0: Motorola PowerPC 7400 revision 2.9, 466.42 MHz > > cpu0: Features 9c00 > > cpu0: HID0 8094c0a4 > > real memory = 1577857024 (1504 MB) > > avail memory = 1527508992 (1456 MB) > > > > # stress -m 1 --vm-bytes 1792M > > stress: info: [1024] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > > : > > /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: > > Failed assertion: "slab == extent_slab_get(extent)" stress: FAIL: > > [1024] (415) <-- worker 1025 got signal 6 stress: WARN: [1024] > > (417) now reaping child worker processes stress: FAIL: [1024] (451) > > failed run completed in 243s > > > > (Note: 1792 is the biggest it allowed with M.) > > > > The following still pages in and out and fails: > > > > # stress -m 1 --vm-bytes 1290M > > stress: info: [1163] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > > : > > /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: > > Failed assertion: "slab == extent_slab_get(extent)" . . . > > > > By contrast, the following had no problem for as > > long as I let it run --and did not page in or out: > > > > # stress -m 1 --vm-bytes 1280M > > stress: info: [1181] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > > ... > The following is was a fix for a "pages magically > turn into zeros" problem on amd64/aarch64. The > original 32-bit powerpc context did not seem a > match to me --but the stress test behavior that > I've just observed seems closer from an > external-test point of view: swapping is involved. > > My be this will suggest something to someone that > knows what they are doing. > > (Note: dsl-only.net closed down, so the E-mail > address reference is no longer valid.) > > Author: kib > Date: Mon Apr 10 15:32:26 2017 > New Revision: 316679 > URL: > https://svnweb.freebsd.org/changeset/base/316679 > > > Log: > Do not lose dirty bits for removing PROT_WRITE on arm64. > > Arm64 pmap interprets accessed writable ptes as modified, since > ARMv8.0 does not track Dirty Bit Modifier in hardware. If writable > bit is removed, page must be marked as dirty for MI VM. > > This change is most important for COW, where fork caused losing > content of the dirty pages which were not yet scanned by pagedaemon. > > Reviewed by: alc, andrew > Reported and tested by: Mark Millard > PR: 217138, 217239 > Sponsored by:The FreeBSD Foundation > MFC after: 2 weeks > > Modified: > head/sys/arm64/arm64/pmap.c > > Modified: head/sys/arm64/arm64/pmap.c > == > --- head/sys/arm64/arm64/pmap.c Mon Apr 10 12:35:58 > 2017 (r316678) +++ head/sys/arm64/arm64/pmap.c Mon Apr > 10 15:32:26 2017 (r316679) @@ -2481,6 +2481,11 @@ > pmap_protect(pmap_t pmap, vm_offset_t sv sva += L3_SIZE) { > l3 = pmap_load(l3p); > if (pmap_l3_valid(l3)) { > + if ((l3 & ATTR_SW_MANAGED) && > + pmap_page_dirty(l3)) { > + > vm_page_dirty(PHYS_TO_VM_PAGE(l3 & > + ~ATTR_MASK)); > + } > pmap_set(l3p, ATTR_AP(ATTR_AP_RO)); > PTE_SYNC(l3p); > /* XXX: Use pmap_invalidate_range */ > > > === > Mark Millard > marklmi at yahoo.com > ( dsl-only.net went > away in early 2018-Mar) > Thanks for this reference. I took a quick look at the 3 pmap implementations we have (haven't check the new radix pmap yet), and it looks like only mmu_oea.c (32-bit AIM pmap, for G3 and G4) is missing vm_page_dirty() calls in its pmap_protect() implementation, analogous to the change you posted right above. Given this, I think it's safe to say that this missing piece is necessary. We'll work on a fix for this; looking at moea64_protect(), there may be additional work needed to support this as well, so it may take a few days. - Justin ___ freebsd-current@freebsd.org mailing
Re: zfs deadlock on r360452 relating to busy vm page
On Wed, May 13, 2020 at 10:45:24AM +0300, Andriy Gapon wrote: > On 13/05/2020 10:35, Andriy Gapon wrote: > > On 13/05/2020 01:47, Bryan Drewery wrote: > >> Trivial repro: > >> > >> dd if=/dev/zero of=blah & tail -F blah > >> ^C > >> load: 0.21 cmd: tail 72381 [prev->lr_read_cv] 2.17r 0.00u 0.01s 0% 2600k > >> #0 0x80bce615 at mi_switch+0x155 > >> #1 0x80c1cfea at sleepq_switch+0x11a > >> #2 0x80b57f0a at _cv_wait+0x15a > >> #3 0x829ddab6 at rangelock_enter+0x306 > >> #4 0x829ecd3f at zfs_freebsd_getpages+0x14f > >> #5 0x810e3ab9 at VOP_GETPAGES_APV+0x59 > >> #6 0x80f349e7 at vnode_pager_getpages+0x37 > >> #7 0x80f2a93f at vm_pager_get_pages+0x4f > >> #8 0x80f054b0 at vm_fault+0x780 > >> #9 0x80f04bde at vm_fault_trap+0x6e > >> #10 0x8106544e at trap_pfault+0x1ee > >> #11 0x81064a9c at trap+0x44c > >> #12 0x8103a978 at calltrap+0x8 > > > > In r329363 I re-worked zfs_getpages and introduced range locking to it. > > At the time I believed that it was safe and maybe it was, please see the > > commit > > message. > > There, indeed, have been many performance / concurrency improvements to the > > VM > > system and r358443 is one of them. > > Thinking more about it, it could be r352176. > I think that vm_page_grab_valid (and later vm_page_grab_valid_unlocked) are > not > equivalent to the code that they replaced. > The original code would check valid field before any locking and it would > attempt any locking / busing if a page is invalid. The object was required to > be locked though. > The new code tries to busy the page in any case. > > > I am not sure how to resolve the problem best. Maybe someone who knows the > > latest VM code better than me can comment on my assumptions stated in the > > commit > > message. The general trend has been to use the page busy lock as the single point of synchronization for per-page state. As you noted, updates to the valid bits were previously interlocked by the object lock, but this is coarse-grained and hurts concurrency. I think you are right that the range locking in getpages was ok before the recent change, but it seems preferable to try and address this in ZFS. > > In illumos (and, I think, in OpenZFS/ZoL) they don't have the range locking > > in > > this corner of the code because of a similar deadlock a long time ago. Do they just not implement readahead? Can you explain exactly what the range lock accomplishes here - is it entirely to ensure that znode block size remains stable? ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311
[I'm adding a reference to an old arm64/aarch64 bug that had pages turning to zero, in case this 32-bit powerpc issue is somewhat analogous.] On 2020-May-13, at 00:29, Mark Millard wrote: > [stress alone is sufficient to have jemalloc asserts fail > in stress, no need for a multi-socket G4 either. No need > to involve nfsd, mountd, rpcbind or the like. This is not > a claim that I know all the problems to be the same, just > that a jemalloc reported failure in this simpler context > happens and zeroed pages are involved.] > > Reminder: head -r360311 based context. > > > First I show a single CPU/core PowerMac G4 context failing > in stress. (I actually did this later, but it is the > simpler context.) I simply moved the media from the > 2-socket G4 to this slower, single-cpu/core one. > > cpu0: Motorola PowerPC 7400 revision 2.9, 466.42 MHz > cpu0: Features 9c00 > cpu0: HID0 8094c0a4 > real memory = 1577857024 (1504 MB) > avail memory = 1527508992 (1456 MB) > > # stress -m 1 --vm-bytes 1792M > stress: info: [1024] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > : > /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: > Failed assertion: "slab == extent_slab_get(extent)" > stress: FAIL: [1024] (415) <-- worker 1025 got signal 6 > stress: WARN: [1024] (417) now reaping child worker processes > stress: FAIL: [1024] (451) failed run completed in 243s > > (Note: 1792 is the biggest it allowed with M.) > > The following still pages in and out and fails: > > # stress -m 1 --vm-bytes 1290M > stress: info: [1163] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > : > /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: > Failed assertion: "slab == extent_slab_get(extent)" > . . . > > By contrast, the following had no problem for as > long as I let it run --and did not page in or out: > > # stress -m 1 --vm-bytes 1280M > stress: info: [1181] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > > > > > The 2 socket PowerMac G4 context with 2048 MiByte of RAM . . . > > stress -m 1 --vm-bytes 1792M > > did not (quickly?) fail or page. 1792 > is as large as it would allow with M. > > The following also did not (quickly?) fail > (and were not paging): > > stress -m 2 --vm-bytes 896M > stress -m 4 --vm-bytes 448M > stress -m 8 --vm-bytes 224M > > (Only 1 example was run at a time.) > > But the following all did quickly fail (and were > paging): > > stress -m 8 --vm-bytes 225M > stress -m 4 --vm-bytes 449M > stress -m 2 --vm-bytes 897M > > (Only 1 example was run at a time.) > > I'll note that when I exited an su process > I ended up with a: > > : /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:200: > Failed assertion: "ret == sz_index2size_compute(index)" > Abort trap (core dumped) > > and a matching su.core file. It appears > that stress's activity leads to other > processes also seeing examples of the > zeroed-page(s) problem (probably su had > paged some or had been fully swapped > out): > > (gdb) bt > #0 thr_kill () at thr_kill.S:4 > #1 0x503821d0 in __raise (s=6) at /usr/src/lib/libc/gen/raise.c:52 > #2 0x502e1d20 in abort () at /usr/src/lib/libc/stdlib/abort.c:67 > #3 0x502d6144 in sz_index2size_lookup (index=) at > /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:200 > #4 sz_index2size (index=) at > /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:207 > #5 ifree (tsd=0x5008b018, ptr=0x50041460, tcache=0x5008b138, > slow_path=) at jemalloc_jemalloc.c:2583 > #6 0x502d5cec in __je_free_default (ptr=0x50041460) at > jemalloc_jemalloc.c:2784 > #7 0x502d62d4 in __free (ptr=0x50041460) at jemalloc_jemalloc.c:2852 > #8 0x501050cc in openpam_destroy_chain (chain=0x50041480) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:113 > #9 0x50105094 in openpam_destroy_chain (chain=0x500413c0) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 > #10 0x50105094 in openpam_destroy_chain (chain=0x50041320) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 > #11 0x50105094 in openpam_destroy_chain (chain=0x50041220) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 > #12 0x50105094 in openpam_destroy_chain (chain=0x50041120) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 > #13 0x50105094 in openpam_destroy_chain (chain=0x50041100) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 > #14 0x50105014 in openpam_clear_chains (policy=0x5064) at > /usr/src/contrib/openpam/lib/libpam/openpam_load.c:130 > #15 0x50101230 in pam_end (pamh=0x5060, status=) at > /usr/src/contrib/openpam/lib/libpam/pam_end.c:83 > #16 0x1001225c in main (argc=, argv=0x0) at > /usr/src/usr.bin/su/su.c:477 > > (gdb) print/x __je_sz_size2index_tab > $1 = {0x0 } > > > Notes: > > Given that the original problem did not involve > paging to the swap partition, may be just making > it to the Laundry list or some such is sufficient, > something that is also involved when the swap > space is
Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311
[stress alone is sufficient to have jemalloc asserts fail in stress, no need for a multi-socket G4 either. No need to involve nfsd, mountd, rpcbind or the like. This is not a claim that I know all the problems to be the same, just that a jemalloc reported failure in this simpler context happens and zeroed pages are involved.] Reminder: head -r360311 based context. First I show a single CPU/core PowerMac G4 context failing in stress. (I actually did this later, but it is the simpler context.) I simply moved the media from the 2-socket G4 to this slower, single-cpu/core one. cpu0: Motorola PowerPC 7400 revision 2.9, 466.42 MHz cpu0: Features 9c00 cpu0: HID0 8094c0a4 real memory = 1577857024 (1504 MB) avail memory = 1527508992 (1456 MB) # stress -m 1 --vm-bytes 1792M stress: info: [1024] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd : /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: Failed assertion: "slab == extent_slab_get(extent)" stress: FAIL: [1024] (415) <-- worker 1025 got signal 6 stress: WARN: [1024] (417) now reaping child worker processes stress: FAIL: [1024] (451) failed run completed in 243s (Note: 1792 is the biggest it allowed with M.) The following still pages in and out and fails: # stress -m 1 --vm-bytes 1290M stress: info: [1163] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd : /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: Failed assertion: "slab == extent_slab_get(extent)" . . . By contrast, the following had no problem for as long as I let it run --and did not page in or out: # stress -m 1 --vm-bytes 1280M stress: info: [1181] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd The 2 socket PowerMac G4 context with 2048 MiByte of RAM . . . stress -m 1 --vm-bytes 1792M did not (quickly?) fail or page. 1792 is as large as it would allow with M. The following also did not (quickly?) fail (and were not paging): stress -m 2 --vm-bytes 896M stress -m 4 --vm-bytes 448M stress -m 8 --vm-bytes 224M (Only 1 example was run at a time.) But the following all did quickly fail (and were paging): stress -m 8 --vm-bytes 225M stress -m 4 --vm-bytes 449M stress -m 2 --vm-bytes 897M (Only 1 example was run at a time.) I'll note that when I exited an su process I ended up with a: : /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:200: Failed assertion: "ret == sz_index2size_compute(index)" Abort trap (core dumped) and a matching su.core file. It appears that stress's activity leads to other processes also seeing examples of the zeroed-page(s) problem (probably su had paged some or had been fully swapped out): (gdb) bt #0 thr_kill () at thr_kill.S:4 #1 0x503821d0 in __raise (s=6) at /usr/src/lib/libc/gen/raise.c:52 #2 0x502e1d20 in abort () at /usr/src/lib/libc/stdlib/abort.c:67 #3 0x502d6144 in sz_index2size_lookup (index=) at /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:200 #4 sz_index2size (index=) at /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:207 #5 ifree (tsd=0x5008b018, ptr=0x50041460, tcache=0x5008b138, slow_path=) at jemalloc_jemalloc.c:2583 #6 0x502d5cec in __je_free_default (ptr=0x50041460) at jemalloc_jemalloc.c:2784 #7 0x502d62d4 in __free (ptr=0x50041460) at jemalloc_jemalloc.c:2852 #8 0x501050cc in openpam_destroy_chain (chain=0x50041480) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:113 #9 0x50105094 in openpam_destroy_chain (chain=0x500413c0) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 #10 0x50105094 in openpam_destroy_chain (chain=0x50041320) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 #11 0x50105094 in openpam_destroy_chain (chain=0x50041220) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 #12 0x50105094 in openpam_destroy_chain (chain=0x50041120) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 #13 0x50105094 in openpam_destroy_chain (chain=0x50041100) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111 #14 0x50105014 in openpam_clear_chains (policy=0x5064) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:130 #15 0x50101230 in pam_end (pamh=0x5060, status=) at /usr/src/contrib/openpam/lib/libpam/pam_end.c:83 #16 0x1001225c in main (argc=, argv=0x0) at /usr/src/usr.bin/su/su.c:477 (gdb) print/x __je_sz_size2index_tab $1 = {0x0 } Notes: Given that the original problem did not involve paging to the swap partition, may be just making it to the Laundry list or some such is sufficient, something that is also involved when the swap space is partially in use (according to top). Or sitting in the inactive list for a long time, if that has some special status. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311
[Yet another new kind of experiment. But this looks like I can cause problems in fairly sort order on demand now. Finally! And with that I've much better evidence for kernel vs. user-space process for making the zeroed memory appear in, for example, nfsd.] I've managed to get: : /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: Failed assertion: "slab == extent_slab_get(extent)" : /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: Failed assertion: "slab == extent_slab_get(extent)" and eventually: [1] Segmentation fault (core dumped) stress -m 2 --vm-bytes 1700M from a user program (stress) while another machine was attempted an nfs mount during the stress activity: # mount -onoatime,soft ...:/ /mnt && umount /mnt && rpcinfo -s ... [tcp] ...:/: RPCPROG_MNT: RPC: Timed out (To get failure I may have to run the commands multiple times. Timing details against stress's activity seem to matter.) That failure lead to: # ls -ldT /*.core* -rw--- 1 root wheel 3899392 May 12 19:52:26 2020 /mountd.core # ls -ldT *.core* -rw--- 1 root wheel 2682880 May 12 20:00:26 2020 stress.core (Note which of nfsd, mountd, or rpcbind need not be fully repeatable. stress.core seems to be written twice, probably because of the -m 2 in use.) The context that let me do this was to first (on the 2 socket G4 with a full 2048 MiBYte RAM configuration): stress -m 2 --vm-bytes 1700M & Note that the stress command keeps the memory busy and causes paging to the swap/page space. I've not tried to make it just fit without paging or just barely paging or such. The original context did not involve paging or low RAM, so I do not expect paging to be required but can be involved. The stress program backtrace is different: 4827return (tls_get_addr_slow(dtvp, index, offset)); 4828} (gdb) bt -full #0 0x41831b04 in tls_get_addr_common (dtvp=0x4186c010, index=2, offset=4294937444) at /usr/src/libexec/rtld-elf/rtld.c:4824 dtv = 0x0 #1 0x4182bfcc in __tls_get_addr (ti=) at /usr/src/libexec/rtld-elf/powerpc/reloc.c:848 tp = p = #2 0x41a83464 in __get_locale () at /usr/src/lib/libc/locale/xlocale_private.h:199 No locals. #3 fprintf (fp=0x41b355f8, fmt=0x1804cbc "%s: FAIL: [%lli] (%d) ") at /usr/src/lib/libc/stdio/fprintf.c:57 ap = {{gpr = 2 '\002', fpr = 0 '\000', reserved = 20731, overflow_arg_area = 0xdb78, reg_save_area = 0xdae8}} ret = #4 0x01802348 in main (argc=, argv=) at stress.c:415 status = ret = 6 do_dryrun = 0 retval = 0 children = 1 do_backoff = do_hdd_bytes = do_hdd = do_vm_keep = 0 do_vm_hang = -1 do_vm_stride = 4096 do_vm_bytes = 1782579200 do_vm = 108174317627375616 do_io = do_cpu = do_timeout = 108176117243859333 starttime = 1589338322 i = forks = pid = 6140 stoptime = runtime = Apparently the asserts did not stop the code and it ran until a failure occurred (via dtv=0x0). Stress uses a mutex stored on a page that gets the "turns into zeros" problem, preventing the mprotect(ADDR,1,1) type of test because stress will write on the page. (I've not tried to find a minimal form of stress run.) The the same sort of globals are again zeroed, such as: (gdb) print/x __je_sz_size2index_tab $1 = {0x0 } Another attempt lost rpcbind instead instead of mountd: # ls -ldT /*.core -rw--- 1 root wheel 3899392 May 12 19:52:26 2020 /mountd.core -rw--- 1 root wheel 3170304 May 12 20:03:00 2020 /rpcbind.core I again find that when I use gdb 3 times to: attach ??? x/x __je_sz_size2index_tab print (int)mprotext(ADDRESS,1,1) quit for each of rpcbind, mountd, and nfsd master that those processes no longer fail during the mount/umount/rpcinfo (or are far less likely to). But it turns out that later when I "service nfsd stop" nfsd does get the zeroed Memory based assert and core dumps. (I'd done a bunch of the mount/umount/ rpcinfo sequences before the stop.) That the failure is during SIGUSR1 based shutdown, leads me to wonder if killing off some child process(es) is involved in the problem. There was *no* evidence of a signal for an attempt to write the page from the user process. It appears that the kernel is doing something that changes what the process sees --instead of the user-space programs stomping on its own memory content. I've no clue how to track down the kernel activity that changes what the process sees on some page(s) of memory. (Prior testing with a debug kernel did not report problems, despite getting an example failure. So that seems insufficient.) At least a procedure is now known that does not involved waiting hours or days. The procedure (adjusted for how much RAM is present and number of cpus/cores?) could be appropriate to run in other contexts than the 32-bit powerpc
Re: zfs deadlock on r360452 relating to busy vm page
On 13/05/2020 10:35, Andriy Gapon wrote: > On 13/05/2020 01:47, Bryan Drewery wrote: >> Trivial repro: >> >> dd if=/dev/zero of=blah & tail -F blah >> ^C >> load: 0.21 cmd: tail 72381 [prev->lr_read_cv] 2.17r 0.00u 0.01s 0% 2600k >> #0 0x80bce615 at mi_switch+0x155 >> #1 0x80c1cfea at sleepq_switch+0x11a >> #2 0x80b57f0a at _cv_wait+0x15a >> #3 0x829ddab6 at rangelock_enter+0x306 >> #4 0x829ecd3f at zfs_freebsd_getpages+0x14f >> #5 0x810e3ab9 at VOP_GETPAGES_APV+0x59 >> #6 0x80f349e7 at vnode_pager_getpages+0x37 >> #7 0x80f2a93f at vm_pager_get_pages+0x4f >> #8 0x80f054b0 at vm_fault+0x780 >> #9 0x80f04bde at vm_fault_trap+0x6e >> #10 0x8106544e at trap_pfault+0x1ee >> #11 0x81064a9c at trap+0x44c >> #12 0x8103a978 at calltrap+0x8 > > In r329363 I re-worked zfs_getpages and introduced range locking to it. > At the time I believed that it was safe and maybe it was, please see the > commit > message. > There, indeed, have been many performance / concurrency improvements to the VM > system and r358443 is one of them. Thinking more about it, it could be r352176. I think that vm_page_grab_valid (and later vm_page_grab_valid_unlocked) are not equivalent to the code that they replaced. The original code would check valid field before any locking and it would attempt any locking / busing if a page is invalid. The object was required to be locked though. The new code tries to busy the page in any case. > I am not sure how to resolve the problem best. Maybe someone who knows the > latest VM code better than me can comment on my assumptions stated in the > commit > message. > > In illumos (and, I think, in OpenZFS/ZoL) they don't have the range locking in > this corner of the code because of a similar deadlock a long time ago. > >> On 5/12/2020 3:13 PM, Bryan Drewery wrote: panic: deadlres_td_sleep_q: possible deadlock detected for 0xfe25eefa2e00 (find), blocked for 1802392 ticks > ... (kgdb) backtrace #0 sched_switch (td=0xfe255eac, flags=) at /usr/src/sys/kern/sched_ule.c:2147 #1 0x80bce615 in mi_switch (flags=260) at /usr/src/sys/kern/kern_synch.c:542 #2 0x80c1cfea in sleepq_switch (wchan=0xf810fb57dd48, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:625 #3 0x80b57f0a in _cv_wait (cvp=0xf810fb57dd48, lock=0xf80049a99040) at /usr/src/sys/kern/kern_condvar.c:146 #4 0x82434ab6 in rangelock_enter_reader (rl=0xf80049a99018, new=0xf8022cadb100) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c:429 #5 rangelock_enter (rl=0xf80049a99018, off=, len=, type=) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c:477 #6 0x82443d3f in zfs_getpages (vp=, ma=0xfe259f204b18, count=, rbehind=0xfe259f204ac4, rahead=0xfe259f204ad0) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4500 #7 zfs_freebsd_getpages (ap=) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4567 #8 0x810e3ab9 in VOP_GETPAGES_APV (vop=0x8250a1e0 , a=0xfe259f2049f0) at vnode_if.c:2644 #9 0x80f349e7 in VOP_GETPAGES (vp=, m=>>> out>, count=, rbehind=, rahead=) at ./vnode_if.h:1171 #10 vnode_pager_getpages (object=, m=, count=, rbehind=, rahead=) at /usr/src/sys/vm/vnode_pager.c:743 #11 0x80f2a93f in vm_pager_get_pages (object=0xf806cb637c60, m=0xfe259f204b18, count=1, rbehind=, rahead=) at /usr/src/sys/vm/vm_pager.c:305 #12 0x80f054b0 in vm_fault_getpages (fs=, nera=0, behindp=, aheadp=) at /usr/src/sys/vm/vm_fault.c:1163 #13 vm_fault (map=, vaddr=, fault_type=, fault_flags=, m_hold=>>> out>) at /usr/src/sys/vm/vm_fault.c:1394 #14 0x80f04bde in vm_fault_trap (map=0xfe25653949e8, vaddr=, fault_type=, fault_flags=0, signo=0xfe259f204d04, ucode=0xfe259f204d00) at /usr/src/sys/vm/vm_fault.c:589 #15 0x8106544e in trap_pfault (frame=0xfe259f204d40, usermode=, signo=, ucode=) at /usr/src/sys/amd64/amd64/trap.c:821 #16 0x81064a9c in trap (frame=0xfe259f204d40) at /usr/src/sys/amd64/amd64/trap.c:340 #17 #18 0x002034fc in ?? () > ... (kgdb) thread [Current thread is 8 (Thread 101255)] (kgdb) backtrace #0 sched_switch (td=0xfe25c8e9bc00, flags=) at /usr/src/sys/kern/sched_ule.c:2147 #1 0x80bce615 in mi_switch (flags=260) at /usr/src/sys/kern/kern_synch.c:542 #2 0x80c1cfea in sleepq_switch (wchan=0xfe001cbca850, pri=84) at /usr/src/sys/kern/subr_sleepqueue.c:625 #3 0x80f1de50 in _vm_page_busy_sleep (obj=,
Re: zfs deadlock on r360452 relating to busy vm page
On 13/05/2020 01:47, Bryan Drewery wrote: > Trivial repro: > > dd if=/dev/zero of=blah & tail -F blah > ^C > load: 0.21 cmd: tail 72381 [prev->lr_read_cv] 2.17r 0.00u 0.01s 0% 2600k > #0 0x80bce615 at mi_switch+0x155 > #1 0x80c1cfea at sleepq_switch+0x11a > #2 0x80b57f0a at _cv_wait+0x15a > #3 0x829ddab6 at rangelock_enter+0x306 > #4 0x829ecd3f at zfs_freebsd_getpages+0x14f > #5 0x810e3ab9 at VOP_GETPAGES_APV+0x59 > #6 0x80f349e7 at vnode_pager_getpages+0x37 > #7 0x80f2a93f at vm_pager_get_pages+0x4f > #8 0x80f054b0 at vm_fault+0x780 > #9 0x80f04bde at vm_fault_trap+0x6e > #10 0x8106544e at trap_pfault+0x1ee > #11 0x81064a9c at trap+0x44c > #12 0x8103a978 at calltrap+0x8 In r329363 I re-worked zfs_getpages and introduced range locking to it. At the time I believed that it was safe and maybe it was, please see the commit message. There, indeed, have been many performance / concurrency improvements to the VM system and r358443 is one of them. I am not sure how to resolve the problem best. Maybe someone who knows the latest VM code better than me can comment on my assumptions stated in the commit message. In illumos (and, I think, in OpenZFS/ZoL) they don't have the range locking in this corner of the code because of a similar deadlock a long time ago. > On 5/12/2020 3:13 PM, Bryan Drewery wrote: >>> panic: deadlres_td_sleep_q: possible deadlock detected for >>> 0xfe25eefa2e00 (find), blocked for 1802392 ticks ... >>> (kgdb) backtrace >>> #0 sched_switch (td=0xfe255eac, flags=) at >>> /usr/src/sys/kern/sched_ule.c:2147 >>> #1 0x80bce615 in mi_switch (flags=260) at >>> /usr/src/sys/kern/kern_synch.c:542 >>> #2 0x80c1cfea in sleepq_switch (wchan=0xf810fb57dd48, pri=0) >>> at /usr/src/sys/kern/subr_sleepqueue.c:625 >>> #3 0x80b57f0a in _cv_wait (cvp=0xf810fb57dd48, >>> lock=0xf80049a99040) at /usr/src/sys/kern/kern_condvar.c:146 >>> #4 0x82434ab6 in rangelock_enter_reader (rl=0xf80049a99018, >>> new=0xf8022cadb100) at >>> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c:429 >>> #5 rangelock_enter (rl=0xf80049a99018, off=, >>> len=, type=) at >>> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c:477 >>> #6 0x82443d3f in zfs_getpages (vp=, >>> ma=0xfe259f204b18, count=, rbehind=0xfe259f204ac4, >>> rahead=0xfe259f204ad0) at >>> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4500 >>> #7 zfs_freebsd_getpages (ap=) at >>> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4567 >>> #8 0x810e3ab9 in VOP_GETPAGES_APV (vop=0x8250a1e0 >>> , a=0xfe259f2049f0) at vnode_if.c:2644 >>> #9 0x80f349e7 in VOP_GETPAGES (vp=, m=>> out>, count=, rbehind=, rahead=) at >>> ./vnode_if.h:1171 >>> #10 vnode_pager_getpages (object=, m=, >>> count=, rbehind=, rahead=) at >>> /usr/src/sys/vm/vnode_pager.c:743 >>> #11 0x80f2a93f in vm_pager_get_pages (object=0xf806cb637c60, >>> m=0xfe259f204b18, count=1, rbehind=, rahead=) >>> at /usr/src/sys/vm/vm_pager.c:305 >>> #12 0x80f054b0 in vm_fault_getpages (fs=, nera=0, >>> behindp=, aheadp=) at >>> /usr/src/sys/vm/vm_fault.c:1163 >>> #13 vm_fault (map=, vaddr=, >>> fault_type=, fault_flags=, m_hold=>> out>) at /usr/src/sys/vm/vm_fault.c:1394 >>> #14 0x80f04bde in vm_fault_trap (map=0xfe25653949e8, >>> vaddr=, fault_type=, fault_flags=0, >>> signo=0xfe259f204d04, ucode=0xfe259f204d00) at >>> /usr/src/sys/vm/vm_fault.c:589 >>> #15 0x8106544e in trap_pfault (frame=0xfe259f204d40, >>> usermode=, signo=, ucode=) at >>> /usr/src/sys/amd64/amd64/trap.c:821 >>> #16 0x81064a9c in trap (frame=0xfe259f204d40) at >>> /usr/src/sys/amd64/amd64/trap.c:340 >>> #17 >>> #18 0x002034fc in ?? () ... >>> (kgdb) thread >>> [Current thread is 8 (Thread 101255)] >>> (kgdb) backtrace >>> #0 sched_switch (td=0xfe25c8e9bc00, flags=) at >>> /usr/src/sys/kern/sched_ule.c:2147 >>> #1 0x80bce615 in mi_switch (flags=260) at >>> /usr/src/sys/kern/kern_synch.c:542 >>> #2 0x80c1cfea in sleepq_switch (wchan=0xfe001cbca850, pri=84) >>> at /usr/src/sys/kern/subr_sleepqueue.c:625 >>> #3 0x80f1de50 in _vm_page_busy_sleep (obj=, >>> m=0xfe001cbca850, pindex=, wmesg=, >>> allocflags=21504, locked=false) at /usr/src/sys/vm/vm_page.c:1094 >>> #4 0x80f241f7 in vm_page_grab_sleep (object=0xf806cb637c60, >>> m=, pindex=, wmesg=, >>> allocflags=21504, locked=>> address 0x0>) at /usr/src/sys/vm/vm_page.c:4326 >>> #5 vm_page_acquire_unlocked (object=0xf806cb637c60, pindex=1098, >>> prev=, mp=0xfe2717fc6730, allocflags=21504) at >>> /usr/src/sys/vm/vm_page.c:4469 >>> #6 0x80f24c61 in vm_page_grab_valid_unlocked >>> (mp=0xfe2717fc6730,
lkpi: print stack trace in WARN_ON ?
Just to get a bigger exposure: https://reviews.freebsd.org/D24779 I think that this is a good idea and, if I am not mistaken, it should match the Linux behavior. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"