Re: Filesystem operations slower in 13.0 than 12.2
On 2021-Mar-15, at 14:57, Kevin Oberman wrote: > Responses in-line. > > On Sun, Mar 14, 2021 at 3:09 PM Mark Millard wrote: > >> On 2021-Mar-14, at 11:09, Kevin Oberman wrote: >> >> > . . . >> > >> > Seems to only occur on large r/w operations from/to the same disk. "sp >> > big-file /other/file/on/same/disk" or tar/untar operations on large files. >> > Hit this today updating firefox. >> > >> > I/O starts at >40MB/s. Dropped to about 1.5MB/s. If I tried doing other >> > things while it was running slowly, the disk would appear to lock up. E.g. >> > pwd(1) seemed to completely lock up the system, but I could still ping it >> > and, after about 30 seconds, things came back to life. It was also not >> > instantaneous. Disc activity dropped to <1MB/s for a few seconds before >> > everything froze. >> > >> > During the untar of firefox, I saw; this several times. I also looked at my >> > console where I found these errors during : >> > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 55043, size: 8192 >> > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 51572, size: 4096 >> >> Does anyone know: >> Are those messages normal "reading is taking a rather long >> time" notices or is their presence more useful information >> in some way about the type of problem or context for the >> problem? >> > As for the tests: > Are these messages always present when near a time frame > when the problem occurs? Never present in a near time > frame to a period when the problem does not occur? > In a large number of test, these errors have not repeated. They baffle me for > another reason. This system has 20G or RAM. Tyically, all swap space is > unused. ATM I see 16384M free out of 16384. Not sure that I have ever seen it > used, though it might have been while building rust. I have not built rust > for a month. > > It appears that the messages are associated with reading > the disk(s), not directly with writing them, where the > reads take more than "hz * 20" time units to complete. > (I'm looking at main (14) code.) What might contribute > to the time taken for the pending read(s)? > The reference to hz * 20 woke up a few sleeping memory cells. I forgot that I > cleaned up my loader.conf. It was largely a copy of the one on my decade-old > T520. I commented out "kern.hz=100". I don't recall the details, but I think > it was actually from an even older system, my T42 from before I retired. > > In any case, restoring this setting has greatly improved the situation. I now > have really bad disk I/O performance on large disk to disk activity > (untarring the firefox distro) instead of terrible performance and the system > freezes have vanished, though I do see pauses in response to clicks or text > entry, but the display remains active and the pauses are short... 1 to 15 > seconds, I'd guess. No, I have no idea what this indicates. Interesting. > I'm still not seeing the performance I was seeing back in February when 40 > MB/s for extended intervals was common and I once untarred firefox.tar.gz2 in > under a minute and performance seldom dropped below 1.4 MB/s. > >>> /* >>> * swap_pager_getpages() - bring pages in from swap >>> * >>> * Attempt to page in the pages in array "ma" of length "count". The >>> * caller may optionally specify that additional pages preceding and >>> * succeeding the specified range be paged in. The number of such >>> pages >>> * is returned in the "rbehind" and "rahead" parameters, and they will >>> * be in the inactive queue upon return. >>> * >>> * The pages in "ma" must be busied and will remain busied upon return. >>> */ >>> static int >>> swap_pager_getpages_locked(vm_object_t object, vm_page_t *ma, int count, >>> int *rbehind, int *rahead) >>> { >>> . . . >>> /* >>> * Wait for the pages we want to complete. VPO_SWAPINPROG is always >>> * cleared on completion. If an I/O error occurs, SWAPBLK_NONE >>> * is set in the metadata for each page in the request. >>> */ >>> VM_OBJECT_WLOCK(object); >>> /* This could be implemented more efficiently with aflags */ >>> while ((ma[0]->oflags & VPO_SWAPINPROG) != 0) { >>> ma[0]->oflags |= VPO_SWAPSLEEP; >>> VM_CNT_INC(v_intrans); >>> if (VM_OBJECT_SLEEP(object, >handle, PSWP, >>> "swread", hz * 20)) { >>> printf( >>> "swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: %ld\n", >>> bp->b_bufobj, (intmax_t)bp->b_blkno, >>> bp->b_bcount); >>> } >>> } >>> VM_OBJECT_WUNLOCK(object); >>> . . . >>> >>> where: >>> >>> #define VM_OBJECT_SLEEP(object, wchan, pri, wmesg, timo)\ >>> rw_sleep((wchan), &(object)->lock, (pri), (wmesg), (timo)) >>> >>> and: >>> >>> #define rw_sleep(chan, rw, pri, wmesg, timo)
Re: Filesystem operations slower in 13.0 than 12.2
Responses in-line. On Sun, Mar 14, 2021 at 3:09 PM Mark Millard wrote: > > > On 2021-Mar-14, at 11:09, Kevin Oberman wrote: > > > . . . > > > > Seems to only occur on large r/w operations from/to the same disk. "sp > > big-file /other/file/on/same/disk" or tar/untar operations on large > files. > > Hit this today updating firefox. > > > > I/O starts at >40MB/s. Dropped to about 1.5MB/s. If I tried doing other > > things while it was running slowly, the disk would appear to lock up. > E.g. > > pwd(1) seemed to completely lock up the system, but I could still ping it > > and, after about 30 seconds, things came back to life. It was also not > > instantaneous. Disc activity dropped to <1MB/s for a few seconds before > > everything froze. > > > > During the untar of firefox, I saw; this several times. I also looked at > my > > console where I found these errors during : > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 55043, size: 8192 > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 51572, size: 4096 > > Does anyone know: > Are those messages normal "reading is taking a rather long > time" notices or is their presence more useful information > in some way about the type of problem or context for the > problem? > > As for the tests: > Are these messages always present when near a time frame > when the problem occurs? Never present in a near time > frame to a period when the problem does not occur? > In a large number of test, these errors have not repeated. They baffle me for another reason. This system has 20G or RAM. Tyically, all swap space is unused. ATM I see 16384M free out of 16384. Not sure that I have ever seen it used, though it might have been while building rust. I have not built rust for a month. > > It appears that the messages are associated with reading > the disk(s), not directly with writing them, where the > reads take more than "hz * 20" time units to complete. > (I'm looking at main (14) code.) What might contribute > to the time taken for the pending read(s)? > The reference to hz * 20 woke up a few sleeping memory cells. I forgot that I cleaned up my loader.conf. It was largely a copy of the one on my decade-old T520. I commented out "kern.hz=100". I don't recall the details, but I think it was actually from an even older system, my T42 from before I retired. In any case, restoring this setting has greatly improved the situation. I now have really bad disk I/O performance on large disk to disk activity (untarring the firefox distro) instead of terrible performance and the system freezes have vanished, though I do see pauses in response to clicks or text entry, but the display remains active and the pauses are short... 1 to 15 seconds, I'd guess. No, I have no idea what this indicates. I'm still not seeing the performance I was seeing back in February when 40 MB/s for extended intervals was common and I once untarred firefox.tar.gz2 in under a minute and performance seldom dropped below 1.4 MB/s. > > /* > * swap_pager_getpages() - bring pages in from swap > * > * Attempt to page in the pages in array "ma" of length "count". The > * caller may optionally specify that additional pages preceding and > * succeeding the specified range be paged in. The number of such > pages > * is returned in the "rbehind" and "rahead" parameters, and they will > * be in the inactive queue upon return. > * > * The pages in "ma" must be busied and will remain busied upon > return. > */ > static int > swap_pager_getpages_locked(vm_object_t object, vm_page_t *ma, int count, > int *rbehind, int *rahead) > { > . . . > /* > * Wait for the pages we want to complete. VPO_SWAPINPROG is > always > * cleared on completion. If an I/O error occurs, SWAPBLK_NONE > * is set in the metadata for each page in the request. > */ > VM_OBJECT_WLOCK(object); > /* This could be implemented more efficiently with aflags */ > while ((ma[0]->oflags & VPO_SWAPINPROG) != 0) { > ma[0]->oflags |= VPO_SWAPSLEEP; > VM_CNT_INC(v_intrans); > if (VM_OBJECT_SLEEP(object, >handle, PSWP, > "swread", hz * 20)) { > printf( > "swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: %ld\n", > bp->b_bufobj, (intmax_t)bp->b_blkno, > bp->b_bcount); > } > } > VM_OBJECT_WUNLOCK(object); > . . . > > where: > > #define VM_OBJECT_SLEEP(object, wchan, pri, wmesg, timo)\ > rw_sleep((wchan), &(object)->lock, (pri), (wmesg), (timo)) > > and: > > #define rw_sleep(chan, rw, pri, wmesg, timo)\ > _sleep((chan), &(rw)->lock_object, (pri), (wmesg), \ > tick_sbt * (timo), 0, C_HARDCLOCK) > > (I do not claim to be able to interpret the implications > of the code that leads to the
Re: Troubles with netdump(4)
>> Next I try netdumping to another FreeBSD-host (virtual machine on >> VMWare >> hypervisor): >> root@host-3:~ # uname -mv >> FreeBSD 12.2-STABLE r369412 GENERIC amd64 >> root@host-3:~ # ifconfig em0 >> em0: flags=8843 metric 0 mtu 1500 >> >> options=81009b >> inet 192.168.7.18 netmask 0xff00 broadcast 192.168.7.255 >> media: Ethernet autoselect (1000baseT ) >> status: active >> nd6 options=29 >> root@host-3:~ # >> >> And netdumping to this host are still slow (~the same 5KB/s). >> >> Next step another FreeBSD-host (virtual machine on VirtualBox hypevisor): >> root@host-4:~ # uname -mv >> FreeBSD 12.2-STABLE r369412 GENERIC amd64 >> root@host-4:~ # ifconfig em0 >> em0: flags=8843 metric 0 mtu 1500 >> >> options=81009b >> inet 192.168.7.19 netmask 0xff00 broadcast 192.168.7.255 >> media: Ethernet autoselect (1000baseT ) >> status: active >> nd6 options=29 >> root@host-4:~ # >> >> Both (host-3 and host-4) are installed from >> FreeBSD-12.2-STABLE-amd64-20210304-r369412-bootonly.iso >> When I caused a kernel panic by sysctl debug.kdb.panic on host-4, >> netdumping to >> host-3 show the same 5 KB/s. > netdump requires explicit acks from the other side. > Could you try dumping the exchange between the dumping host and the server to > verify that acks are sent immediately after receiving the next chunk? Sure. There is a dump: https://disk.yandex.ru/d/i27n1m-I80DaEg What should be tuned: netdumpd, netdumpd-host or something else? Now I can't completely done netdump: there frequenly error: ** DUMP FAILED (ERROR 60) ** Cannot dump: unknown error (error=60). and panic inside iflib: https://pastebin.com/1BTTjGLJ ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"