Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On 09.09.2023 12:32, Mark Millard wrote: On Sep 8, 2023, at 21:54, Mark Millard wrote: On Sep 8, 2023, at 18:19, Mark Millard wrote: On Sep 8, 2023, at 17:03, Mark Millard wrote: On Sep 8, 2023, at 15:30, Martin Matuska wrote: On 9. 9. 2023 0:09, Alexander Motin wrote: Thank you, Martin. I was able to reproduce the issue with your script and found the cause. I first though the issue is triggered by the `cp`, but it appeared to be triggered by `cat`. It also got copy_file_range() support, but later than `cp`. That is probably why it slipped through testing. This patch fixes it for me: https://github.com/openzfs/zfs/pull/15251 . Mark, could you please try the patch? I finally stopped it at 7473 built (a little over 13 hrs elapsed): ^C[13:08:30] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [sigint:] Queued: 34588 Built: 7473 Failed: 23Skipped: 798 Ignored: 335 Fetched: 0 Tobuild: 25959 Time: 13:08:26 [13:08:30] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_19h51m52s [13:08:31] Cleaning up [13:17:10] Unmounting file systems Exiting with status 1 In part that was more evidence for deadlocks at least being fairly rare as well. None of the failed ones looked odd. (A fair portion are because the bulk -a was mostly doing WITH_DEBUG= builds. Many upstreams change library names, some other file names, or paths used for debug builds and ports generally do not cover well building the debug builds for such. I've used these runs to extend my list of exceptions that avoid using WITH_DEBUG .) So no evidence of corruptions. Thank you, Mark. The patch was accepted upstream and merged to both master and zfs-2.2-release branches. -- Alexander Motin
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 21:54, Mark Millard wrote: > On Sep 8, 2023, at 18:19, Mark Millard wrote: > >> On Sep 8, 2023, at 17:03, Mark Millard wrote: >> >>> On Sep 8, 2023, at 15:30, Martin Matuska wrote: >>> I can confirm that the patch fixes the panic caused by the provided script on my test systems. Mark, would it be possible to try poudriere on your system with a patched kernel? >>> >>> . . . >>> >>> On 9. 9. 2023 0:09, Alexander Motin wrote: On 08.09.2023 09:52, Martin Matuska wrote: > . . . Thank you, Martin. I was able to reproduce the issue with your script and found the cause. I first though the issue is triggered by the `cp`, but it appeared to be triggered by `cat`. It also got copy_file_range() support, but later than `cp`. That is probably why it slipped through testing. This patch fixes it for me: https://github.com/openzfs/zfs/pull/15251 . Mark, could you please try the patch? >>> >>> If all goes well, this will end up reporting that the >>> poudriere bulk -a is still running but has gotten past, >>> say, 320+ port->package builds finished (so: more than >>> double observed so far for the panic context). Later >>> would be a report with a larger figure. A normal run >>> I might let go for 6000+ ports and 10 hr or so. >>> >>> Notes as I go . . . >>> >>> Patch applied, built, and installed to the test media. >>> Also, booted: >>> >>> # uname -apKU >>> FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 >>> main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 >>> root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG >>> amd64 amd64 150 150 >>> >>> Note that this is with a debug kernel (-dbg- in path and -DBG in >>> the GENERIC* name). Also, the vintage of what it is based on has: >>> >>> git: 969071be938c - main - vfs: copy_file_range() between multiple >>> mountpoints of the same fs type >>> >>> The usual sort of sequencing previously reported to get to this >>> point. Media update starts with the rewind to the checkpoint in >>> hopes of avoiding oddities from the later failure. >>> >>> . . . : >>> >>> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] >>> Queued: 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 >>> Fetched: 0 Tobuild: 33800 Time: 00:30:41 >>> >>> >>> So 414 and and still building. >>> >>> More later. (It may be a while.) >>> >> >> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: >> 34588 Built: 2013 Failed: 2 Skipped: 179 Ignored: 335 Fetched: 0 >> Tobuild: 32059 Time: 01:42:47 >> >> and still going. (FYI: The failures are expected.) >> >> After a while I might stop it and start over with a non-debug >> kernel installed instead. > > I did ^C after 2.5 hr (with 2447 built): > > ^C[02:30:05] Error: Signal SIGINT caught, cleaning up and exiting > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [sigint:] Queued: 34588 > Built: 2447 Failed: 5 Skipped: 226 Ignored: 335 Fetched: 0 > Tobuild: 31575 Time: 02:29:59 > [02:30:05] Logs: > /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_16h31m51s > [02:30:05] Cleaning up > [02:38:04] Unmounting file systems > Exiting with status 1 > > I'll switch it over to a non-debug kernel and, probably, world > and setup/run another test. > > . . . (time goes by) . . . > > Hmm. This did not get sent when I wrote the above. FYI, non-debug > test status: > > [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [parallel_build:] Queued: > 34588 Built: 2547 Failed: 5 Skipped: 239 Ignored: 335 Fetched: 0 > Tobuild: 31462 Time: 01:59:58 > > I may let it run overnight. I finally stopped it at 7473 built (a little over 13 hrs elapsed): ^C[13:08:30] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [sigint:] Queued: 34588 Built: 7473 Failed: 23Skipped: 798 Ignored: 335 Fetched: 0 Tobuild: 25959 Time: 13:08:26 [13:08:30] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_19h51m52s [13:08:31] Cleaning up [13:17:10] Unmounting file systems Exiting with status 1 In part that was more evidence for deadlocks at least being fairly rare as well. None of the failed ones looked odd. (A fair portion are because the bulk -a was mostly doing WITH_DEBUG= builds. Many upstreams change library names, some other file names, or paths used for debug builds and ports generally do not cover well building the debug builds for such. I've used these runs to extend my list of exceptions that avoid using WITH_DEBUG .) So no evidence of corruptions. (I do not normally do bulk -a builds. The rare bulk -a runs are normally to check that my configuration of a builder machine still works reasonably --beyond building just the few hundred ports that I
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 18:19, Mark Millard wrote: > On Sep 8, 2023, at 17:03, Mark Millard wrote: > >> On Sep 8, 2023, at 15:30, Martin Matuska wrote: >> >>> I can confirm that the patch fixes the panic caused by the provided script >>> on my test systems. >>> Mark, would it be possible to try poudriere on your system with a patched >>> kernel? >> >> . . . >> >> On 9. 9. 2023 0:09, Alexander Motin wrote: >>> On 08.09.2023 09:52, Martin Matuska wrote: . . . >>> >>> Thank you, Martin. I was able to reproduce the issue with your script and >>> found the cause. >>> >>> I first though the issue is triggered by the `cp`, but it appeared to be >>> triggered by `cat`. It also got copy_file_range() support, but later than >>> `cp`. That is probably why it slipped through testing. This patch fixes >>> it for me: https://github.com/openzfs/zfs/pull/15251 . >>> >>> Mark, could you please try the patch? >> >> If all goes well, this will end up reporting that the >> poudriere bulk -a is still running but has gotten past, >> say, 320+ port->package builds finished (so: more than >> double observed so far for the panic context). Later >> would be a report with a larger figure. A normal run >> I might let go for 6000+ ports and 10 hr or so. >> >> Notes as I go . . . >> >> Patch applied, built, and installed to the test media. >> Also, booted: >> >> # uname -apKU >> FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 >> main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 >> root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG >> amd64 amd64 150 150 >> >> Note that this is with a debug kernel (-dbg- in path and -DBG in >> the GENERIC* name). Also, the vintage of what it is based on has: >> >> git: 969071be938c - main - vfs: copy_file_range() between multiple >> mountpoints of the same fs type >> >> The usual sort of sequencing previously reported to get to this >> point. Media update starts with the rewind to the checkpoint in >> hopes of avoiding oddities from the later failure. >> >> . . . : >> >> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: >> 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 >> Tobuild: 33800 Time: 00:30:41 >> >> >> So 414 and and still building. >> >> More later. (It may be a while.) >> > > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: > 34588 Built: 2013 Failed: 2 Skipped: 179 Ignored: 335 Fetched: 0 > Tobuild: 32059 Time: 01:42:47 > > and still going. (FYI: The failures are expected.) > > After a while I might stop it and start over with a non-debug > kernel installed instead. I did ^C after 2.5 hr (with 2447 built): ^C[02:30:05] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [sigint:] Queued: 34588 Built: 2447 Failed: 5 Skipped: 226 Ignored: 335 Fetched: 0 Tobuild: 31575 Time: 02:29:59 [02:30:05] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_16h31m51s [02:30:05] Cleaning up [02:38:04] Unmounting file systems Exiting with status 1 I'll switch it over to a non-debug kernel and, probably, world and setup/run another test. . . . (time goes by) . . . Hmm. This did not get sent when I wrote the above. FYI, non-debug test status: [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [parallel_build:] Queued: 34588 Built: 2547 Failed: 5 Skipped: 239 Ignored: 335 Fetched: 0 Tobuild: 31462 Time: 01:59:58 I may let it run overnight. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Fri, 8 Sep 2023 17:03:07 -0700 Mark Millard wrote: > On Sep 8, 2023, at 15:30, Martin Matuska wrote: > > > I can confirm that the patch fixes the panic caused by the provided script > > on my test systems. > > Mark, would it be possible to try poudriere on your system with a patched > > kernel? > > . . . > > On 9. 9. 2023 0:09, Alexander Motin wrote: > > On 08.09.2023 09:52, Martin Matuska wrote: > >> . . . > > > > Thank you, Martin. I was able to reproduce the issue with your script and > > found the cause. > > > > I first though the issue is triggered by the `cp`, but it appeared to be > > triggered by `cat`. It also got copy_file_range() support, but later than > > `cp`. That is probably why it slipped through testing. This patch fixes > > it for me: https://github.com/openzfs/zfs/pull/15251 . > > > > Mark, could you please try the patch? > > If all goes well, this will end up reporting that the > poudriere bulk -a is still running but has gotten past, > say, 320+ port->package builds finished (so: more than > double observed so far for the panic context). Later > would be a report with a larger figure. A normal run > I might let go for 6000+ ports and 10 hr or so. > > Notes as I go . . . > > Patch applied, built, and installed to the test media. > Also, booted: > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 > main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 > root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG > amd64 amd64 150 150 > > Note that this is with a debug kernel (-dbg- in path and -DBG in > the GENERIC* name). Also, the vintage of what it is based on has: > > git: 969071be938c - main - vfs: copy_file_range() between multiple > mountpoints of the same fs type > > The usual sort of sequencing previously reported to get to this > point. Media update starts with the rewind to the checkpoint in > hopes of avoiding oddities from the later failure. > > . . . : > > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: > 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 > Tobuild: 33800 Time: 00:30:41 > > > So 414 and and still building. > > More later. (It may be a while.) > > === > Mark Millard > marklmi at yahoo.com Would it planned to be MFC'ed to stable/14, and then releng/14.0 once MFV'ed to main? Regards. -- Tomoaki AOKI
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 17:03, Mark Millard wrote: > On Sep 8, 2023, at 15:30, Martin Matuska wrote: > >> I can confirm that the patch fixes the panic caused by the provided script >> on my test systems. >> Mark, would it be possible to try poudriere on your system with a patched >> kernel? > > . . . > > On 9. 9. 2023 0:09, Alexander Motin wrote: >> On 08.09.2023 09:52, Martin Matuska wrote: >>> . . . >> >> Thank you, Martin. I was able to reproduce the issue with your script and >> found the cause. >> >> I first though the issue is triggered by the `cp`, but it appeared to be >> triggered by `cat`. It also got copy_file_range() support, but later than >> `cp`. That is probably why it slipped through testing. This patch fixes it >> for me: https://github.com/openzfs/zfs/pull/15251 . >> >> Mark, could you please try the patch? > > If all goes well, this will end up reporting that the > poudriere bulk -a is still running but has gotten past, > say, 320+ port->package builds finished (so: more than > double observed so far for the panic context). Later > would be a report with a larger figure. A normal run > I might let go for 6000+ ports and 10 hr or so. > > Notes as I go . . . > > Patch applied, built, and installed to the test media. > Also, booted: > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 > main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 > root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG > amd64 amd64 150 150 > > Note that this is with a debug kernel (-dbg- in path and -DBG in > the GENERIC* name). Also, the vintage of what it is based on has: > > git: 969071be938c - main - vfs: copy_file_range() between multiple > mountpoints of the same fs type > > The usual sort of sequencing previously reported to get to this > point. Media update starts with the rewind to the checkpoint in > hopes of avoiding oddities from the later failure. > > . . . : > > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: > 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 > Tobuild: 33800 Time: 00:30:41 > > > So 414 and and still building. > > More later. (It may be a while.) > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: 34588 Built: 2013 Failed: 2 Skipped: 179 Ignored: 335 Fetched: 0 Tobuild: 32059 Time: 01:42:47 and still going. (FYI: The failures are expected.) After a while I might stop it and start over with a non-debug kernel installed instead. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On 9/8/23 15:09, Alexander Motin wrote: Thank you, Martin. I was able to reproduce the issue with your script and found the cause. I first though the issue is triggered by the `cp`, but it appeared to be triggered by `cat`. It also got copy_file_range() support, but later than `cp`. That is probably why it slipped through testing. This patch fixes it for me: https://github.com/openzfs/zfs/pull/15251 . Mark, could you please try the patch? Thank you Alex for the fix! -- Pawel Jakub Dawidek
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 15:30, Martin Matuska wrote: > I can confirm that the patch fixes the panic caused by the provided script on > my test systems. > Mark, would it be possible to try poudriere on your system with a patched > kernel? . . . On 9. 9. 2023 0:09, Alexander Motin wrote: > On 08.09.2023 09:52, Martin Matuska wrote: >> . . . > > Thank you, Martin. I was able to reproduce the issue with your script and > found the cause. > > I first though the issue is triggered by the `cp`, but it appeared to be > triggered by `cat`. It also got copy_file_range() support, but later than > `cp`. That is probably why it slipped through testing. This patch fixes it > for me: https://github.com/openzfs/zfs/pull/15251 . > > Mark, could you please try the patch? If all goes well, this will end up reporting that the poudriere bulk -a is still running but has gotten past, say, 320+ port->package builds finished (so: more than double observed so far for the panic context). Later would be a report with a larger figure. A normal run I might let go for 6000+ ports and 10 hr or so. Notes as I go . . . Patch applied, built, and installed to the test media. Also, booted: # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 150 150 Note that this is with a debug kernel (-dbg- in path and -DBG in the GENERIC* name). Also, the vintage of what it is based on has: git: 969071be938c - main - vfs: copy_file_range() between multiple mountpoints of the same fs type The usual sort of sequencing previously reported to get to this point. Media update starts with the rewind to the checkpoint in hopes of avoiding oddities from the later failure. . . . : [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 Tobuild: 33800 Time: 00:30:41 So 414 and and still building. More later. (It may be a while.) === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
Hi Alexander, I can confirm that the patch fixes the panic caused by the provided script on my test systems. Mark, would it be possible to try poudriere on your system with a patched kernel? Thanks mm On 9. 9. 2023 0:09, Alexander Motin wrote: On 08.09.2023 09:52, Martin Matuska wrote: I digged a little and was able to reproduce the panic without poudriere with a shell script. #!/bin/sh nl=' ' sed_script=s/aaa/b/ for ac_i in 1 2 3 4 5 6 7; do sed_script="$sed_script$nl$sed_script" done echo "$sed_script" 2>/dev/null | sed 99q >conftest.sed repeats=8 count=0 echo -n 0123456789 >"conftest.in" while : do cat "conftest.in" "conftest.in" >"conftest.tmp" mv "conftest.tmp" "conftest.in" cp "conftest.in" "conftest.nl" echo '' >> "conftest.nl" sed -f conftest.sed < "conftest.nl" >"conftest.out" 2>/dev/null || break diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break count=$(($count + 1)) echo "count: $count" # 10*(2^10) chars as input seems more than enough test $count -gt $repeats && break done rm -f conftest.in conftest.tmp conftest.nl conftest.out Thank you, Martin. I was able to reproduce the issue with your script and found the cause. I first though the issue is triggered by the `cp`, but it appeared to be triggered by `cat`. It also got copy_file_range() support, but later than `cp`. That is probably why it slipped through testing. This patch fixes it for me: https://github.com/openzfs/zfs/pull/15251 . Mark, could you please try the patch?
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On 08.09.2023 09:52, Martin Matuska wrote: I digged a little and was able to reproduce the panic without poudriere with a shell script. #!/bin/sh nl=' ' sed_script=s/aaa/b/ for ac_i in 1 2 3 4 5 6 7; do sed_script="$sed_script$nl$sed_script" done echo "$sed_script" 2>/dev/null | sed 99q >conftest.sed repeats=8 count=0 echo -n 0123456789 >"conftest.in" while : do cat "conftest.in" "conftest.in" >"conftest.tmp" mv "conftest.tmp" "conftest.in" cp "conftest.in" "conftest.nl" echo '' >> "conftest.nl" sed -f conftest.sed < "conftest.nl" >"conftest.out" 2>/dev/null || break diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break count=$(($count + 1)) echo "count: $count" # 10*(2^10) chars as input seems more than enough test $count -gt $repeats && break done rm -f conftest.in conftest.tmp conftest.nl conftest.out Thank you, Martin. I was able to reproduce the issue with your script and found the cause. I first though the issue is triggered by the `cp`, but it appeared to be triggered by `cat`. It also got copy_file_range() support, but later than `cp`. That is probably why it slipped through testing. This patch fixes it for me: https://github.com/openzfs/zfs/pull/15251 . Mark, could you please try the patch? -- Alexander Motin
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 06:52, Martin Matuska wrote: > I digged a little and was able to reproduce the panic without poudriere with > a shell script. > > You may want to increase "repeats". > The script causes the panic in dmu_buf_hold_array_by_dnode() on my VirtualBox > with the cat command on 9th iteration. > > Here is the script: > > #!/bin/sh > nl=' > ' > sed_script=s/aaa/b/ > for ac_i in 1 2 3 4 5 6 7; do > sed_script="$sed_script$nl$sed_script" > done > echo "$sed_script" 2>/dev/null | sed 99q >conftest.sed > > repeats=8 > count=0 > echo -n 0123456789 >"conftest.in" > while : > do > cat "conftest.in" "conftest.in" >"conftest.tmp" > mv "conftest.tmp" "conftest.in" > cp "conftest.in" "conftest.nl" > echo '' >> "conftest.nl" > sed -f conftest.sed < "conftest.nl" >"conftest.out" 2>/dev/null || break > diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break > count=$(($count + 1)) > echo "count: $count" > # 10*(2^10) chars as input seems more than enough > test $count -gt $repeats && break > done > rm -f conftest.in conftest.tmp conftest.nl conftest.out . . . (history removed) . . . # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #0 main-n265205-03a7c36ddbc0: Thu Sep 7 03:10:34 UTC 2023 r...@releng3.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 amd64 150 150 In my test environment with yesterday's snapshot kernel in use and with vfs.zfs.bclone_enabled=1 : # ~/bclone_panic.sh count: 1 count: 2 count: 3 count: 4 count: 5 count: 6 count: 7 count: 8 then panic: no 9. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
I digged a little and was able to reproduce the panic without poudriere with a shell script. You may want to increase "repeats". The script causes the panic in dmu_buf_hold_array_by_dnode() on my VirtualBox with the cat command on 9th iteration. Here is the script: #!/bin/sh nl=' ' sed_script=s/aaa/b/ for ac_i in 1 2 3 4 5 6 7; do sed_script="$sed_script$nl$sed_script" done echo "$sed_script" 2>/dev/null | sed 99q >conftest.sed repeats=8 count=0 echo -n 0123456789 >"conftest.in" while : do cat "conftest.in" "conftest.in" >"conftest.tmp" mv "conftest.tmp" "conftest.in" cp "conftest.in" "conftest.nl" echo '' >> "conftest.nl" sed -f conftest.sed < "conftest.nl" >"conftest.out" 2>/dev/null || break diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break count=$(($count + 1)) echo "count: $count" # 10*(2^10) chars as input seems more than enough test $count -gt $repeats && break done rm -f conftest.in conftest.tmp conftest.nl conftest.out On 8. 9. 2023 6:32, Mark Millard wrote: [Today's main-snapshot kernel panics as well.] On Sep 7, 2023, at 16:32, Mark Millard wrote: On Sep 7, 2023, at 13:07, Alexander Motin wrote: Thanks, Mark. On 07.09.2023 15:40, Mark Millard wrote: On Sep 7, 2023, at 11:48, Glen Barber wrote: On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: When I next have time, should I retry based on a more recent vintage of main that includes 969071be938c ? Yes, please, if you can. As stands, I rebooted that machine into my normal enviroment, so the after-crash-with-dump-info context is preserved. I'll presume lack of a need to preserve that context unless I hear otherwise. (But I'll work on this until later today.) Even my normal environment predates the commit in question by a few commits. So I'll end up doing a more general round of updates overall. Someone can let me know if there is a preference for debug over non-debug for the next test run. It is not unknown when some bugs disappear once debugging is enabled due to different execution timings, but generally debug may to detect the problem closer to its origin instead of looking on random consequences. I am only starting to look on this report (unless Pawel or somebody beat me on it), and don't have additional requests yet, but if you can repeat the same with debug kernel (in-base ZFS's ZFS_DEBUG setting follows kernel's INVARIANTS), it may give us some additional information. So I did a zpool import, rewinding to the checkpoint. (This depends on the questionable zfs doing fully as desired for this. Notably the normal environment has vfs.zfs.bclone_enabled=0 , including when it was doing this activity.) My normal environment reported no problems. Note: the earlier snapshot from my first setup was still in place since it was made just before the original checkpoint used above. However, the rewind did remove the /var/crash/ material that had been added. I did the appropriate zfs mount. I installed a debug kernel and world to the import. Again, no problems reported. I did the appropriate zfs umount. I did the appropriate zpool export. I rebooted with the test media. # sysctl vfs.zfs.bclone_enabled vfs.zfs.bclone_enabled: 1 # zpool trim -w zamd64 # zpool checkpoint zamd64 # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 main-n265188-117c54a78ccd-dirty: Tue Sep 5 21:29:53 PDT 2023 root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 150 150 (So, before the 969071be938c vintage, same sources as for my last run but a debug build.) # poudriere bulk -jmain-amd64-bulk_a -a . . . [00:03:23] Building 34214 packages using up to 32 builders [00:03:23] Hit CTRL+t at any time to see build progress and stats [00:03:23] [01] [00:00:00] Builder starting [00:04:19] [01] [00:00:56] Builder started [00:04:20] [01] [00:00:01] Building ports-mgmt/pkg | pkg-1.20.6 [00:05:33] [01] [00:01:14] Finished ports-mgmt/pkg | pkg-1.20.6: Success [00:05:53] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 [00:05:53] [02] [00:00:00] Builder starting . . . [00:05:54] [32] [00:00:00] Builder starting [00:06:11] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success [00:06:12] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:08:24] [01] [00:02:12] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success [00:08:31] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 [00:10:06] [05] [00:04:13] Builder started [00:10:06] [05] [00:00:00] Building devel/autoconf-switch | autoconf-switch-20220527 [00:10:06] [31] [00:04:12] Builder started [00:10:06] [31] [00:00:00] Building devel/libatomic_ops | libatomic_ops-7.8.0 . . . Crashed again, with 158 *.pkg files in .building/All/ after rebooting. The crash is similar to the non-debug one. No extra
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
[Today's main-snapshot kernel panics as well.] On Sep 7, 2023, at 16:32, Mark Millard wrote: > On Sep 7, 2023, at 13:07, Alexander Motin wrote: > >> Thanks, Mark. >> >> On 07.09.2023 15:40, Mark Millard wrote: >>> On Sep 7, 2023, at 11:48, Glen Barber wrote: On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: > When I next have time, should I retry based on a more recent > vintage of main that includes 969071be938c ? Yes, please, if you can. >>> As stands, I rebooted that machine into my normal >>> enviroment, so the after-crash-with-dump-info >>> context is preserved. I'll presume lack of a need >>> to preserve that context unless I hear otherwise. >>> (But I'll work on this until later today.) >>> Even my normal environment predates the commit in >>> question by a few commits. So I'll end up doing a >>> more general round of updates overall. >>> Someone can let me know if there is a preference >>> for debug over non-debug for the next test run. >> >> It is not unknown when some bugs disappear once debugging is enabled due to >> different execution timings, but generally debug may to detect the problem >> closer to its origin instead of looking on random consequences. I am only >> starting to look on this report (unless Pawel or somebody beat me on it), >> and don't have additional requests yet, but if you can repeat the same with >> debug kernel (in-base ZFS's ZFS_DEBUG setting follows kernel's INVARIANTS), >> it may give us some additional information. > > So I did a zpool import, rewinding to the checkpoint. > (This depends on the questionable zfs doing fully as > desired for this. Notably the normal environment has > vfs.zfs.bclone_enabled=0 , including when it was > doing this activity.) My normal environment reported > no problems. > > Note: the earlier snapshot from my first setup was > still in place since it was made just before the > original checkpoint used above. > > However, the rewind did remove the /var/crash/ > material that had been added. > > I did the appropriate zfs mount. > > I installed a debug kernel and world to the import. Again, > no problems reported. > > I did the appropriate zfs umount. > > I did the appropriate zpool export. > > I rebooted with the test media. > > # sysctl vfs.zfs.bclone_enabled > vfs.zfs.bclone_enabled: 1 > > # zpool trim -w zamd64 > > # zpool checkpoint zamd64 > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 > main-n265188-117c54a78ccd-dirty: Tue Sep 5 21:29:53 PDT 2023 > root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG > amd64 amd64 150 150 > > (So, before the 969071be938c vintage, same sources as for > my last run but a debug build.) > > # poudriere bulk -jmain-amd64-bulk_a -a > . . . > [00:03:23] Building 34214 packages using up to 32 builders > [00:03:23] Hit CTRL+t at any time to see build progress and stats > [00:03:23] [01] [00:00:00] Builder starting > [00:04:19] [01] [00:00:56] Builder started > [00:04:20] [01] [00:00:01] Building ports-mgmt/pkg | pkg-1.20.6 > [00:05:33] [01] [00:01:14] Finished ports-mgmt/pkg | pkg-1.20.6: Success > [00:05:53] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 > [00:05:53] [02] [00:00:00] Builder starting > . . . > [00:05:54] [32] [00:00:00] Builder starting > [00:06:11] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success > [00:06:12] [01] [00:00:00] Building devel/gettext-runtime | > gettext-runtime-0.22_1 > [00:08:24] [01] [00:02:12] Finished devel/gettext-runtime | > gettext-runtime-0.22_1: Success > [00:08:31] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 > [00:10:06] [05] [00:04:13] Builder started > [00:10:06] [05] [00:00:00] Building devel/autoconf-switch | > autoconf-switch-20220527 > [00:10:06] [31] [00:04:12] Builder started > [00:10:06] [31] [00:00:00] Building devel/libatomic_ops | libatomic_ops-7.8.0 > . . . > > Crashed again, with 158 *.pkg files in .building/All/ after > rebooting. > > The crash is similar to the non-debug one. No extra output > from the debug build. > > For reference: > > Unread portion of the kernel message buffer: > panic: Solaris(panic): zfs: accessing past end of object 422/10b1c02 > (size=2560 access=2560+2560) > . . . Same world with newer snapshot main kernel that should be compatible with the world: # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #0 main-n265205-03a7c36ddbc0: Thu Sep 7 03:10:34 UTC 2023 r...@releng3.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 amd64 150 150 Steps: #NOTE: (re)boot to normal environment #NOTE: login cd ~/artifacts/ #NOTE: as needed . . . fetch http://ftp3.freebsd.org/pub/FreeBSD/snapshots/amd64/15.0-CURRENT/kernel.txz fetch http://ftp3.freebsd.org/pub/FreeBSD/snapshots/amd64/15.0-CURRENT/kernel-dbg.txz fetch
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 7, 2023, at 13:07, Alexander Motin wrote: > Thanks, Mark. > > On 07.09.2023 15:40, Mark Millard wrote: >> On Sep 7, 2023, at 11:48, Glen Barber wrote: >>> On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: When I next have time, should I retry based on a more recent vintage of main that includes 969071be938c ? >>> >>> Yes, please, if you can. >> As stands, I rebooted that machine into my normal >> enviroment, so the after-crash-with-dump-info >> context is preserved. I'll presume lack of a need >> to preserve that context unless I hear otherwise. >> (But I'll work on this until later today.) >> Even my normal environment predates the commit in >> question by a few commits. So I'll end up doing a >> more general round of updates overall. >> Someone can let me know if there is a preference >> for debug over non-debug for the next test run. > > It is not unknown when some bugs disappear once debugging is enabled due to > different execution timings, but generally debug may to detect the problem > closer to its origin instead of looking on random consequences. I am only > starting to look on this report (unless Pawel or somebody beat me on it), and > don't have additional requests yet, but if you can repeat the same with debug > kernel (in-base ZFS's ZFS_DEBUG setting follows kernel's INVARIANTS), it may > give us some additional information. So I did a zpool import, rewinding to the checkpoint. (This depends on the questionable zfs doing fully as desired for this. Notably the normal environment has vfs.zfs.bclone_enabled=0 , including when it was doing this activity.) My normal environment reported no problems. Note: the earlier snapshot from my first setup was still in place since it was made just before the original checkpoint used above. However, the rewind did remove the /var/crash/ material that had been added. I did the appropriate zfs mount. I installed a debug kernel and world to the import. Again, no problems reported. I did the appropriate zfs umount. I did the appropriate zpool export. I rebooted with the test media. # sysctl vfs.zfs.bclone_enabled vfs.zfs.bclone_enabled: 1 # zpool trim -w zamd64 # zpool checkpoint zamd64 # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 main-n265188-117c54a78ccd-dirty: Tue Sep 5 21:29:53 PDT 2023 root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 150 150 (So, before the 969071be938c vintage, same sources as for my last run but a debug build.) # poudriere bulk -jmain-amd64-bulk_a -a . . . [00:03:23] Building 34214 packages using up to 32 builders [00:03:23] Hit CTRL+t at any time to see build progress and stats [00:03:23] [01] [00:00:00] Builder starting [00:04:19] [01] [00:00:56] Builder started [00:04:20] [01] [00:00:01] Building ports-mgmt/pkg | pkg-1.20.6 [00:05:33] [01] [00:01:14] Finished ports-mgmt/pkg | pkg-1.20.6: Success [00:05:53] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 [00:05:53] [02] [00:00:00] Builder starting . . . [00:05:54] [32] [00:00:00] Builder starting [00:06:11] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success [00:06:12] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:08:24] [01] [00:02:12] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success [00:08:31] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 [00:10:06] [05] [00:04:13] Builder started [00:10:06] [05] [00:00:00] Building devel/autoconf-switch | autoconf-switch-20220527 [00:10:06] [31] [00:04:12] Builder started [00:10:06] [31] [00:00:00] Building devel/libatomic_ops | libatomic_ops-7.8.0 . . . Crashed again, with 158 *.pkg files in .building/All/ after rebooting. The crash is similar to the non-debug one. No extra output from the debug build. For reference: Unread portion of the kernel message buffer: panic: Solaris(panic): zfs: accessing past end of object 422/10b1c02 (size=2560 access=2560+2560) cpuid = 15 time = 1694127988 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe02e783b5a0 vpanic() at vpanic+0x132/frame 0xfe02e783b6d0 panic() at panic+0x43/frame 0xfe02e783b730 vcmn_err() at vcmn_err+0xeb/frame 0xfe02e783b860 zfs_panic_recover() at zfs_panic_recover+0x59/frame 0xfe02e783b8c0 dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0xb8/frame 0xfe02e783b970 dmu_brt_clone() at dmu_brt_clone+0x61/frame 0xfe02e783b9f0 zfs_clone_range() at zfs_clone_range+0xaa3/frame 0xfe02e783bbc0 zfs_freebsd_copy_file_range() at zfs_freebsd_copy_file_range+0x18a/frame 0xfe02e783bc40 vn_copy_file_range() at vn_copy_file_range+0x114/frame 0xfe02e783bce0 kern_copy_file_range() at kern_copy_file_range+0x36c/frame 0xfe02e783bdb0 sys_copy_file_range() at sys_copy_file_range+0x78/frame 0xfe02e783be00 amd64_syscall() at amd64_syscall+0x14f/frame
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
Thanks, Mark. On 07.09.2023 15:40, Mark Millard wrote: On Sep 7, 2023, at 11:48, Glen Barber wrote: On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: When I next have time, should I retry based on a more recent vintage of main that includes 969071be938c ? Yes, please, if you can. As stands, I rebooted that machine into my normal enviroment, so the after-crash-with-dump-info context is preserved. I'll presume lack of a need to preserve that context unless I hear otherwise. (But I'll work on this until later today.) Even my normal environment predates the commit in question by a few commits. So I'll end up doing a more general round of updates overall. Someone can let me know if there is a preference for debug over non-debug for the next test run. It is not unknown when some bugs disappear once debugging is enabled due to different execution timings, but generally debug may to detect the problem closer to its origin instead of looking on random consequences. I am only starting to look on this report (unless Pawel or somebody beat me on it), and don't have additional requests yet, but if you can repeat the same with debug kernel (in-base ZFS's ZFS_DEBUG setting follows kernel's INVARIANTS), it may give us some additional information. Looking at "git: 969071be938c - main", the relevant part seems to be just (white space possibly not preserved accurately): diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c index 9fb5aee6a023..4e4161ef1a7f 100644 --- a/sys/kern/vfs_vnops.c +++ b/sys/kern/vfs_vnops.c @@ -3076,12 +3076,14 @@ vn_copy_file_range(struct vnode *invp, off_t *inoffp, struct vnode *outvp, goto out; /* -* If the two vnode are for the same file system, call +* If the two vnodes are for the same file system type, call * VOP_COPY_FILE_RANGE(), otherwise call vn_generic_copy_file_range() -* which can handle copies across multiple file systems. +* which can handle copies across multiple file system types. */ *lenp = len; - if (invp->v_mount == outvp->v_mount) + if (invp->v_mount == outvp->v_mount || + strcmp(invp->v_mount->mnt_vfc->vfc_name, + outvp->v_mount->mnt_vfc->vfc_name) == 0) error = VOP_COPY_FILE_RANGE(invp, inoffp, outvp, outoffp, lenp, flags, incred, outcred, fsize_td); else That looks to call VOP_COPY_FILE_RANGE in more contexts and vn_generic_copy_file_range in fewer. The backtrace I reported involves: VOP_COPY_FILE_RANGE So it appears this change is unlikely to invalidate my test result, although failure might happen sooner if more VOP_COPY_FILE_RANGE calls happen with the newer code. Your logic is likely right, but if you have block cloning requests both within and between datasets, this patch may change the pattern. Though it is obviously not a fix for the issue. I responded to the commit email only because it makes no difference while vfs.zfs.bclone_enabled is 0. That in turns means that someone may come up with some other change for me to test by the time I get around to setting up another test. Let me know if so. -- Alexander Motin
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 7, 2023, at 11:48, Glen Barber wrote: > On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: >> When I next have time, should I retry based on a more recent >> vintage of main that includes 969071be938c ? >> > > Yes, please, if you can. As stands, I rebooted that machine into my normal enviroment, so the after-crash-with-dump-info context is preserved. I'll presume lack of a need to preserve that context unless I hear otherwise. (But I'll work on this until later today.) Even my normal environment predates the commit in question by a few commits. So I'll end up doing a more general round of updates overall. Someone can let me know if there is a preference for debug over non-debug for the next test run. Looking at "git: 969071be938c - main", the relevant part seems to be just (white space possibly not preserved accurately): diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c index 9fb5aee6a023..4e4161ef1a7f 100644 --- a/sys/kern/vfs_vnops.c +++ b/sys/kern/vfs_vnops.c @@ -3076,12 +3076,14 @@ vn_copy_file_range(struct vnode *invp, off_t *inoffp, struct vnode *outvp, goto out; /* -* If the two vnode are for the same file system, call +* If the two vnodes are for the same file system type, call * VOP_COPY_FILE_RANGE(), otherwise call vn_generic_copy_file_range() -* which can handle copies across multiple file systems. +* which can handle copies across multiple file system types. */ *lenp = len; - if (invp->v_mount == outvp->v_mount) + if (invp->v_mount == outvp->v_mount || + strcmp(invp->v_mount->mnt_vfc->vfc_name, + outvp->v_mount->mnt_vfc->vfc_name) == 0) error = VOP_COPY_FILE_RANGE(invp, inoffp, outvp, outoffp, lenp, flags, incred, outcred, fsize_td); else That looks to call VOP_COPY_FILE_RANGE in more contexts and vn_generic_copy_file_range in fewer. The backtrace I reported involves: VOP_COPY_FILE_RANGE So it appears this change is unlikely to invalidate my test result, although failure might happen sooner if more VOP_COPY_FILE_RANGE calls happen with the newer code. That in turns means that someone may come up with some other change for me to test by the time I get around to setting up another test. Let me know if so. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: > When I next have time, should I retry based on a more recent > vintage of main that includes 969071be938c ? > Yes, please, if you can. Glen signature.asc Description: PGP signature
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
[Drat, the request to rerun my tests did not not mention the more recent change: vfs: copy_file_range() between multiple mountpoints of the same fs type and I'd not noticed on my own and ran the test without updating.] On Sep 7, 2023, at 11:02, Mark Millard wrote: > I was requested to do a test with vfs.zfs.bclone_enabled=1 and > the bulk -a build paniced (having stored 128 *.pkg files in > .building/ first): Unfortunately, rerunning my tests with this set was testing a context predating: Wed, 06 Sep 2023 . . . • git: 969071be938c - main - vfs: copy_file_range() between multiple mountpoints of the same fs type Martin Matuska So the information might be out of date for main and for stable/14 : I've no clue how good of a test it was. May be some of those I've cc'd would know. When I next have time, should I retry based on a more recent vintage of main that includes 969071be938c ? > # more /var/crash/core.txt.3 > . . . > Unread portion of the kernel message buffer: > panic: Solaris(panic): zfs: accessing past end of object 422/1108c16 > (size=2560 access=2560+2560) > cpuid = 15 > time = 1694103674 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0352758590 > vpanic() at vpanic+0x132/frame 0xfe03527586c0 > panic() at panic+0x43/frame 0xfe0352758720 > vcmn_err() at vcmn_err+0xeb/frame 0xfe0352758850 > zfs_panic_recover() at zfs_panic_recover+0x59/frame 0xfe03527588b0 > dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x97/frame > 0xfe0352758960 > dmu_brt_clone() at dmu_brt_clone+0x61/frame 0xfe03527589f0 > zfs_clone_range() at zfs_clone_range+0xa6a/frame 0xfe0352758bc0 > zfs_freebsd_copy_file_range() at zfs_freebsd_copy_file_range+0x1ae/frame > 0xfe0352758c40 > vn_copy_file_range() at vn_copy_file_range+0x11e/frame 0xfe0352758ce0 > kern_copy_file_range() at kern_copy_file_range+0x338/frame 0xfe0352758db0 > sys_copy_file_range() at sys_copy_file_range+0x78/frame 0xfe0352758e00 > amd64_syscall() at amd64_syscall+0x109/frame 0xfe0352758f30 > fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe0352758f30 > --- syscall (569, FreeBSD ELF64, copy_file_range), rip = 0x1ce4506d155a, rsp > = 0x1ce44ec71e88, rbp = 0x1ce44ec72320 --- > KDB: enter: panic > > __curthread () at /usr/main-src/sys/amd64/include/pcpu_aux.h:57 > 57 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct > pcpu, > (kgdb) #0 __curthread () at /usr/main-src/sys/amd64/include/pcpu_aux.h:57 > #1 doadump (textdump=textdump@entry=0) > at /usr/main-src/sys/kern/kern_shutdown.c:405 > #2 0x804a442a in db_dump (dummy=, > dummy2=, dummy3=, dummy4=) > at /usr/main-src/sys/ddb/db_command.c:591 > #3 0x804a422d in db_command (last_cmdp=, > cmd_table=, dopager=true) > at /usr/main-src/sys/ddb/db_command.c:504 > #4 0x804a3eed in db_command_loop () > at /usr/main-src/sys/ddb/db_command.c:551 > #5 0x804a7876 in db_trap (type=, code=) > at /usr/main-src/sys/ddb/db_main.c:268 > #6 0x80bb9e57 in kdb_trap (type=type@entry=3, code=code@entry=0, > tf=tf@entry=0xfe03527584d0) at /usr/main-src/sys/kern/subr_kdb.c:790 > #7 0x8104ad3d in trap (frame=0xfe03527584d0) > at /usr/main-src/sys/amd64/amd64/trap.c:608 > #8 > #9 kdb_enter (why=, msg=) > at /usr/main-src/sys/kern/subr_kdb.c:556 > #10 0x80b6aab3 in vpanic (fmt=0x82be52d6 "%s%s", > ap=ap@entry=0xfe0352758700) > at /usr/main-src/sys/kern/kern_shutdown.c:958 > #11 0x80b6a943 in panic ( > fmt=0x820aa2e8 "\312C$\201\377\377\377\377") > at /usr/main-src/sys/kern/kern_shutdown.c:894 > #12 0x82993c5b in vcmn_err (ce=, > fmt=0x82bfdd1f "zfs: accessing past end of object %llx/%llx (size=%u > access=%llu+%llu)", adx=0xfe0352758890) > at /usr/main-src/sys/contrib/openzfs/module/os/freebsd/spl/spl_cmn_err.c:60 > #13 0x82a84d69 in zfs_panic_recover ( > fmt=0x12 ) > at /usr/main-src/sys/contrib/openzfs/module/zfs/spa_misc.c:1594 > #14 0x829f8e27 in dmu_buf_hold_array_by_dnode (dn=0xf813dfc48978, > offset=offset@entry=2560, length=length@entry=2560, read=read@entry=0, >tag=0x82bd8175, numbufsp=numbufsp@entry=0xfe03527589bc, > dbpp=0xfe03527589c0, flags=0) > at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:543 > #15 0x829fc6a1 in dmu_buf_hold_array (os=, > object=, read=0, numbufsp=0xfe03527589bc, > dbpp=0xfe03527589c0, offset=, length=, > tag=) > at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:654 > #16 dmu_brt_clone (os=os@entry=0xf8010ae0e000, object=, > offset=offset@entry=2560, length=length@entry=2560, > tx=tx@entry=0xf81aaeb6e100, bps=bps@entry=0xfe0595931000, nbps=1, > replay=0) at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:2301 > #17 0x82b4440a in
main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
I was requested to do a test with vfs.zfs.bclone_enabled=1 and the bulk -a build paniced (having stored 128 *.pkg files in .building/ first): # more /var/crash/core.txt.3 . . . Unread portion of the kernel message buffer: panic: Solaris(panic): zfs: accessing past end of object 422/1108c16 (size=2560 access=2560+2560) cpuid = 15 time = 1694103674 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0352758590 vpanic() at vpanic+0x132/frame 0xfe03527586c0 panic() at panic+0x43/frame 0xfe0352758720 vcmn_err() at vcmn_err+0xeb/frame 0xfe0352758850 zfs_panic_recover() at zfs_panic_recover+0x59/frame 0xfe03527588b0 dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x97/frame 0xfe0352758960 dmu_brt_clone() at dmu_brt_clone+0x61/frame 0xfe03527589f0 zfs_clone_range() at zfs_clone_range+0xa6a/frame 0xfe0352758bc0 zfs_freebsd_copy_file_range() at zfs_freebsd_copy_file_range+0x1ae/frame 0xfe0352758c40 vn_copy_file_range() at vn_copy_file_range+0x11e/frame 0xfe0352758ce0 kern_copy_file_range() at kern_copy_file_range+0x338/frame 0xfe0352758db0 sys_copy_file_range() at sys_copy_file_range+0x78/frame 0xfe0352758e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfe0352758f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe0352758f30 --- syscall (569, FreeBSD ELF64, copy_file_range), rip = 0x1ce4506d155a, rsp = 0x1ce44ec71e88, rbp = 0x1ce44ec72320 --- KDB: enter: panic __curthread () at /usr/main-src/sys/amd64/include/pcpu_aux.h:57 57 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/main-src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=0) at /usr/main-src/sys/kern/kern_shutdown.c:405 #2 0x804a442a in db_dump (dummy=, dummy2=, dummy3=, dummy4=) at /usr/main-src/sys/ddb/db_command.c:591 #3 0x804a422d in db_command (last_cmdp=, cmd_table=, dopager=true) at /usr/main-src/sys/ddb/db_command.c:504 #4 0x804a3eed in db_command_loop () at /usr/main-src/sys/ddb/db_command.c:551 #5 0x804a7876 in db_trap (type=, code=) at /usr/main-src/sys/ddb/db_main.c:268 #6 0x80bb9e57 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfe03527584d0) at /usr/main-src/sys/kern/subr_kdb.c:790 #7 0x8104ad3d in trap (frame=0xfe03527584d0) at /usr/main-src/sys/amd64/amd64/trap.c:608 #8 #9 kdb_enter (why=, msg=) at /usr/main-src/sys/kern/subr_kdb.c:556 #10 0x80b6aab3 in vpanic (fmt=0x82be52d6 "%s%s", ap=ap@entry=0xfe0352758700) at /usr/main-src/sys/kern/kern_shutdown.c:958 #11 0x80b6a943 in panic ( fmt=0x820aa2e8 "\312C$\201\377\377\377\377") at /usr/main-src/sys/kern/kern_shutdown.c:894 #12 0x82993c5b in vcmn_err (ce=, fmt=0x82bfdd1f "zfs: accessing past end of object %llx/%llx (size=%u access=%llu+%llu)", adx=0xfe0352758890) at /usr/main-src/sys/contrib/openzfs/module/os/freebsd/spl/spl_cmn_err.c:60 #13 0x82a84d69 in zfs_panic_recover ( fmt=0x12 ) at /usr/main-src/sys/contrib/openzfs/module/zfs/spa_misc.c:1594 #14 0x829f8e27 in dmu_buf_hold_array_by_dnode (dn=0xf813dfc48978, offset=offset@entry=2560, length=length@entry=2560, read=read@entry=0, tag=0x82bd8175, numbufsp=numbufsp@entry=0xfe03527589bc, dbpp=0xfe03527589c0, flags=0) at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:543 #15 0x829fc6a1 in dmu_buf_hold_array (os=, object=, read=0, numbufsp=0xfe03527589bc, dbpp=0xfe03527589c0, offset=, length=, tag=) at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:654 #16 dmu_brt_clone (os=os@entry=0xf8010ae0e000, object=, offset=offset@entry=2560, length=length@entry=2560, tx=tx@entry=0xf81aaeb6e100, bps=bps@entry=0xfe0595931000, nbps=1, replay=0) at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:2301 #17 0x82b4440a in zfs_clone_range (inzp=0xf8100054c910, inoffp=0xf81910c3c7c8, outzp=0xf80fb3233000, outoffp=0xf819860a2c78, lenp=lenp@entry=0xfe0352758c00, cr=0xf80e32335200) at /usr/main-src/sys/contrib/openzfs/module/zfs/zfs_vnops.c:1302 #18 0x829b4ece in zfs_freebsd_copy_file_range (ap=0xfe0352758c58) at /usr/main-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:6294 #19 0x80c7160e in VOP_COPY_FILE_RANGE (invp=, inoffp=0x40, outvp=0xfe03527581d0, outoffp=0x811e6eb7, lenp=0x0, flags=0, incred=0xf80e32335200, outcred=0x0, fsizetd=0xfe03586c0720) at ./vnode_if.h:2381 #20 vn_copy_file_range (invp=invp@entry=0xf8095e1e8000, inoffp=0x40, inoffp@entry=0xf81910c3c7c8, outvp=0xfe03527581d0, outvp@entry=0xf805d6107380, outoffp=0x811e6eb7,