Re: nullfs and ZFS issues
Quoting Eirik Øverby (from Mon, 25 Apr 2022 18:44:19 +0200): On Mon, 2022-04-25 at 15:27 +0200, Alexander Leidinger wrote: Quoting Alexander Leidinger (from Sun, 24 Apr 2022 19:58:17 +0200): > Quoting Alexander Leidinger (from Fri, 22 > Apr 2022 09:04:39 +0200): > > > Quoting Doug Ambrisko (from Thu, 21 Apr > > 2022 09:38:35 -0700): > > > > I've attached mount.patch that when doing mount -v should > > > show the vnode usage per filesystem. Note that the problem I was > > > running into was after some operations arc_prune and arc_evict would > > > consume 100% of 2 cores and make ZFS really slow. If you are not > > > running into that issue then nocache etc. shouldn't be needed. > > > > I don't run into this issue, but I have a huge perf difference when > > using nocache in the nightly periodic runs. 4h instead of 12-24h > > (22 jails on this system). > > > > > On my laptop I set ARC to 1G since I don't use swap and in the past > > > ARC would consume to much memory and things would die. When the > > > nullfs holds a bunch of vnodes then ZFS couldn't release them. > > > > > > FYI, on my laptop with nocache and limited vnodes I haven't run > > > into this problem. I haven't tried the patch to let ZFS free > > > it's and nullfs vnodes on my laptop. I have only tried it via > > > > I have this patch and your mount patch installed now, without > > nocache and reduced arc reclaim settings (100, 1). I will check the > > runtime for the next 2 days. > > 9-10h runtime with the above settings (compared to 4h with nocache > and 12-24h without any patch and without nocache). > I changed the sysctls back to the defaults and will see in the next > run (in 7h) what the result is with just the patches. And again 9-10h runtime (I've seen a lot of the find processes in the periodic daily run of those 22 jails in the state "*vnode"). Seems nocache gives the best perf for me in this case. Sorry for jumping in here - I've got a couple of questions: - Will this also apply to nullfs read-only mounts? Or is it only in case of writing "through" a nullfs mount that these problems are seen? - Is it a problem also in 13, or is this "new" in -CURRENT? We're having weird and unexplained CPU spikes on several systems, even after tuning geli to not use gazillions of threads. So far our suspicion has been ZFS snapshot cleanups but this is an interesting contender - unless the whole "read only" part makes it moot. For me this started after creating one more jail on this system and I dont't see CPU spikes (as the system is running permanently at 100% and the distribution of the CPU looks as I would expect it). The experience of Doug is a little bit different, as he experiences a high amount of CPU usage "for nothing" or even a dead-lock like situation. So I would say we see different things based on similar triggers. The nocache option for nullfs is affecting the number of vnodes in use on the system no matter if ro or rw. As such you can give it a try. Note, depending on the usage pattern, the nocache option may increase lock contention. So it may or may not have a positive or negative performance impact. Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF pgp8REsu61FBW.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
On Mon, 2022-04-25 at 15:27 +0200, Alexander Leidinger wrote: > Quoting Alexander Leidinger (from Sun, 24 > Apr 2022 19:58:17 +0200): > > > Quoting Alexander Leidinger (from Fri, 22 > > Apr 2022 09:04:39 +0200): > > > > > Quoting Doug Ambrisko (from Thu, 21 Apr > > > 2022 09:38:35 -0700): > > > > > > I've attached mount.patch that when doing mount -v should > > > > show the vnode usage per filesystem. Note that the problem I was > > > > running into was after some operations arc_prune and arc_evict would > > > > consume 100% of 2 cores and make ZFS really slow. If you are not > > > > running into that issue then nocache etc. shouldn't be needed. > > > > > > I don't run into this issue, but I have a huge perf difference when > > > using nocache in the nightly periodic runs. 4h instead of 12-24h > > > (22 jails on this system). > > > > > > > On my laptop I set ARC to 1G since I don't use swap and in the past > > > > ARC would consume to much memory and things would die. When the > > > > nullfs holds a bunch of vnodes then ZFS couldn't release them. > > > > > > > > FYI, on my laptop with nocache and limited vnodes I haven't run > > > > into this problem. I haven't tried the patch to let ZFS free > > > > it's and nullfs vnodes on my laptop. I have only tried it via > > > > > > I have this patch and your mount patch installed now, without > > > nocache and reduced arc reclaim settings (100, 1). I will check the > > > runtime for the next 2 days. > > > > 9-10h runtime with the above settings (compared to 4h with nocache > > and 12-24h without any patch and without nocache). > > I changed the sysctls back to the defaults and will see in the next > > run (in 7h) what the result is with just the patches. > > And again 9-10h runtime (I've seen a lot of the find processes in the > periodic daily run of those 22 jails in the state "*vnode"). Seems > nocache gives the best perf for me in this case. Sorry for jumping in here - I've got a couple of questions: - Will this also apply to nullfs read-only mounts? Or is it only in case of writing "through" a nullfs mount that these problems are seen? - Is it a problem also in 13, or is this "new" in -CURRENT? We're having weird and unexplained CPU spikes on several systems, even after tuning geli to not use gazillions of threads. So far our suspicion has been ZFS snapshot cleanups but this is an interesting contender - unless the whole "read only" part makes it moot. /Eirik
Re: nullfs and ZFS issues
Quoting Alexander Leidinger (from Sun, 24 Apr 2022 19:58:17 +0200): Quoting Alexander Leidinger (from Fri, 22 Apr 2022 09:04:39 +0200): Quoting Doug Ambrisko (from Thu, 21 Apr 2022 09:38:35 -0700): I've attached mount.patch that when doing mount -v should show the vnode usage per filesystem. Note that the problem I was running into was after some operations arc_prune and arc_evict would consume 100% of 2 cores and make ZFS really slow. If you are not running into that issue then nocache etc. shouldn't be needed. I don't run into this issue, but I have a huge perf difference when using nocache in the nightly periodic runs. 4h instead of 12-24h (22 jails on this system). On my laptop I set ARC to 1G since I don't use swap and in the past ARC would consume to much memory and things would die. When the nullfs holds a bunch of vnodes then ZFS couldn't release them. FYI, on my laptop with nocache and limited vnodes I haven't run into this problem. I haven't tried the patch to let ZFS free it's and nullfs vnodes on my laptop. I have only tried it via I have this patch and your mount patch installed now, without nocache and reduced arc reclaim settings (100, 1). I will check the runtime for the next 2 days. 9-10h runtime with the above settings (compared to 4h with nocache and 12-24h without any patch and without nocache). I changed the sysctls back to the defaults and will see in the next run (in 7h) what the result is with just the patches. And again 9-10h runtime (I've seen a lot of the find processes in the periodic daily run of those 22 jails in the state "*vnode"). Seems nocache gives the best perf for me in this case. Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF pgpbbgb6PWKOs.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
Quoting Alexander Leidinger (from Fri, 22 Apr 2022 09:04:39 +0200): Quoting Doug Ambrisko (from Thu, 21 Apr 2022 09:38:35 -0700): I've attached mount.patch that when doing mount -v should show the vnode usage per filesystem. Note that the problem I was running into was after some operations arc_prune and arc_evict would consume 100% of 2 cores and make ZFS really slow. If you are not running into that issue then nocache etc. shouldn't be needed. I don't run into this issue, but I have a huge perf difference when using nocache in the nightly periodic runs. 4h instead of 12-24h (22 jails on this system). On my laptop I set ARC to 1G since I don't use swap and in the past ARC would consume to much memory and things would die. When the nullfs holds a bunch of vnodes then ZFS couldn't release them. FYI, on my laptop with nocache and limited vnodes I haven't run into this problem. I haven't tried the patch to let ZFS free it's and nullfs vnodes on my laptop. I have only tried it via I have this patch and your mount patch installed now, without nocache and reduced arc reclaim settings (100, 1). I will check the runtime for the next 2 days. 9-10h runtime with the above settings (compared to 4h with nocache and 12-24h without any patch and without nocache). I changed the sysctls back to the defaults and will see in the next run (in 7h) what the result is with just the patches. Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF pgpufK9bQL8ps.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
On Fri, Apr 22, 2022 at 09:04:39AM +0200, Alexander Leidinger wrote: | Quoting Doug Ambrisko (from Thu, 21 Apr 2022 | 09:38:35 -0700): | | > On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote: | > | Quoting Mateusz Guzik (from Thu, 21 Apr 2022 | > | 14:50:42 +0200): | > | | > | > On 4/21/22, Alexander Leidinger wrote: | > | >> I tried nocache on a system with a lot of jails which use nullfs, | > | >> which showed very slow behavior in the daily periodic runs (12h runs | > | >> in the night after boot, 24h or more in subsequent nights). Now the | > | >> first nightly run after boot was finished after 4h. | > | >> | > | >> What is the benefit of not disabling the cache in nullfs? I would | > | >> expect zfs (or ufs) to cache the (meta)data anyway. | > | >> | > | > | > | > does the poor performance show up with | > | > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ? | > | | > | I would like to have all the 22 jails run the periodic scripts a | > | second night in a row before trying this. | > | | > | > if the long runs are still there, can you get some profiling from it? | > | > sysctl -a before and after would be a start. | > | > | > | > My guess is that you are the vnode limit and bumping into the 1 | > second sleep. | > | | > | That would explain the behavior I see since I added the last jail | > | which seems to have crossed a threshold which triggers the slow | > | behavior. | > | | > | Current status (with the 112 nullfs mounts with nocache): | > | kern.maxvnodes: 10485760 | > | kern.numvnodes:3791064 | > | kern.freevnodes: 3613694 | > | kern.cache.stats.heldvnodes:151707 | > | kern.vnodes_created: 260288639 | > | | > | The maxvnodes value is already increased by 10 times compared to the | > | default value on this system. | > | > I've attached mount.patch that when doing mount -v should | > show the vnode usage per filesystem. Note that the problem I was | > running into was after some operations arc_prune and arc_evict would | > consume 100% of 2 cores and make ZFS really slow. If you are not | > running into that issue then nocache etc. shouldn't be needed. | | I don't run into this issue, but I have a huge perf difference when | using nocache in the nightly periodic runs. 4h instead of 12-24h (22 | jails on this system). I wouldn't do the nocache then! It would be good to see what Mateusz patch does without nocache for your env. | > On my laptop I set ARC to 1G since I don't use swap and in the past | > ARC would consume to much memory and things would die. When the | > nullfs holds a bunch of vnodes then ZFS couldn't release them. | > | > FYI, on my laptop with nocache and limited vnodes I haven't run | > into this problem. I haven't tried the patch to let ZFS free | > it's and nullfs vnodes on my laptop. I have only tried it via | | I have this patch and your mount patch installed now, without nocache | and reduced arc reclaim settings (100, 1). I will check the runtime | for the next 2 days. | | Your mount patch to show the per mount vnodes count looks useful, not | only for this particular case. Do you intend to commit it? I should since it doesn't change the size of the structure etc. I need to put it up for review. Thanks, Doug A.
Re: nullfs and ZFS issues
Quoting Doug Ambrisko (from Thu, 21 Apr 2022 09:38:35 -0700): On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote: | Quoting Mateusz Guzik (from Thu, 21 Apr 2022 | 14:50:42 +0200): | | > On 4/21/22, Alexander Leidinger wrote: | >> I tried nocache on a system with a lot of jails which use nullfs, | >> which showed very slow behavior in the daily periodic runs (12h runs | >> in the night after boot, 24h or more in subsequent nights). Now the | >> first nightly run after boot was finished after 4h. | >> | >> What is the benefit of not disabling the cache in nullfs? I would | >> expect zfs (or ufs) to cache the (meta)data anyway. | >> | > | > does the poor performance show up with | > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ? | | I would like to have all the 22 jails run the periodic scripts a | second night in a row before trying this. | | > if the long runs are still there, can you get some profiling from it? | > sysctl -a before and after would be a start. | > | > My guess is that you are the vnode limit and bumping into the 1 second sleep. | | That would explain the behavior I see since I added the last jail | which seems to have crossed a threshold which triggers the slow | behavior. | | Current status (with the 112 nullfs mounts with nocache): | kern.maxvnodes: 10485760 | kern.numvnodes:3791064 | kern.freevnodes: 3613694 | kern.cache.stats.heldvnodes:151707 | kern.vnodes_created: 260288639 | | The maxvnodes value is already increased by 10 times compared to the | default value on this system. I've attached mount.patch that when doing mount -v should show the vnode usage per filesystem. Note that the problem I was running into was after some operations arc_prune and arc_evict would consume 100% of 2 cores and make ZFS really slow. If you are not running into that issue then nocache etc. shouldn't be needed. I don't run into this issue, but I have a huge perf difference when using nocache in the nightly periodic runs. 4h instead of 12-24h (22 jails on this system). On my laptop I set ARC to 1G since I don't use swap and in the past ARC would consume to much memory and things would die. When the nullfs holds a bunch of vnodes then ZFS couldn't release them. FYI, on my laptop with nocache and limited vnodes I haven't run into this problem. I haven't tried the patch to let ZFS free it's and nullfs vnodes on my laptop. I have only tried it via I have this patch and your mount patch installed now, without nocache and reduced arc reclaim settings (100, 1). I will check the runtime for the next 2 days. Your mount patch to show the per mount vnodes count looks useful, not only for this particular case. Do you intend to commit it? Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF pgpaRSyTU_E11.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote: | Quoting Mateusz Guzik (from Thu, 21 Apr 2022 | 14:50:42 +0200): | | > On 4/21/22, Alexander Leidinger wrote: | >> I tried nocache on a system with a lot of jails which use nullfs, | >> which showed very slow behavior in the daily periodic runs (12h runs | >> in the night after boot, 24h or more in subsequent nights). Now the | >> first nightly run after boot was finished after 4h. | >> | >> What is the benefit of not disabling the cache in nullfs? I would | >> expect zfs (or ufs) to cache the (meta)data anyway. | >> | > | > does the poor performance show up with | > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ? | | I would like to have all the 22 jails run the periodic scripts a | second night in a row before trying this. | | > if the long runs are still there, can you get some profiling from it? | > sysctl -a before and after would be a start. | > | > My guess is that you are the vnode limit and bumping into the 1 second sleep. | | That would explain the behavior I see since I added the last jail | which seems to have crossed a threshold which triggers the slow | behavior. | | Current status (with the 112 nullfs mounts with nocache): | kern.maxvnodes: 10485760 | kern.numvnodes:3791064 | kern.freevnodes: 3613694 | kern.cache.stats.heldvnodes:151707 | kern.vnodes_created: 260288639 | | The maxvnodes value is already increased by 10 times compared to the | default value on this system. With the patch, you shouldn't mount with nocache! However, you might want to tune: vfs.zfs.arc.meta_prune vfs.zfs.arc.meta_adjust_restarts Since the code on restart will increment the prune amount by vfs.zfs.arc.meta_prune and submit that amount to the vnode reclaim code. So then it will end up reclaiming a lot of vnodes. The defaults of 1 * 4096 and submitting it each loop can most of the cache to be freed. With relative small values of them, then the cache didn't shrink to much. Doug A.
Re: nullfs and ZFS issues
On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote: | Quoting Mateusz Guzik (from Thu, 21 Apr 2022 | 14:50:42 +0200): | | > On 4/21/22, Alexander Leidinger wrote: | >> I tried nocache on a system with a lot of jails which use nullfs, | >> which showed very slow behavior in the daily periodic runs (12h runs | >> in the night after boot, 24h or more in subsequent nights). Now the | >> first nightly run after boot was finished after 4h. | >> | >> What is the benefit of not disabling the cache in nullfs? I would | >> expect zfs (or ufs) to cache the (meta)data anyway. | >> | > | > does the poor performance show up with | > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ? | | I would like to have all the 22 jails run the periodic scripts a | second night in a row before trying this. | | > if the long runs are still there, can you get some profiling from it? | > sysctl -a before and after would be a start. | > | > My guess is that you are the vnode limit and bumping into the 1 second sleep. | | That would explain the behavior I see since I added the last jail | which seems to have crossed a threshold which triggers the slow | behavior. | | Current status (with the 112 nullfs mounts with nocache): | kern.maxvnodes: 10485760 | kern.numvnodes:3791064 | kern.freevnodes: 3613694 | kern.cache.stats.heldvnodes:151707 | kern.vnodes_created: 260288639 | | The maxvnodes value is already increased by 10 times compared to the | default value on this system. I've attached mount.patch that when doing mount -v should show the vnode usage per filesystem. Note that the problem I was running into was after some operations arc_prune and arc_evict would consume 100% of 2 cores and make ZFS really slow. If you are not running into that issue then nocache etc. shouldn't be needed. On my laptop I set ARC to 1G since I don't use swap and in the past ARC would consume to much memory and things would die. When the nullfs holds a bunch of vnodes then ZFS couldn't release them. FYI, on my laptop with nocache and limited vnodes I haven't run into this problem. I haven't tried the patch to let ZFS free it's and nullfs vnodes on my laptop. I have only tried it via bhyve test. I use bhyve and a md drive to avoid wearing out my SSD and it's faster to test. I have found the git, tar, make world etc. could trigger the issue before but haven't had any issues with nocache and capping vnodes. Thanks, Doug A. diff --git a/sbin/mount/mount.c b/sbin/mount/mount.c index 79d9d6cb0ca..00eefb3a5e0 100644 --- a/sbin/mount/mount.c +++ b/sbin/mount/mount.c @@ -692,6 +692,13 @@ prmount(struct statfs *sfp) xo_emit("{D:, }{Lw:fsid}{:fsid}", fsidbuf); free(fsidbuf); } + if (sfp->f_nvnodelistsize != 0 || sfp->f_lazyvnodelistsize != 0) { + xo_open_container("vnodes"); +xo_emit("{D:, }{Lwc:vnodes}{Lw:count}{w:count/%ju}{Lw:lazy}{:lazy/%ju}", +(uintmax_t)sfp->f_nvnodelistsize, +(uintmax_t)sfp->f_lazyvnodelistsize); + xo_close_container("vnodes"); + } } xo_emit("{D:)}\n"); } diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c index a495ad86ac4..3648ef8d080 100644 --- a/sys/kern/vfs_mount.c +++ b/sys/kern/vfs_mount.c @@ -2625,6 +2626,8 @@ __vfs_statfs(struct mount *mp, struct statfs *sbp) sbp->f_version = STATFS_VERSION; sbp->f_namemax = NAME_MAX; sbp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK; + sbp->f_nvnodelistsize = mp->mnt_nvnodelistsize; + sbp->f_lazyvnodelistsize = mp->mnt_lazyvnodelistsize; return (mp->mnt_op->vfs_statfs(mp, sbp)); } diff --git a/sys/sys/mount.h b/sys/sys/mount.h index 3383bfe8f43..95dd3c76ae5 100644 --- a/sys/sys/mount.h +++ b/sys/sys/mount.h @@ -91,7 +91,9 @@ struct statfs { uint64_t f_asyncwrites; /* count of async writes since mount */ uint64_t f_syncreads; /* count of sync reads since mount */ uint64_t f_asyncreads; /* count of async reads since mount */ - uint64_t f_spare[10]; /* unused spare */ + uint32_t f_nvnodelistsize; /* (i) # of vnodes */ + uint32_t f_lazyvnodelistsize;/* (l) # of lazy vnodes */ + uint64_t f_spare[9]; /* unused spare */ uint32_t f_namemax; /* maximum filename length */ uid_t f_owner; /* user that mounted the filesystem */ fsid_t f_fsid; /* filesystem id */
Re: nullfs and ZFS issues
Quoting Mateusz Guzik (from Thu, 21 Apr 2022 14:50:42 +0200): On 4/21/22, Alexander Leidinger wrote: I tried nocache on a system with a lot of jails which use nullfs, which showed very slow behavior in the daily periodic runs (12h runs in the night after boot, 24h or more in subsequent nights). Now the first nightly run after boot was finished after 4h. What is the benefit of not disabling the cache in nullfs? I would expect zfs (or ufs) to cache the (meta)data anyway. does the poor performance show up with https://people.freebsd.org/~mjg/vnlru_free_pick.diff ? I would like to have all the 22 jails run the periodic scripts a second night in a row before trying this. if the long runs are still there, can you get some profiling from it? sysctl -a before and after would be a start. My guess is that you are the vnode limit and bumping into the 1 second sleep. That would explain the behavior I see since I added the last jail which seems to have crossed a threshold which triggers the slow behavior. Current status (with the 112 nullfs mounts with nocache): kern.maxvnodes: 10485760 kern.numvnodes:3791064 kern.freevnodes: 3613694 kern.cache.stats.heldvnodes:151707 kern.vnodes_created: 260288639 The maxvnodes value is already increased by 10 times compared to the default value on this system. Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF pgpDvfG_fAon2.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
On 4/21/22, Alexander Leidinger wrote: > Quoting Doug Ambrisko (from Wed, 20 Apr 2022 > 09:20:33 -0700): > >> On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote: >> | Quoting Doug Ambrisko (from Mon, 18 Apr 2022 >> | 16:32:38 -0700): >> | >> | > With nullfs, nocache and settings max vnodes to a low number I can >> | >> | Where is nocache documented? I don't see it in mount_nullfs(8), >> | mount(8) or nullfs(5). >> >> I didn't find it but it is in: >> src/sys/fs/nullfs/null_vfsops.c: if (vfs_getopt(mp->mnt_optnew, >> "nocache", NULL, NULL) == 0 || >> >> Also some file systems disable it via MNTK_NULL_NOCACHE > > Does the attached diff look ok? > >> | I tried a nullfs mount with nocache and it doesn't show up in the >> | output of "mount". >> >> Yep, I saw that as well. I could tell by dropping into ddb and then >> do a show mount on the FS and look at the count. That is why I added >> the vnode count to mount -v so I could see the usage without dropping >> into ddb. > > I tried nocache on a system with a lot of jails which use nullfs, > which showed very slow behavior in the daily periodic runs (12h runs > in the night after boot, 24h or more in subsequent nights). Now the > first nightly run after boot was finished after 4h. > > What is the benefit of not disabling the cache in nullfs? I would > expect zfs (or ufs) to cache the (meta)data anyway. > does the poor performance show up with https://people.freebsd.org/~mjg/vnlru_free_pick.diff ? if the long runs are still there, can you get some profiling from it? sysctl -a before and after would be a start. My guess is that you are the vnode limit and bumping into the 1 second sleep. -- Mateusz Guzik
Re: nullfs and ZFS issues
Quoting Doug Ambrisko (from Wed, 20 Apr 2022 09:20:33 -0700): On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote: | Quoting Doug Ambrisko (from Mon, 18 Apr 2022 | 16:32:38 -0700): | | > With nullfs, nocache and settings max vnodes to a low number I can | | Where is nocache documented? I don't see it in mount_nullfs(8), | mount(8) or nullfs(5). I didn't find it but it is in: src/sys/fs/nullfs/null_vfsops.c: if (vfs_getopt(mp->mnt_optnew, "nocache", NULL, NULL) == 0 || Also some file systems disable it via MNTK_NULL_NOCACHE Does the attached diff look ok? | I tried a nullfs mount with nocache and it doesn't show up in the | output of "mount". Yep, I saw that as well. I could tell by dropping into ddb and then do a show mount on the FS and look at the count. That is why I added the vnode count to mount -v so I could see the usage without dropping into ddb. I tried nocache on a system with a lot of jails which use nullfs, which showed very slow behavior in the daily periodic runs (12h runs in the night after boot, 24h or more in subsequent nights). Now the first nightly run after boot was finished after 4h. What is the benefit of not disabling the cache in nullfs? I would expect zfs (or ufs) to cache the (meta)data anyway. Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF diff --git a/sbin/mount/mount.8 b/sbin/mount/mount.8 index 2a877c04c07..823df63953d 100644 --- a/sbin/mount/mount.8 +++ b/sbin/mount/mount.8 @@ -28,7 +28,7 @@ .\" @(#)mount.8 8.8 (Berkeley) 6/16/94 .\" $FreeBSD$ .\" -.Dd March 17, 2022 +.Dd April 21, 2022 .Dt MOUNT 8 .Os .Sh NAME @@ -245,6 +245,9 @@ This file system should be skipped when is run with the .Fl a flag. +.It Cm nocache +Disable caching. +Some filesystems may not support this. .It Cm noclusterr Disable read clustering. .It Cm noclusterw pgpYPDNVeBw1Z.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote: | Quoting Doug Ambrisko (from Mon, 18 Apr 2022 | 16:32:38 -0700): | | > With nullfs, nocache and settings max vnodes to a low number I can | | Where is nocache documented? I don't see it in mount_nullfs(8), | mount(8) or nullfs(5). I didn't find it but it is in: src/sys/fs/nullfs/null_vfsops.c: if (vfs_getopt(mp->mnt_optnew, "nocache", NULL, NULL) == 0 || Also some file systems disable it via MNTK_NULL_NOCACHE | I tried a nullfs mount with nocache and it doesn't show up in the | output of "mount". Yep, I saw that as well. I could tell by dropping into ddb and then do a show mount on the FS and look at the count. That is why I added the vnode count to mount -v so I could see the usage without dropping into ddb. Doug A.
Re: nullfs and ZFS issues
On Wed, Apr 20, 2022 at 11:43:10AM +0200, Mateusz Guzik wrote: | On 4/19/22, Doug Ambrisko wrote: | > On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote: | > | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff | > | | > | this is not committable but should validate whether it works fine | > | > As a POC it's working. I see the vnode count for the nullfs and | > ZFS go up. The ARC cache also goes up until it exceeds the ARC max. | > size tten the vnodes for nullfs and ZFS goes down. The ARC cache goes | > down as well. This all repeats over and over. The systems seems | > healthy. No excessive running of arc_prune or arc_evict. | > | > My only comment is that the vnode freeing seems a bit agressive. | > Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS. | > The ARC drops from 70M to 7M (max is set at 64M) for this unit | > test. | > | | Can you check what kind of shrinking is requested by arc to begin | with? I imagine encountering a nullfs vnode may end up recycling 2 | instead of 1, but even repeated a lot it does not explain the above. I dug it into a bit more and think there could be a bug in: module/zfs/arc.c arc_evict_meta_balanced(uint64_t meta_used) prune += zfs_arc_meta_prune; //arc_prune_async(prune); arc_prune_async(zfs_arc_meta_prune); Since arc_prune_async, is queuing up a run of arc_prune_task for each call it is actually already accumulating the zfs_arc_meta_prune amount. It makes the count to vnlru_free_impl get really big quickly since it is looping via restart. 1 HELLO arc_prune_task 164 ticks 2147465958 count 2048 dmesg | grep arc_prune_task | uniq -c 14 HELLO arc_prune_task 164 ticks -2147343772 count 100 50 HELLO arc_prune_task 164 ticks -2147343771 count 100 46 HELLO arc_prune_task 164 ticks -2147343770 count 100 49 HELLO arc_prune_task 164 ticks -2147343769 count 100 44 HELLO arc_prune_task 164 ticks -2147343768 count 100 116 HELLO arc_prune_task 164 ticks -2147343767 count 100 1541 HELLO arc_prune_task 164 ticks -2147343766 count 100 53 HELLO arc_prune_task 164 ticks -2147343101 count 100 100 HELLO arc_prune_task 164 ticks -2147343100 count 100 75 HELLO arc_prune_task 164 ticks -2147343099 count 100 52 HELLO arc_prune_task 164 ticks -2147343098 count 100 50 HELLO arc_prune_task 164 ticks -2147343097 count 100 51 HELLO arc_prune_task 164 ticks -2147343096 count 100 783 HELLO arc_prune_task 164 ticks -2147343095 count 100 884 HELLO arc_prune_task 164 ticks -2147343094 count 100 Note I shrunk vfs.zfs.arc.meta_prune=100 to see how that might help. Changing it to 1, helps more! I see less agressive swings. I added printf("HELLO %s %d ticks %d count %ld\n",__FUNCTION__,__LINE__,ticks,nr_scan); to arc_prune_task. Adjusting both sysctl vfs.zfs.arc.meta_adjust_restarts=1 sysctl vfs.zfs.arc.meta_prune=100 without changing arc_prune_async(prune) helps avoid excessive swings. Thanks, Doug A. | > | On 4/19/22, Mateusz Guzik wrote: | > | > On 4/19/22, Mateusz Guzik wrote: | > | >> On 4/19/22, Doug Ambrisko wrote: | > | >>> I've switched my laptop to use nullfs and ZFS. Previously, I used | > | >>> localhost NFS mounts instead of nullfs when nullfs would complain | > | >>> that it couldn't mount. Since that check has been removed, I've | > | >>> switched to nullfs only. However, every so often my laptop would | > | >>> get slow and the the ARC evict and prune thread would consume two | > | >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased | > | >>> it to 2G now. Looking into this has uncovered some issues: | > | >>> -nullfs would prevent vnlru_free_vfsops from doing anything | > | >>> when called from ZFS arc_prune_task | > | >>> -nullfs would hang onto a bunch of vnodes unless mounted with | > | >>> nocache | > | >>> -nullfs and nocache would break untar. This has been fixed | > now. | > | >>> | > | >>> With nullfs, nocache and settings max vnodes to a low number I can | > | >>> keep the ARC around the max. without evict and prune consuming | > | >>> 100% of 2 cores. This doesn't seem like the best solution but it | > | >>> better then when the ARC starts spinning. | > | >>> | > | >>> Looking into this issue with bhyve and a md drive for testing I | > create | > | >>> a brand new zpool mounted as /test and then nullfs mount /test to | > /mnt. | > | >>> I loop through untaring the Linux kernel into the nullfs mount, rm | > -rf | > | >>> it | > | >>> and repeat. I set the ARC to the smallest value I can. Untarring | > the | > | >>> Linux kernel was enough to get the ARC evict and prune to spin since | > | >>> they couldn't evict/prune anything. | > | >>> | > | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it | > | >>> static
Re: nullfs and ZFS issues
On 4/19/22, Doug Ambrisko wrote: > On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote: > | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff > | > | this is not committable but should validate whether it works fine > > As a POC it's working. I see the vnode count for the nullfs and > ZFS go up. The ARC cache also goes up until it exceeds the ARC max. > size tten the vnodes for nullfs and ZFS goes down. The ARC cache goes > down as well. This all repeats over and over. The systems seems > healthy. No excessive running of arc_prune or arc_evict. > > My only comment is that the vnode freeing seems a bit agressive. > Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS. > The ARC drops from 70M to 7M (max is set at 64M) for this unit > test. > Can you check what kind of shrinking is requested by arc to begin with? I imagine encountering a nullfs vnode may end up recycling 2 instead of 1, but even repeated a lot it does not explain the above. > > | On 4/19/22, Mateusz Guzik wrote: > | > On 4/19/22, Mateusz Guzik wrote: > | >> On 4/19/22, Doug Ambrisko wrote: > | >>> I've switched my laptop to use nullfs and ZFS. Previously, I used > | >>> localhost NFS mounts instead of nullfs when nullfs would complain > | >>> that it couldn't mount. Since that check has been removed, I've > | >>> switched to nullfs only. However, every so often my laptop would > | >>> get slow and the the ARC evict and prune thread would consume two > | >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased > | >>> it to 2G now. Looking into this has uncovered some issues: > | >>> - nullfs would prevent vnlru_free_vfsops from doing anything > | >>> when called from ZFS arc_prune_task > | >>> - nullfs would hang onto a bunch of vnodes unless mounted with > | >>> nocache > | >>> - nullfs and nocache would break untar. This has been fixed > now. > | >>> > | >>> With nullfs, nocache and settings max vnodes to a low number I can > | >>> keep the ARC around the max. without evict and prune consuming > | >>> 100% of 2 cores. This doesn't seem like the best solution but it > | >>> better then when the ARC starts spinning. > | >>> > | >>> Looking into this issue with bhyve and a md drive for testing I > create > | >>> a brand new zpool mounted as /test and then nullfs mount /test to > /mnt. > | >>> I loop through untaring the Linux kernel into the nullfs mount, rm > -rf > | >>> it > | >>> and repeat. I set the ARC to the smallest value I can. Untarring > the > | >>> Linux kernel was enough to get the ARC evict and prune to spin since > | >>> they couldn't evict/prune anything. > | >>> > | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it > | >>> static int > | >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode > *mvp) > | >>> { > | >>> ... > | >>> > | >>> for (;;) { > | >>> ... > | >>> vp = TAILQ_NEXT(vp, v_vnodelist); > | >>> ... > | >>> > | >>> /* > | >>> * Don't recycle if our vnode is from different type > | >>> * of mount point. Note that mp is type-safe, the > | >>> * check does not reach unmapped address even if > | >>> * vnode is reclaimed. > | >>> */ > | >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && > | >>> mp->mnt_op != mnt_op) { > | >>> continue; > | >>> } > | >>> ... > | >>> > | >>> The vp ends up being the nulfs mount and then hits the continue > | >>> even though the passed in mvp is on ZFS. If I do a hack to > | >>> comment out the continue then I see the ARC, nullfs vnodes and > | >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls > | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. > | >>> The ARC cache usage also goes down. Then they increase again until > | >>> the ARC gets full and then they go down again. So with this hack > | >>> I don't need nocache passed to nullfs and I don't need to limit > | >>> the max vnodes. Doing multiple untars in parallel over and over > | >>> doesn't seem to cause any issues for this test. I'm not saying > | >>> commenting out continue is the fix but a simple POC test. > | >>> > | >> > | >> I don't see an easy way to say "this is a nullfs vnode holding onto a > | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs > | >> callback, if the module is loaded. > | >> > | >> In the meantime I think a good enough(tm) fix would be to check that > | >> nothing was freed and fallback to good old regular clean up without > | >> filtering by vfsops. This would be very similar to what you are doing > | >> with your hack. > | >> > | > > | > Now that I wrote this perhaps an acceptable hack would be to extend > | > struct mount with a pointer to "lower layer" mount
Re: nullfs and ZFS issues
Quoting Doug Ambrisko (from Mon, 18 Apr 2022 16:32:38 -0700): With nullfs, nocache and settings max vnodes to a low number I can Where is nocache documented? I don't see it in mount_nullfs(8), mount(8) or nullfs(5). I tried a nullfs mount with nocache and it doesn't show up in the output of "mount". Bye, Alexander. -- http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.orgnetch...@freebsd.org : PGP 0x8F31830F9F2772BF pgpGalMcrXooX.pgp Description: Digitale PGP-Signatur
Re: nullfs and ZFS issues
On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote: | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff | | this is not committable but should validate whether it works fine As a POC it's working. I see the vnode count for the nullfs and ZFS go up. The ARC cache also goes up until it exceeds the ARC max. size tten the vnodes for nullfs and ZFS goes down. The ARC cache goes down as well. This all repeats over and over. The systems seems healthy. No excessive running of arc_prune or arc_evict. My only comment is that the vnode freeing seems a bit agressive. Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS. The ARC drops from 70M to 7M (max is set at 64M) for this unit test. Thanks, Doug A. | On 4/19/22, Mateusz Guzik wrote: | > On 4/19/22, Mateusz Guzik wrote: | >> On 4/19/22, Doug Ambrisko wrote: | >>> I've switched my laptop to use nullfs and ZFS. Previously, I used | >>> localhost NFS mounts instead of nullfs when nullfs would complain | >>> that it couldn't mount. Since that check has been removed, I've | >>> switched to nullfs only. However, every so often my laptop would | >>> get slow and the the ARC evict and prune thread would consume two | >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased | >>> it to 2G now. Looking into this has uncovered some issues: | >>> -nullfs would prevent vnlru_free_vfsops from doing anything | >>> when called from ZFS arc_prune_task | >>> -nullfs would hang onto a bunch of vnodes unless mounted with | >>> nocache | >>> -nullfs and nocache would break untar. This has been fixed now. | >>> | >>> With nullfs, nocache and settings max vnodes to a low number I can | >>> keep the ARC around the max. without evict and prune consuming | >>> 100% of 2 cores. This doesn't seem like the best solution but it | >>> better then when the ARC starts spinning. | >>> | >>> Looking into this issue with bhyve and a md drive for testing I create | >>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt. | >>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf | >>> it | >>> and repeat. I set the ARC to the smallest value I can. Untarring the | >>> Linux kernel was enough to get the ARC evict and prune to spin since | >>> they couldn't evict/prune anything. | >>> | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it | >>> static int | >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) | >>> { | >>> ... | >>> | >>> for (;;) { | >>> ... | >>> vp = TAILQ_NEXT(vp, v_vnodelist); | >>> ... | >>> | >>> /* | >>> * Don't recycle if our vnode is from different type | >>> * of mount point. Note that mp is type-safe, the | >>> * check does not reach unmapped address even if | >>> * vnode is reclaimed. | >>> */ | >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && | >>> mp->mnt_op != mnt_op) { | >>> continue; | >>> } | >>> ... | >>> | >>> The vp ends up being the nulfs mount and then hits the continue | >>> even though the passed in mvp is on ZFS. If I do a hack to | >>> comment out the continue then I see the ARC, nullfs vnodes and | >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. | >>> The ARC cache usage also goes down. Then they increase again until | >>> the ARC gets full and then they go down again. So with this hack | >>> I don't need nocache passed to nullfs and I don't need to limit | >>> the max vnodes. Doing multiple untars in parallel over and over | >>> doesn't seem to cause any issues for this test. I'm not saying | >>> commenting out continue is the fix but a simple POC test. | >>> | >> | >> I don't see an easy way to say "this is a nullfs vnode holding onto a | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs | >> callback, if the module is loaded. | >> | >> In the meantime I think a good enough(tm) fix would be to check that | >> nothing was freed and fallback to good old regular clean up without | >> filtering by vfsops. This would be very similar to what you are doing | >> with your hack. | >> | > | > Now that I wrote this perhaps an acceptable hack would be to extend | > struct mount with a pointer to "lower layer" mount (if any) and patch | > the vfsops check to also look there. | > | >> | >>> It appears that when ZFS is asking for cached vnodes to be | >>> free'd nullfs also needs to free some up as well so that | >>> they are free'd on the VFS level. It seems that vnlru_free_impl | >>> should allow some of the related nullfs vnodes to be free'd so | >>> the ZFS ones can be free'd and reduce the size of the ARC. | >>> | >>> BTW, I also hacked the kernel
Re: nullfs and ZFS issues
Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff this is not committable but should validate whether it works fine On 4/19/22, Mateusz Guzik wrote: > On 4/19/22, Mateusz Guzik wrote: >> On 4/19/22, Doug Ambrisko wrote: >>> I've switched my laptop to use nullfs and ZFS. Previously, I used >>> localhost NFS mounts instead of nullfs when nullfs would complain >>> that it couldn't mount. Since that check has been removed, I've >>> switched to nullfs only. However, every so often my laptop would >>> get slow and the the ARC evict and prune thread would consume two >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased >>> it to 2G now. Looking into this has uncovered some issues: >>> - nullfs would prevent vnlru_free_vfsops from doing anything >>> when called from ZFS arc_prune_task >>> - nullfs would hang onto a bunch of vnodes unless mounted with >>> nocache >>> - nullfs and nocache would break untar. This has been fixed now. >>> >>> With nullfs, nocache and settings max vnodes to a low number I can >>> keep the ARC around the max. without evict and prune consuming >>> 100% of 2 cores. This doesn't seem like the best solution but it >>> better then when the ARC starts spinning. >>> >>> Looking into this issue with bhyve and a md drive for testing I create >>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt. >>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf >>> it >>> and repeat. I set the ARC to the smallest value I can. Untarring the >>> Linux kernel was enough to get the ARC evict and prune to spin since >>> they couldn't evict/prune anything. >>> >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it >>> static int >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) >>> { >>> ... >>> >>> for (;;) { >>> ... >>> vp = TAILQ_NEXT(vp, v_vnodelist); >>> ... >>> >>> /* >>> * Don't recycle if our vnode is from different type >>> * of mount point. Note that mp is type-safe, the >>> * check does not reach unmapped address even if >>> * vnode is reclaimed. >>> */ >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && >>> mp->mnt_op != mnt_op) { >>> continue; >>> } >>> ... >>> >>> The vp ends up being the nulfs mount and then hits the continue >>> even though the passed in mvp is on ZFS. If I do a hack to >>> comment out the continue then I see the ARC, nullfs vnodes and >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. >>> The ARC cache usage also goes down. Then they increase again until >>> the ARC gets full and then they go down again. So with this hack >>> I don't need nocache passed to nullfs and I don't need to limit >>> the max vnodes. Doing multiple untars in parallel over and over >>> doesn't seem to cause any issues for this test. I'm not saying >>> commenting out continue is the fix but a simple POC test. >>> >> >> I don't see an easy way to say "this is a nullfs vnode holding onto a >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs >> callback, if the module is loaded. >> >> In the meantime I think a good enough(tm) fix would be to check that >> nothing was freed and fallback to good old regular clean up without >> filtering by vfsops. This would be very similar to what you are doing >> with your hack. >> > > Now that I wrote this perhaps an acceptable hack would be to extend > struct mount with a pointer to "lower layer" mount (if any) and patch > the vfsops check to also look there. > >> >>> It appears that when ZFS is asking for cached vnodes to be >>> free'd nullfs also needs to free some up as well so that >>> they are free'd on the VFS level. It seems that vnlru_free_impl >>> should allow some of the related nullfs vnodes to be free'd so >>> the ZFS ones can be free'd and reduce the size of the ARC. >>> >>> BTW, I also hacked the kernel and mount to show the vnodes used >>> per mount ie. mount -v: >>> test on /test (zfs, NFS exported, local, nfsv4acls, fsid >>> 2b23b2a1de21ed66, >>> vnodes: count 13846 lazy 0) >>> /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid >>> 11ff00292900, vnodes: count 13846 lazy 0) >>> >>> Now I can easily see how the vnodes are used without going into ddb. >>> On my laptop I have various vnet jails and nullfs mount my homedir into >>> them so pretty much everything goes through nullfs to ZFS. I'm limping >>> along with the nullfs nocache and small number of vnodes but it would be >>> nice to not need that. >>> >>> Thanks, >>> >>> Doug A. >>> >>> >> >> >> -- >> Mateusz Guzik >> > > > -- > Mateusz Guzik > -- Mateusz Guzik
Re: nullfs and ZFS issues
On 4/19/22, Mateusz Guzik wrote: > On 4/19/22, Doug Ambrisko wrote: >> I've switched my laptop to use nullfs and ZFS. Previously, I used >> localhost NFS mounts instead of nullfs when nullfs would complain >> that it couldn't mount. Since that check has been removed, I've >> switched to nullfs only. However, every so often my laptop would >> get slow and the the ARC evict and prune thread would consume two >> cores 100% until I rebooted. I had a 1G max. ARC and have increased >> it to 2G now. Looking into this has uncovered some issues: >> - nullfs would prevent vnlru_free_vfsops from doing anything >> when called from ZFS arc_prune_task >> - nullfs would hang onto a bunch of vnodes unless mounted with >> nocache >> - nullfs and nocache would break untar. This has been fixed now. >> >> With nullfs, nocache and settings max vnodes to a low number I can >> keep the ARC around the max. without evict and prune consuming >> 100% of 2 cores. This doesn't seem like the best solution but it >> better then when the ARC starts spinning. >> >> Looking into this issue with bhyve and a md drive for testing I create >> a brand new zpool mounted as /test and then nullfs mount /test to /mnt. >> I loop through untaring the Linux kernel into the nullfs mount, rm -rf it >> and repeat. I set the ARC to the smallest value I can. Untarring the >> Linux kernel was enough to get the ARC evict and prune to spin since >> they couldn't evict/prune anything. >> >> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it >> static int >> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) >> { >> ... >> >> for (;;) { >> ... >> vp = TAILQ_NEXT(vp, v_vnodelist); >> ... >> >> /* >> * Don't recycle if our vnode is from different type >> * of mount point. Note that mp is type-safe, the >> * check does not reach unmapped address even if >> * vnode is reclaimed. >> */ >> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && >> mp->mnt_op != mnt_op) { >> continue; >> } >> ... >> >> The vp ends up being the nulfs mount and then hits the continue >> even though the passed in mvp is on ZFS. If I do a hack to >> comment out the continue then I see the ARC, nullfs vnodes and >> ZFS vnodes grow. When the ARC calls arc_prune_task that calls >> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. >> The ARC cache usage also goes down. Then they increase again until >> the ARC gets full and then they go down again. So with this hack >> I don't need nocache passed to nullfs and I don't need to limit >> the max vnodes. Doing multiple untars in parallel over and over >> doesn't seem to cause any issues for this test. I'm not saying >> commenting out continue is the fix but a simple POC test. >> > > I don't see an easy way to say "this is a nullfs vnode holding onto a > zfs vnode". Perhaps the routine can be extrended with issuing a nullfs > callback, if the module is loaded. > > In the meantime I think a good enough(tm) fix would be to check that > nothing was freed and fallback to good old regular clean up without > filtering by vfsops. This would be very similar to what you are doing > with your hack. > Now that I wrote this perhaps an acceptable hack would be to extend struct mount with a pointer to "lower layer" mount (if any) and patch the vfsops check to also look there. > >> It appears that when ZFS is asking for cached vnodes to be >> free'd nullfs also needs to free some up as well so that >> they are free'd on the VFS level. It seems that vnlru_free_impl >> should allow some of the related nullfs vnodes to be free'd so >> the ZFS ones can be free'd and reduce the size of the ARC. >> >> BTW, I also hacked the kernel and mount to show the vnodes used >> per mount ie. mount -v: >> test on /test (zfs, NFS exported, local, nfsv4acls, fsid >> 2b23b2a1de21ed66, >> vnodes: count 13846 lazy 0) >> /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid >> 11ff00292900, vnodes: count 13846 lazy 0) >> >> Now I can easily see how the vnodes are used without going into ddb. >> On my laptop I have various vnet jails and nullfs mount my homedir into >> them so pretty much everything goes through nullfs to ZFS. I'm limping >> along with the nullfs nocache and small number of vnodes but it would be >> nice to not need that. >> >> Thanks, >> >> Doug A. >> >> > > > -- > Mateusz Guzik > -- Mateusz Guzik
Re: nullfs and ZFS issues
On 4/19/22, Doug Ambrisko wrote: > I've switched my laptop to use nullfs and ZFS. Previously, I used > localhost NFS mounts instead of nullfs when nullfs would complain > that it couldn't mount. Since that check has been removed, I've > switched to nullfs only. However, every so often my laptop would > get slow and the the ARC evict and prune thread would consume two > cores 100% until I rebooted. I had a 1G max. ARC and have increased > it to 2G now. Looking into this has uncovered some issues: > -nullfs would prevent vnlru_free_vfsops from doing anything > when called from ZFS arc_prune_task > -nullfs would hang onto a bunch of vnodes unless mounted with > nocache > -nullfs and nocache would break untar. This has been fixed now. > > With nullfs, nocache and settings max vnodes to a low number I can > keep the ARC around the max. without evict and prune consuming > 100% of 2 cores. This doesn't seem like the best solution but it > better then when the ARC starts spinning. > > Looking into this issue with bhyve and a md drive for testing I create > a brand new zpool mounted as /test and then nullfs mount /test to /mnt. > I loop through untaring the Linux kernel into the nullfs mount, rm -rf it > and repeat. I set the ARC to the smallest value I can. Untarring the > Linux kernel was enough to get the ARC evict and prune to spin since > they couldn't evict/prune anything. > > Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it > static int > vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) > { > ... > > for (;;) { > ... > vp = TAILQ_NEXT(vp, v_vnodelist); > ... > > /* > * Don't recycle if our vnode is from different type > * of mount point. Note that mp is type-safe, the > * check does not reach unmapped address even if > * vnode is reclaimed. > */ > if (mnt_op != NULL && (mp = vp->v_mount) != NULL && > mp->mnt_op != mnt_op) { > continue; > } > ... > > The vp ends up being the nulfs mount and then hits the continue > even though the passed in mvp is on ZFS. If I do a hack to > comment out the continue then I see the ARC, nullfs vnodes and > ZFS vnodes grow. When the ARC calls arc_prune_task that calls > vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. > The ARC cache usage also goes down. Then they increase again until > the ARC gets full and then they go down again. So with this hack > I don't need nocache passed to nullfs and I don't need to limit > the max vnodes. Doing multiple untars in parallel over and over > doesn't seem to cause any issues for this test. I'm not saying > commenting out continue is the fix but a simple POC test. > I don't see an easy way to say "this is a nullfs vnode holding onto a zfs vnode". Perhaps the routine can be extrended with issuing a nullfs callback, if the module is loaded. In the meantime I think a good enough(tm) fix would be to check that nothing was freed and fallback to good old regular clean up without filtering by vfsops. This would be very similar to what you are doing with your hack. > It appears that when ZFS is asking for cached vnodes to be > free'd nullfs also needs to free some up as well so that > they are free'd on the VFS level. It seems that vnlru_free_impl > should allow some of the related nullfs vnodes to be free'd so > the ZFS ones can be free'd and reduce the size of the ARC. > > BTW, I also hacked the kernel and mount to show the vnodes used > per mount ie. mount -v: > test on /test (zfs, NFS exported, local, nfsv4acls, fsid 2b23b2a1de21ed66, > vnodes: count 13846 lazy 0) > /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid > 11ff00292900, vnodes: count 13846 lazy 0) > > Now I can easily see how the vnodes are used without going into ddb. > On my laptop I have various vnet jails and nullfs mount my homedir into > them so pretty much everything goes through nullfs to ZFS. I'm limping > along with the nullfs nocache and small number of vnodes but it would be > nice to not need that. > > Thanks, > > Doug A. > > -- Mateusz Guzik
nullfs and ZFS issues
I've switched my laptop to use nullfs and ZFS. Previously, I used localhost NFS mounts instead of nullfs when nullfs would complain that it couldn't mount. Since that check has been removed, I've switched to nullfs only. However, every so often my laptop would get slow and the the ARC evict and prune thread would consume two cores 100% until I rebooted. I had a 1G max. ARC and have increased it to 2G now. Looking into this has uncovered some issues: - nullfs would prevent vnlru_free_vfsops from doing anything when called from ZFS arc_prune_task - nullfs would hang onto a bunch of vnodes unless mounted with nocache - nullfs and nocache would break untar. This has been fixed now. With nullfs, nocache and settings max vnodes to a low number I can keep the ARC around the max. without evict and prune consuming 100% of 2 cores. This doesn't seem like the best solution but it better then when the ARC starts spinning. Looking into this issue with bhyve and a md drive for testing I create a brand new zpool mounted as /test and then nullfs mount /test to /mnt. I loop through untaring the Linux kernel into the nullfs mount, rm -rf it and repeat. I set the ARC to the smallest value I can. Untarring the Linux kernel was enough to get the ARC evict and prune to spin since they couldn't evict/prune anything. Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it static int vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) { ... for (;;) { ... vp = TAILQ_NEXT(vp, v_vnodelist); ... /* * Don't recycle if our vnode is from different type * of mount point. Note that mp is type-safe, the * check does not reach unmapped address even if * vnode is reclaimed. */ if (mnt_op != NULL && (mp = vp->v_mount) != NULL && mp->mnt_op != mnt_op) { continue; } ... The vp ends up being the nulfs mount and then hits the continue even though the passed in mvp is on ZFS. If I do a hack to comment out the continue then I see the ARC, nullfs vnodes and ZFS vnodes grow. When the ARC calls arc_prune_task that calls vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. The ARC cache usage also goes down. Then they increase again until the ARC gets full and then they go down again. So with this hack I don't need nocache passed to nullfs and I don't need to limit the max vnodes. Doing multiple untars in parallel over and over doesn't seem to cause any issues for this test. I'm not saying commenting out continue is the fix but a simple POC test. It appears that when ZFS is asking for cached vnodes to be free'd nullfs also needs to free some up as well so that they are free'd on the VFS level. It seems that vnlru_free_impl should allow some of the related nullfs vnodes to be free'd so the ZFS ones can be free'd and reduce the size of the ARC. BTW, I also hacked the kernel and mount to show the vnodes used per mount ie. mount -v: test on /test (zfs, NFS exported, local, nfsv4acls, fsid 2b23b2a1de21ed66, vnodes: count 13846 lazy 0) /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid 11ff00292900, vnodes: count 13846 lazy 0) Now I can easily see how the vnodes are used without going into ddb. On my laptop I have various vnet jails and nullfs mount my homedir into them so pretty much everything goes through nullfs to ZFS. I'm limping along with the nullfs nocache and small number of vnodes but it would be nice to not need that. Thanks, Doug A.