Re: nullfs and ZFS issues

2022-04-26 Thread Alexander Leidinger
Quoting Eirik Øverby  (from Mon, 25 Apr 2022  
18:44:19 +0200):



On Mon, 2022-04-25 at 15:27 +0200, Alexander Leidinger wrote:

Quoting Alexander Leidinger  (from Sun, 24
Apr 2022 19:58:17 +0200):

> Quoting Alexander Leidinger  (from Fri, 22
> Apr 2022 09:04:39 +0200):
>
> > Quoting Doug Ambrisko  (from Thu, 21 Apr
> > 2022 09:38:35 -0700):
>
> > > I've attached mount.patch that when doing mount -v should
> > > show the vnode usage per filesystem.  Note that the problem I was
> > > running into was after some operations arc_prune and arc_evict would
> > > consume 100% of 2 cores and make ZFS really slow.  If you are not
> > > running into that issue then nocache etc. shouldn't be needed.
> >
> > I don't run into this issue, but I have a huge perf difference when
> > using nocache in the nightly periodic runs. 4h instead of 12-24h
> > (22 jails on this system).
> >
> > > On my laptop I set ARC to 1G since I don't use swap and in the past
> > > ARC would consume to much memory and things would die.  When the
> > > nullfs holds a bunch of vnodes then ZFS couldn't release them.
> > >
> > > FYI, on my laptop with nocache and limited vnodes I haven't run
> > > into this problem.  I haven't tried the patch to let ZFS free
> > > it's and nullfs vnodes on my laptop.  I have only tried it via
> >
> > I have this patch and your mount patch installed now, without
> > nocache and reduced arc reclaim settings (100, 1). I will check the
> > runtime for the next 2 days.
>
> 9-10h runtime with the above settings (compared to 4h with nocache
> and 12-24h without any patch and without nocache).
> I changed the sysctls back to the defaults and will see in the next
> run (in 7h) what the result is with just the patches.

And again 9-10h runtime (I've seen a lot of the find processes in the
periodic daily run of those 22 jails in the state "*vnode"). Seems
nocache gives the best perf for me in this case.


Sorry for jumping in here - I've got a couple of questions:
- Will this also apply to nullfs read-only mounts? Or is it only in
case of writing "through" a nullfs mount that these problems are seen?
- Is it a problem also in 13, or is this "new" in -CURRENT?

We're having weird and unexplained CPU spikes on several systems, even
after tuning geli to not use gazillions of threads. So far our
suspicion has been ZFS snapshot cleanups but this is an interesting
contender - unless the whole "read only" part makes it moot.


For me this started after creating one more jail on this system and I  
dont't see CPU spikes (as the system is running permanently at 100%  
and the distribution of the CPU looks as I would expect it). The  
experience of Doug is a little bit different, as he experiences a high  
amount of CPU usage "for nothing" or even a dead-lock like situation.  
So I would say we see different things based on similar triggers.


The nocache option for nullfs is affecting the number of vnodes in use  
on the system no matter if ro or rw. As such you can give it a try.  
Note, depending on the usage pattern, the nocache option may increase  
lock contention. So it may or may not have a positive or negative  
performance impact.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgp8REsu61FBW.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-25 Thread Eirik Øverby
On Mon, 2022-04-25 at 15:27 +0200, Alexander Leidinger wrote:
> Quoting Alexander Leidinger  (from Sun, 24  
> Apr 2022 19:58:17 +0200):
> 
> > Quoting Alexander Leidinger  (from Fri, 22  
> > Apr 2022 09:04:39 +0200):
> > 
> > > Quoting Doug Ambrisko  (from Thu, 21 Apr  
> > > 2022 09:38:35 -0700):
> > 
> > > > I've attached mount.patch that when doing mount -v should
> > > > show the vnode usage per filesystem.  Note that the problem I was
> > > > running into was after some operations arc_prune and arc_evict would
> > > > consume 100% of 2 cores and make ZFS really slow.  If you are not
> > > > running into that issue then nocache etc. shouldn't be needed.
> > > 
> > > I don't run into this issue, but I have a huge perf difference when  
> > > using nocache in the nightly periodic runs. 4h instead of 12-24h  
> > > (22 jails on this system).
> > > 
> > > > On my laptop I set ARC to 1G since I don't use swap and in the past
> > > > ARC would consume to much memory and things would die.  When the
> > > > nullfs holds a bunch of vnodes then ZFS couldn't release them.
> > > > 
> > > > FYI, on my laptop with nocache and limited vnodes I haven't run
> > > > into this problem.  I haven't tried the patch to let ZFS free
> > > > it's and nullfs vnodes on my laptop.  I have only tried it via
> > > 
> > > I have this patch and your mount patch installed now, without  
> > > nocache and reduced arc reclaim settings (100, 1). I will check the  
> > > runtime for the next 2 days.
> > 
> > 9-10h runtime with the above settings (compared to 4h with nocache  
> > and 12-24h without any patch and without nocache).
> > I changed the sysctls back to the defaults and will see in the next  
> > run (in 7h) what the result is with just the patches.
> 
> And again 9-10h runtime (I've seen a lot of the find processes in the  
> periodic daily run of those 22 jails in the state "*vnode"). Seems  
> nocache gives the best perf for me in this case.

Sorry for jumping in here - I've got a couple of questions:
- Will this also apply to nullfs read-only mounts? Or is it only in
case of writing "through" a nullfs mount that these problems are seen?
- Is it a problem also in 13, or is this "new" in -CURRENT?

We're having weird and unexplained CPU spikes on several systems, even
after tuning geli to not use gazillions of threads. So far our
suspicion has been ZFS snapshot cleanups but this is an interesting
contender - unless the whole "read only" part makes it moot.

/Eirik




Re: nullfs and ZFS issues

2022-04-25 Thread Alexander Leidinger
Quoting Alexander Leidinger  (from Sun, 24  
Apr 2022 19:58:17 +0200):


Quoting Alexander Leidinger  (from Fri, 22  
Apr 2022 09:04:39 +0200):


Quoting Doug Ambrisko  (from Thu, 21 Apr  
2022 09:38:35 -0700):



I've attached mount.patch that when doing mount -v should
show the vnode usage per filesystem.  Note that the problem I was
running into was after some operations arc_prune and arc_evict would
consume 100% of 2 cores and make ZFS really slow.  If you are not
running into that issue then nocache etc. shouldn't be needed.


I don't run into this issue, but I have a huge perf difference when  
using nocache in the nightly periodic runs. 4h instead of 12-24h  
(22 jails on this system).



On my laptop I set ARC to 1G since I don't use swap and in the past
ARC would consume to much memory and things would die.  When the
nullfs holds a bunch of vnodes then ZFS couldn't release them.

FYI, on my laptop with nocache and limited vnodes I haven't run
into this problem.  I haven't tried the patch to let ZFS free
it's and nullfs vnodes on my laptop.  I have only tried it via


I have this patch and your mount patch installed now, without  
nocache and reduced arc reclaim settings (100, 1). I will check the  
runtime for the next 2 days.


9-10h runtime with the above settings (compared to 4h with nocache  
and 12-24h without any patch and without nocache).
I changed the sysctls back to the defaults and will see in the next  
run (in 7h) what the result is with just the patches.


And again 9-10h runtime (I've seen a lot of the find processes in the  
periodic daily run of those 22 jails in the state "*vnode"). Seems  
nocache gives the best perf for me in this case.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpbbgb6PWKOs.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-24 Thread Alexander Leidinger
Quoting Alexander Leidinger  (from Fri, 22  
Apr 2022 09:04:39 +0200):


Quoting Doug Ambrisko  (from Thu, 21 Apr 2022  
09:38:35 -0700):



I've attached mount.patch that when doing mount -v should
show the vnode usage per filesystem.  Note that the problem I was
running into was after some operations arc_prune and arc_evict would
consume 100% of 2 cores and make ZFS really slow.  If you are not
running into that issue then nocache etc. shouldn't be needed.


I don't run into this issue, but I have a huge perf difference when  
using nocache in the nightly periodic runs. 4h instead of 12-24h (22  
jails on this system).



On my laptop I set ARC to 1G since I don't use swap and in the past
ARC would consume to much memory and things would die.  When the
nullfs holds a bunch of vnodes then ZFS couldn't release them.

FYI, on my laptop with nocache and limited vnodes I haven't run
into this problem.  I haven't tried the patch to let ZFS free
it's and nullfs vnodes on my laptop.  I have only tried it via


I have this patch and your mount patch installed now, without  
nocache and reduced arc reclaim settings (100, 1). I will check the  
runtime for the next 2 days.


9-10h runtime with the above settings (compared to 4h with nocache and  
12-24h without any patch and without nocache).
I changed the sysctls back to the defaults and will see in the next  
run (in 7h) what the result is with just the patches.


Bye,
Alexander.
--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpufK9bQL8ps.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-22 Thread Doug Ambrisko
On Fri, Apr 22, 2022 at 09:04:39AM +0200, Alexander Leidinger wrote:
| Quoting Doug Ambrisko  (from Thu, 21 Apr 2022  
| 09:38:35 -0700):
| 
| > On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| > | Quoting Mateusz Guzik  (from Thu, 21 Apr 2022
| > | 14:50:42 +0200):
| > |
| > | > On 4/21/22, Alexander Leidinger  wrote:
| > | >> I tried nocache on a system with a lot of jails which use nullfs,
| > | >> which showed very slow behavior in the daily periodic runs (12h runs
| > | >> in the night after boot, 24h or more in subsequent nights). Now the
| > | >> first nightly run after boot was finished after 4h.
| > | >>
| > | >> What is the benefit of not disabling the cache in nullfs? I would
| > | >> expect zfs (or ufs) to cache the (meta)data anyway.
| > | >>
| > | >
| > | > does the poor performance show up with
| > | > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
| > |
| > | I would like to have all the 22 jails run the periodic scripts a
| > | second night in a row before trying this.
| > |
| > | > if the long runs are still there, can you get some profiling from it?
| > | > sysctl -a before and after would be a start.
| > | >
| > | > My guess is that you are the vnode limit and bumping into the 1  
| > second sleep.
| > |
| > | That would explain the behavior I see since I added the last jail
| > | which seems to have crossed a threshold which triggers the slow
| > | behavior.
| > |
| > | Current status (with the 112 nullfs mounts with nocache):
| > | kern.maxvnodes:   10485760
| > | kern.numvnodes:3791064
| > | kern.freevnodes:   3613694
| > | kern.cache.stats.heldvnodes:151707
| > | kern.vnodes_created: 260288639
| > |
| > | The maxvnodes value is already increased by 10 times compared to the
| > | default value on this system.
| >
| > I've attached mount.patch that when doing mount -v should
| > show the vnode usage per filesystem.  Note that the problem I was
| > running into was after some operations arc_prune and arc_evict would
| > consume 100% of 2 cores and make ZFS really slow.  If you are not
| > running into that issue then nocache etc. shouldn't be needed.
| 
| I don't run into this issue, but I have a huge perf difference when  
| using nocache in the nightly periodic runs. 4h instead of 12-24h (22  
| jails on this system).

I wouldn't do the nocache then!  It would be good to see what
Mateusz patch does without nocache for your env.
 
| > On my laptop I set ARC to 1G since I don't use swap and in the past
| > ARC would consume to much memory and things would die.  When the
| > nullfs holds a bunch of vnodes then ZFS couldn't release them.
| >
| > FYI, on my laptop with nocache and limited vnodes I haven't run
| > into this problem.  I haven't tried the patch to let ZFS free
| > it's and nullfs vnodes on my laptop.  I have only tried it via
| 
| I have this patch and your mount patch installed now, without nocache  
| and reduced arc reclaim settings (100, 1). I will check the runtime  
| for the next 2 days.
| 
| Your mount patch to show the per mount vnodes count looks useful, not  
| only for this particular case. Do you intend to commit it?

I should since it doesn't change the size of the structure etc.  I need
to put it up for review.

Thanks,

Doug A.



Re: nullfs and ZFS issues

2022-04-22 Thread Alexander Leidinger
Quoting Doug Ambrisko  (from Thu, 21 Apr 2022  
09:38:35 -0700):



On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| Quoting Mateusz Guzik  (from Thu, 21 Apr 2022
| 14:50:42 +0200):
|
| > On 4/21/22, Alexander Leidinger  wrote:
| >> I tried nocache on a system with a lot of jails which use nullfs,
| >> which showed very slow behavior in the daily periodic runs (12h runs
| >> in the night after boot, 24h or more in subsequent nights). Now the
| >> first nightly run after boot was finished after 4h.
| >>
| >> What is the benefit of not disabling the cache in nullfs? I would
| >> expect zfs (or ufs) to cache the (meta)data anyway.
| >>
| >
| > does the poor performance show up with
| > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
|
| I would like to have all the 22 jails run the periodic scripts a
| second night in a row before trying this.
|
| > if the long runs are still there, can you get some profiling from it?
| > sysctl -a before and after would be a start.
| >
| > My guess is that you are the vnode limit and bumping into the 1  
second sleep.

|
| That would explain the behavior I see since I added the last jail
| which seems to have crossed a threshold which triggers the slow
| behavior.
|
| Current status (with the 112 nullfs mounts with nocache):
| kern.maxvnodes:   10485760
| kern.numvnodes:3791064
| kern.freevnodes:   3613694
| kern.cache.stats.heldvnodes:151707
| kern.vnodes_created: 260288639
|
| The maxvnodes value is already increased by 10 times compared to the
| default value on this system.

I've attached mount.patch that when doing mount -v should
show the vnode usage per filesystem.  Note that the problem I was
running into was after some operations arc_prune and arc_evict would
consume 100% of 2 cores and make ZFS really slow.  If you are not
running into that issue then nocache etc. shouldn't be needed.


I don't run into this issue, but I have a huge perf difference when  
using nocache in the nightly periodic runs. 4h instead of 12-24h (22  
jails on this system).



On my laptop I set ARC to 1G since I don't use swap and in the past
ARC would consume to much memory and things would die.  When the
nullfs holds a bunch of vnodes then ZFS couldn't release them.

FYI, on my laptop with nocache and limited vnodes I haven't run
into this problem.  I haven't tried the patch to let ZFS free
it's and nullfs vnodes on my laptop.  I have only tried it via


I have this patch and your mount patch installed now, without nocache  
and reduced arc reclaim settings (100, 1). I will check the runtime  
for the next 2 days.


Your mount patch to show the per mount vnodes count looks useful, not  
only for this particular case. Do you intend to commit it?


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpaRSyTU_E11.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-21 Thread Doug Ambrisko
On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| Quoting Mateusz Guzik  (from Thu, 21 Apr 2022  
| 14:50:42 +0200):
| 
| > On 4/21/22, Alexander Leidinger  wrote:
| >> I tried nocache on a system with a lot of jails which use nullfs,
| >> which showed very slow behavior in the daily periodic runs (12h runs
| >> in the night after boot, 24h or more in subsequent nights). Now the
| >> first nightly run after boot was finished after 4h.
| >>
| >> What is the benefit of not disabling the cache in nullfs? I would
| >> expect zfs (or ufs) to cache the (meta)data anyway.
| >>
| >
| > does the poor performance show up with
| > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
| 
| I would like to have all the 22 jails run the periodic scripts a  
| second night in a row before trying this.
| 
| > if the long runs are still there, can you get some profiling from it?
| > sysctl -a before and after would be a start.
| >
| > My guess is that you are the vnode limit and bumping into the 1 second 
sleep.
| 
| That would explain the behavior I see since I added the last jail  
| which seems to have crossed a threshold which triggers the slow  
| behavior.
| 
| Current status (with the 112 nullfs mounts with nocache):
| kern.maxvnodes:   10485760
| kern.numvnodes:3791064
| kern.freevnodes:   3613694
| kern.cache.stats.heldvnodes:151707
| kern.vnodes_created: 260288639
| 
| The maxvnodes value is already increased by 10 times compared to the  
| default value on this system.

With the patch, you shouldn't mount with nocache!  However, you might
want to tune:
vfs.zfs.arc.meta_prune
vfs.zfs.arc.meta_adjust_restarts

Since the code on restart will increment the prune amount by
vfs.zfs.arc.meta_prune and submit that amount to the vnode reclaim
code.  So then it will end up reclaiming a lot of vnodes.  The
defaults of 1 * 4096 and submitting it each loop can most of
the cache to be freed.  With relative small values of them, then
the cache didn't shrink to much.

Doug A.



Re: nullfs and ZFS issues

2022-04-21 Thread Doug Ambrisko
On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| Quoting Mateusz Guzik  (from Thu, 21 Apr 2022  
| 14:50:42 +0200):
| 
| > On 4/21/22, Alexander Leidinger  wrote:
| >> I tried nocache on a system with a lot of jails which use nullfs,
| >> which showed very slow behavior in the daily periodic runs (12h runs
| >> in the night after boot, 24h or more in subsequent nights). Now the
| >> first nightly run after boot was finished after 4h.
| >>
| >> What is the benefit of not disabling the cache in nullfs? I would
| >> expect zfs (or ufs) to cache the (meta)data anyway.
| >>
| >
| > does the poor performance show up with
| > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
| 
| I would like to have all the 22 jails run the periodic scripts a  
| second night in a row before trying this.
| 
| > if the long runs are still there, can you get some profiling from it?
| > sysctl -a before and after would be a start.
| >
| > My guess is that you are the vnode limit and bumping into the 1 second 
sleep.
| 
| That would explain the behavior I see since I added the last jail  
| which seems to have crossed a threshold which triggers the slow  
| behavior.
| 
| Current status (with the 112 nullfs mounts with nocache):
| kern.maxvnodes:   10485760
| kern.numvnodes:3791064
| kern.freevnodes:   3613694
| kern.cache.stats.heldvnodes:151707
| kern.vnodes_created: 260288639
| 
| The maxvnodes value is already increased by 10 times compared to the  
| default value on this system.

I've attached mount.patch that when doing mount -v should
show the vnode usage per filesystem.  Note that the problem I was
running into was after some operations arc_prune and arc_evict would
consume 100% of 2 cores and make ZFS really slow.  If you are not
running into that issue then nocache etc. shouldn't be needed.
On my laptop I set ARC to 1G since I don't use swap and in the past
ARC would consume to much memory and things would die.  When the
nullfs holds a bunch of vnodes then ZFS couldn't release them.

FYI, on my laptop with nocache and limited vnodes I haven't run
into this problem.  I haven't tried the patch to let ZFS free
it's and nullfs vnodes on my laptop.  I have only tried it via
bhyve test.  I use bhyve and a md drive to avoid wearing
out my SSD and it's faster to test.  I have found the git, tar,
make world etc. could trigger the issue before but haven't had
any issues with nocache and capping vnodes.

Thanks,

Doug A.
diff --git a/sbin/mount/mount.c b/sbin/mount/mount.c
index 79d9d6cb0ca..00eefb3a5e0 100644
--- a/sbin/mount/mount.c
+++ b/sbin/mount/mount.c
@@ -692,6 +692,13 @@ prmount(struct statfs *sfp)
 			xo_emit("{D:, }{Lw:fsid}{:fsid}", fsidbuf);
 			free(fsidbuf);
 		}
+		if (sfp->f_nvnodelistsize != 0 || sfp->f_lazyvnodelistsize != 0) {
+			xo_open_container("vnodes");
+xo_emit("{D:, }{Lwc:vnodes}{Lw:count}{w:count/%ju}{Lw:lazy}{:lazy/%ju}",
+(uintmax_t)sfp->f_nvnodelistsize,
+(uintmax_t)sfp->f_lazyvnodelistsize);
+			xo_close_container("vnodes");
+		}
 	}
 	xo_emit("{D:)}\n");
 }
diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index a495ad86ac4..3648ef8d080 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -2625,6 +2626,8 @@ __vfs_statfs(struct mount *mp, struct statfs *sbp)
 	sbp->f_version = STATFS_VERSION;
 	sbp->f_namemax = NAME_MAX;
 	sbp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK;
+	sbp->f_nvnodelistsize = mp->mnt_nvnodelistsize;
+	sbp->f_lazyvnodelistsize = mp->mnt_lazyvnodelistsize;
 
 	return (mp->mnt_op->vfs_statfs(mp, sbp));
 }
diff --git a/sys/sys/mount.h b/sys/sys/mount.h
index 3383bfe8f43..95dd3c76ae5 100644
--- a/sys/sys/mount.h
+++ b/sys/sys/mount.h
@@ -91,7 +91,9 @@ struct statfs {
 	uint64_t f_asyncwrites;		/* count of async writes since mount */
 	uint64_t f_syncreads;		/* count of sync reads since mount */
 	uint64_t f_asyncreads;		/* count of async reads since mount */
-	uint64_t f_spare[10];		/* unused spare */
+	uint32_t f_nvnodelistsize;	/* (i) # of vnodes */
+	uint32_t f_lazyvnodelistsize;/* (l) # of lazy vnodes */
+	uint64_t f_spare[9];		/* unused spare */
 	uint32_t f_namemax;		/* maximum filename length */
 	uid_t	  f_owner;		/* user that mounted the filesystem */
 	fsid_t	  f_fsid;		/* filesystem id */


Re: nullfs and ZFS issues

2022-04-21 Thread Alexander Leidinger
Quoting Mateusz Guzik  (from Thu, 21 Apr 2022  
14:50:42 +0200):



On 4/21/22, Alexander Leidinger  wrote:

I tried nocache on a system with a lot of jails which use nullfs,
which showed very slow behavior in the daily periodic runs (12h runs
in the night after boot, 24h or more in subsequent nights). Now the
first nightly run after boot was finished after 4h.

What is the benefit of not disabling the cache in nullfs? I would
expect zfs (or ufs) to cache the (meta)data anyway.



does the poor performance show up with
https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?


I would like to have all the 22 jails run the periodic scripts a  
second night in a row before trying this.



if the long runs are still there, can you get some profiling from it?
sysctl -a before and after would be a start.

My guess is that you are the vnode limit and bumping into the 1 second sleep.


That would explain the behavior I see since I added the last jail  
which seems to have crossed a threshold which triggers the slow  
behavior.


Current status (with the 112 nullfs mounts with nocache):
kern.maxvnodes:   10485760
kern.numvnodes:3791064
kern.freevnodes:   3613694
kern.cache.stats.heldvnodes:151707
kern.vnodes_created: 260288639

The maxvnodes value is already increased by 10 times compared to the  
default value on this system.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpDvfG_fAon2.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-21 Thread Mateusz Guzik
On 4/21/22, Alexander Leidinger  wrote:
> Quoting Doug Ambrisko  (from Wed, 20 Apr 2022
> 09:20:33 -0700):
>
>> On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote:
>> | Quoting Doug Ambrisko  (from Mon, 18 Apr 2022
>> | 16:32:38 -0700):
>> |
>> | > With nullfs, nocache and settings max vnodes to a low number I can
>> |
>> | Where is nocache documented? I don't see it in mount_nullfs(8),
>> | mount(8) or nullfs(5).
>>
>> I didn't find it but it is in:
>>  src/sys/fs/nullfs/null_vfsops.c:  if (vfs_getopt(mp->mnt_optnew,
>> "nocache", NULL, NULL) == 0 ||
>>
>> Also some file systems disable it via MNTK_NULL_NOCACHE
>
> Does the attached diff look ok?
>
>> | I tried a nullfs mount with nocache and it doesn't show up in the
>> | output of "mount".
>>
>> Yep, I saw that as well.  I could tell by dropping into ddb and then
>> do a show mount on the FS and look at the count.  That is why I added
>> the vnode count to mount -v so I could see the usage without dropping
>> into ddb.
>
> I tried nocache on a system with a lot of jails which use nullfs,
> which showed very slow behavior in the daily periodic runs (12h runs
> in the night after boot, 24h or more in subsequent nights). Now the
> first nightly run after boot was finished after 4h.
>
> What is the benefit of not disabling the cache in nullfs? I would
> expect zfs (or ufs) to cache the (meta)data anyway.
>

does the poor performance show up with
https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?

if the long runs are still there, can you get some profiling from it?
sysctl -a before and after would be a start.

My guess is that you are the vnode limit and bumping into the 1 second sleep.

-- 
Mateusz Guzik 



Re: nullfs and ZFS issues

2022-04-21 Thread Alexander Leidinger
Quoting Doug Ambrisko  (from Wed, 20 Apr 2022  
09:20:33 -0700):



On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote:
| Quoting Doug Ambrisko  (from Mon, 18 Apr 2022
| 16:32:38 -0700):
|
| > With nullfs, nocache and settings max vnodes to a low number I can
|
| Where is nocache documented? I don't see it in mount_nullfs(8),
| mount(8) or nullfs(5).

I didn't find it but it is in:
	src/sys/fs/nullfs/null_vfsops.c:  if (vfs_getopt(mp->mnt_optnew,  
"nocache", NULL, NULL) == 0 ||


Also some file systems disable it via MNTK_NULL_NOCACHE


Does the attached diff look ok?


| I tried a nullfs mount with nocache and it doesn't show up in the
| output of "mount".

Yep, I saw that as well.  I could tell by dropping into ddb and then
do a show mount on the FS and look at the count.  That is why I added
the vnode count to mount -v so I could see the usage without dropping
into ddb.


I tried nocache on a system with a lot of jails which use nullfs,  
which showed very slow behavior in the daily periodic runs (12h runs  
in the night after boot, 24h or more in subsequent nights). Now the  
first nightly run after boot was finished after 4h.


What is the benefit of not disabling the cache in nullfs? I would  
expect zfs (or ufs) to cache the (meta)data anyway.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF
diff --git a/sbin/mount/mount.8 b/sbin/mount/mount.8
index 2a877c04c07..823df63953d 100644
--- a/sbin/mount/mount.8
+++ b/sbin/mount/mount.8
@@ -28,7 +28,7 @@
 .\" @(#)mount.8	8.8 (Berkeley) 6/16/94
 .\" $FreeBSD$
 .\"
-.Dd March 17, 2022
+.Dd April 21, 2022
 .Dt MOUNT 8
 .Os
 .Sh NAME
@@ -245,6 +245,9 @@ This file system should be skipped when
 is run with the
 .Fl a
 flag.
+.It Cm nocache
+Disable caching.
+Some filesystems may not support this.
 .It Cm noclusterr
 Disable read clustering.
 .It Cm noclusterw


pgpYPDNVeBw1Z.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-20 Thread Doug Ambrisko
On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote:
| Quoting Doug Ambrisko  (from Mon, 18 Apr 2022  
| 16:32:38 -0700):
| 
| > With nullfs, nocache and settings max vnodes to a low number I can
| 
| Where is nocache documented? I don't see it in mount_nullfs(8),  
| mount(8) or nullfs(5).

I didn't find it but it is in:
src/sys/fs/nullfs/null_vfsops.c:  if (vfs_getopt(mp->mnt_optnew, 
"nocache", NULL, NULL) == 0 ||

Also some file systems disable it via MNTK_NULL_NOCACHE

| I tried a nullfs mount with nocache and it doesn't show up in the  
| output of "mount".

Yep, I saw that as well.  I could tell by dropping into ddb and then
do a show mount on the FS and look at the count.  That is why I added
the vnode count to mount -v so I could see the usage without dropping
into ddb.

Doug A.



Re: nullfs and ZFS issues

2022-04-20 Thread Doug Ambrisko
On Wed, Apr 20, 2022 at 11:43:10AM +0200, Mateusz Guzik wrote:
| On 4/19/22, Doug Ambrisko  wrote:
| > On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote:
| > | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff
| > |
| > | this is not committable but should validate whether it works fine
| >
| > As a POC it's working.  I see the vnode count for the nullfs and
| > ZFS go up.  The ARC cache also goes up until it exceeds the ARC max.
| > size tten the vnodes for nullfs and ZFS goes down.  The ARC cache goes
| > down as well.  This all repeats over and over.  The systems seems
| > healthy.  No excessive running of arc_prune or arc_evict.
| >
| > My only comment is that the vnode freeing seems a bit agressive.
| > Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS.
| > The ARC drops from 70M to 7M (max is set at 64M) for this unit
| > test.
| >
| 
| Can you check what kind of shrinking is requested by arc to begin
| with? I imagine encountering a nullfs vnode may end up recycling 2
| instead of 1, but even repeated a lot it does not explain the above.

I dug it into a bit more and think there could be a bug in:
module/zfs/arc.c
arc_evict_meta_balanced(uint64_t meta_used)
prune += zfs_arc_meta_prune;
//arc_prune_async(prune);
arc_prune_async(zfs_arc_meta_prune);

Since arc_prune_async, is queuing up a run of arc_prune_task for each
call it is actually already accumulating the zfs_arc_meta_prune
amount.  It makes the count to vnlru_free_impl get really big quickly
since it is looping via restart.

   1 HELLO arc_prune_task 164   ticks 2147465958 count 2048

dmesg | grep arc_prune_task | uniq -c
  14 HELLO arc_prune_task 164   ticks -2147343772 count 100
  50 HELLO arc_prune_task 164   ticks -2147343771 count 100
  46 HELLO arc_prune_task 164   ticks -2147343770 count 100
  49 HELLO arc_prune_task 164   ticks -2147343769 count 100
  44 HELLO arc_prune_task 164   ticks -2147343768 count 100
 116 HELLO arc_prune_task 164   ticks -2147343767 count 100
1541 HELLO arc_prune_task 164   ticks -2147343766 count 100
  53 HELLO arc_prune_task 164   ticks -2147343101 count 100
 100 HELLO arc_prune_task 164   ticks -2147343100 count 100
  75 HELLO arc_prune_task 164   ticks -2147343099 count 100
  52 HELLO arc_prune_task 164   ticks -2147343098 count 100
  50 HELLO arc_prune_task 164   ticks -2147343097 count 100
  51 HELLO arc_prune_task 164   ticks -2147343096 count 100
 783 HELLO arc_prune_task 164   ticks -2147343095 count 100
 884 HELLO arc_prune_task 164   ticks -2147343094 count 100

Note I shrunk vfs.zfs.arc.meta_prune=100 to see how that might
help.  Changing it to 1, helps more!  I see less agressive
swings.

I added
printf("HELLO %s %d   ticks %d count 
%ld\n",__FUNCTION__,__LINE__,ticks,nr_scan);

to arc_prune_task.

Adjusting both
sysctl vfs.zfs.arc.meta_adjust_restarts=1
sysctl vfs.zfs.arc.meta_prune=100

without changing arc_prune_async(prune) helps avoid excessive swings.

Thanks,

Doug A.

| > | On 4/19/22, Mateusz Guzik  wrote:
| > | > On 4/19/22, Mateusz Guzik  wrote:
| > | >> On 4/19/22, Doug Ambrisko  wrote:
| > | >>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
| > | >>> localhost NFS mounts instead of nullfs when nullfs would complain
| > | >>> that it couldn't mount.  Since that check has been removed, I've
| > | >>> switched to nullfs only.  However, every so often my laptop would
| > | >>> get slow and the the ARC evict and prune thread would consume two
| > | >>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
| > | >>> it to 2G now.  Looking into this has uncovered some issues:
| > | >>>  -nullfs would prevent vnlru_free_vfsops from doing 
anything
| > | >>>   when called from ZFS arc_prune_task
| > | >>>  -nullfs would hang onto a bunch of vnodes unless mounted 
with
| > | >>>   nocache
| > | >>>  -nullfs and nocache would break untar.  This has been 
fixed
| > now.
| > | >>>
| > | >>> With nullfs, nocache and settings max vnodes to a low number I can
| > | >>> keep the ARC around the max. without evict and prune consuming
| > | >>> 100% of 2 cores.  This doesn't seem like the best solution but it
| > | >>> better then when the ARC starts spinning.
| > | >>>
| > | >>> Looking into this issue with bhyve and a md drive for testing I
| > create
| > | >>> a brand new zpool mounted as /test and then nullfs mount /test to
| > /mnt.
| > | >>> I loop through untaring the Linux kernel into the nullfs mount, rm
| > -rf
| > | >>> it
| > | >>> and repeat.  I set the ARC to the smallest value I can.  Untarring
| > the
| > | >>> Linux kernel was enough to get the ARC evict and prune to spin since
| > | >>> they couldn't evict/prune anything.
| > | >>>
| > | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
| > | >>>   static 

Re: nullfs and ZFS issues

2022-04-20 Thread Mateusz Guzik
On 4/19/22, Doug Ambrisko  wrote:
> On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote:
> | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff
> |
> | this is not committable but should validate whether it works fine
>
> As a POC it's working.  I see the vnode count for the nullfs and
> ZFS go up.  The ARC cache also goes up until it exceeds the ARC max.
> size tten the vnodes for nullfs and ZFS goes down.  The ARC cache goes
> down as well.  This all repeats over and over.  The systems seems
> healthy.  No excessive running of arc_prune or arc_evict.
>
> My only comment is that the vnode freeing seems a bit agressive.
> Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS.
> The ARC drops from 70M to 7M (max is set at 64M) for this unit
> test.
>

Can you check what kind of shrinking is requested by arc to begin
with? I imagine encountering a nullfs vnode may end up recycling 2
instead of 1, but even repeated a lot it does not explain the above.

>
> | On 4/19/22, Mateusz Guzik  wrote:
> | > On 4/19/22, Mateusz Guzik  wrote:
> | >> On 4/19/22, Doug Ambrisko  wrote:
> | >>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
> | >>> localhost NFS mounts instead of nullfs when nullfs would complain
> | >>> that it couldn't mount.  Since that check has been removed, I've
> | >>> switched to nullfs only.  However, every so often my laptop would
> | >>> get slow and the the ARC evict and prune thread would consume two
> | >>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
> | >>> it to 2G now.  Looking into this has uncovered some issues:
> | >>>  -  nullfs would prevent vnlru_free_vfsops from doing anything
> | >>> when called from ZFS arc_prune_task
> | >>>  -  nullfs would hang onto a bunch of vnodes unless mounted with
> | >>> nocache
> | >>>  -  nullfs and nocache would break untar.  This has been fixed
> now.
> | >>>
> | >>> With nullfs, nocache and settings max vnodes to a low number I can
> | >>> keep the ARC around the max. without evict and prune consuming
> | >>> 100% of 2 cores.  This doesn't seem like the best solution but it
> | >>> better then when the ARC starts spinning.
> | >>>
> | >>> Looking into this issue with bhyve and a md drive for testing I
> create
> | >>> a brand new zpool mounted as /test and then nullfs mount /test to
> /mnt.
> | >>> I loop through untaring the Linux kernel into the nullfs mount, rm
> -rf
> | >>> it
> | >>> and repeat.  I set the ARC to the smallest value I can.  Untarring
> the
> | >>> Linux kernel was enough to get the ARC evict and prune to spin since
> | >>> they couldn't evict/prune anything.
> | >>>
> | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
> | >>>   static int
> | >>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode
> *mvp)
> | >>>   {
> | >>> ...
> | >>>
> | >>> for (;;) {
> | >>> ...
> | >>> vp = TAILQ_NEXT(vp, v_vnodelist);
> | >>> ...
> | >>>
> | >>> /*
> | >>>  * Don't recycle if our vnode is from different type
> | >>>  * of mount point.  Note that mp is type-safe, the
> | >>>  * check does not reach unmapped address even if
> | >>>  * vnode is reclaimed.
> | >>>  */
> | >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
> | >>> mp->mnt_op != mnt_op) {
> | >>> continue;
> | >>> }
> | >>> ...
> | >>>
> | >>> The vp ends up being the nulfs mount and then hits the continue
> | >>> even though the passed in mvp is on ZFS.  If I do a hack to
> | >>> comment out the continue then I see the ARC, nullfs vnodes and
> | >>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
> | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
> | >>> The ARC cache usage also goes down.  Then they increase again until
> | >>> the ARC gets full and then they go down again.  So with this hack
> | >>> I don't need nocache passed to nullfs and I don't need to limit
> | >>> the max vnodes.  Doing multiple untars in parallel over and over
> | >>> doesn't seem to cause any issues for this test.  I'm not saying
> | >>> commenting out continue is the fix but a simple POC test.
> | >>>
> | >>
> | >> I don't see an easy way to say "this is a nullfs vnode holding onto a
> | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
> | >> callback, if the module is loaded.
> | >>
> | >> In the meantime I think a good enough(tm) fix would be to check that
> | >> nothing was freed and fallback to good old regular clean up without
> | >> filtering by vfsops. This would be very similar to what you are doing
> | >> with your hack.
> | >>
> | >
> | > Now that I wrote this perhaps an acceptable hack would be to extend
> | > struct mount with a pointer to "lower layer" mount 

Re: nullfs and ZFS issues

2022-04-20 Thread Alexander Leidinger
Quoting Doug Ambrisko  (from Mon, 18 Apr 2022  
16:32:38 -0700):



With nullfs, nocache and settings max vnodes to a low number I can


Where is nocache documented? I don't see it in mount_nullfs(8),  
mount(8) or nullfs(5).


I tried a nullfs mount with nocache and it doesn't show up in the  
output of "mount".


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpGalMcrXooX.pgp
Description: Digitale PGP-Signatur


Re: nullfs and ZFS issues

2022-04-19 Thread Doug Ambrisko
On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote:
| Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff
| 
| this is not committable but should validate whether it works fine

As a POC it's working.  I see the vnode count for the nullfs and
ZFS go up.  The ARC cache also goes up until it exceeds the ARC max.
size tten the vnodes for nullfs and ZFS goes down.  The ARC cache goes
down as well.  This all repeats over and over.  The systems seems
healthy.  No excessive running of arc_prune or arc_evict.

My only comment is that the vnode freeing seems a bit agressive.
Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS.
The ARC drops from 70M to 7M (max is set at 64M) for this unit
test.

Thanks,

Doug A.
 
| On 4/19/22, Mateusz Guzik  wrote:
| > On 4/19/22, Mateusz Guzik  wrote:
| >> On 4/19/22, Doug Ambrisko  wrote:
| >>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
| >>> localhost NFS mounts instead of nullfs when nullfs would complain
| >>> that it couldn't mount.  Since that check has been removed, I've
| >>> switched to nullfs only.  However, every so often my laptop would
| >>> get slow and the the ARC evict and prune thread would consume two
| >>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
| >>> it to 2G now.  Looking into this has uncovered some issues:
| >>>  -nullfs would prevent vnlru_free_vfsops from doing anything
| >>>   when called from ZFS arc_prune_task
| >>>  -nullfs would hang onto a bunch of vnodes unless mounted with
| >>>   nocache
| >>>  -nullfs and nocache would break untar.  This has been fixed now.
| >>>
| >>> With nullfs, nocache and settings max vnodes to a low number I can
| >>> keep the ARC around the max. without evict and prune consuming
| >>> 100% of 2 cores.  This doesn't seem like the best solution but it
| >>> better then when the ARC starts spinning.
| >>>
| >>> Looking into this issue with bhyve and a md drive for testing I create
| >>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
| >>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf
| >>> it
| >>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
| >>> Linux kernel was enough to get the ARC evict and prune to spin since
| >>> they couldn't evict/prune anything.
| >>>
| >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
| >>>   static int
| >>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
| >>>   {
| >>>   ...
| >>>
| >>> for (;;) {
| >>>   ...
| >>> vp = TAILQ_NEXT(vp, v_vnodelist);
| >>>   ...
| >>>
| >>> /*
| >>>  * Don't recycle if our vnode is from different type
| >>>  * of mount point.  Note that mp is type-safe, the
| >>>  * check does not reach unmapped address even if
| >>>  * vnode is reclaimed.
| >>>  */
| >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
| >>> mp->mnt_op != mnt_op) {
| >>> continue;
| >>> }
| >>>   ...
| >>>
| >>> The vp ends up being the nulfs mount and then hits the continue
| >>> even though the passed in mvp is on ZFS.  If I do a hack to
| >>> comment out the continue then I see the ARC, nullfs vnodes and
| >>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
| >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
| >>> The ARC cache usage also goes down.  Then they increase again until
| >>> the ARC gets full and then they go down again.  So with this hack
| >>> I don't need nocache passed to nullfs and I don't need to limit
| >>> the max vnodes.  Doing multiple untars in parallel over and over
| >>> doesn't seem to cause any issues for this test.  I'm not saying
| >>> commenting out continue is the fix but a simple POC test.
| >>>
| >>
| >> I don't see an easy way to say "this is a nullfs vnode holding onto a
| >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
| >> callback, if the module is loaded.
| >>
| >> In the meantime I think a good enough(tm) fix would be to check that
| >> nothing was freed and fallback to good old regular clean up without
| >> filtering by vfsops. This would be very similar to what you are doing
| >> with your hack.
| >>
| >
| > Now that I wrote this perhaps an acceptable hack would be to extend
| > struct mount with a pointer to "lower layer" mount (if any) and patch
| > the vfsops check to also look there.
| >
| >>
| >>> It appears that when ZFS is asking for cached vnodes to be
| >>> free'd nullfs also needs to free some up as well so that
| >>> they are free'd on the VFS level.  It seems that vnlru_free_impl
| >>> should allow some of the related nullfs vnodes to be free'd so
| >>> the ZFS ones can be free'd and reduce the size of the ARC.
| >>>
| >>> BTW, I also hacked the kernel 

Re: nullfs and ZFS issues

2022-04-19 Thread Mateusz Guzik
Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff

this is not committable but should validate whether it works fine

On 4/19/22, Mateusz Guzik  wrote:
> On 4/19/22, Mateusz Guzik  wrote:
>> On 4/19/22, Doug Ambrisko  wrote:
>>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
>>> localhost NFS mounts instead of nullfs when nullfs would complain
>>> that it couldn't mount.  Since that check has been removed, I've
>>> switched to nullfs only.  However, every so often my laptop would
>>> get slow and the the ARC evict and prune thread would consume two
>>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
>>> it to 2G now.  Looking into this has uncovered some issues:
>>>  -  nullfs would prevent vnlru_free_vfsops from doing anything
>>> when called from ZFS arc_prune_task
>>>  -  nullfs would hang onto a bunch of vnodes unless mounted with
>>> nocache
>>>  -  nullfs and nocache would break untar.  This has been fixed now.
>>>
>>> With nullfs, nocache and settings max vnodes to a low number I can
>>> keep the ARC around the max. without evict and prune consuming
>>> 100% of 2 cores.  This doesn't seem like the best solution but it
>>> better then when the ARC starts spinning.
>>>
>>> Looking into this issue with bhyve and a md drive for testing I create
>>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
>>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf
>>> it
>>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
>>> Linux kernel was enough to get the ARC evict and prune to spin since
>>> they couldn't evict/prune anything.
>>>
>>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>>>   static int
>>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>>>   {
>>> ...
>>>
>>> for (;;) {
>>> ...
>>> vp = TAILQ_NEXT(vp, v_vnodelist);
>>> ...
>>>
>>> /*
>>>  * Don't recycle if our vnode is from different type
>>>  * of mount point.  Note that mp is type-safe, the
>>>  * check does not reach unmapped address even if
>>>  * vnode is reclaimed.
>>>  */
>>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
>>> mp->mnt_op != mnt_op) {
>>> continue;
>>> }
>>> ...
>>>
>>> The vp ends up being the nulfs mount and then hits the continue
>>> even though the passed in mvp is on ZFS.  If I do a hack to
>>> comment out the continue then I see the ARC, nullfs vnodes and
>>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
>>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
>>> The ARC cache usage also goes down.  Then they increase again until
>>> the ARC gets full and then they go down again.  So with this hack
>>> I don't need nocache passed to nullfs and I don't need to limit
>>> the max vnodes.  Doing multiple untars in parallel over and over
>>> doesn't seem to cause any issues for this test.  I'm not saying
>>> commenting out continue is the fix but a simple POC test.
>>>
>>
>> I don't see an easy way to say "this is a nullfs vnode holding onto a
>> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
>> callback, if the module is loaded.
>>
>> In the meantime I think a good enough(tm) fix would be to check that
>> nothing was freed and fallback to good old regular clean up without
>> filtering by vfsops. This would be very similar to what you are doing
>> with your hack.
>>
>
> Now that I wrote this perhaps an acceptable hack would be to extend
> struct mount with a pointer to "lower layer" mount (if any) and patch
> the vfsops check to also look there.
>
>>
>>> It appears that when ZFS is asking for cached vnodes to be
>>> free'd nullfs also needs to free some up as well so that
>>> they are free'd on the VFS level.  It seems that vnlru_free_impl
>>> should allow some of the related nullfs vnodes to be free'd so
>>> the ZFS ones can be free'd and reduce the size of the ARC.
>>>
>>> BTW, I also hacked the kernel and mount to show the vnodes used
>>> per mount ie. mount -v:
>>>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid
>>> 2b23b2a1de21ed66,
>>> vnodes: count 13846 lazy 0)
>>>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
>>> 11ff00292900, vnodes: count 13846 lazy 0)
>>>
>>> Now I can easily see how the vnodes are used without going into ddb.
>>> On my laptop I have various vnet jails and nullfs mount my homedir into
>>> them so pretty much everything goes through nullfs to ZFS.  I'm limping
>>> along with the nullfs nocache and small number of vnodes but it would be
>>> nice to not need that.
>>>
>>> Thanks,
>>>
>>> Doug A.
>>>
>>>
>>
>>
>> --
>> Mateusz Guzik 
>>
>
>
> --
> Mateusz Guzik 
>


-- 
Mateusz Guzik 



Re: nullfs and ZFS issues

2022-04-19 Thread Mateusz Guzik
On 4/19/22, Mateusz Guzik  wrote:
> On 4/19/22, Doug Ambrisko  wrote:
>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
>> localhost NFS mounts instead of nullfs when nullfs would complain
>> that it couldn't mount.  Since that check has been removed, I've
>> switched to nullfs only.  However, every so often my laptop would
>> get slow and the the ARC evict and prune thread would consume two
>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
>> it to 2G now.  Looking into this has uncovered some issues:
>>  -   nullfs would prevent vnlru_free_vfsops from doing anything
>>  when called from ZFS arc_prune_task
>>  -   nullfs would hang onto a bunch of vnodes unless mounted with
>>  nocache
>>  -   nullfs and nocache would break untar.  This has been fixed now.
>>
>> With nullfs, nocache and settings max vnodes to a low number I can
>> keep the ARC around the max. without evict and prune consuming
>> 100% of 2 cores.  This doesn't seem like the best solution but it
>> better then when the ARC starts spinning.
>>
>> Looking into this issue with bhyve and a md drive for testing I create
>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf it
>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
>> Linux kernel was enough to get the ARC evict and prune to spin since
>> they couldn't evict/prune anything.
>>
>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>>   static int
>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>>   {
>>  ...
>>
>> for (;;) {
>>  ...
>> vp = TAILQ_NEXT(vp, v_vnodelist);
>>  ...
>>
>> /*
>>  * Don't recycle if our vnode is from different type
>>  * of mount point.  Note that mp is type-safe, the
>>  * check does not reach unmapped address even if
>>  * vnode is reclaimed.
>>  */
>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
>> mp->mnt_op != mnt_op) {
>> continue;
>> }
>>  ...
>>
>> The vp ends up being the nulfs mount and then hits the continue
>> even though the passed in mvp is on ZFS.  If I do a hack to
>> comment out the continue then I see the ARC, nullfs vnodes and
>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
>> The ARC cache usage also goes down.  Then they increase again until
>> the ARC gets full and then they go down again.  So with this hack
>> I don't need nocache passed to nullfs and I don't need to limit
>> the max vnodes.  Doing multiple untars in parallel over and over
>> doesn't seem to cause any issues for this test.  I'm not saying
>> commenting out continue is the fix but a simple POC test.
>>
>
> I don't see an easy way to say "this is a nullfs vnode holding onto a
> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
> callback, if the module is loaded.
>
> In the meantime I think a good enough(tm) fix would be to check that
> nothing was freed and fallback to good old regular clean up without
> filtering by vfsops. This would be very similar to what you are doing
> with your hack.
>

Now that I wrote this perhaps an acceptable hack would be to extend
struct mount with a pointer to "lower layer" mount (if any) and patch
the vfsops check to also look there.

>
>> It appears that when ZFS is asking for cached vnodes to be
>> free'd nullfs also needs to free some up as well so that
>> they are free'd on the VFS level.  It seems that vnlru_free_impl
>> should allow some of the related nullfs vnodes to be free'd so
>> the ZFS ones can be free'd and reduce the size of the ARC.
>>
>> BTW, I also hacked the kernel and mount to show the vnodes used
>> per mount ie. mount -v:
>>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid
>> 2b23b2a1de21ed66,
>> vnodes: count 13846 lazy 0)
>>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
>> 11ff00292900, vnodes: count 13846 lazy 0)
>>
>> Now I can easily see how the vnodes are used without going into ddb.
>> On my laptop I have various vnet jails and nullfs mount my homedir into
>> them so pretty much everything goes through nullfs to ZFS.  I'm limping
>> along with the nullfs nocache and small number of vnodes but it would be
>> nice to not need that.
>>
>> Thanks,
>>
>> Doug A.
>>
>>
>
>
> --
> Mateusz Guzik 
>


-- 
Mateusz Guzik 



Re: nullfs and ZFS issues

2022-04-19 Thread Mateusz Guzik
On 4/19/22, Doug Ambrisko  wrote:
> I've switched my laptop to use nullfs and ZFS.  Previously, I used
> localhost NFS mounts instead of nullfs when nullfs would complain
> that it couldn't mount.  Since that check has been removed, I've
> switched to nullfs only.  However, every so often my laptop would
> get slow and the the ARC evict and prune thread would consume two
> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
> it to 2G now.  Looking into this has uncovered some issues:
>  -nullfs would prevent vnlru_free_vfsops from doing anything
>   when called from ZFS arc_prune_task
>  -nullfs would hang onto a bunch of vnodes unless mounted with
>   nocache
>  -nullfs and nocache would break untar.  This has been fixed now.
>
> With nullfs, nocache and settings max vnodes to a low number I can
> keep the ARC around the max. without evict and prune consuming
> 100% of 2 cores.  This doesn't seem like the best solution but it
> better then when the ARC starts spinning.
>
> Looking into this issue with bhyve and a md drive for testing I create
> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
> I loop through untaring the Linux kernel into the nullfs mount, rm -rf it
> and repeat.  I set the ARC to the smallest value I can.  Untarring the
> Linux kernel was enough to get the ARC evict and prune to spin since
> they couldn't evict/prune anything.
>
> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>   static int
>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>   {
>   ...
>
> for (;;) {
>   ...
> vp = TAILQ_NEXT(vp, v_vnodelist);
>   ...
>
> /*
>  * Don't recycle if our vnode is from different type
>  * of mount point.  Note that mp is type-safe, the
>  * check does not reach unmapped address even if
>  * vnode is reclaimed.
>  */
> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
> mp->mnt_op != mnt_op) {
> continue;
> }
>   ...
>
> The vp ends up being the nulfs mount and then hits the continue
> even though the passed in mvp is on ZFS.  If I do a hack to
> comment out the continue then I see the ARC, nullfs vnodes and
> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
> The ARC cache usage also goes down.  Then they increase again until
> the ARC gets full and then they go down again.  So with this hack
> I don't need nocache passed to nullfs and I don't need to limit
> the max vnodes.  Doing multiple untars in parallel over and over
> doesn't seem to cause any issues for this test.  I'm not saying
> commenting out continue is the fix but a simple POC test.
>

I don't see an easy way to say "this is a nullfs vnode holding onto a
zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
callback, if the module is loaded.

In the meantime I think a good enough(tm) fix would be to check that
nothing was freed and fallback to good old regular clean up without
filtering by vfsops. This would be very similar to what you are doing
with your hack.


> It appears that when ZFS is asking for cached vnodes to be
> free'd nullfs also needs to free some up as well so that
> they are free'd on the VFS level.  It seems that vnlru_free_impl
> should allow some of the related nullfs vnodes to be free'd so
> the ZFS ones can be free'd and reduce the size of the ARC.
>
> BTW, I also hacked the kernel and mount to show the vnodes used
> per mount ie. mount -v:
>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid 2b23b2a1de21ed66,
> vnodes: count 13846 lazy 0)
>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
> 11ff00292900, vnodes: count 13846 lazy 0)
>
> Now I can easily see how the vnodes are used without going into ddb.
> On my laptop I have various vnet jails and nullfs mount my homedir into
> them so pretty much everything goes through nullfs to ZFS.  I'm limping
> along with the nullfs nocache and small number of vnodes but it would be
> nice to not need that.
>
> Thanks,
>
> Doug A.
>
>


-- 
Mateusz Guzik 



nullfs and ZFS issues

2022-04-18 Thread Doug Ambrisko
I've switched my laptop to use nullfs and ZFS.  Previously, I used
localhost NFS mounts instead of nullfs when nullfs would complain
that it couldn't mount.  Since that check has been removed, I've
switched to nullfs only.  However, every so often my laptop would
get slow and the the ARC evict and prune thread would consume two
cores 100% until I rebooted.  I had a 1G max. ARC and have increased
it to 2G now.  Looking into this has uncovered some issues:
 -  nullfs would prevent vnlru_free_vfsops from doing anything
when called from ZFS arc_prune_task
 -  nullfs would hang onto a bunch of vnodes unless mounted with
nocache
 -  nullfs and nocache would break untar.  This has been fixed now.

With nullfs, nocache and settings max vnodes to a low number I can
keep the ARC around the max. without evict and prune consuming
100% of 2 cores.  This doesn't seem like the best solution but it
better then when the ARC starts spinning.

Looking into this issue with bhyve and a md drive for testing I create
a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
I loop through untaring the Linux kernel into the nullfs mount, rm -rf it
and repeat.  I set the ARC to the smallest value I can.  Untarring the
Linux kernel was enough to get the ARC evict and prune to spin since
they couldn't evict/prune anything.

Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
  static int
  vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
  {
...

for (;;) {
...
vp = TAILQ_NEXT(vp, v_vnodelist);
...

/*
 * Don't recycle if our vnode is from different type
 * of mount point.  Note that mp is type-safe, the
 * check does not reach unmapped address even if
 * vnode is reclaimed.
 */
if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
mp->mnt_op != mnt_op) {
continue;
}
...

The vp ends up being the nulfs mount and then hits the continue
even though the passed in mvp is on ZFS.  If I do a hack to
comment out the continue then I see the ARC, nullfs vnodes and 
ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
The ARC cache usage also goes down.  Then they increase again until
the ARC gets full and then they go down again.  So with this hack
I don't need nocache passed to nullfs and I don't need to limit
the max vnodes.  Doing multiple untars in parallel over and over
doesn't seem to cause any issues for this test.  I'm not saying
commenting out continue is the fix but a simple POC test.

It appears that when ZFS is asking for cached vnodes to be
free'd nullfs also needs to free some up as well so that
they are free'd on the VFS level.  It seems that vnlru_free_impl
should allow some of the related nullfs vnodes to be free'd so
the ZFS ones can be free'd and reduce the size of the ARC.

BTW, I also hacked the kernel and mount to show the vnodes used
per mount ie. mount -v:
  test on /test (zfs, NFS exported, local, nfsv4acls, fsid 2b23b2a1de21ed66, 
vnodes: count 13846 lazy 0)
  /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid 11ff00292900, 
vnodes: count 13846 lazy 0)

Now I can easily see how the vnodes are used without going into ddb.
On my laptop I have various vnet jails and nullfs mount my homedir into
them so pretty much everything goes through nullfs to ZFS.  I'm limping
along with the nullfs nocache and small number of vnodes but it would be
nice to not need that.

Thanks,

Doug A.