Re: Speed improvements in ZFS

2023-09-15 Thread Alexander Leidinger

Am 2023-09-15 13:40, schrieb George Michaelson:

Not wanting to hijack threads I am interested if any of this can 
translate back up tree and make Linux ZFS faster.


And, if there are simple sysctl tuning worth trying in large (tb) 
memory model pre 14 FreeBSD systems with slow zfs. Older freebsd alas.


The current part of the discussion is not really about ZFS (I use a lot 
of nullfs on top of ZFS). So no to the first part.


The tuning I did (maxvnodes) doesn't really depend on the FreeBSD 
version, but on the number of files touched/contained in the FS. The 
only other change I made is updating the OS itself, so this part doesn't 
apply to pre 14 systems.


If you think your ZFS (with a large ARC) is slow, you need to review 
your primary cache settings per dataset, check the arcstats, and maybe 
think about a 2nd level arc on fast storage (cache device on nvm or 
ssd). IF you have a read-once workload, nothing of this will help. So 
all depends on your workload.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF

signature.asc
Description: OpenPGP digital signature


Re: Speed improvements in ZFS

2023-09-15 Thread George Michaelson
Not wanting to hijack threads I am interested if any of this can translate
back up tree and make Linux ZFS faster.

And, if there are simple sysctl tuning worth trying in large (tb) memory
model pre 14 FreeBSD systems with slow zfs. Older freebsd alas.


Re: Speed improvements in ZFS

2023-09-15 Thread Alexander Leidinger

Am 2023-09-04 14:26, schrieb Mateusz Guzik:

On 9/4/23, Alexander Leidinger  wrote:

Am 2023-08-28 22:33, schrieb Alexander Leidinger:

Am 2023-08-22 18:59, schrieb Mateusz Guzik:

On 8/22/23, Alexander Leidinger  wrote:

Am 2023-08-21 10:53, schrieb Konstantin Belousov:
On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger 
wrote:

Am 2023-08-20 23:17, schrieb Konstantin Belousov:
> On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
> > On 8/20/23, Alexander Leidinger  wrote:
> > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
> > >> On 8/20/23, Alexander Leidinger 
> > >> wrote:
> > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
> >  On 8/18/23, Alexander Leidinger 
> >  wrote:
> > >>>
> > > I have a 51MB text file, compressed to about 1MB. Are you
> > > interested
> > > to
> > > get it?
> > >
> > 
> >  Your problem is not the vnode limit, but nullfs.
> > 
> >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
> > >>>
> > >>> 122 nullfs mounts on this system. And every jail I setup has
> > >>> several
> > >>> null mounts. One basesystem mounted into every jail, and then
> > >>> shared
> > >>> ports (packages/distfiles/ccache) across all of them.
> > >>>
> >  First, some of the contention is notorious VI_LOCK in order
> >  to
> >  do
> >  anything.
> > 
> >  But more importantly the mind-boggling off-cpu time comes
> >  from
> >  exclusive locking which should not be there to begin with --
> >  as
> >  in
> >  that xlock in stat should be a slock.
> > 
> >  Maybe I'm going to look into it later.
> > >>>
> > >>> That would be fantastic.
> > >>>
> > >>
> > >> I did a quick test, things are shared locked as expected.
> > >>
> > >> However, I found the following:
> > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
> > >> mp->mnt_kern_flag |=
> > >> lowerrootvp->v_mount->mnt_kern_flag &
> > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
> > >> MNTK_EXTENDED_SHARED);
> > >> }
> > >>
> > >> are you using the "nocache" option? it has a side effect of
> > >> xlocking
> > >
> > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
> > >
> >
> > If you don't have "nocache" on null mounts, then I don't see how
> > this
> > could happen.
>
> There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
> for
> fuse and nfs at least.

11 of those 122 nullfs mounts are ZFS datasets which are also NFS
exported.
6 of those nullfs mounts are also exported via Samba. The NFS
exports
shouldn't be needed anymore, I will remove them.

By nfs I meant nfs client, not nfs exports.


No NFS client mounts anywhere on this system. So where is this
exclusive
lock coming from then...
This is a ZFS system. 2 pools: one for the root, one for anything I
need
space for. Both pools reside on the same disks. The root pool is a
3-way
mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
space-pool. The jails are all basejail-style jails.



While I don't see why xlocking happens, you should be able to dtrace
or printf your way into finding out.


dtrace looks to me like a faster approach to get to the root than
printf... my first naive try is to detect exclusive locks. I'm not 
100%

sure I got it right, but at least dtrace doesn't complain about it:
---snip---
#pragma D option dynvarsize=32m

fbt:nullfs:null_lock:entry
/args[0]->a_flags & 0x08 != 0/
{
stack();
}
---snip---

In which direction should I look with dtrace if this works in 
tonights

run of periodic? I don't have enough knowledge about VFS to come up
with some immediate ideas.


After your sysctl fix for maxvnodes I increased the amount of vnodes 
10

times compared to the initial report. This has increased the speed of
the operation, the find runs in all those jails finished today after 
~5h

(@~8am) instead of in the afternoon as before. Could this suggest that
in parallel some null_reclaim() is running which does the exclusive
locks and slows down the entire operation?



That may be a slowdown to some extent, but the primary problem is
exclusive vnode locking for stat lookup, which should not be
happening.


With -current as of 2023-09-03 (and right now 2023-09-11), the periodic 
daily runs are down to less than an hour... and this didn't happen 
directly after switching to 2023-09-13. First it went down to 4h, then 
down to 1h without any update of the OS. The only thing what I did was 
modifying the number of maxfiles. First to some huge amount after your 
commit in the sysctl affecting part. Then after noticing way more 
freevnodes than configured down to 5.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


signature.asc
Description: OpenPGP digital signature


Re: Speed improvements in ZFS

2023-09-04 Thread Mateusz Guzik
On 9/4/23, Alexander Leidinger  wrote:
> Am 2023-08-28 22:33, schrieb Alexander Leidinger:
>> Am 2023-08-22 18:59, schrieb Mateusz Guzik:
>>> On 8/22/23, Alexander Leidinger  wrote:
 Am 2023-08-21 10:53, schrieb Konstantin Belousov:
> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:
>> Am 2023-08-20 23:17, schrieb Konstantin Belousov:
>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
>> > > On 8/20/23, Alexander Leidinger  wrote:
>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>> > > >> On 8/20/23, Alexander Leidinger 
>> > > >> wrote:
>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>> > >  On 8/18/23, Alexander Leidinger 
>> > >  wrote:
>> > > >>>
>> > > > I have a 51MB text file, compressed to about 1MB. Are you
>> > > > interested
>> > > > to
>> > > > get it?
>> > > >
>> > > 
>> > >  Your problem is not the vnode limit, but nullfs.
>> > > 
>> > >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>> > > >>>
>> > > >>> 122 nullfs mounts on this system. And every jail I setup has
>> > > >>> several
>> > > >>> null mounts. One basesystem mounted into every jail, and then
>> > > >>> shared
>> > > >>> ports (packages/distfiles/ccache) across all of them.
>> > > >>>
>> > >  First, some of the contention is notorious VI_LOCK in order
>> > >  to
>> > >  do
>> > >  anything.
>> > > 
>> > >  But more importantly the mind-boggling off-cpu time comes
>> > >  from
>> > >  exclusive locking which should not be there to begin with --
>> > >  as
>> > >  in
>> > >  that xlock in stat should be a slock.
>> > > 
>> > >  Maybe I'm going to look into it later.
>> > > >>>
>> > > >>> That would be fantastic.
>> > > >>>
>> > > >>
>> > > >> I did a quick test, things are shared locked as expected.
>> > > >>
>> > > >> However, I found the following:
>> > > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>> > > >> mp->mnt_kern_flag |=
>> > > >> lowerrootvp->v_mount->mnt_kern_flag &
>> > > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>> > > >> MNTK_EXTENDED_SHARED);
>> > > >> }
>> > > >>
>> > > >> are you using the "nocache" option? it has a side effect of
>> > > >> xlocking
>> > > >
>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>> > > >
>> > >
>> > > If you don't have "nocache" on null mounts, then I don't see how
>> > > this
>> > > could happen.
>> >
>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
>> > for
>> > fuse and nfs at least.
>>
>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS
>> exported.
>> 6 of those nullfs mounts are also exported via Samba. The NFS
>> exports
>> shouldn't be needed anymore, I will remove them.
> By nfs I meant nfs client, not nfs exports.

 No NFS client mounts anywhere on this system. So where is this
 exclusive
 lock coming from then...
 This is a ZFS system. 2 pools: one for the root, one for anything I
 need
 space for. Both pools reside on the same disks. The root pool is a
 3-way
 mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
 space-pool. The jails are all basejail-style jails.

>>>
>>> While I don't see why xlocking happens, you should be able to dtrace
>>> or printf your way into finding out.
>>
>> dtrace looks to me like a faster approach to get to the root than
>> printf... my first naive try is to detect exclusive locks. I'm not 100%
>> sure I got it right, but at least dtrace doesn't complain about it:
>> ---snip---
>> #pragma D option dynvarsize=32m
>>
>> fbt:nullfs:null_lock:entry
>> /args[0]->a_flags & 0x08 != 0/
>> {
>> stack();
>> }
>> ---snip---
>>
>> In which direction should I look with dtrace if this works in tonights
>> run of periodic? I don't have enough knowledge about VFS to come up
>> with some immediate ideas.
>
> After your sysctl fix for maxvnodes I increased the amount of vnodes 10
> times compared to the initial report. This has increased the speed of
> the operation, the find runs in all those jails finished today after ~5h
> (@~8am) instead of in the afternoon as before. Could this suggest that
> in parallel some null_reclaim() is running which does the exclusive
> locks and slows down the entire operation?
>

That may be a slowdown to some extent, but the primary problem is
exclusive vnode locking for stat lookup, which should not be
happening.

-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-09-04 Thread Alexander Leidinger

Am 2023-08-28 22:33, schrieb Alexander Leidinger:

Am 2023-08-22 18:59, schrieb Mateusz Guzik:

On 8/22/23, Alexander Leidinger  wrote:

Am 2023-08-21 10:53, schrieb Konstantin Belousov:

On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:

Am 2023-08-20 23:17, schrieb Konstantin Belousov:
> On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
> > On 8/20/23, Alexander Leidinger  wrote:
> > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
> > >> On 8/20/23, Alexander Leidinger  wrote:
> > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
> >  On 8/18/23, Alexander Leidinger 
> >  wrote:
> > >>>
> > > I have a 51MB text file, compressed to about 1MB. Are you
> > > interested
> > > to
> > > get it?
> > >
> > 
> >  Your problem is not the vnode limit, but nullfs.
> > 
> >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
> > >>>
> > >>> 122 nullfs mounts on this system. And every jail I setup has
> > >>> several
> > >>> null mounts. One basesystem mounted into every jail, and then
> > >>> shared
> > >>> ports (packages/distfiles/ccache) across all of them.
> > >>>
> >  First, some of the contention is notorious VI_LOCK in order to
> >  do
> >  anything.
> > 
> >  But more importantly the mind-boggling off-cpu time comes from
> >  exclusive locking which should not be there to begin with -- as
> >  in
> >  that xlock in stat should be a slock.
> > 
> >  Maybe I'm going to look into it later.
> > >>>
> > >>> That would be fantastic.
> > >>>
> > >>
> > >> I did a quick test, things are shared locked as expected.
> > >>
> > >> However, I found the following:
> > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
> > >> mp->mnt_kern_flag |=
> > >> lowerrootvp->v_mount->mnt_kern_flag &
> > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
> > >> MNTK_EXTENDED_SHARED);
> > >> }
> > >>
> > >> are you using the "nocache" option? it has a side effect of
> > >> xlocking
> > >
> > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
> > >
> >
> > If you don't have "nocache" on null mounts, then I don't see how
> > this
> > could happen.
>
> There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
> for
> fuse and nfs at least.

11 of those 122 nullfs mounts are ZFS datasets which are also NFS
exported.
6 of those nullfs mounts are also exported via Samba. The NFS 
exports

shouldn't be needed anymore, I will remove them.

By nfs I meant nfs client, not nfs exports.


No NFS client mounts anywhere on this system. So where is this 
exclusive

lock coming from then...
This is a ZFS system. 2 pools: one for the root, one for anything I 
need
space for. Both pools reside on the same disks. The root pool is a 
3-way

mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
space-pool. The jails are all basejail-style jails.



While I don't see why xlocking happens, you should be able to dtrace
or printf your way into finding out.


dtrace looks to me like a faster approach to get to the root than 
printf... my first naive try is to detect exclusive locks. I'm not 100% 
sure I got it right, but at least dtrace doesn't complain about it:

---snip---
#pragma D option dynvarsize=32m

fbt:nullfs:null_lock:entry
/args[0]->a_flags & 0x08 != 0/
{
stack();
}
---snip---

In which direction should I look with dtrace if this works in tonights 
run of periodic? I don't have enough knowledge about VFS to come up 
with some immediate ideas.


After your sysctl fix for maxvnodes I increased the amount of vnodes 10 
times compared to the initial report. This has increased the speed of 
the operation, the find runs in all those jails finished today after ~5h 
(@~8am) instead of in the afternoon as before. Could this suggest that 
in parallel some null_reclaim() is running which does the exclusive 
locks and slows down the entire operation?


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-28 Thread Alexander Leidinger

Am 2023-08-22 18:59, schrieb Mateusz Guzik:

On 8/22/23, Alexander Leidinger  wrote:

Am 2023-08-21 10:53, schrieb Konstantin Belousov:

On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:

Am 2023-08-20 23:17, schrieb Konstantin Belousov:
> On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
> > On 8/20/23, Alexander Leidinger  wrote:
> > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
> > >> On 8/20/23, Alexander Leidinger  wrote:
> > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
> >  On 8/18/23, Alexander Leidinger 
> >  wrote:
> > >>>
> > > I have a 51MB text file, compressed to about 1MB. Are you
> > > interested
> > > to
> > > get it?
> > >
> > 
> >  Your problem is not the vnode limit, but nullfs.
> > 
> >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
> > >>>
> > >>> 122 nullfs mounts on this system. And every jail I setup has
> > >>> several
> > >>> null mounts. One basesystem mounted into every jail, and then
> > >>> shared
> > >>> ports (packages/distfiles/ccache) across all of them.
> > >>>
> >  First, some of the contention is notorious VI_LOCK in order to
> >  do
> >  anything.
> > 
> >  But more importantly the mind-boggling off-cpu time comes from
> >  exclusive locking which should not be there to begin with -- as
> >  in
> >  that xlock in stat should be a slock.
> > 
> >  Maybe I'm going to look into it later.
> > >>>
> > >>> That would be fantastic.
> > >>>
> > >>
> > >> I did a quick test, things are shared locked as expected.
> > >>
> > >> However, I found the following:
> > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
> > >> mp->mnt_kern_flag |=
> > >> lowerrootvp->v_mount->mnt_kern_flag &
> > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
> > >> MNTK_EXTENDED_SHARED);
> > >> }
> > >>
> > >> are you using the "nocache" option? it has a side effect of
> > >> xlocking
> > >
> > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
> > >
> >
> > If you don't have "nocache" on null mounts, then I don't see how
> > this
> > could happen.
>
> There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
> for
> fuse and nfs at least.

11 of those 122 nullfs mounts are ZFS datasets which are also NFS
exported.
6 of those nullfs mounts are also exported via Samba. The NFS 
exports

shouldn't be needed anymore, I will remove them.

By nfs I meant nfs client, not nfs exports.


No NFS client mounts anywhere on this system. So where is this 
exclusive

lock coming from then...
This is a ZFS system. 2 pools: one for the root, one for anything I 
need
space for. Both pools reside on the same disks. The root pool is a 
3-way

mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
space-pool. The jails are all basejail-style jails.



While I don't see why xlocking happens, you should be able to dtrace
or printf your way into finding out.


dtrace looks to me like a faster approach to get to the root than 
printf... my first naive try is to detect exclusive locks. I'm not 100% 
sure I got it right, but at least dtrace doesn't complain about it:

---snip---
#pragma D option dynvarsize=32m

fbt:nullfs:null_lock:entry
/args[0]->a_flags & 0x08 != 0/
{
stack();
}
---snip---

In which direction should I look with dtrace if this works in tonights 
run of periodic? I don't have enough knowledge about VFS to come up with 
some immediate ideas.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-22 Thread Mateusz Guzik
On 8/22/23, Alexander Leidinger  wrote:
> Am 2023-08-21 10:53, schrieb Konstantin Belousov:
>> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:
>>> Am 2023-08-20 23:17, schrieb Konstantin Belousov:
>>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
>>> > > On 8/20/23, Alexander Leidinger  wrote:
>>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>>> > > >> On 8/20/23, Alexander Leidinger  wrote:
>>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>>> > >  On 8/18/23, Alexander Leidinger 
>>> > >  wrote:
>>> > > >>>
>>> > > > I have a 51MB text file, compressed to about 1MB. Are you
>>> > > > interested
>>> > > > to
>>> > > > get it?
>>> > > >
>>> > > 
>>> > >  Your problem is not the vnode limit, but nullfs.
>>> > > 
>>> > >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>>> > > >>>
>>> > > >>> 122 nullfs mounts on this system. And every jail I setup has
>>> > > >>> several
>>> > > >>> null mounts. One basesystem mounted into every jail, and then
>>> > > >>> shared
>>> > > >>> ports (packages/distfiles/ccache) across all of them.
>>> > > >>>
>>> > >  First, some of the contention is notorious VI_LOCK in order to
>>> > >  do
>>> > >  anything.
>>> > > 
>>> > >  But more importantly the mind-boggling off-cpu time comes from
>>> > >  exclusive locking which should not be there to begin with -- as
>>> > >  in
>>> > >  that xlock in stat should be a slock.
>>> > > 
>>> > >  Maybe I'm going to look into it later.
>>> > > >>>
>>> > > >>> That would be fantastic.
>>> > > >>>
>>> > > >>
>>> > > >> I did a quick test, things are shared locked as expected.
>>> > > >>
>>> > > >> However, I found the following:
>>> > > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>>> > > >> mp->mnt_kern_flag |=
>>> > > >> lowerrootvp->v_mount->mnt_kern_flag &
>>> > > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>>> > > >> MNTK_EXTENDED_SHARED);
>>> > > >> }
>>> > > >>
>>> > > >> are you using the "nocache" option? it has a side effect of
>>> > > >> xlocking
>>> > > >
>>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>>> > > >
>>> > >
>>> > > If you don't have "nocache" on null mounts, then I don't see how
>>> > > this
>>> > > could happen.
>>> >
>>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
>>> > for
>>> > fuse and nfs at least.
>>>
>>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS
>>> exported.
>>> 6 of those nullfs mounts are also exported via Samba. The NFS exports
>>> shouldn't be needed anymore, I will remove them.
>> By nfs I meant nfs client, not nfs exports.
>
> No NFS client mounts anywhere on this system. So where is this exclusive
> lock coming from then...
> This is a ZFS system. 2 pools: one for the root, one for anything I need
> space for. Both pools reside on the same disks. The root pool is a 3-way
> mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
> space-pool. The jails are all basejail-style jails.
>

While I don't see why xlocking happens, you should be able to dtrace
or printf your way into finding out.

-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-21 Thread Alexander Leidinger

Am 2023-08-21 10:53, schrieb Konstantin Belousov:

On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:

Am 2023-08-20 23:17, schrieb Konstantin Belousov:
> On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
> > On 8/20/23, Alexander Leidinger  wrote:
> > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
> > >> On 8/20/23, Alexander Leidinger  wrote:
> > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
> >  On 8/18/23, Alexander Leidinger  wrote:
> > >>>
> > > I have a 51MB text file, compressed to about 1MB. Are you interested
> > > to
> > > get it?
> > >
> > 
> >  Your problem is not the vnode limit, but nullfs.
> > 
> >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
> > >>>
> > >>> 122 nullfs mounts on this system. And every jail I setup has several
> > >>> null mounts. One basesystem mounted into every jail, and then shared
> > >>> ports (packages/distfiles/ccache) across all of them.
> > >>>
> >  First, some of the contention is notorious VI_LOCK in order to do
> >  anything.
> > 
> >  But more importantly the mind-boggling off-cpu time comes from
> >  exclusive locking which should not be there to begin with -- as in
> >  that xlock in stat should be a slock.
> > 
> >  Maybe I'm going to look into it later.
> > >>>
> > >>> That would be fantastic.
> > >>>
> > >>
> > >> I did a quick test, things are shared locked as expected.
> > >>
> > >> However, I found the following:
> > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
> > >> mp->mnt_kern_flag |=
> > >> lowerrootvp->v_mount->mnt_kern_flag &
> > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
> > >> MNTK_EXTENDED_SHARED);
> > >> }
> > >>
> > >> are you using the "nocache" option? it has a side effect of xlocking
> > >
> > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
> > >
> >
> > If you don't have "nocache" on null mounts, then I don't see how this
> > could happen.
>
> There is also MNTK_NULL_NOCACHE on lower fs, which is currently set for
> fuse and nfs at least.

11 of those 122 nullfs mounts are ZFS datasets which are also NFS 
exported.

6 of those nullfs mounts are also exported via Samba. The NFS exports
shouldn't be needed anymore, I will remove them.

By nfs I meant nfs client, not nfs exports.


No NFS client mounts anywhere on this system. So where is this exclusive 
lock coming from then...
This is a ZFS system. 2 pools: one for the root, one for anything I need 
space for. Both pools reside on the same disks. The root pool is a 3-way 
mirror, the "space-pool" is a 5-disk raidz2. All jails are on the 
space-pool. The jails are all basejail-style jails.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-21 Thread Konstantin Belousov
On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:
> Am 2023-08-20 23:17, schrieb Konstantin Belousov:
> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
> > > On 8/20/23, Alexander Leidinger  wrote:
> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
> > > >> On 8/20/23, Alexander Leidinger  wrote:
> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
> > >  On 8/18/23, Alexander Leidinger  wrote:
> > > >>>
> > > > I have a 51MB text file, compressed to about 1MB. Are you interested
> > > > to
> > > > get it?
> > > >
> > > 
> > >  Your problem is not the vnode limit, but nullfs.
> > > 
> > >  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
> > > >>>
> > > >>> 122 nullfs mounts on this system. And every jail I setup has several
> > > >>> null mounts. One basesystem mounted into every jail, and then shared
> > > >>> ports (packages/distfiles/ccache) across all of them.
> > > >>>
> > >  First, some of the contention is notorious VI_LOCK in order to do
> > >  anything.
> > > 
> > >  But more importantly the mind-boggling off-cpu time comes from
> > >  exclusive locking which should not be there to begin with -- as in
> > >  that xlock in stat should be a slock.
> > > 
> > >  Maybe I'm going to look into it later.
> > > >>>
> > > >>> That would be fantastic.
> > > >>>
> > > >>
> > > >> I did a quick test, things are shared locked as expected.
> > > >>
> > > >> However, I found the following:
> > > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
> > > >> mp->mnt_kern_flag |=
> > > >> lowerrootvp->v_mount->mnt_kern_flag &
> > > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
> > > >> MNTK_EXTENDED_SHARED);
> > > >> }
> > > >>
> > > >> are you using the "nocache" option? it has a side effect of xlocking
> > > >
> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
> > > >
> > > 
> > > If you don't have "nocache" on null mounts, then I don't see how this
> > > could happen.
> > 
> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set for
> > fuse and nfs at least.
> 
> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS exported.
> 6 of those nullfs mounts are also exported via Samba. The NFS exports
> shouldn't be needed anymore, I will remove them.
By nfs I meant nfs client, not nfs exports.

> 
> Shouldn't this implicit nocache propagate to the mount of the upper fs to
> give the user feedback about the effective state?
> 
> Bye,
> Alexander.
> 
> -- 
> http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
> http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-21 Thread Alexander Leidinger

Am 2023-08-20 23:17, schrieb Konstantin Belousov:

On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:

On 8/20/23, Alexander Leidinger  wrote:
> Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>> On 8/20/23, Alexander Leidinger  wrote:
>>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
 On 8/18/23, Alexander Leidinger  wrote:
>>>
> I have a 51MB text file, compressed to about 1MB. Are you interested
> to
> get it?
>

 Your problem is not the vnode limit, but nullfs.

 https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>>>
>>> 122 nullfs mounts on this system. And every jail I setup has several
>>> null mounts. One basesystem mounted into every jail, and then shared
>>> ports (packages/distfiles/ccache) across all of them.
>>>
 First, some of the contention is notorious VI_LOCK in order to do
 anything.

 But more importantly the mind-boggling off-cpu time comes from
 exclusive locking which should not be there to begin with -- as in
 that xlock in stat should be a slock.

 Maybe I'm going to look into it later.
>>>
>>> That would be fantastic.
>>>
>>
>> I did a quick test, things are shared locked as expected.
>>
>> However, I found the following:
>> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>> mp->mnt_kern_flag |=
>> lowerrootvp->v_mount->mnt_kern_flag &
>> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>> MNTK_EXTENDED_SHARED);
>> }
>>
>> are you using the "nocache" option? it has a side effect of xlocking
>
> I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>

If you don't have "nocache" on null mounts, then I don't see how this
could happen.


There is also MNTK_NULL_NOCACHE on lower fs, which is currently set for
fuse and nfs at least.


11 of those 122 nullfs mounts are ZFS datasets which are also NFS 
exported. 6 of those nullfs mounts are also exported via Samba. The NFS 
exports shouldn't be needed anymore, I will remove them.


Shouldn't this implicit nocache propagate to the mount of the upper fs 
to give the user feedback about the effective state?


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-20 Thread Konstantin Belousov
On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
> On 8/20/23, Alexander Leidinger  wrote:
> > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
> >> On 8/20/23, Alexander Leidinger  wrote:
> >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>  On 8/18/23, Alexander Leidinger  wrote:
> >>>
> > I have a 51MB text file, compressed to about 1MB. Are you interested
> > to
> > get it?
> >
> 
>  Your problem is not the vnode limit, but nullfs.
> 
>  https://people.freebsd.org/~mjg/netchild-periodic-find.svg
> >>>
> >>> 122 nullfs mounts on this system. And every jail I setup has several
> >>> null mounts. One basesystem mounted into every jail, and then shared
> >>> ports (packages/distfiles/ccache) across all of them.
> >>>
>  First, some of the contention is notorious VI_LOCK in order to do
>  anything.
> 
>  But more importantly the mind-boggling off-cpu time comes from
>  exclusive locking which should not be there to begin with -- as in
>  that xlock in stat should be a slock.
> 
>  Maybe I'm going to look into it later.
> >>>
> >>> That would be fantastic.
> >>>
> >>
> >> I did a quick test, things are shared locked as expected.
> >>
> >> However, I found the following:
> >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
> >> mp->mnt_kern_flag |=
> >> lowerrootvp->v_mount->mnt_kern_flag &
> >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
> >> MNTK_EXTENDED_SHARED);
> >> }
> >>
> >> are you using the "nocache" option? it has a side effect of xlocking
> >
> > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
> >
> 
> If you don't have "nocache" on null mounts, then I don't see how this
> could happen.

There is also MNTK_NULL_NOCACHE on lower fs, which is currently set for
fuse and nfs at least.



Re: Speed improvements in ZFS

2023-08-20 Thread Mateusz Guzik
On 8/20/23, Alexander Leidinger  wrote:
> Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>> On 8/20/23, Alexander Leidinger  wrote:
>>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
 On 8/18/23, Alexander Leidinger  wrote:
>>>
> I have a 51MB text file, compressed to about 1MB. Are you interested
> to
> get it?
>

 Your problem is not the vnode limit, but nullfs.

 https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>>>
>>> 122 nullfs mounts on this system. And every jail I setup has several
>>> null mounts. One basesystem mounted into every jail, and then shared
>>> ports (packages/distfiles/ccache) across all of them.
>>>
 First, some of the contention is notorious VI_LOCK in order to do
 anything.

 But more importantly the mind-boggling off-cpu time comes from
 exclusive locking which should not be there to begin with -- as in
 that xlock in stat should be a slock.

 Maybe I'm going to look into it later.
>>>
>>> That would be fantastic.
>>>
>>
>> I did a quick test, things are shared locked as expected.
>>
>> However, I found the following:
>> if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>> mp->mnt_kern_flag |=
>> lowerrootvp->v_mount->mnt_kern_flag &
>> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>> MNTK_EXTENDED_SHARED);
>> }
>>
>> are you using the "nocache" option? it has a side effect of xlocking
>
> I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>

If you don't have "nocache" on null mounts, then I don't see how this
could happen.

-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-20 Thread Alexander Leidinger

Am 2023-08-20 22:02, schrieb Mateusz Guzik:

On 8/20/23, Alexander Leidinger  wrote:

Am 2023-08-20 19:10, schrieb Mateusz Guzik:

On 8/18/23, Alexander Leidinger  wrote:



I have a 51MB text file, compressed to about 1MB. Are you interested
to
get it?



Your problem is not the vnode limit, but nullfs.

https://people.freebsd.org/~mjg/netchild-periodic-find.svg


122 nullfs mounts on this system. And every jail I setup has several
null mounts. One basesystem mounted into every jail, and then shared
ports (packages/distfiles/ccache) across all of them.


First, some of the contention is notorious VI_LOCK in order to do
anything.

But more importantly the mind-boggling off-cpu time comes from
exclusive locking which should not be there to begin with -- as in
that xlock in stat should be a slock.

Maybe I'm going to look into it later.


That would be fantastic.



I did a quick test, things are shared locked as expected.

However, I found the following:
if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
mp->mnt_kern_flag |= 
lowerrootvp->v_mount->mnt_kern_flag &

(MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
MNTK_EXTENDED_SHARED);
}

are you using the "nocache" option? it has a side effect of xlocking


I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-20 Thread Mateusz Guzik
On 8/20/23, Alexander Leidinger  wrote:
> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>> On 8/18/23, Alexander Leidinger  wrote:
>
>>> I have a 51MB text file, compressed to about 1MB. Are you interested
>>> to
>>> get it?
>>>
>>
>> Your problem is not the vnode limit, but nullfs.
>>
>> https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>
> 122 nullfs mounts on this system. And every jail I setup has several
> null mounts. One basesystem mounted into every jail, and then shared
> ports (packages/distfiles/ccache) across all of them.
>
>> First, some of the contention is notorious VI_LOCK in order to do
>> anything.
>>
>> But more importantly the mind-boggling off-cpu time comes from
>> exclusive locking which should not be there to begin with -- as in
>> that xlock in stat should be a slock.
>>
>> Maybe I'm going to look into it later.
>
> That would be fantastic.
>

I did a quick test, things are shared locked as expected.

However, I found the following:
if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
mp->mnt_kern_flag |= lowerrootvp->v_mount->mnt_kern_flag &
(MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
MNTK_EXTENDED_SHARED);
}

are you using the "nocache" option? it has a side effect of xlocking

-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-20 Thread Alexander Leidinger

Am 2023-08-20 19:10, schrieb Mateusz Guzik:

On 8/18/23, Alexander Leidinger  wrote:


I have a 51MB text file, compressed to about 1MB. Are you interested 
to

get it?



Your problem is not the vnode limit, but nullfs.

https://people.freebsd.org/~mjg/netchild-periodic-find.svg


122 nullfs mounts on this system. And every jail I setup has several 
null mounts. One basesystem mounted into every jail, and then shared 
ports (packages/distfiles/ccache) across all of them.


First, some of the contention is notorious VI_LOCK in order to do 
anything.


But more importantly the mind-boggling off-cpu time comes from
exclusive locking which should not be there to begin with -- as in
that xlock in stat should be a slock.

Maybe I'm going to look into it later.


That would be fantastic.

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-20 Thread Mateusz Guzik
On 8/18/23, Alexander Leidinger  wrote:
> Am 2023-08-16 18:48, schrieb Alexander Leidinger:
>> Am 2023-08-15 23:29, schrieb Mateusz Guzik:
>>> On 8/15/23, Alexander Leidinger  wrote:
 Am 2023-08-15 14:41, schrieb Mateusz Guzik:

> With this in mind can you provide: sysctl kern.maxvnodes
> vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
> vfs.recycles_free vfs.recycles

 After a reboot:
 kern.maxvnodes: 10485760
 vfs.wantfreevnodes: 2621440
 vfs.freevnodes: 24696
 vfs.vnodes_created: 1658162
 vfs.numvnodes: 173937
 vfs.recycles_free: 0
 vfs.recycles: 0
>>
>> New values after one rund of periodic:
>> kern.maxvnodes: 10485760
>> vfs.wantfreevnodes: 2621440
>> vfs.freevnodes: 356202
>> vfs.vnodes_created: 427696288
>> vfs.numvnodes: 532620
>> vfs.recycles_free: 20213257
>> vfs.recycles: 0
>
> And after the second round which only took 7h this night:
> kern.maxvnodes: 10485760
> vfs.wantfreevnodes: 2621440
> vfs.freevnodes: 3071754
> vfs.vnodes_created: 1275963316
> vfs.numvnodes: 3414906
> vfs.recycles_free: 58411371
> vfs.recycles: 0
>
> Meanwhile if there is tons of recycles, you can damage control by
> bumping kern.maxvnodes.
>>
>> What's the difference between recycles and recycles_free? Does the
>> above count as bumping the maxvnodes?
>
> ^
>
 Looks like there are not much free directly after the reboot. I will
 check the values tomorrow after the periodic run again and maybe
 increase by 10 or 100 so see if it makes a difference.

> If this is not the problem you can use dtrace to figure it out.

 dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or
 something else?

>>>
>>> I mean checking where find is spending time instead of speculating.
>>>
>>> There is no productized way to do it so to speak, but the following
>>> crapper should be good enough:
>> [script]
>>
>> I will let it run this night.
>
> I have a 51MB text file, compressed to about 1MB. Are you interested to
> get it?
>

Your problem is not the vnode limit, but nullfs.

https://people.freebsd.org/~mjg/netchild-periodic-find.svg

First, some of the contention is notorious VI_LOCK in order to do anything.

But more importantly the mind-boggling off-cpu time comes from
exclusive locking which should not be there to begin with -- as in
that xlock in stat should be a slock.

Maybe I'm going to look into it later.

-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-18 Thread Mateusz Guzik
On 8/18/23, Alexander Leidinger  wrote:
> Am 2023-08-16 18:48, schrieb Alexander Leidinger:
>> Am 2023-08-15 23:29, schrieb Mateusz Guzik:
>>> On 8/15/23, Alexander Leidinger  wrote:
 Am 2023-08-15 14:41, schrieb Mateusz Guzik:

> With this in mind can you provide: sysctl kern.maxvnodes
> vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
> vfs.recycles_free vfs.recycles

 After a reboot:
 kern.maxvnodes: 10485760
 vfs.wantfreevnodes: 2621440
 vfs.freevnodes: 24696
 vfs.vnodes_created: 1658162
 vfs.numvnodes: 173937
 vfs.recycles_free: 0
 vfs.recycles: 0
>>
>> New values after one rund of periodic:
>> kern.maxvnodes: 10485760
>> vfs.wantfreevnodes: 2621440
>> vfs.freevnodes: 356202
>> vfs.vnodes_created: 427696288
>> vfs.numvnodes: 532620
>> vfs.recycles_free: 20213257
>> vfs.recycles: 0
>
> And after the second round which only took 7h this night:
> kern.maxvnodes: 10485760
> vfs.wantfreevnodes: 2621440
> vfs.freevnodes: 3071754
> vfs.vnodes_created: 1275963316
> vfs.numvnodes: 3414906
> vfs.recycles_free: 58411371
> vfs.recycles: 0
>

so your setup has a vastly higher number of vnodes to inspect than the
number of vnodes it allows to exist at the same time, which further
suggests it easily may be about that msleep.

> Meanwhile if there is tons of recycles, you can damage control by
> bumping kern.maxvnodes.
>>
>> What's the difference between recycles and recycles_free? Does the
>> above count as bumping the maxvnodes?
>
> ^
>

"free" vnodes are just hanging around and can be directly whacked, the
others are used but *maybe* freeable (say a directory with a bunch of
vnodes already established).

 Looks like there are not much free directly after the reboot. I will
 check the values tomorrow after the periodic run again and maybe
 increase by 10 or 100 so see if it makes a difference.

> If this is not the problem you can use dtrace to figure it out.

 dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or
 something else?

>>>
>>> I mean checking where find is spending time instead of speculating.
>>>
>>> There is no productized way to do it so to speak, but the following
>>> crapper should be good enough:
>> [script]
>>
>> I will let it run this night.
>
> I have a 51MB text file, compressed to about 1MB. Are you interested to
> get it?
>

Yea, put it on freefall for example.

or feed it directly to flamegraph: cat file | ./stackcollapse.pl |
./flamegraph.pl > out.svg

see this repo https://github.com/brendangregg/FlameGraph.git


-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-18 Thread Alexander Leidinger

Am 2023-08-16 18:48, schrieb Alexander Leidinger:

Am 2023-08-15 23:29, schrieb Mateusz Guzik:

On 8/15/23, Alexander Leidinger  wrote:

Am 2023-08-15 14:41, schrieb Mateusz Guzik:


With this in mind can you provide: sysctl kern.maxvnodes
vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
vfs.recycles_free vfs.recycles


After a reboot:
kern.maxvnodes: 10485760
vfs.wantfreevnodes: 2621440
vfs.freevnodes: 24696
vfs.vnodes_created: 1658162
vfs.numvnodes: 173937
vfs.recycles_free: 0
vfs.recycles: 0


New values after one rund of periodic:
kern.maxvnodes: 10485760
vfs.wantfreevnodes: 2621440
vfs.freevnodes: 356202
vfs.vnodes_created: 427696288
vfs.numvnodes: 532620
vfs.recycles_free: 20213257
vfs.recycles: 0


And after the second round which only took 7h this night:
kern.maxvnodes: 10485760
vfs.wantfreevnodes: 2621440
vfs.freevnodes: 3071754
vfs.vnodes_created: 1275963316
vfs.numvnodes: 3414906
vfs.recycles_free: 58411371
vfs.recycles: 0


Meanwhile if there is tons of recycles, you can damage control by
bumping kern.maxvnodes.


What's the difference between recycles and recycles_free? Does the 
above count as bumping the maxvnodes?


^


Looks like there are not much free directly after the reboot. I will
check the values tomorrow after the periodic run again and maybe
increase by 10 or 100 so see if it makes a difference.


If this is not the problem you can use dtrace to figure it out.


dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or
something else?



I mean checking where find is spending time instead of speculating.

There is no productized way to do it so to speak, but the following
crapper should be good enough:

[script]

I will let it run this night.


I have a 51MB text file, compressed to about 1MB. Are you interested to 
get it?


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-16 Thread Alexander Leidinger

Am 2023-08-15 23:29, schrieb Mateusz Guzik:

On 8/15/23, Alexander Leidinger  wrote:

Am 2023-08-15 14:41, schrieb Mateusz Guzik:


With this in mind can you provide: sysctl kern.maxvnodes
vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
vfs.recycles_free vfs.recycles


After a reboot:
kern.maxvnodes: 10485760
vfs.wantfreevnodes: 2621440
vfs.freevnodes: 24696
vfs.vnodes_created: 1658162
vfs.numvnodes: 173937
vfs.recycles_free: 0
vfs.recycles: 0


New values after one rund of periodic:
kern.maxvnodes: 10485760
vfs.wantfreevnodes: 2621440
vfs.freevnodes: 356202
vfs.vnodes_created: 427696288
vfs.numvnodes: 532620
vfs.recycles_free: 20213257
vfs.recycles: 0


Meanwhile if there is tons of recycles, you can damage control by
bumping kern.maxvnodes.


What's the difference between recycles and recycles_free? Does the above 
count as bumping the maxvnodes?



Looks like there are not much free directly after the reboot. I will
check the values tomorrow after the periodic run again and maybe
increase by 10 or 100 so see if it makes a difference.


If this is not the problem you can use dtrace to figure it out.


dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or
something else?



I mean checking where find is spending time instead of speculating.

There is no productized way to do it so to speak, but the following
crapper should be good enough:

[script]

I will let it run this night.

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-15 Thread Mateusz Guzik
On 8/15/23, Alexander Leidinger  wrote:
> Am 2023-08-15 14:41, schrieb Mateusz Guzik:
>
>> With this in mind can you provide: sysctl kern.maxvnodes
>> vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
>> vfs.recycles_free vfs.recycles
>
> After a reboot:
> kern.maxvnodes: 10485760
> vfs.wantfreevnodes: 2621440
> vfs.freevnodes: 24696
> vfs.vnodes_created: 1658162
> vfs.numvnodes: 173937
> vfs.recycles_free: 0
> vfs.recycles: 0
>
>> Meanwhile if there is tons of recycles, you can damage control by
>> bumping kern.maxvnodes.
>
> Looks like there are not much free directly after the reboot. I will
> check the values tomorrow after the periodic run again and maybe
> increase by 10 or 100 so see if it makes a difference.
>
>> If this is not the problem you can use dtrace to figure it out.
>
> dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or
> something else?
>

I mean checking where find is spending time instead of speculating.

There is no productized way to do it so to speak, but the following
crapper should be good enough:
#pragma D option dynvarsize=32m

profile:::profile-997
/execname == "find"/
{
@oncpu[stack(), "oncpu"] = count();
}

/*
 * The p_flag & 0x4 test filters out kernel threads.
 */

sched:::off-cpu
/execname == "find"/
{
self->ts = timestamp;
}

sched:::on-cpu
/self->ts/
{
@offcpu[stack(30), "offcpu"] = sum(timestamp - self->ts);
self->ts = 0;
}

dtrace:::END
{
normalize(@offcpu, 100);
printa("%k\n%s\n%@d\n\n", @offcpu);
printa("%k\n%s\n%@d\n\n", @oncpu);
}

just leave it running as: dtrace -s script.d -o output

kill it after periodic finishes. it blindly assumes there will be no
other processes named "find" messing around.


> Bye,
> Alexander.
>
> --
> http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
> http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF
>


-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-15 Thread joe mcguckin
When I watch the activity leds of the disks on one of our ZFS servers, I notice 
there will be a burst of activity for 3 or 4 seconds, then no activity for a 
couple of seconds, then it repeats.
Is that normal?

Thanks,

joe


Joe McGuckin
ViaNet Communications

j...@via.net
650-207-0372 cell
650-213-1302 office
650-969-2124 fax



> On Aug 15, 2023, at 12:33 PM, Alexander Leidinger  
> wrote:
> 
> Am 2023-08-15 14:41, schrieb Mateusz Guzik:
> 
>> With this in mind can you provide: sysctl kern.maxvnodes
>> vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
>> vfs.recycles_free vfs.recycles
> 
> After a reboot:
> kern.maxvnodes: 10485760
> vfs.wantfreevnodes: 2621440
> vfs.freevnodes: 24696
> vfs.vnodes_created: 1658162
> vfs.numvnodes: 173937
> vfs.recycles_free: 0
> vfs.recycles: 0
> 
>> Meanwhile if there is tons of recycles, you can damage control by
>> bumping kern.maxvnodes.
> 
> Looks like there are not much free directly after the reboot. I will check 
> the values tomorrow after the periodic run again and maybe increase by 10 or 
> 100 so see if it makes a difference.
> 
>> If this is not the problem you can use dtrace to figure it out.
> 
> dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or something 
> else?
> 
> Bye,
> Alexander.
> 
> -- 
> http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
> http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-15 Thread Alexander Leidinger

Am 2023-08-15 14:41, schrieb Mateusz Guzik:


With this in mind can you provide: sysctl kern.maxvnodes
vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
vfs.recycles_free vfs.recycles


After a reboot:
kern.maxvnodes: 10485760
vfs.wantfreevnodes: 2621440
vfs.freevnodes: 24696
vfs.vnodes_created: 1658162
vfs.numvnodes: 173937
vfs.recycles_free: 0
vfs.recycles: 0


Meanwhile if there is tons of recycles, you can damage control by
bumping kern.maxvnodes.


Looks like there are not much free directly after the reboot. I will 
check the values tomorrow after the periodic run again and maybe 
increase by 10 or 100 so see if it makes a difference.



If this is not the problem you can use dtrace to figure it out.


dtrace-count on vnlru_read_freevnodes() and vnlru_free_locked()? Or 
something else?


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF



Re: Speed improvements in ZFS

2023-08-15 Thread Mateusz Guzik
On 8/15/23, Alexander Leidinger  wrote:
> Hi,
>
> just a report that I noticed a very high speed improvement in ZFS in
> -current. Since a looong time (at least since last year), for a
> jail-host of mine with about >20 jails on it which each runs periodic
> daily, the periodic daily runs of the jails take from about 3 am to 5pm
> or longer. I don't remember when this started, and I thought at that
> time that the problem may be data related. It's the long runs of "find"
> in one of the periodic daily jobs which takes that long, and the number
> of jails together with null-mounted basesystem inside the jail and a
> null-mounted package repository inside each jail the number of files and
> congruent access to the spining rust with first SSD and now NVME based
> cache may have reached some tipping point. I have all the periodic daily
> mails around, so theoretically I may be able to find when this started,
> but as can be seen in another mail to this mailinglist, the system which
> has all the periodic mails has some issues which have higher priority
> for me to track down...
>
> Since I updated to a src from 2023-07-20, this is not the case anymore.
> The data is the same (maybe even a bit more, as I have added 2 more
> jails since then and the periodic daily runs which run more or less in
> parallel, are not taking considerably longer). The speed increase with
> the July-build are in the area of 3-4 hours for 23 parallel periodic
> daily runs. So instead of finishing the periodic runs around 5pm, they
> finish already around 1pm/2pm.
>
> So whatever was done inside ZFS or VFS or nullfs between 2023-06-19 and
> 2023-07-20 has given a huge speed improvement. From my memory I would
> say there is still room for improvement, as I think it may be the case
> that the periodic daily runs ended in the morning instead of the
> afteroon, but my memory may be flaky in this regard...
>
> Great work to whoever was involved.
>

several hours to run periodic is still unusably slow.

have you tried figuring out where is the time spent?

I don't know what caused the change here, but do know of one major
bottleneck which you are almost guaranteed to run into if you inspect
all files everywhere -- namely bumping over a vnode limit.

In vn_alloc_hard you can find:
msleep(_sig, _list_mtx, PVFS, "vlruwk", hz);
if (atomic_load_long() + 1 > desiredvnodes &&
vnlru_read_freevnodes() > 1)
vnlru_free_locked(1);

that is, the allocating thread will sleep up to 1 second if there are
no vnodes up for grabs and then go ahead and allocate one anyway.
Going over the numvnodes is partially rate-limited, but in a manner
which is not very usable.

The entire is mostly borked and in desperate need of a rewrite.

With this in mind can you provide: sysctl kern.maxvnodes
vfs.wantfreevnodes vfs.freevnodes vfs.vnodes_created vfs.numvnodes
vfs.recycles_free vfs.recycles

Meanwhile if there is tons of recycles, you can damage control by
bumping kern.maxvnodes.

If this is not the problem you can use dtrace to figure it out.

-- 
Mateusz Guzik 



Re: Speed improvements in ZFS

2023-08-15 Thread Graham Perrin

On 15/08/2023 13:05, Alexander Leidinger wrote:

… periodic runs …


Here, I get a sense that these benefit greatly from L2ARC.

… So whatever was done inside ZFS or VFS or nullfs between 2023-06-19 
and 2023-07-20 has given a huge speed improvement. …


 not within the time 
frame, the most recent commit was 2023-06-16.


 
the two most recent commits might be of interest. Related: 
, 
. 



Speed improvements in ZFS

2023-08-15 Thread Alexander Leidinger

Hi,

just a report that I noticed a very high speed improvement in ZFS in 
-current. Since a looong time (at least since last year), for a 
jail-host of mine with about >20 jails on it which each runs periodic 
daily, the periodic daily runs of the jails take from about 3 am to 5pm 
or longer. I don't remember when this started, and I thought at that 
time that the problem may be data related. It's the long runs of "find" 
in one of the periodic daily jobs which takes that long, and the number 
of jails together with null-mounted basesystem inside the jail and a 
null-mounted package repository inside each jail the number of files and 
congruent access to the spining rust with first SSD and now NVME based 
cache may have reached some tipping point. I have all the periodic daily 
mails around, so theoretically I may be able to find when this started, 
but as can be seen in another mail to this mailinglist, the system which 
has all the periodic mails has some issues which have higher priority 
for me to track down...


Since I updated to a src from 2023-07-20, this is not the case anymore. 
The data is the same (maybe even a bit more, as I have added 2 more 
jails since then and the periodic daily runs which run more or less in 
parallel, are not taking considerably longer). The speed increase with 
the July-build are in the area of 3-4 hours for 23 parallel periodic 
daily runs. So instead of finishing the periodic runs around 5pm, they 
finish already around 1pm/2pm.


So whatever was done inside ZFS or VFS or nullfs between 2023-06-19 and 
2023-07-20 has given a huge speed improvement. From my memory I would 
say there is still room for improvement, as I think it may be the case 
that the periodic daily runs ended in the morning instead of the 
afteroon, but my memory may be flaky in this regard...


Great work to whoever was involved.

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF