Re: zfs pool behavior - is it ever freed?

2023-07-31 Thread Greg Troxel
I took your patch and have been adding comments to help me understand
things, as well as debug logging.  I also switched how it works, to have
an ifdef for netbsd approch vs others, to make it less confusing -- but
it amounts to the same thing.  (I understand your intent was to touch as
few lines as possible and agree that your approach is also sensible.)

I conclude:

  the default behavior is to set the ARC size to all memory except 1
  GB

  Even on a high-memory machine, without memory pressure mechanisms, the
  current code is dangerous -- even if in practice it is usually ok.

  If the ARC size is more moderate, things are ok

  The ARC tends to fill with metadata, and I believe this is because the
  vnode cache has refs to the ARC so it is not evictable

  We don't have any real evidence that huge ARC is much better than
  biggish ARC on real workloads.

I attach my patch, which I am not really proposing for committing this
minute.  But I suggest anyone trying to run zfs on 8G and below try it.
I think it would be interesting to hear how it affects systems with lots
of memory.

On a system with 6000 MB (yes, that's not a power of 2 - xen config), I
end up with 750 MB of arc_c.   There is often 2+ GB of pool usage, but
that's of course also non-zfs.  The system is stable, so far.  (I have
pkgsrc, distfiles, binary packages in zfs; the OS is in UFS2.)

ARCI 002 arc_abs_min 16777216
ARCI 002 arc_c_min 196608000
ARCI 005 arc_c_max 786432000
ARCI 010 arc_c_min 196608000
ARCI 010 arc_p 393216000
ARCI 010 arc_c 786432000
ARCI 010 arc_c_max 786432000
ARCI 011 arc_meta_limit 196608000



DIFF.arc
Description: Binary data


Re: zfs pool behavior - is it ever freed?

2023-07-29 Thread Michael van Elst
g...@lexort.com (Greg Troxel) writes:

>mlel...@serpens.de (Michael van Elst) writes:

>> t...@netbsd.org (Tobias Nygren) writes:
>>
>>>There exists ZFS code which hooks into UVM to drain memory -- but part
>>>of it is ifdef __i386 for some reason. See arc_kmem_reap_now().
>>
>> That's an extra for 32bit systems (later code replaced __i386 with
>> the proper macro) where kernel address space is much smaller.

>Sure, but I don't see why it shouldn't always be hooked up.

It might have negative effects, as its used to reduce non-zfs-related
pools. The zfs-related-pools are freed in any case.

I think our problem is that the vcache doesn't get enough pressure.
For FreeBSD you will find the dnlc_reduce_cache() function in arc.c
that (in a separate thread) reduces metadata held by the kernel so that
buffers actually get unreferenced and can be freed afterwards.

We do have the vdrain_thread() that tries to keep cached vnodes
(and associated metadata) below the 'desiredvnodes' value. That
value however doesn't change under memory pressure. You can manually
change the value (as kern.maxvnodes) to get an immediate effect
(for e.g. UFS).

For ZFS however that apparently isn't enough. I can shrink the
vcache, but the ZFS buffers seem to be unaffected.



Re: zfs pool behavior - is it ever freed?

2023-07-29 Thread Greg Troxel
mlel...@serpens.de (Michael van Elst) writes:

> t...@netbsd.org (Tobias Nygren) writes:
>
>>There exists ZFS code which hooks into UVM to drain memory -- but part
>>of it is ifdef __i386 for some reason. See arc_kmem_reap_now().
>
> That's an extra for 32bit systems (later code replaced __i386 with
> the proper macro) where kernel address space is much smaller.

Sure, but I don't see why it shouldn't always be hooked up.

The upstream code ends up setting

  arc_c_min: 1/32 of memory.  I think this is "amount below which arc
  will not be pushed, even under memory pressure".

  arc_c: 1/8 of memory.  I think this is "target size of ARC"

  arc_c_max: all memory except 1GB.  I think this is "size above which
  ARC will be hard prohibited from growing".

The idea that arc_c_max is so big is only reasonable if there is an
effective, known-working mechanism to free memory under pressure.

I'm in the middle of trying to document what the variables mean and
rationalize the size calculation, and will post a diff fairly soon.

With tnn@'s patch, I am seeing

  $ sysctl kstat.zfs.misc.arcstats|egrep size
  kstat.zfs.misc.arcstats.size = 48786184
  kstat.zfs.misc.arcstats.compressed_size = 36969472
  kstat.zfs.misc.arcstats.uncompressed_size = 185070080
  kstat.zfs.misc.arcstats.overhead_size = 1348096
  kstat.zfs.misc.arcstats.hdr_size = 10160672
  kstat.zfs.misc.arcstats.data_size = 0
  kstat.zfs.misc.arcstats.metadata_size = 38317568
  kstat.zfs.misc.arcstats.other_size = 307944
  kstat.zfs.misc.arcstats.anon_size = 16384
  kstat.zfs.misc.arcstats.mru_size = 28901888
  kstat.zfs.misc.arcstats.mru_ghost_size = 87785472
  kstat.zfs.misc.arcstats.mfu_size = 9399296
  kstat.zfs.misc.arcstats.mfu_ghost_size = 97047552
  kstat.zfs.misc.arcstats.l2_size = 0
  kstat.zfs.misc.arcstats.l2_asize = 0
  kstat.zfs.misc.arcstats.l2_hdr_size = 0

but also vmstat -m

  In use 1445870K, total allocated 3323556K; utilization 43.5%

so I think either ARC use is not being accounted for correctly or there
is substantial non-ARC use.   But, my system is stable with 6G of RAM
under Xen. (Which is good because nvmm requires newish CPUs and I can't
run it.)


Also, I don't understand "ghost" yet.


Re: zfs pool behavior - is it ever freed?

2023-07-29 Thread Greg Troxel
tlaro...@polynum.com writes:

> On Sat, Jul 29, 2023 at 12:42:13PM +0200, Tobias Nygren wrote:
>> On Fri, 28 Jul 2023 20:04:56 -0400
>> Greg Troxel  wrote:
>> 
>> > The upstream code tries to find a min/target/max under the assumption
>> > that there is a mechanism to free memory under pressure -- which there
>> > is not.
>> 
>> There exists ZFS code which hooks into UVM to drain memory -- but part
>> of it is ifdef __i386 for some reason. See arc_kmem_reap_now().
>
> FWIW, with jemalloc, there is the possibility to configure to give back
> memory to the system.
>
> Since jemalloc is incorporated in NetBSD, one(TM) should perhaps look if
> the feature is available and what has to be done in this area to use
> it.

It's unlikely reasonable to rototill the way zfs allocates memory.  The
issue is just that things aren't hooked up.


Re: zfs pool behavior - is it ever freed?

2023-07-29 Thread Michael van Elst
t...@netbsd.org (Tobias Nygren) writes:

>There exists ZFS code which hooks into UVM to drain memory -- but part
>of it is ifdef __i386 for some reason. See arc_kmem_reap_now().

That's an extra for 32bit systems (later code replaced __i386 with
the proper macro) where kernel address space is much smaller.



Re: zfs pool behavior - is it ever freed?

2023-07-29 Thread tlaronde
On Sat, Jul 29, 2023 at 12:42:13PM +0200, Tobias Nygren wrote:
> On Fri, 28 Jul 2023 20:04:56 -0400
> Greg Troxel  wrote:
> 
> > The upstream code tries to find a min/target/max under the assumption
> > that there is a mechanism to free memory under pressure -- which there
> > is not.
> 
> There exists ZFS code which hooks into UVM to drain memory -- but part
> of it is ifdef __i386 for some reason. See arc_kmem_reap_now().

FWIW, with jemalloc, there is the possibility to configure to give back
memory to the system.

Since jemalloc is incorporated in NetBSD, one(TM) should perhaps look if
the feature is available and what has to be done in this area to use
it. (If you skim through the documentation, you will end up knowing at
least as much as I do; i.e.: this is from a superficial knowledge.)
-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Re: zfs pool behavior - is it ever freed?

2023-07-29 Thread Tobias Nygren
On Fri, 28 Jul 2023 20:04:56 -0400
Greg Troxel  wrote:

> The upstream code tries to find a min/target/max under the assumption
> that there is a mechanism to free memory under pressure -- which there
> is not.

There exists ZFS code which hooks into UVM to drain memory -- but part
of it is ifdef __i386 for some reason. See arc_kmem_reap_now().


Re: zfs pool behavior - is it ever freed?

2023-07-28 Thread Greg Troxel


Tobias Nygren  writes:

> n Thu, 27 Jul 2023 06:43:45 -0400
> Greg Troxel  wrote:
>
>>   Thus it seems there is a limit for zfs usage, but it is simply
>>   sometimes too high depending on available RAM.
>
> I use this patch on my RPi4, which I feel improves things.
> People might find it helpful.
> There ought to be writable sysctl knobs for some of the ZFS
> tuneables, but looks like it isn't implemented in NetBSD yet.

It definitely helps.

The upstream code tries to find a min/target/max under the assumption
that there is a mechanism to free memory under pressure -- which there
is not.

Reading the code, the cache is supposed to free things if size >
target.  I am not sure that works.   It makes sense to have a target/max
gap so that the freee can be async.  That's often lost.

The code to set min/target/max does not seem clearly sensible to me.
It seems to set arc_max to all RAM except 1 GB.  No wonder we have trouble.

Looking at kstats, I see excessive meta usage, way above limit.  In his
case min/max are historical, not controls.

Here, meta imit is 200M, which is 1/4 of the 800M target for the whole
cache (1/8 of 6GB, my allocation to dom0).  That seems reasonable.
But there is 1.2G of metadata.  Perhaps that is uncompresed size.


kstat.zfs.misc.arcstats.arc_meta_used = 1225255744
kstat.zfs.misc.arcstats.arc_meta_limit = 201326592
kstat.zfs.misc.arcstats.arc_meta_max = 1407252032
kstat.zfs.misc.arcstats.arc_meta_min = 100663296



Re: zfs pool behavior - is it ever freed?

2023-07-28 Thread Michael van Elst
On Fri, Jul 28, 2023 at 12:26:57PM -0400, Greg Troxel wrote:
> mlel...@serpens.de (Michael van Elst) writes:
> 
> > g...@lexort.com (Greg Troxel) writes:
> >
> >>I'm not either, but if there is a precise description/code of what they
> >>did, that lowers the barrier to us stealing* it.  (* There is of course
> >>a long tradition of improvements from various *BSD being applied to
> >>others.)
> >
> > The FreeBSD code is already there and I have exposed a few settings:
> >
> > vfs.zfs_arc.meta_limit = 0
> > vfs.zfs_arc.meta_min = 0
> > vfs.zfs_arc.shrink_shift = 0
> > vfs.zfs_arc.max = 5292193280
> > vfs.zfs_arc.min = 661524160
> > vfs.zfs_arc.compressed = 1
> >
> > but that's not enough to control the behaviour.
> 
> Is that in current only?   I don't see that in netbsd-10.


Only in my local tree.


-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: zfs pool behavior - is it ever freed?

2023-07-28 Thread Greg Troxel
mlel...@serpens.de (Michael van Elst) writes:

> g...@lexort.com (Greg Troxel) writes:
>
>>I'm not either, but if there is a precise description/code of what they
>>did, that lowers the barrier to us stealing* it.  (* There is of course
>>a long tradition of improvements from various *BSD being applied to
>>others.)
>
> The FreeBSD code is already there and I have exposed a few settings:
>
> vfs.zfs_arc.meta_limit = 0
> vfs.zfs_arc.meta_min = 0
> vfs.zfs_arc.shrink_shift = 0
> vfs.zfs_arc.max = 5292193280
> vfs.zfs_arc.min = 661524160
> vfs.zfs_arc.compressed = 1
>
> but that's not enough to control the behaviour.

Is that in current only?   I don't see that in netbsd-10.

I did some code reading and it looks like the arc parameters are
computed at module load time, from the systctl/whatever values as they
exist at that moment, and then not adjusted.  But I didn't read that far
and am still trying to understand.  The ARC sizing rules are pretty
complicated.


Re: zfs pool behavior - is it ever freed?

2023-07-28 Thread Michael van Elst
g...@lexort.com (Greg Troxel) writes:

>I'm not either, but if there is a precise description/code of what they
>did, that lowers the barrier to us stealing* it.  (* There is of course
>a long tradition of improvements from various *BSD being applied to
>others.)

The FreeBSD code is already there and I have exposed a few settings:

vfs.zfs_arc.meta_limit = 0
vfs.zfs_arc.meta_min = 0
vfs.zfs_arc.shrink_shift = 0
vfs.zfs_arc.max = 5292193280
vfs.zfs_arc.min = 661524160
vfs.zfs_arc.compressed = 1

but that's not enough to control the behaviour.



Re: zfs pool behavior - is it ever freed?

2023-07-28 Thread Greg Troxel
Mr Roooster  writes:

> I'm not sure they did a lot more than expose the ARC limit as a sysctl.

I'm not either, but if there is a precise description/code of what they
did, that lowers the barrier to us stealing* it.  (* There is of course
a long tradition of improvements from various *BSD being applied to
others.)

> I moved to FreeBSD from Net a few years ago (mainly to get ZFS), and
> have had similar issues under heavy load with a large ARC. It wouldn't
> crash or hang, but it would always favour killing something over
> flushing the ARC under pressure. I did a little bit of digging and got
> the impression this was the way it was intended to work. (Although
> reading this thread it may be a little more complex than that. :) )

Somebody may intend that, but it seems obviously buggy to kill processes
than to drop data from a cache.

> Once I limited my ARC my problems went away. I limited mine to 16 gig
> on a 96 gig system, but I was running some processes with high memory
> usage. I've not had cause to increase it though, and the system runs
> reliably. It has a few zpools, and I'm running a VM of an iSCSI
> exposed ZVOL, so it get a decent amount of use.

Did I hear that right -- you had problems on a 96 GB system with the
default settings?  What was the default limit?

Did you -- or could you -- characterize the performance impact on ZFS of
having ARC limited to say 8/16/24G?  And is this with spinning disks or
SSD, with or without L2ARC?

> (This is my home system, not a production system, however it does have
> something like 10 HDDs in, so is often quite I/O loaded).

Wow, that's a lot of disks!


Re: zfs pool behavior - is it ever freed?

2023-07-28 Thread Mr Roooster
On Thu, 27 Jul 2023 at 19:28, Greg Troxel  wrote:
>
> Mike Pumford  writes:
>
[snip]
>
> > If I've read it right there needs to be a mechanism for memory
> > pressure to force ZFS to release memory. Doing it after all the
> > processes have been swapped to disk is way too late as the chances are
> > the system will become non-responsive by then. From memory this was a
> > problem FreeBSD had to solve as well.
>
> It would be interesting to read a description of what they did.  That
> seems easier than figuring it out from scratch.
>
I'm not sure they did a lot more than expose the ARC limit as a sysctl.

I moved to FreeBSD from Net a few years ago (mainly to get ZFS), and
have had similar issues under heavy load with a large ARC. It wouldn't
crash or hang, but it would always favour killing something over
flushing the ARC under pressure. I did a little bit of digging and got
the impression this was the way it was intended to work. (Although
reading this thread it may be a little more complex than that. :) )

Once I limited my ARC my problems went away. I limited mine to 16 gig
on a 96 gig system, but I was running some processes with high memory
usage. I've not had cause to increase it though, and the system runs
reliably. It has a few zpools, and I'm running a VM of an iSCSI
exposed ZVOL, so it get a decent amount of use.

(This is my home system, not a production system, however it does have
something like 10 HDDs in, so is often quite I/O loaded).

Cheers,

Ian


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Michael van Elst
On Thu, Jul 27, 2023 at 06:42:02PM +0100, Mike Pumford wrote:
> 
> Now I might be reading it wrong but that suggest to me that it would be an
> awful idea to run ZFS on a system that needs memory for things other than
> filesystem caching as there is no way for those memory needs to force ZFS to
> give up its pool usage.

At least not in the current incarnation. There are lots of tunables
though that are supposed to limit ZFS memory usage, but so far we do
not expose these (FreeBSD does).


-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Brett Lymn
On Thu, Jul 27, 2023 at 06:43:45AM -0400, Greg Troxel wrote:
> 
>   Our howto should say:
> 
> 32G is pretty clearly enough.  Nobody thinks there will be trouble.
> 16G is highly likely enough; we have no reports of trouble.
> 8G will probably work but ill advised for production use.
> 4G will not work; we have no reports of succesful long-term operation
> 
> When you run out, it's ugly.  External tickle after sync(8) works to
> reboot.  Other wdog approaches unclear.
> 
> 
> Additional data welcome of course.

 From my experience with ZFS on Solaris, these numbers are not unique to
NetBSD.  ZFS is a notorious memory pig regardless of the OS :)

-- 
Brett Lymn
--
Sent from my NetBSD device.

"We are were wolves",
"You mean werewolves?",
"No we were wolves, now we are something else entirely",
"Oh"


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Greg Troxel
Mike Pumford  writes:

> Now I might be reading it wrong but that suggest to me that it would
> be an awful idea to run ZFS on a system that needs memory for things
> other than filesystem caching as there is no way for those memory
> needs to force ZFS to give up its pool usage.

As I infer the kernel behavior from reading tnn@'s patch, there is a
limit on the amount of ARC storage.  On my 8G system, it seems ARC ends
up around 2-2.5G and doesn't grow.  One can debate what the limit should
be -- and clearly that's too big for a 4G system, but it does seem to be
bounded.

> If I've read it right there needs to be a mechanism for memory
> pressure to force ZFS to release memory. Doing it after all the
> processes have been swapped to disk is way too late as the chances are
> the system will become non-responsive by then. From memory this was a
> problem FreeBSD had to solve as well.

It would be interesting to read a description of what they did.  That
seems easier than figuring it out from scratch.

> Even with the conventional BSD FFS I have to set vm.filemin and
> vm.filemax to quite low values to stop the kernel prioritizing file
> system cache over process memory and thats on a system with 16GB of
> RAM. Without that tuning I'd regularly have processes effectively
> rendered unresponsive as they were completely swapped out in favor of
> FS cache.

Yes, but the FS cache is allowed to grow to most of memory.  The ARC
size has a limit that if you have as much memory as the people that
wrote the code comntemplated, is not nearly "most of memory".

Another thing I don't understand is how ARC relates to the vnode cache
and the buffer cache that stores file contents, and in particular if
there are two copies of things.

> What's the equivalent lever for ZFS?

Some variable not hooked up to a sysctl!


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Mike Pumford




On 27/07/2023 13:47, Michael van Elst wrote:


Swapping out userland pages is done much earlier, so with high ZFS
utilization you end with a system that has a huge part of real memory
allocated to the kernel. When you run out of swap (and processes
already get killed), then you see some effects on kernel data.

Now I might be reading it wrong but that suggest to me that it would be 
an awful idea to run ZFS on a system that needs memory for things other 
than filesystem caching as there is no way for those memory needs to 
force ZFS to give up its pool usage.


If I've read it right there needs to be a mechanism for memory pressure 
to force ZFS to release memory. Doing it after all the processes have 
been swapped to disk is way too late as the chances are the system will 
become non-responsive by then. From memory this was a problem FreeBSD 
had to solve as well.


Even with the conventional BSD FFS I have to set vm.filemin and 
vm.filemax to quite low values to stop the kernel prioritizing file 
system cache over process memory and thats on a system with 16GB of RAM. 
Without that tuning I'd regularly have processes effectively rendered 
unresponsive as they were completely swapped out in favor of FS cache.


What's the equivalent lever for ZFS?

Mike



Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Greg Troxel
David Brownlee  writes:

> I would definitely like to see something like this in-tree soonest for
> low memory (<6GB?) machines, but I'd prefer not to affect machines
> with large amounts of memory used as dedicated ZFS fileservers (at
> least not until its easily tunable)

Can you apply this locally and spiff it up so that for say >= 8GB or >=
16GB the new rule doesn't fire?  That seems the fastest path to fixing
that is clearly very broken.  (IMHO avoiding pathological behavior is
more important than what is likely a minor efficiency issue, but it's
easier to avoid that discussion.)

(We don't have any data on the table that says this would hurt, either,
assuming that anyone using zfs is either using ssd only or has l2arc on
ssd.  Actual data would be interesting!)


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread David Brownlee
On Thu, 27 Jul 2023 at 13:24, Greg Troxel  wrote:
>
> Tobias Nygren  writes:
>
> > I use this patch on my RPi4, which I feel improves things.
> > People might find it helpful.
>
> That looks very helpful; I'll try it.
>
> > There ought to be writable sysctl knobs for some of the ZFS
> > tuneables, but looks like it isn't implemented in NetBSD yet.
>
> That seems not that hard -- it would be great if someone(tm) did that
> and mailed a patch.
>
> > --- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c  3 Aug 2022 01:53:06 
> > -   1.22
> > +++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c  27 Jul 2023 11:10:40 
> > -
> > @@ -6100,6 +6100,10 @@ arc_init(void)
> >   else
> >   arc_c_max = arc_c_min;
> >   arc_c_max = MAX(arc_c * 5, arc_c_max);
> > +#if defined(__NetBSD__) && defined(_KERNEL)
> > +/* XXX prevent ARC from eating more than 12% of kmem */
> > + arc_c_max = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) 
> > / 8);
> > +#endif
> >
> >   /*
> >* In userland, there's only the memory pressure that we artificially
>
> That seems eminently sensible and is sort of what I was thinking of
> heading to.  Interesting q about /8 vs /16, but it's a reasonable enough
> value to avoid lockups and that's 90% of the benefit.
>
> I wonder if we should commit that as obviously better than where we are
> now, where machines of <= 4G fail badly.
>
> It would be interesting for people with 8G and 16G machines to try this
> patch.  That will be somewhat less and maybe not less respectively.
>
> Also perhaps a dmesg printout of what arc_c_max is set to, to help in
> figuring things out.
>
> (I suppose one can gdb it, too, for testing.)

I would definitely like to see something like this in-tree soonest for
low memory (<6GB?) machines, but I'd prefer not to affect machines
with large amounts of memory used as dedicated ZFS fileservers (at
least not until its easily tunable)

Thanks

David


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Michael van Elst
g...@lexort.com (Greg Troxel) writes:

>  RAM and/or responds to pressure.  That's why we see almost no reports
>  of trouble expect for zfs.


There is almost no pressure on pools and several effects prevent
pressure from actually draining pool caches.

There is almost no pressure on vcache and the ZFS equivalents. Impact
by ZFS is much higher, because of the amount of memory locked up
this way. These data structures are significant as these actually
reference other data structures and buffers.

Swapping out userland pages is done much earlier, so with high ZFS
utilization you end with a system that has a huge part of real memory
allocated to the kernel. When you run out of swap (and processes
already get killed), then you see some effects on kernel data.




Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Greg Troxel
Tobias Nygren  writes:

> I use this patch on my RPi4, which I feel improves things.
> People might find it helpful.

That looks very helpful; I'll try it.

> There ought to be writable sysctl knobs for some of the ZFS
> tuneables, but looks like it isn't implemented in NetBSD yet.

That seems not that hard -- it would be great if someone(tm) did that
and mailed a patch.

> --- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c  3 Aug 2022 01:53:06 
> -   1.22
> +++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c  27 Jul 2023 11:10:40 
> -
> @@ -6100,6 +6100,10 @@ arc_init(void)
>   else
>   arc_c_max = arc_c_min;
>   arc_c_max = MAX(arc_c * 5, arc_c_max);
> +#if defined(__NetBSD__) && defined(_KERNEL)
> +/* XXX prevent ARC from eating more than 12% of kmem */
> + arc_c_max = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 
> 8);
> +#endif
>
>   /*
>* In userland, there's only the memory pressure that we artificially

That seems eminently sensible and is sort of what I was thinking of
heading to.  Interesting q about /8 vs /16, but it's a reasonable enough
value to avoid lockups and that's 90% of the benefit.

I wonder if we should commit that as obviously better than where we are
now, where machines of <= 4G fail badly.

It would be interesting for people with 8G and 16G machines to try this
patch.  That will be somewhat less and maybe not less respectively.

Also perhaps a dmesg printout of what arc_c_max is set to, to help in
figuring things out.

(I suppose one can gdb it, too, for testing.)



Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Tobias Nygren
On Thu, 27 Jul 2023 06:43:45 -0400
Greg Troxel  wrote:

>   Thus it seems there is a limit for zfs usage, but it is simply
>   sometimes too high depending on available RAM.

I use this patch on my RPi4, which I feel improves things.
People might find it helpful.
There ought to be writable sysctl knobs for some of the ZFS
tuneables, but looks like it isn't implemented in NetBSD yet.

--- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c3 Aug 2022 01:53:06 
-   1.22
+++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c27 Jul 2023 11:10:40 
-
@@ -6100,6 +6100,10 @@ arc_init(void)
else
arc_c_max = arc_c_min;
arc_c_max = MAX(arc_c * 5, arc_c_max);
+#if defined(__NetBSD__) && defined(_KERNEL)
+/* XXX prevent ARC from eating more than 12% of kmem */
+   arc_c_max = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 
8);
+#endif

/*
 * In userland, there's only the memory pressure that we artificially



Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread David Brownlee
Potentially supporting datapoint:

I've found issues with netbsd-9 with ZFS on 4GB. Memory pressure was
incredibly high and the system went away every few months.

Currently running fine on -9 & -10 machines with between 8GB and 192GB

The three 8GB ZFS machines (netbsd-9+raidz1, netbsd-10+raidz0,
netbsd-10+raidz1 Frankenstein with a 8TB and two pairs of 6TB & 2TB)
all seem to run fine, but they only have around 24TB of ZFS storage

David


Re: zfs pool behavior - is it ever freed?

2023-07-27 Thread Greg Troxel
I have a bit of data, perhaps merged with some off list comments:

  People say that a 16G machine is ok with zfs, and I have seen no
  reports of real trouble.

  When I run my box with 4G, it locks up.

  When I run my box with 8G, I end up with pool usage in the 3 G to 3.5
  G range.  It feels like there's a limit as I've never seen it above
  3.5G.  vmstat -m says (after a lot of things happening):
In use 1975994K, total allocated 3110132K; utilization 63.5%

  On machines I have handy to check without zfs (amd64 if not labeled):
In use 198214K, total allocated 217912K; utilization 91.0%
   (1G, n9 rpi3, operating near RAM capacity)
In use 67140K, total allocated 71664K; utilization 93.7%
   (1G, n9 rpi3, doing very little)
In use 813025K, total allocated 864324K; utilization 94.1%
   (4G, n9, operates a backup disk (ufs2) and little else)
In use 901729K, total allocated 975280K; utilization 92.5%
   (4G, n9, router and various home servers)
In use 574035K, total allocated 652188K; utilization 88.0%
   (5G, n9, no building, mail+everything_else server)
In use 2841803K, total allocated 3120148K; utilization 91.1%
   (24G, n9, 14G tmpfs, has built a lot of packages)
  
  On the zfs box, the big users are:
zio_buf_512 dnode_t dmu_buf_impl zio_buf_16384 zfs_znode_cache


My conclusions:

  Generally in NetBSD pool usage for caching scales appropriately with
  RAM and/or responds to pressure.  That's why we see almost no reports
  of trouble expect for zfs.

  A machine without zfs that is in the 4G class will use 0.5-1G for pools.

  A 4G machine with zfs, and an 8G machine, tend to end up around 3.5G
  for pools.  It seems that zfs uses 2.5-3G, regardless of what's
  available.

  Thus it seems there is a limit for zfs usage, but it is simply
  sometimes too high depending on available RAM.

  Utilization is particularly poor on the zfs machine, 64% vs 88-94% for
  the rest.

  Our howto should say:

32G is pretty clearly enough.  Nobody thinks there will be trouble.
16G is highly likely enough; we have no reports of trouble.
8G will probably work but ill advised for production use.
4G will not work; we have no reports of succesful long-term operation

When you run out, it's ugly.  External tickle after sync(8) works to
reboot.  Other wdog approaches unclear.


Additional data welcome of course.


Re: zfs pool behavior - is it ever freed?

2023-07-22 Thread Hauke Fath
On Sat, 22 Jul 2023 14:13:06 +0200, Hauke Fath wrote:
> It has a pair of SSDs (older intel SLC sata) for system partitions and 
> L2ARC, [...]

Got my acronyms wrong, I meant SLOG*. I understand that L2ARC is 
largely pointless, and a waste of good RAM.

Cheerio,
Hauke

* 

-- 
 The ASCII Ribbon CampaignHauke Fath
() No HTML/RTF in emailInstitut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
 Respect for open standards  Ruf +49-6151-16-21344


Re: zfs pool behavior - is it ever freed?

2023-07-22 Thread Hauke Fath
On Sat, 22 Jul 2023 07:55:41 -0400, Greg Troxel wrote:
> Using half the ram for pools feels like perhaps a bug, depending -- even
> if you are getting away with it.
> 
> I am curious:
> 
>   What VM approach?

An nvmm accelerated qemu

>   How much ram in the domU (generic term even if not xen)?

800 MB

>   Are you using NFS from the domU to dom0?  domU running zfs?  Something
>   else?

The machine is not running Xen - I found the Dom0 too limiting for the 
purpose.

It has a pair of SSDs (older intel SLC sata) for system partitions and 
L2ARC, and a pair of 4TB Seagate IronWolf disks for zfs storage, 
exported through nfs, afp, smb.

The VM nfs-mounts pkgsrc and distfiles from the base machine.   

>   Is the 16G for pools the sum of the dom0 and domU pools?  Or ?

That is for the base machine, so yes.

>> My guess would be that your 8 GB are simply not enough for sustaining 
>> both zfs and builds.
> 
> I think that's how it is, but it seems obviously buggy for that to be
> the case.  It is dysfunctional to run the system to lockup caching
> things that don't need to be cached.  The ffs vnode cache for example
> does not do this.

Agreed. But zfs has had a reputation of being memory-hungry for ~ever. 
And to my understanding NetBSD's zfs version is not exactly state of 
the art.

> The zfs howto currently talks about zfs taking 1G plus 1G per 1T of
> disk.  For me that would be 1.8G, which would be ok.  But that's not
> what happens.

I wonder: How much memory are you assigning to the Dom0? That alone 
would increase memory pressure compared to a native kernel.
 
> Thanks for the data point; I'll probably edit the zfs HOWTO.  As it is
> we should probably be recommending against zfs unless you have 64G of
> RAM :-( as even your system doesn't seem healthy memory usage wise.

That sounds overly pessimistic to me. I have run a department 
fileserver with dozens of clients (home on NFS) on 32 GB RAM for years 
- the same hardware that now serves as file and build server under 
NetBSD - without ever running into zfs related memory issues.

Your 8 GB machine would probably be just fine with zfs as a pure 
fileserver.

Cheerio,
Hauke

-- 
 The ASCII Ribbon CampaignHauke Fath
() No HTML/RTF in emailInstitut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
 Respect for open standards  Ruf +49-6151-16-21344


Re: zfs pool behavior - is it ever freed?

2023-07-22 Thread Greg Troxel
Hauke Fath  writes:

> On Fri, 21 Jul 2023 08:31:46 -0400, Greg Troxel wrote:
> [zfs memory pressure]
>
>>   Are others having this problem?
>
> I have two machines, one at home (-10) and one at work (-9), in a 
> similar role as yours (fileserver and builds). While both have had 
> their moments, those have never been zfs related.
>
> They both have 32 GB RAM. The home machine, currently running a 
> netbsd-9 build natively and pkg_rr in a VM, is using 16 GB for pools as 
> we speak. 

Using half the ram for pools feels like perhaps a bug, depending -- even
if you are getting away with it.

I am curious:

  What VM approach?

  How much ram in the domU (generic term even if not xen)?

  Are you using NFS from the domU to dom0?  domU running zfs?  Something
  else?

  Is the 16G for pools the sum of the dom0 and domU pools?  Or ?

> My guess would be that your 8 GB are simply not enough for sustaining 
> both zfs and builds.

I think that's how it is, but it seems obviously buggy for that to be
the case.  It is dysfunctional to run the system to lockup caching
things that don't need to be cached.  The ffs vnode cache for example
does not do this.

The zfs howto currently talks about zfs taking 1G plus 1G per 1T of
disk.  For me that would be 1.8G, which would be ok.  But that's not
what happens.

Thanks for the data point; I'll probably edit the zfs HOWTO.  As it is
we should probably be recommending against zfs unless you have 64G of
RAM :-( as even your system doesn't seem healthy memory usage wise.


Re: zfs pool behavior - is it ever freed?

2023-07-22 Thread Hauke Fath
On Fri, 21 Jul 2023 08:31:46 -0400, Greg Troxel wrote:
[zfs memory pressure]

>   Are others having this problem?

I have two machines, one at home (-10) and one at work (-9), in a 
similar role as yours (fileserver and builds). While both have had 
their moments, those have never been zfs related.

They both have 32 GB RAM. The home machine, currently running a 
netbsd-9 build natively and pkg_rr in a VM, is using 16 GB for pools as 
we speak. 

My guess would be that your 8 GB are simply not enough for sustaining 
both zfs and builds.

Cheerio,
Hauke

-- 
 The ASCII Ribbon CampaignHauke Fath
() No HTML/RTF in emailInstitut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
 Respect for open standards  Ruf +49-6151-16-21344


Re: zfs pool behavior - is it ever freed?

2023-07-21 Thread Greg Troxel
This script worked to reboot after a wedge.  Assuming one has a
watchdog of course.

  #!/bin/sh

  if [ `id -u` != 0 ]; then
  echo run as root
  exit 1
  fi

  wdogctl -e -p 360 tco0

  while true; do
  echo -n "LOOP: "; date
  date > /tank0/n0/do-wdog
  sync
  wdogctl -t
  sleep 60
  done


zfs pool behavior - is it ever freed?

2023-07-21 Thread Greg Troxel
I'm having trouble with zfs causing a system to run out of memory, when
I think it should work ok.  I have tried to err on the side of TMI.

I have a semi-old computer (2010) that is:
  netbsd-10
  amd64
  8GB RAM
  1T SSD
  cpu0: "Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHz"
  cpu1: "Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHz"

and it basically works fine, besides being a bit slow by today's
standards.  I am using it as a build and fileserver, heading to
eventually running pbulk, either in domUs or chroots.  I have recently
moved 2 physical machines (netbsd-9 i386 and amd64) to domUs; I use
these to build packages for production use.  (The machines are 2006 and
2008 mac notebooks, with painfully slow spinning disks and 4G of RAM
each -- but they work.)

wd0 has a disklabel, with / and /usr as normal FFSv2 (a and e), normal
swap on wd0b.  wd0f is defined as most of the disk, and is the sole
component of tank0:

  #> zpool status
pool: tank0
   state: ONLINE
scan: scrub repaired 0 in 0h8m with 0 errors on Tue Jul  4 20:31:03 2023
  config:

  NAME   STATE READ WRITE CKSUM
  tank0  ONLINE   0 0 0
/etc/zfs/tank0/wd0f  ONLINE   0 0 0

  errors: No known data errors

I have a bunch of filesystems, for various pkgsrc branches (created from
snapshots), etc:

  NAME   USED  AVAIL  REFER  MOUNTPOINT
  tank0  138G   699G26K  /tank0
  tank0/b0  6.16G   699G  6.16G  /tank0/b0
  tank0/ccache  24.1G   699G  24.1G  /tank0/ccache
  tank0/distfiles   35.1G   699G  35.1G  /tank0/distfiles
  tank0/n0  31.5K   699G  31.5K  /tank0/n0
  tank0/obj 3.48G   699G  3.48G  /tank0/obj
  tank0/packages7.27G   699G  7.27G  /tank0/packages
  tank0/pkgsrc-2022Q1130M   699G   567M  /tank0/pkgsrc-2022Q1
  tank0/pkgsrc-2022Q2145M   699G   569M  /tank0/pkgsrc-2022Q2
  tank0/pkgsrc-2022Q3194M   699G   566M  /tank0/pkgsrc-2022Q3
  tank0/pkgsrc-2022Q4130M   699G   573M  /tank0/pkgsrc-2022Q4
  tank0/pkgsrc-2023Q1147M   699G   582M  /tank0/pkgsrc-2023Q1
  tank0/pkgsrc-2023Q2148M   699G   583M  /tank0/pkgsrc-2023Q2
  tank0/pkgsrc-current  10.3G   699G  1.14G  /tank0/pkgsrc-current
  tank0/pkgsrc-wip   623M   699G   623M  /tank0/pkgsrc-wip
  tank0/u0  1.91M   699G  1.91M  /tank0/u0
  tank0/vm  49.5G   699G23K  /tank0/vm
  tank0/vm/n9-amd64 33.0G   722G  10.1G  -
  tank0/vm/n9-i386  16.5G   711G  4.38G  -
  tank0/ztmp 121M   699G   121M  /tank0/ztmp

which all feels normal to me.


I used to usually boot this as GENERIC.  Now I'm booting xen with 4G:

  menu=GENERIC:rndseed /var/db/entropy-file;boot netbsd
  menu=GENERIC single user:rndseed /var/db/entropy-file;boot netbsd -s
  menu=Xen:load /netbsd-XEN3_DOM0.gz root=wd0a rndseed=/var/db/entropy-file 
console=pc;multiboot /xen.gz dom0_mem=4096M
  menu=Xen single user:load /netbsd-XEN3_DOM0.gz root=wd0a 
rndseed=/var/db/entropy-file console=pc -s;multiboot /xen.gz dom0_mem=4096M
  menu=GENERIC.ok:rndseed /var/db/entropy-file;boot netbsd
  menu=Drop to boot prompt:prompt
  default=3
  timeout=5
  clear=1

I find that after doing things like cvs update in pkgsrc, I have a vast
amount of memory in pools:

  Memory: 629M Act, 341M Inact, 16M Wired, 43M Exec, 739M File, 66M Free
  Swap: 16G Total, 16G Free / Pools: 3372M Used

vmstat -m, sorted by Npage and showing > 1E4:

  zio_buf_16384 16384 57643153341 33786 22341 11445 30831 0   inf 
7143
  zio_buf_2560 2560   18636017890 15244  2467 12777 12777 0   inf 
12031
  ffsdino2 264   5406070   348374 28691 15875 12816 13522 0   inf   
 0
  zfs_znode_cache 248 245152   0   206469 1301518 12997 13015 0   inf  
665
  ffsino   280   5402490   348016 30887 17156 13731 14488 0   inf   
 0
  zio_buf_2048 2048   36944036004 15617   599 15018 15026 0   inf 
14259
  zio_buf_1536 2048   41491040737 18313 6 18307 18313 0   inf 
17657
  zio_buf_1024 1536   55808054191 22942   357 22585 22942 0   inf 
21442
  dmu_buf_impl_t 216 5388280   440673 2301611 23005 23016 0   inf  
380
  arc_buf_hdr_t_f 208 657474   0   556468 25273   638 24635 25096 0   inf 
7913
  zio_data_buf_51 1024 187177  0   157005 45575 14127 31448 45575 0   inf 
10220
  vcachepl 640   266639056918 34959 2 34957 34958 0   inf   
 1
  dnode_t  640   5761980   485522 70645  9470 61175 70645 0   inf 
11511
  zio_buf_512 1024   8482400   798838 141743 15535 126208 128224  0   inf 
96759
  Memory resource pool statistics
  NameSize Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg 
Idle

systcl:

  kstat.zfs.misc.arcstats.size = 283598992

If I continue to do things, the system locks up and needs to have the
reset button pushed.  I'm now trying an external tickle watchdog