Re: bsdinstall partition error when installing to nvme

2022-04-22 Thread Alfonso S. Siciliano

On 4/22/22 23:31, tech-lists wrote:

Hi,

Attempting to install from: 
FreeBSD-14.0-CURRENT-amd64-20220421-b91a48693a5-254961-memstick.img


to brand-new hardware, the installer failed at the stage after where one
selects partition schema (so, GPT in this case) and UFS filesystem
with an error like (sorry to be paraphrasing this from memory as the 
hardware is no longer available)


"autopart failed -5"

I thought it might be down to it being a nvme stick. The hardware
in question is Crucial;CT1000P5PSSD8 1TB. Is this a known issue
with nvme? On that machine, this was the only "disk". Can
nvme be used as the primary disk on FreeBSD?



Thank you for the report,


Reading "autopart failed -5", probably you chose "Auto UFS Guided Disk",

I found a problem with Auto Partitioning in CURRENT.
This should solve: .

However, to be sure, you could reinstall choosing "Manual" Partitioning
to understand if the cause is "nvme" or "bsdinstall".


Alfonso



Re: IPv6 TCP: first two SYN packets to local v6 unicast addresses ignored

2022-04-22 Thread tuexen
> On 23. Apr 2022, at 02:24, Gleb Smirnoff  wrote:
> 
>  Michael,
> 
> On Sat, Apr 23, 2022 at 01:54:25AM +0200, Michael Tuexen wrote:
> M> > here is a patch that should help with the IPv6 problem. I'm not
> M> > yet committing it, it might be not final.
> M> 
> M> when I was looking at the code, I was also wondering if it would make
> M> more sense to check for M_LOOP.
> M> 
> M> However, isn't the rcvif wrong for the first two received packets? I
> M> would expect it always to be the loopback interface. Is that expectation
> M> wrong?
> 
> The IPv6 has a special feature of calling (ifp->if_output)(origifp, ...
> 
> I don't fully understand it, but Alexander does.
> 
> What I can observe is that it works differently for the original packet,
> its first retransmit and second retransmit. Still unclear to me why.
I consider this also strange. The three packets are identical.
So I would expect, that all of these are handled the same way.
> 
> Here is how to observe it:
> 
> dtrace
>-n 'fbt::ip6_output:entry
>{ printf("ro %p ifp %p\n", args[2], args[2]->ro_nh ? 
> args[2]->ro_nh->nh_ifp : 0); }'
>-n 'fbt::ip6_output_send:entry { printf("ifp %p origifp %p\n", args[1], 
> args[2]); }'
> 
> And you will see this:
> 
>  1  45625 ip6_output:entry ro f800122c19a0 ifp 0
>  1  22539ip6_output_send:entry ifp f800027cb800 origifp 
> f800020db000
> 
>  0  45625 ip6_output:entry ro f800122c19a0 ifp 
> f800027cb800
>  0  22539ip6_output_send:entry ifp f800027cb800 origifp 
> f800020db000
> 
>  0  45625 ip6_output:entry ro f800122c19a0 ifp 
> f800027cb800
>  0  22539ip6_output_send:entry ifp f800027cb800 origifp 
> f800027cb800
> 
> So, on packet three (second retransmit) the origifp is equal to ifp (is lo0) 
> and now
> packet passes validation. However, the more I read it, the more it seems to 
> me that
> actually packet three is incorrect and first two are correct :)
> 
> To cope with this self inflicted damage of (ifp->if_output)(origifp, IPV6 
> introduced
> M_LOOP and uses it internally. Looks like a quick solution for IPv6 is to use 
> it.
> However, I will commit it only once we got understanding why the hell a 
> second retransmit
> is different.
> 
> M> I also have an additional question:
> M> Why is this check protected by an (ia != NULL) condition? It does not make
> M> any use of ia?
> 
> It is a host protection feature, so checks only packets that are destined to 
> us.
> This allows to do basic antispoof checks for a host not equipped with any 
> firewall.
Understood. I was confused, since all other code protected by (ia != NULL) 
actually
depends on ia not being the NULL pointer.

Best regards
Michael
> 
> For a machine acting as a router better behavior is not to drop anything 
> routed
> through unless explicitly told so by a filtering policy.
> 
> -- 
> Gleb Smirnoff




Re: Chasing OOM Issues - good sysctl metrics to use?

2022-04-22 Thread Mark Millard
On 2022-Apr-22, at 16:42, Pete Wright  wrote:

> On 4/21/22 21:18, Mark Millard wrote:
>> 
>> Messages in the console out would be appropriate
>> to report. Messages might also be available via
>> the following at appropriate times:
> 
> that is what is frustrating.  i will get notification that the processes are 
> killed:
> Apr 22 09:55:15 topanga kernel: pid 76242 (chrome), jid 0, uid 1001, was 
> killed: failed to reclaim memory
> Apr 22 09:55:19 topanga kernel: pid 76288 (chrome), jid 0, uid 1001, was 
> killed: failed to reclaim memory
> Apr 22 09:55:20 topanga kernel: pid 76259 (firefox), jid 0, uid 1001, was 
> killed: failed to reclaim memory
> Apr 22 09:55:22 topanga kernel: pid 76252 (firefox), jid 0, uid 1001, was 
> killed: failed to reclaim memory
> Apr 22 09:55:23 topanga kernel: pid 76267 (firefox), jid 0, uid 1001, was 
> killed: failed to reclaim memory
> Apr 22 09:55:24 topanga kernel: pid 76234 (chrome), jid 0, uid 1001, was 
> killed: failed to reclaim memory
> Apr 22 09:55:26 topanga kernel: pid 76275 (firefox), jid 0, uid 1001, was 
> killed: failed to reclaim memory

Those messages are not reporting being out of swap
as such. They are reporting sustained low free RAM
despite a number of less drastic attempts to gain
back free RAM (to above some threshold).

FreeBSD does not swap out the kernel stacks for
processes that stay in a runnable state: it just
continues to page. Thus just one large process
that has a huge working set of active pages can
lead to OOM kills in a context were no other set
of processes would be enough to gain the free
RAM required. Such contexts are not really a
swap issue.

Based on there being only 1 "killed:" reason,
I have a suggestion that should allow delaying
such kills for a long time. That in turn may
help with investigating without actually
suffering the kills during the activity: more
time with low free RAM to observe.

Increase:

# sysctl -d vm.pageout_oom_seq
vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM

The default value was 12, last I checked.

My /boot/loader.conf contains the following relative to
that and another type of kill context (just comments
currently for that other type):

#
# Delay when persistent low free RAM leads to
# Out Of Memory killing of processes:
vm.pageout_oom_seq=120
#
# For plunty of swap/paging space (will not
# run out), avoid pageout delays leading to
# Out Of Memory killing of processes:
#vm.pfault_oom_attempts=-1
#
# For possibly insufficient swap/paging space
# (might run out), increase the pageout delay
# that leads to Out Of Memory killing of
# processes (showing defaults at the time):
#vm.pfault_oom_attempts= 3
#vm.pfault_oom_wait= 10
# (The multiplication is the total but there
# are other potential tradoffs in the factors
# multiplied, even for nearly the same total.)

There is no value of vm.pageout_oom_seq that
disables the mechanism. But you can set large
values, like I did --or even larger-- to
wait for more attempts to free some RAM before
the kills. Some notes about that follow.

The 120 I use allows even low end arm Small
Board Computers to manage buildworld buildkernel
without such kills. The buildworld buildkernel
completion is sufficient that the low-free-RAM
status is no longer true and the OOM attempts
stop --so the count goes back to 0.

But those are large but finite activities. If
you want to leave something running for days,
weeks, months, or whatever that produces the
sustained low free RAM conditions, the problem
will eventually happen. Ultimately one may have
to exit and restart such processes once and a
while, exiting enough of them to give a little
time with sufficient free RAM.


> the system in this case had killed both firefox and chrome while i was afk.  
> i logged back in and started them up to do more more, then the next logline 
> is from this morning when i had to force power off/on the system as they 
> keyboard and network were both unresponsive:
> 
> Apr 22 09:58:20 topanga syslogd: kernel boot file is /boot/kernel/kernel
> 
>> Do you have any swap partitions set up and in use? The
>> details could be relevant. Do you have swap set up
>> some other way than via swap partition use? No swap?
> yes i have a 2GB of swap that resides on a nvme device.

I assume a partition style. Otherwise there are other
issues involved --that likely should be avoided by
switching to partition style.

>> ZFS (so with ARC)? UFS? Both?
> 
> i am using ZFS and am setting my vfs.zfs.arc.max to 10G.  i have also 
> experienced this crash with that set to the default unlimited value as well.

I use ZFS on systems with at least 8 GiBytes of RAM,
but I've never tuned ZFS. So I'm not much help for
that side of things.

For systems with under 8 GiBytes of RAM, I uses UFS
unless doing an odd experiment.

>> The first block of lines from a top display could be
>> relevant, particularly when it is clearly progressing
>> towards having the problem. (After the problem is too
>> late.) (I 

Re: IPv6 TCP: first two SYN packets to local v6 unicast addresses ignored

2022-04-22 Thread tuexen
> On 23. Apr 2022, at 01:38, Gleb Smirnoff  wrote:
> 
> Hi Florian,
> 
> here is a patch that should help with the IPv6 problem. I'm not
> yet committing it, it might be not final.
Hi Gleb,

when I was looking at the code, I was also wondering if it would make
more sense to check for M_LOOP.

However, isn't the rcvif wrong for the first two received packets? I
would expect it always to be the loopback interface. Is that expectation
wrong?

I also have an additional question:
Why is this check protected by an (ia != NULL) condition? It does not make
any use of ia?

Best regards
Michael
> -- 
> Gleb Smirnoff
> 




Re: IPv6 TCP: first two SYN packets to local v6 unicast addresses ignored

2022-04-22 Thread Gleb Smirnoff
  Michael,

On Sat, Apr 23, 2022 at 01:54:25AM +0200, Michael Tuexen wrote:
M> > here is a patch that should help with the IPv6 problem. I'm not
M> > yet committing it, it might be not final.
M> 
M> when I was looking at the code, I was also wondering if it would make
M> more sense to check for M_LOOP.
M> 
M> However, isn't the rcvif wrong for the first two received packets? I
M> would expect it always to be the loopback interface. Is that expectation
M> wrong?

The IPv6 has a special feature of calling (ifp->if_output)(origifp, ...

I don't fully understand it, but Alexander does.

What I can observe is that it works differently for the original packet,
its first retransmit and second retransmit. Still unclear to me why.

Here is how to observe it:

dtrace
-n 'fbt::ip6_output:entry
{ printf("ro %p ifp %p\n", args[2], args[2]->ro_nh ? args[2]->ro_nh->nh_ifp 
: 0); }'
-n 'fbt::ip6_output_send:entry { printf("ifp %p origifp %p\n", args[1], 
args[2]); }'

And you will see this:

  1  45625 ip6_output:entry ro f800122c19a0 ifp 0
  1  22539ip6_output_send:entry ifp f800027cb800 origifp 
f800020db000

  0  45625 ip6_output:entry ro f800122c19a0 ifp 
f800027cb800
  0  22539ip6_output_send:entry ifp f800027cb800 origifp 
f800020db000

  0  45625 ip6_output:entry ro f800122c19a0 ifp 
f800027cb800
  0  22539ip6_output_send:entry ifp f800027cb800 origifp 
f800027cb800

So, on packet three (second retransmit) the origifp is equal to ifp (is lo0) 
and now
packet passes validation. However, the more I read it, the more it seems to me 
that
actually packet three is incorrect and first two are correct :)

To cope with this self inflicted damage of (ifp->if_output)(origifp, IPV6 
introduced
M_LOOP and uses it internally. Looks like a quick solution for IPv6 is to use 
it.
However, I will commit it only once we got understanding why the hell a second 
retransmit
is different.

M> I also have an additional question:
M> Why is this check protected by an (ia != NULL) condition? It does not make
M> any use of ia?

It is a host protection feature, so checks only packets that are destined to us.
This allows to do basic antispoof checks for a host not equipped with any 
firewall.

For a machine acting as a router better behavior is not to drop anything routed
through unless explicitly told so by a filtering policy.

-- 
Gleb Smirnoff



Re: Chasing OOM Issues - good sysctl metrics to use?

2022-04-22 Thread Pete Wright




On 4/22/22 13:39, tech-lists wrote:

Hi,

On Thu, Apr 21, 2022 at 07:16:42PM -0700, Pete Wright wrote:

hello -

on my workstation running CURRENT (amd64/32g of ram) i've been running
into a scenario where after 4 or 5 days of daily use I get an OOM event
and both chromium and firefox are killed.  then in the next day or so
the system will become very unresponsive in the morning when i unlock my
screensaver in the morning forcing a manual power cycle.


I have the following set in /etc/sysctl.conf on a stable/13 
workstation. Am using zfs with 32GB RAM.


vm.pageout_oom_seq=120
vm.pfault_oom_attempts=-1
vm.pageout_update_period=0

Since setting these here, OOM is a rarity. I don't profess to exactly 
know

what they do in detail though. But my experience since these were set
is hardly any OOM and big users of memory like firefox don't crash.


nice, i will give those a test next time i crash which will be by next 
thurs if the pattern continues.


looking at the sysctl descriptions:
vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
vm.pfault_oom_attempts: Number of page allocation attempts in page fault 
handler before it triggers OOM handling

vm.pageout_update_period: Maximum active LRU update period

i could certainly see how those could be helpful.  in an ideal world i'd 
find the root cause of the system lock-ups, but it would be nice to just 
move on from this :)


cheers,
-p

--
Pete Wright
p...@nomadlogic.org
@nomadlogicLA




Re: Chasing OOM Issues - good sysctl metrics to use?

2022-04-22 Thread Pete Wright




On 4/21/22 21:18, Mark Millard wrote:


Messages in the console out would be appropriate
to report. Messages might also be available via
the following at appropriate times:


that is what is frustrating.  i will get notification that the processes 
are killed:
Apr 22 09:55:15 topanga kernel: pid 76242 (chrome), jid 0, uid 1001, was 
killed: failed to reclaim memory
Apr 22 09:55:19 topanga kernel: pid 76288 (chrome), jid 0, uid 1001, was 
killed: failed to reclaim memory
Apr 22 09:55:20 topanga kernel: pid 76259 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory
Apr 22 09:55:22 topanga kernel: pid 76252 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory
Apr 22 09:55:23 topanga kernel: pid 76267 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory
Apr 22 09:55:24 topanga kernel: pid 76234 (chrome), jid 0, uid 1001, was 
killed: failed to reclaim memory
Apr 22 09:55:26 topanga kernel: pid 76275 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory


the system in this case had killed both firefox and chrome while i was 
afk.  i logged back in and started them up to do more more, then the 
next logline is from this morning when i had to force power off/on the 
system as they keyboard and network were both unresponsive:


Apr 22 09:58:20 topanga syslogd: kernel boot file is /boot/kernel/kernel


Do you have any swap partitions set up and in use? The
details could be relevant. Do you have swap set up
some other way than via swap partition use? No swap?

yes i have a 2GB of swap that resides on a nvme device.

ZFS (so with ARC)? UFS? Both?


i am using ZFS and am setting my vfs.zfs.arc.max to 10G.  i have also 
experienced this crash with that set to the default unlimited value as well.




The first block of lines from a top display could be
relevant, particularly when it is clearly progressing
towards having the problem. (After the problem is too
late.) (I just picked top as a way to get a bunch of
the information all together automatically.)


since the initial OOM events happen when i am AFK it is difficult to get 
relevant stats out of top.


this is why i've started collecting more detailed metrics in 
prometheus.  my hope is i'll be able to do a better job observing how my 
system is behaving over time, in the run up to the OOM event as well as 
right before and after.  there are heaps of metrics collected though so 
hoping someone can point me in the right direction :)


-pete


--
Pete Wright
p...@nomadlogic.org
@nomadlogicLA




Re: IPv6 TCP: first two SYN packets to local v6 unicast addresses ignored

2022-04-22 Thread Gleb Smirnoff
On Sat, Apr 16, 2022 at 09:19:57AM -0400, Michael Butler wrote:
M> > Michael, can you please confirm or decline that you see the packets
M> > that are dropped when you tcpdump on lo0?
M> 
M> All the jails are aliased to share a single bridge interface. That 
M> results in the route to each jail being on lo0 so .. probably :-)

This probably is somehow related to bridge. Can you please help me
providing minimal configuration of bridge/jails where the problem
shows up?

-- 
Gleb Smirnoff



Re: IPv6 TCP: first two SYN packets to local v6 unicast addresses ignored

2022-04-22 Thread Gleb Smirnoff
  Hi Florian,

here is a patch that should help with the IPv6 problem. I'm not
yet committing it, it might be not final.

-- 
Gleb Smirnoff
diff --git a/sys/netinet6/ip6_input.c b/sys/netinet6/ip6_input.c
index 3a13d2a9dc7..625de6d3657 100644
--- a/sys/netinet6/ip6_input.c
+++ b/sys/netinet6/ip6_input.c
@@ -825,7 +825,7 @@ ip6_input(struct mbuf *m)
 			ip6_sprintf(ip6bufd, >ip6_dst)));
 			goto bad;
 		}
-		if (V_ip6_sav && !(rcvif->if_flags & IFF_LOOPBACK) &&
+		if (V_ip6_sav && !(m->m_flags & M_LOOP) &&
 		__predict_false(in6_localip_fib(>ip6_src,
 			rcvif->if_fib))) {
 			IP6STAT_INC(ip6s_badscope); /* XXX */


Re: bsdinstall partition error when installing to nvme

2022-04-22 Thread tech-lists

On Fri, Apr 22, 2022 at 10:16:07PM +, Glen Barber wrote:


Yes, please do, in case it is indeed related.


done!

thanks,
--
J.


signature.asc
Description: PGP signature


Re: bsdinstall partition error when installing to nvme

2022-04-22 Thread Glen Barber
On Fri, Apr 22, 2022 at 11:04:29PM +0100, tech-lists wrote:
> Hi Glen,
> 
> On Fri, Apr 22, 2022 at 09:36:45PM +, Glen Barber wrote:
> > 
> > This may be related to:
> > 
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263473
> 
> Do you think it'd be useful to the ticket if I added my
> error?
> 
> My context was a little different (current/14 and bsdinstall)
> 

Yes, please do, in case it is indeed related.

Glen



signature.asc
Description: PGP signature


Re: bsdinstall partition error when installing to nvme

2022-04-22 Thread tech-lists

Hi Glen,

On Fri, Apr 22, 2022 at 09:36:45PM +, Glen Barber wrote:


This may be related to:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263473


Do you think it'd be useful to the ticket if I added my
error?

My context was a little different (current/14 and bsdinstall)

thanks,
--
J.


signature.asc
Description: PGP signature


Re: bsdinstall partition error when installing to nvme

2022-04-22 Thread Glen Barber
On Fri, Apr 22, 2022 at 10:31:31PM +0100, tech-lists wrote:
> Hi,
> 
> Attempting to install from:
> FreeBSD-14.0-CURRENT-amd64-20220421-b91a48693a5-254961-memstick.img
> 
> to brand-new hardware, the installer failed at the stage after where one
> selects partition schema (so, GPT in this case) and UFS filesystem
> with an error like (sorry to be paraphrasing this from memory as the
> hardware is no longer available)
> 
> "autopart failed -5"
> 
> I thought it might be down to it being a nvme stick. The hardware
> in question is Crucial;CT1000P5PSSD8 1TB. Is this a known issue
> with nvme? On that machine, this was the only "disk". Can
> nvme be used as the primary disk on FreeBSD?
> 

This may be related to:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263473

Glen



signature.asc
Description: PGP signature


bsdinstall partition error when installing to nvme

2022-04-22 Thread tech-lists

Hi,

Attempting to install from: 
FreeBSD-14.0-CURRENT-amd64-20220421-b91a48693a5-254961-memstick.img


to brand-new hardware, the installer failed at the stage after where one
selects partition schema (so, GPT in this case) and UFS filesystem
with an error like (sorry to be paraphrasing this from memory as the 
hardware is no longer available)


"autopart failed -5"

I thought it might be down to it being a nvme stick. The hardware
in question is Crucial;CT1000P5PSSD8 1TB. Is this a known issue
with nvme? On that machine, this was the only "disk". Can
nvme be used as the primary disk on FreeBSD?

thanks,
--
J.


signature.asc
Description: PGP signature


Re: Chasing OOM Issues - good sysctl metrics to use?

2022-04-22 Thread tech-lists

Hi,

On Thu, Apr 21, 2022 at 07:16:42PM -0700, Pete Wright wrote:

hello -

on my workstation running CURRENT (amd64/32g of ram) i've been running
into a scenario where after 4 or 5 days of daily use I get an OOM event
and both chromium and firefox are killed.  then in the next day or so
the system will become very unresponsive in the morning when i unlock my
screensaver in the morning forcing a manual power cycle.


I have the following set in /etc/sysctl.conf on a stable/13 workstation. 
Am using zfs with 32GB RAM.


vm.pageout_oom_seq=120
vm.pfault_oom_attempts=-1
vm.pageout_update_period=0

Since setting these here, OOM is a rarity. I don't profess to exactly know
what they do in detail though. But my experience since these were set
is hardly any OOM and big users of memory like firefox don't crash.
--
J.


signature.asc
Description: PGP signature


Re: nullfs and ZFS issues

2022-04-22 Thread Doug Ambrisko
On Fri, Apr 22, 2022 at 09:04:39AM +0200, Alexander Leidinger wrote:
| Quoting Doug Ambrisko  (from Thu, 21 Apr 2022  
| 09:38:35 -0700):
| 
| > On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| > | Quoting Mateusz Guzik  (from Thu, 21 Apr 2022
| > | 14:50:42 +0200):
| > |
| > | > On 4/21/22, Alexander Leidinger  wrote:
| > | >> I tried nocache on a system with a lot of jails which use nullfs,
| > | >> which showed very slow behavior in the daily periodic runs (12h runs
| > | >> in the night after boot, 24h or more in subsequent nights). Now the
| > | >> first nightly run after boot was finished after 4h.
| > | >>
| > | >> What is the benefit of not disabling the cache in nullfs? I would
| > | >> expect zfs (or ufs) to cache the (meta)data anyway.
| > | >>
| > | >
| > | > does the poor performance show up with
| > | > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
| > |
| > | I would like to have all the 22 jails run the periodic scripts a
| > | second night in a row before trying this.
| > |
| > | > if the long runs are still there, can you get some profiling from it?
| > | > sysctl -a before and after would be a start.
| > | >
| > | > My guess is that you are the vnode limit and bumping into the 1  
| > second sleep.
| > |
| > | That would explain the behavior I see since I added the last jail
| > | which seems to have crossed a threshold which triggers the slow
| > | behavior.
| > |
| > | Current status (with the 112 nullfs mounts with nocache):
| > | kern.maxvnodes:   10485760
| > | kern.numvnodes:3791064
| > | kern.freevnodes:   3613694
| > | kern.cache.stats.heldvnodes:151707
| > | kern.vnodes_created: 260288639
| > |
| > | The maxvnodes value is already increased by 10 times compared to the
| > | default value on this system.
| >
| > I've attached mount.patch that when doing mount -v should
| > show the vnode usage per filesystem.  Note that the problem I was
| > running into was after some operations arc_prune and arc_evict would
| > consume 100% of 2 cores and make ZFS really slow.  If you are not
| > running into that issue then nocache etc. shouldn't be needed.
| 
| I don't run into this issue, but I have a huge perf difference when  
| using nocache in the nightly periodic runs. 4h instead of 12-24h (22  
| jails on this system).

I wouldn't do the nocache then!  It would be good to see what
Mateusz patch does without nocache for your env.
 
| > On my laptop I set ARC to 1G since I don't use swap and in the past
| > ARC would consume to much memory and things would die.  When the
| > nullfs holds a bunch of vnodes then ZFS couldn't release them.
| >
| > FYI, on my laptop with nocache and limited vnodes I haven't run
| > into this problem.  I haven't tried the patch to let ZFS free
| > it's and nullfs vnodes on my laptop.  I have only tried it via
| 
| I have this patch and your mount patch installed now, without nocache  
| and reduced arc reclaim settings (100, 1). I will check the runtime  
| for the next 2 days.
| 
| Your mount patch to show the per mount vnodes count looks useful, not  
| only for this particular case. Do you intend to commit it?

I should since it doesn't change the size of the structure etc.  I need
to put it up for review.

Thanks,

Doug A.



Re: nullfs and ZFS issues

2022-04-22 Thread Alexander Leidinger
Quoting Doug Ambrisko  (from Thu, 21 Apr 2022  
09:38:35 -0700):



On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| Quoting Mateusz Guzik  (from Thu, 21 Apr 2022
| 14:50:42 +0200):
|
| > On 4/21/22, Alexander Leidinger  wrote:
| >> I tried nocache on a system with a lot of jails which use nullfs,
| >> which showed very slow behavior in the daily periodic runs (12h runs
| >> in the night after boot, 24h or more in subsequent nights). Now the
| >> first nightly run after boot was finished after 4h.
| >>
| >> What is the benefit of not disabling the cache in nullfs? I would
| >> expect zfs (or ufs) to cache the (meta)data anyway.
| >>
| >
| > does the poor performance show up with
| > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
|
| I would like to have all the 22 jails run the periodic scripts a
| second night in a row before trying this.
|
| > if the long runs are still there, can you get some profiling from it?
| > sysctl -a before and after would be a start.
| >
| > My guess is that you are the vnode limit and bumping into the 1  
second sleep.

|
| That would explain the behavior I see since I added the last jail
| which seems to have crossed a threshold which triggers the slow
| behavior.
|
| Current status (with the 112 nullfs mounts with nocache):
| kern.maxvnodes:   10485760
| kern.numvnodes:3791064
| kern.freevnodes:   3613694
| kern.cache.stats.heldvnodes:151707
| kern.vnodes_created: 260288639
|
| The maxvnodes value is already increased by 10 times compared to the
| default value on this system.

I've attached mount.patch that when doing mount -v should
show the vnode usage per filesystem.  Note that the problem I was
running into was after some operations arc_prune and arc_evict would
consume 100% of 2 cores and make ZFS really slow.  If you are not
running into that issue then nocache etc. shouldn't be needed.


I don't run into this issue, but I have a huge perf difference when  
using nocache in the nightly periodic runs. 4h instead of 12-24h (22  
jails on this system).



On my laptop I set ARC to 1G since I don't use swap and in the past
ARC would consume to much memory and things would die.  When the
nullfs holds a bunch of vnodes then ZFS couldn't release them.

FYI, on my laptop with nocache and limited vnodes I haven't run
into this problem.  I haven't tried the patch to let ZFS free
it's and nullfs vnodes on my laptop.  I have only tried it via


I have this patch and your mount patch installed now, without nocache  
and reduced arc reclaim settings (100, 1). I will check the runtime  
for the next 2 days.


Your mount patch to show the per mount vnodes count looks useful, not  
only for this particular case. Do you intend to commit it?


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpaRSyTU_E11.pgp
Description: Digitale PGP-Signatur