Re: Chasing OOM Issues - good sysctl metrics to use?

2022-04-21 Thread Mark Millard
Pete Wright  wrote on
Date: Thu, 21 Apr 2022 19:16:42 -0700 :

> on my workstation running CURRENT (amd64/32g of ram) i've been running 
> into a scenario where after 4 or 5 days of daily use I get an OOM event 
> and both chromium and firefox are killed.  then in the next day or so 
> the system will become very unresponsive in the morning when i unlock my 
> screensaver in the morning forcing a manual power cycle.
> 
> one thing i've noticed is growing swap usage but plenty of free and 
> inactive memory as well as a GB or so of memory in the Laundry state 
> according top.  my understanding is that seeing swap usage grow over 
> time is expected and doesn't necessarily indicate a problem.  but what 
> concerns me is the system locking up while seeing quite a bit of disk 
> i/o (maybe from paging back in?).
> 
> in order to help chase this down i've setup the 
> prometheus_sysctl_exporter(8) to send data to a local prometheus 
> instance.  the goal is to examine memory utilizaton over time to help 
> detect any issues. so my question is this:
> 
> what OID's would be useful to help see to help diagnose weird memory 
> issues like this?
> 
> i'm currently looking at:
> sysctl_vm_domain_0_stats_laundry
> sysctl_vm_domain_0_stats_active
> sysctl_vm_domain_0_stats_free_count
> sysctl_vm_domain_0_stats_inactive_pps
> 
> 
> thanks in advance - and i'd be happy to share my data if anyone is 
> interested :)

Messages in the console out would be appropriate
to report. Messages might also be available via
the following at appropriate times:

# dmesg -a
. . .

or:

# more /var/log/messages
. . .

Generally messages from after the boot is complete
are more relevant.


Messages like the following are some examples
that would be of interest:

pid . . .(c++), jid . . ., uid . . ., was killed: failed to reclaim memory
pid . . .(c++), jid . . ., uid . . ., was killed: a thread waited too long to 
allocate a page
pid . . .(c++), jid . . ., uid . . ., was killed: out of swap space

(That last is somewhat of a misnomer for the internal
issue that leads to it.)

I'm hoping you got message(s) of one or more of the above
kinds. But others are also relevant:

. . . kernel: swap_pager: out of swap space
. . . kernel: swp_pager_getswapspace(7): failed

. . . kernel: swap_pager: indefinite wait buffer: bufobj: . . ., blkno: . . ., 
size: . . .

(Those messages do not announce a process kill but
give some evidence about context.)

Some of the messages with part of the text matching
actually identify somewhat different contexts --so
each message type is relevant.

There may be other types of messages that are relevant.

The sequencing of the messages could be relevant.

Do you have any swap partitions set up and in use? The
details could be relevant. Do you have swap set up
some other way than via swap partition use? No swap?

If 1+ swap partitions are in use, things that suggest
the speeds/latency characteristics of the I/O to the
drive could be relevant.

ZFS (so with ARC)? UFS? Both?

The first block of lines from a top display could be
relevant, particularly when it is clearly progressing
towards having the problem. (After the problem is too
late.) (I just picked top as a way to get a bunch of
the information all together automatically.)

These sorts of things might help folks help you.

===
Mark Millard
marklmi at yahoo.com




Chasing OOM Issues - good sysctl metrics to use?

2022-04-21 Thread Pete Wright

hello -

on my workstation running CURRENT (amd64/32g of ram) i've been running 
into a scenario where after 4 or 5 days of daily use I get an OOM event 
and both chromium and firefox are killed.  then in the next day or so 
the system will become very unresponsive in the morning when i unlock my 
screensaver in the morning forcing a manual power cycle.


one thing i've noticed is growing swap usage but plenty of free and 
inactive memory as well as a GB or so of memory in the Laundry state 
according top.  my understanding is that seeing swap usage grow over 
time is expected and doesn't necessarily indicate a problem.  but what 
concerns me is the system locking up while seeing quite a bit of disk 
i/o (maybe from paging back in?).


in order to help chase this down i've setup the 
prometheus_sysctl_exporter(8) to send data to a local prometheus 
instance.  the goal is to examine memory utilizaton over time to help 
detect any issues. so my question is this:


what OID's would be useful to help see to help diagnose weird memory 
issues like this?


i'm currently looking at:
sysctl_vm_domain_0_stats_laundry
sysctl_vm_domain_0_stats_active
sysctl_vm_domain_0_stats_free_count
sysctl_vm_domain_0_stats_inactive_pps


thanks in advance - and i'd be happy to share my data if anyone is 
interested :)

-pete

--
Pete Wright
p...@nomadlogic.org
@nomadlogicLA




Re: nullfs and ZFS issues

2022-04-21 Thread Doug Ambrisko
On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| Quoting Mateusz Guzik  (from Thu, 21 Apr 2022  
| 14:50:42 +0200):
| 
| > On 4/21/22, Alexander Leidinger  wrote:
| >> I tried nocache on a system with a lot of jails which use nullfs,
| >> which showed very slow behavior in the daily periodic runs (12h runs
| >> in the night after boot, 24h or more in subsequent nights). Now the
| >> first nightly run after boot was finished after 4h.
| >>
| >> What is the benefit of not disabling the cache in nullfs? I would
| >> expect zfs (or ufs) to cache the (meta)data anyway.
| >>
| >
| > does the poor performance show up with
| > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
| 
| I would like to have all the 22 jails run the periodic scripts a  
| second night in a row before trying this.
| 
| > if the long runs are still there, can you get some profiling from it?
| > sysctl -a before and after would be a start.
| >
| > My guess is that you are the vnode limit and bumping into the 1 second 
sleep.
| 
| That would explain the behavior I see since I added the last jail  
| which seems to have crossed a threshold which triggers the slow  
| behavior.
| 
| Current status (with the 112 nullfs mounts with nocache):
| kern.maxvnodes:   10485760
| kern.numvnodes:3791064
| kern.freevnodes:   3613694
| kern.cache.stats.heldvnodes:151707
| kern.vnodes_created: 260288639
| 
| The maxvnodes value is already increased by 10 times compared to the  
| default value on this system.

With the patch, you shouldn't mount with nocache!  However, you might
want to tune:
vfs.zfs.arc.meta_prune
vfs.zfs.arc.meta_adjust_restarts

Since the code on restart will increment the prune amount by
vfs.zfs.arc.meta_prune and submit that amount to the vnode reclaim
code.  So then it will end up reclaiming a lot of vnodes.  The
defaults of 1 * 4096 and submitting it each loop can most of
the cache to be freed.  With relative small values of them, then
the cache didn't shrink to much.

Doug A.



Re: nullfs and ZFS issues

2022-04-21 Thread Doug Ambrisko
On Thu, Apr 21, 2022 at 03:44:02PM +0200, Alexander Leidinger wrote:
| Quoting Mateusz Guzik  (from Thu, 21 Apr 2022  
| 14:50:42 +0200):
| 
| > On 4/21/22, Alexander Leidinger  wrote:
| >> I tried nocache on a system with a lot of jails which use nullfs,
| >> which showed very slow behavior in the daily periodic runs (12h runs
| >> in the night after boot, 24h or more in subsequent nights). Now the
| >> first nightly run after boot was finished after 4h.
| >>
| >> What is the benefit of not disabling the cache in nullfs? I would
| >> expect zfs (or ufs) to cache the (meta)data anyway.
| >>
| >
| > does the poor performance show up with
| > https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?
| 
| I would like to have all the 22 jails run the periodic scripts a  
| second night in a row before trying this.
| 
| > if the long runs are still there, can you get some profiling from it?
| > sysctl -a before and after would be a start.
| >
| > My guess is that you are the vnode limit and bumping into the 1 second 
sleep.
| 
| That would explain the behavior I see since I added the last jail  
| which seems to have crossed a threshold which triggers the slow  
| behavior.
| 
| Current status (with the 112 nullfs mounts with nocache):
| kern.maxvnodes:   10485760
| kern.numvnodes:3791064
| kern.freevnodes:   3613694
| kern.cache.stats.heldvnodes:151707
| kern.vnodes_created: 260288639
| 
| The maxvnodes value is already increased by 10 times compared to the  
| default value on this system.

I've attached mount.patch that when doing mount -v should
show the vnode usage per filesystem.  Note that the problem I was
running into was after some operations arc_prune and arc_evict would
consume 100% of 2 cores and make ZFS really slow.  If you are not
running into that issue then nocache etc. shouldn't be needed.
On my laptop I set ARC to 1G since I don't use swap and in the past
ARC would consume to much memory and things would die.  When the
nullfs holds a bunch of vnodes then ZFS couldn't release them.

FYI, on my laptop with nocache and limited vnodes I haven't run
into this problem.  I haven't tried the patch to let ZFS free
it's and nullfs vnodes on my laptop.  I have only tried it via
bhyve test.  I use bhyve and a md drive to avoid wearing
out my SSD and it's faster to test.  I have found the git, tar,
make world etc. could trigger the issue before but haven't had
any issues with nocache and capping vnodes.

Thanks,

Doug A.
diff --git a/sbin/mount/mount.c b/sbin/mount/mount.c
index 79d9d6cb0ca..00eefb3a5e0 100644
--- a/sbin/mount/mount.c
+++ b/sbin/mount/mount.c
@@ -692,6 +692,13 @@ prmount(struct statfs *sfp)
 			xo_emit("{D:, }{Lw:fsid}{:fsid}", fsidbuf);
 			free(fsidbuf);
 		}
+		if (sfp->f_nvnodelistsize != 0 || sfp->f_lazyvnodelistsize != 0) {
+			xo_open_container("vnodes");
+xo_emit("{D:, }{Lwc:vnodes}{Lw:count}{w:count/%ju}{Lw:lazy}{:lazy/%ju}",
+(uintmax_t)sfp->f_nvnodelistsize,
+(uintmax_t)sfp->f_lazyvnodelistsize);
+			xo_close_container("vnodes");
+		}
 	}
 	xo_emit("{D:)}\n");
 }
diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index a495ad86ac4..3648ef8d080 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -2625,6 +2626,8 @@ __vfs_statfs(struct mount *mp, struct statfs *sbp)
 	sbp->f_version = STATFS_VERSION;
 	sbp->f_namemax = NAME_MAX;
 	sbp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK;
+	sbp->f_nvnodelistsize = mp->mnt_nvnodelistsize;
+	sbp->f_lazyvnodelistsize = mp->mnt_lazyvnodelistsize;
 
 	return (mp->mnt_op->vfs_statfs(mp, sbp));
 }
diff --git a/sys/sys/mount.h b/sys/sys/mount.h
index 3383bfe8f43..95dd3c76ae5 100644
--- a/sys/sys/mount.h
+++ b/sys/sys/mount.h
@@ -91,7 +91,9 @@ struct statfs {
 	uint64_t f_asyncwrites;		/* count of async writes since mount */
 	uint64_t f_syncreads;		/* count of sync reads since mount */
 	uint64_t f_asyncreads;		/* count of async reads since mount */
-	uint64_t f_spare[10];		/* unused spare */
+	uint32_t f_nvnodelistsize;	/* (i) # of vnodes */
+	uint32_t f_lazyvnodelistsize;/* (l) # of lazy vnodes */
+	uint64_t f_spare[9];		/* unused spare */
 	uint32_t f_namemax;		/* maximum filename length */
 	uid_t	  f_owner;		/* user that mounted the filesystem */
 	fsid_t	  f_fsid;		/* filesystem id */


Re: Daily black screen of death

2022-04-21 Thread Steve Kargl
On Thu, Apr 21, 2022 at 09:44:04AM +0200, Emmanuel Vadot wrote:
> 
>  Hello Steve,
> 
> On Tue, 19 Apr 2022 11:32:32 -0700
> Steve Kargl  wrote:
> 
> > FYI,
> > 
> > I'm experiencing an almost daily black screen of death panic.
> > Kernel, world, drm-current-kmod, and gpu-firmware-kmod were
> > all rebuilt and installed at the same time.  Uname shows
> > 
> > FreeBSD 14.0-CURRENT #0 main-n254360-eb9d205fa69: Tue Apr 5 13:49:47 PDT 
> > 2022
> > 
> > So, April 5th sources.
> > 
> > The panic results in a keyboard lock and no dump.  The system
> > does not have a serial console.  Only recourse is a hard rest.
> > 
> > Hand transcribed from photo
> > 
> > _sleep() at _sleep+0x38a/frame 0xfe012b7c0680
> > buf_daemon_shutdown() at buf_daemon_shutdown+0x6b/frame 0xfe012b7c06a0
> > kern_reboot() at kern_reboot+0x2ae/frame 0xfe012b7c06e0
> > vpanic() at vpanic+0x1ee/frame 0xfe012b7c0730
> > panic() at panic+0x43/frame 0xfe012b7c0790
> > 
> > Above repeats 100s of time scrolling off the screen with ever
> > increasing frame pointer.
> > 
> > Final message,
> > 
> > mi_switch() at mi_switch+0x18e/frame 0xfe012b7c14b0
> > __mtx_lock_sleep() at __mtx_lock_sleep+0x173/frame 0xfe012b7c1510
> > __mtx_lock_flags() at __mtx_lock_flags+0xc0/frame 0xfe012b7c1550
> > linux_wake_up() at linux_wake_up+0x38/frame 0xfe012b7c15a0
> > radeon_fence_is_signaled() at radeon_fence_is_signaled+0x99/frame 
> > 0xfe012b7c15f0
> > dma_resv_add_shared_fence() at dma_resv_add_shared_fence+0x99/frame 
> > 0xfe012b7c1640
> > ttm_eu_fence_buffer_objects() at ttm_eu_fence_buffer_objects+0x79/frame 
> > 0xfe012b7c1680
> > radeon_cs_parser_fini() at radeon_cs_parser_fini+0x53/frame 
> > 0xfe012b7c16b0
> > radeaon_cs_ioctl() at radeaon_cs_ioctl+0x75e/frame 0xfe012b7c1b30
> > drm_ioctl_kernel() at drm_ioctl_kernel+0xc7/frame 0xfe012b7c1b80
> > drm_ioctl() at drm_ioctl+0x2c3/frame 0xfe012b7c1c70
> > linux_file_ioctl() at linux_file_ioctl+0x309/frame 0xfe012b7c1cd0
> > kern_ioctl() at kern_ioctl+0x1dc/frame 0xfe012b7c1d40
> > sys_ioctl() at sys_ioctl+0x121/frame 0xfe012b7c1e10
> > amd64_syscall() at amd64_syscall+0x108/frame 0xfe012b7c1f30
> > fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe012b7c1f30
> > --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x36a096c34ea, rsp = 
> > 0x3fa11e623eb8, \
> > rbp = 0x3fa11e623ee0 ---
> > panic: _sleep: curthread not running
> > cpuid = 4
> > time = 1650389478
> > KDB: stack backtrace:
> > 
> > One common trigger appears to be the use of firefox-99.0,2 from
> > the ports collection.  
> > 
> > -- 
> > Steve
> > 
> 
>  What version of drm are you using ?
>  Since when do you experience this ?
>  drm as not changed much for a long time now except adapting a few
> files for new linuxkpi addition.
> 

drm-current-kmod-5.4.144.g20220223
gpu-firmware-kmod-g20210330

I upgraded a Jan 2022 kernel+world+drm+gpu 2 to 3 weeks ago.
The Jan 2022 system just worked.  I've had the problem since
the upgrade.  I've also rebuild firefox, libdrm, the X-server,
and X11 libraries.  Still see the panic.

As the panic messages scroll off the screen, I'm not sure the
above last bit is the actual cause or simply a side effect.

Some additional info from a dmesg after the reboot.


WARNING: / was not properly dismounted
[drm] radeon kernel modesetting enabled.
drmn0:  on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
[drm] initializing kernel modesetting (CAICOS 0x1002:0x6779 0x1092:0x6450 0x00).
[drm ERROR :radeon_atombios_init] Unable to find PCI I/O BAR; using MMIO for 
ATOM IIO
ATOM BIOS: C26401
drmn0: VRAM: 1024M 0x - 0x3FFF (1024M used)
drmn0: GTT: 1024M 0x4000 - 0x7FFF
[drm] Detected VRAM RAM=1024M, BAR=256M
[drm] RAM width 64bits DDR
[TTM] Zone  kernel: Available graphics memory: 8359708 KiB
[TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[TTM] Initializing pool allocator
[drm] radeon: 1024M of VRAM memory ready
[drm] radeon: 1024M of GTT memory ready.
[drm] Loading CAICOS Microcode
drmn0: successfully loaded firmware image 'radeon/CAICOS_pfp.bin'
drmn0: successfully loaded firmware image 'radeon/CAICOS_me.bin'
drmn0: successfully loaded firmware image 'radeon/BTC_rlc.bin'
drmn0: successfully loaded firmware image 'radeon/CAICOS_mc.bin'
drmn0: successfully loaded firmware image 'radeon/CAICOS_smc.bin'
[drm] Internal thermal controller with fan control
[drm] radeon: dpm initialized
drmn0: successfully loaded firmware image 'radeon/SUMO_uvd.bin'
[drm] GART: num cpu pages 262144, num gpu pages 262144
[drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[drm] PCIE GART of 1024M enabled (table at 0x00162000).
drmn0: WB enabled
drmn0: fence driver on ring 0 use gpu addr 0x4c00 and cpu addr 
0x0xf8000be96c00
drmn0: fence driver 

Re: 'set but unused' breaks drm-*-kmod

2022-04-21 Thread Emmanuel Vadot
On Thu, 21 Apr 2022 08:51:26 -0400
Michael Butler  wrote:

> On 4/21/22 03:42, Emmanuel Vadot wrote:
> > 
> >   Hello Michael,
> > 
> > On Wed, 20 Apr 2022 23:39:12 -0400
> > Michael Butler  wrote:
> > 
> >> Seems this new requirement breaks kmod builds too ..
> >>
> >> The first of many errors was (I stopped chasing them all for lack of
> >> time) ..
> >>
> >> --- amdgpu_cs.o ---
> >> /usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.7.19_3/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1210:26:
> >> error: variable 'priority' set but not used
> >> [-Werror,-Wunused-but-set-variable]
> >>   enum drm_sched_priority priority;
> >>   ^
> >> 1 error generated.
> >> *** [amdgpu_cs.o] Error code 1
> >>
> > 
> >   How are you building the port, directly or with PORTS_MODULES ?
> >   I do make passes on the warning for drm and I did for set-but-not-used
> > case but unfortunately this option doesn't exists in 13.0 so I couldn't
> > apply those in every branch.
> 
> I build this directly on -current. I'm guessing that these are what 
> triggered this behaviour:
> 
> commit 8b83d7e0ee54416b0ee58bd85f9c0ae7fb3357a1
> Author: John Baldwin 
> Date:   Mon Apr 18 16:06:27 2022 -0700
> 
>  Make -Wunused-but-set-variable a fatal error for clang 13+ for 
> kernel builds.
> 
>  Reviewed by:imp, emaste
>  Differential Revision:  https://reviews.freebsd.org/D34949
> 
> commit 615d289ffefe2b175f80caa9b1e113c975576472
> Author: John Baldwin 
> Date:   Mon Apr 18 16:06:14 2022 -0700
> 
>  Re-enable set but not used warnings for kernel builds.
> 
>  make tinderbox now passes with this warning enabled as a fatal error,
>  so revert the change to hide it in preparation for making it fatal.
> 
>  This reverts commit e8e691983bb75e80153b802f47733f1531615fa2.
> 
>  Reviewed by:imp, emaste
>  Differential Revision:  https://reviews.freebsd.org/D34948
> 
> 

 Ok I see,

 I won't have time until monday (maybe tuesday to fix this) but if
someone wants to beat me to it we should add some new CWARNFLAGS for
each problematic files in the 5.4-lts and 5.7-table branches of
drm-kmod (master which is following 5.10 is already good) only
if $ {COMPILER_VERSION} >= 13.

 Cheers,

-- 
Emmanuel Vadot  



Re: nullfs and ZFS issues

2022-04-21 Thread Alexander Leidinger
Quoting Mateusz Guzik  (from Thu, 21 Apr 2022  
14:50:42 +0200):



On 4/21/22, Alexander Leidinger  wrote:

I tried nocache on a system with a lot of jails which use nullfs,
which showed very slow behavior in the daily periodic runs (12h runs
in the night after boot, 24h or more in subsequent nights). Now the
first nightly run after boot was finished after 4h.

What is the benefit of not disabling the cache in nullfs? I would
expect zfs (or ufs) to cache the (meta)data anyway.



does the poor performance show up with
https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?


I would like to have all the 22 jails run the periodic scripts a  
second night in a row before trying this.



if the long runs are still there, can you get some profiling from it?
sysctl -a before and after would be a start.

My guess is that you are the vnode limit and bumping into the 1 second sleep.


That would explain the behavior I see since I added the last jail  
which seems to have crossed a threshold which triggers the slow  
behavior.


Current status (with the 112 nullfs mounts with nocache):
kern.maxvnodes:   10485760
kern.numvnodes:3791064
kern.freevnodes:   3613694
kern.cache.stats.heldvnodes:151707
kern.vnodes_created: 260288639

The maxvnodes value is already increased by 10 times compared to the  
default value on this system.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpDvfG_fAon2.pgp
Description: Digitale PGP-Signatur


Re: 'set but unused' breaks drm-*-kmod

2022-04-21 Thread Michael Butler

On 4/21/22 03:42, Emmanuel Vadot wrote:


  Hello Michael,

On Wed, 20 Apr 2022 23:39:12 -0400
Michael Butler  wrote:


Seems this new requirement breaks kmod builds too ..

The first of many errors was (I stopped chasing them all for lack of
time) ..

--- amdgpu_cs.o ---
/usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.7.19_3/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1210:26:
error: variable 'priority' set but not used
[-Werror,-Wunused-but-set-variable]
  enum drm_sched_priority priority;
  ^
1 error generated.
*** [amdgpu_cs.o] Error code 1



  How are you building the port, directly or with PORTS_MODULES ?
  I do make passes on the warning for drm and I did for set-but-not-used
case but unfortunately this option doesn't exists in 13.0 so I couldn't
apply those in every branch.


I build this directly on -current. I'm guessing that these are what 
triggered this behaviour:


commit 8b83d7e0ee54416b0ee58bd85f9c0ae7fb3357a1
Author: John Baldwin 
Date:   Mon Apr 18 16:06:27 2022 -0700

Make -Wunused-but-set-variable a fatal error for clang 13+ for 
kernel builds.


Reviewed by:imp, emaste
Differential Revision:  https://reviews.freebsd.org/D34949

commit 615d289ffefe2b175f80caa9b1e113c975576472
Author: John Baldwin 
Date:   Mon Apr 18 16:06:14 2022 -0700

Re-enable set but not used warnings for kernel builds.

make tinderbox now passes with this warning enabled as a fatal error,
so revert the change to hide it in preparation for making it fatal.

This reverts commit e8e691983bb75e80153b802f47733f1531615fa2.

Reviewed by:imp, emaste
Differential Revision:  https://reviews.freebsd.org/D34948




Re: nullfs and ZFS issues

2022-04-21 Thread Mateusz Guzik
On 4/21/22, Alexander Leidinger  wrote:
> Quoting Doug Ambrisko  (from Wed, 20 Apr 2022
> 09:20:33 -0700):
>
>> On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote:
>> | Quoting Doug Ambrisko  (from Mon, 18 Apr 2022
>> | 16:32:38 -0700):
>> |
>> | > With nullfs, nocache and settings max vnodes to a low number I can
>> |
>> | Where is nocache documented? I don't see it in mount_nullfs(8),
>> | mount(8) or nullfs(5).
>>
>> I didn't find it but it is in:
>>  src/sys/fs/nullfs/null_vfsops.c:  if (vfs_getopt(mp->mnt_optnew,
>> "nocache", NULL, NULL) == 0 ||
>>
>> Also some file systems disable it via MNTK_NULL_NOCACHE
>
> Does the attached diff look ok?
>
>> | I tried a nullfs mount with nocache and it doesn't show up in the
>> | output of "mount".
>>
>> Yep, I saw that as well.  I could tell by dropping into ddb and then
>> do a show mount on the FS and look at the count.  That is why I added
>> the vnode count to mount -v so I could see the usage without dropping
>> into ddb.
>
> I tried nocache on a system with a lot of jails which use nullfs,
> which showed very slow behavior in the daily periodic runs (12h runs
> in the night after boot, 24h or more in subsequent nights). Now the
> first nightly run after boot was finished after 4h.
>
> What is the benefit of not disabling the cache in nullfs? I would
> expect zfs (or ufs) to cache the (meta)data anyway.
>

does the poor performance show up with
https://people.freebsd.org/~mjg/vnlru_free_pick.diff ?

if the long runs are still there, can you get some profiling from it?
sysctl -a before and after would be a start.

My guess is that you are the vnode limit and bumping into the 1 second sleep.

-- 
Mateusz Guzik 



Re: 'set but unused' breaks drm-*-kmod

2022-04-21 Thread Evilham



On dj., abr. 21 2022, Emmanuel Vadot wrote:


 Hello Michael,

On Wed, 20 Apr 2022 23:39:12 -0400
Michael Butler  wrote:


Seems this new requirement breaks kmod builds too ..

The first of many errors was (I stopped chasing them all for 
lack of

time) ..

--- amdgpu_cs.o ---
/usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.7.19_3/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1210:26:
error: variable 'priority' set but not used
[-Werror,-Wunused-but-set-variable]
 enum drm_sched_priority priority;
 ^
1 error generated.
*** [amdgpu_cs.o] Error code 1



 How are you building the port, directly or with PORTS_MODULES ?
 I do make passes on the warning for drm and I did for 
 set-but-not-used
case but unfortunately this option doesn't exists in 13.0 so I 
couldn't

apply those in every branch.

 Cheers,


Can confirm the breakage on 14-CURRENT building 
graphics/drm-devel-kmod in poudriere with matching sources and 
kernel.

Probably due to 8b83d7e0ee54416b0ee58bd85f9c0ae7fb3357a1

--
Evilham



Re: Daily black screen of death

2022-04-21 Thread Emmanuel Vadot


 Hello Steve,

On Tue, 19 Apr 2022 11:32:32 -0700
Steve Kargl  wrote:

> FYI,
> 
> I'm experiencing an almost daily black screen of death panic.
> Kernel, world, drm-current-kmod, and gpu-firmware-kmod were
> all rebuilt and installed at the same time.  Uname shows
> 
> FreeBSD 14.0-CURRENT #0 main-n254360-eb9d205fa69: Tue Apr 5 13:49:47 PDT 2022
> 
> So, April 5th sources.
> 
> The panic results in a keyboard lock and no dump.  The system
> does not have a serial console.  Only recourse is a hard rest.
> 
> Hand transcribed from photo
> 
> _sleep() at _sleep+0x38a/frame 0xfe012b7c0680
> buf_daemon_shutdown() at buf_daemon_shutdown+0x6b/frame 0xfe012b7c06a0
> kern_reboot() at kern_reboot+0x2ae/frame 0xfe012b7c06e0
> vpanic() at vpanic+0x1ee/frame 0xfe012b7c0730
> panic() at panic+0x43/frame 0xfe012b7c0790
> 
> Above repeats 100s of time scrolling off the screen with ever
> increasing frame pointer.
> 
> Final message,
> 
> mi_switch() at mi_switch+0x18e/frame 0xfe012b7c14b0
> __mtx_lock_sleep() at __mtx_lock_sleep+0x173/frame 0xfe012b7c1510
> __mtx_lock_flags() at __mtx_lock_flags+0xc0/frame 0xfe012b7c1550
> linux_wake_up() at linux_wake_up+0x38/frame 0xfe012b7c15a0
> radeon_fence_is_signaled() at radeon_fence_is_signaled+0x99/frame 
> 0xfe012b7c15f0
> dma_resv_add_shared_fence() at dma_resv_add_shared_fence+0x99/frame 
> 0xfe012b7c1640
> ttm_eu_fence_buffer_objects() at ttm_eu_fence_buffer_objects+0x79/frame 
> 0xfe012b7c1680
> radeon_cs_parser_fini() at radeon_cs_parser_fini+0x53/frame 0xfe012b7c16b0
> radeaon_cs_ioctl() at radeaon_cs_ioctl+0x75e/frame 0xfe012b7c1b30
> drm_ioctl_kernel() at drm_ioctl_kernel+0xc7/frame 0xfe012b7c1b80
> drm_ioctl() at drm_ioctl+0x2c3/frame 0xfe012b7c1c70
> linux_file_ioctl() at linux_file_ioctl+0x309/frame 0xfe012b7c1cd0
> kern_ioctl() at kern_ioctl+0x1dc/frame 0xfe012b7c1d40
> sys_ioctl() at sys_ioctl+0x121/frame 0xfe012b7c1e10
> amd64_syscall() at amd64_syscall+0x108/frame 0xfe012b7c1f30
> fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe012b7c1f30
> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x36a096c34ea, rsp = 
> 0x3fa11e623eb8, \
> rbp = 0x3fa11e623ee0 ---
> panic: _sleep: curthread not running
> cpuid = 4
> time = 1650389478
> KDB: stack backtrace:
> 
> One common trigger appears to be the use of firefox-99.0,2 from
> the ports collection.  
> 
> -- 
> Steve
> 

 What version of drm are you using ?
 Since when do you experience this ?
 drm as not changed much for a long time now except adapting a few
files for new linuxkpi addition.

 Cheers,

-- 
Emmanuel Vadot  



Re: 'set but unused' breaks drm-*-kmod

2022-04-21 Thread Emmanuel Vadot


 Hello Michael,

On Wed, 20 Apr 2022 23:39:12 -0400
Michael Butler  wrote:

> Seems this new requirement breaks kmod builds too ..
> 
> The first of many errors was (I stopped chasing them all for lack of 
> time) ..
> 
> --- amdgpu_cs.o ---
> /usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.7.19_3/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1210:26:
>  
> error: variable 'priority' set but not used 
> [-Werror,-Wunused-but-set-variable]
>  enum drm_sched_priority priority;
>  ^
> 1 error generated.
> *** [amdgpu_cs.o] Error code 1
> 

 How are you building the port, directly or with PORTS_MODULES ?
 I do make passes on the warning for drm and I did for set-but-not-used
case but unfortunately this option doesn't exists in 13.0 so I couldn't
apply those in every branch.

 Cheers,

-- 
Emmanuel Vadot  



Re: nullfs and ZFS issues

2022-04-21 Thread Alexander Leidinger
Quoting Doug Ambrisko  (from Wed, 20 Apr 2022  
09:20:33 -0700):



On Wed, Apr 20, 2022 at 11:39:44AM +0200, Alexander Leidinger wrote:
| Quoting Doug Ambrisko  (from Mon, 18 Apr 2022
| 16:32:38 -0700):
|
| > With nullfs, nocache and settings max vnodes to a low number I can
|
| Where is nocache documented? I don't see it in mount_nullfs(8),
| mount(8) or nullfs(5).

I didn't find it but it is in:
	src/sys/fs/nullfs/null_vfsops.c:  if (vfs_getopt(mp->mnt_optnew,  
"nocache", NULL, NULL) == 0 ||


Also some file systems disable it via MNTK_NULL_NOCACHE


Does the attached diff look ok?


| I tried a nullfs mount with nocache and it doesn't show up in the
| output of "mount".

Yep, I saw that as well.  I could tell by dropping into ddb and then
do a show mount on the FS and look at the count.  That is why I added
the vnode count to mount -v so I could see the usage without dropping
into ddb.


I tried nocache on a system with a lot of jails which use nullfs,  
which showed very slow behavior in the daily periodic runs (12h runs  
in the night after boot, 24h or more in subsequent nights). Now the  
first nightly run after boot was finished after 4h.


What is the benefit of not disabling the cache in nullfs? I would  
expect zfs (or ufs) to cache the (meta)data anyway.


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF
diff --git a/sbin/mount/mount.8 b/sbin/mount/mount.8
index 2a877c04c07..823df63953d 100644
--- a/sbin/mount/mount.8
+++ b/sbin/mount/mount.8
@@ -28,7 +28,7 @@
 .\" @(#)mount.8	8.8 (Berkeley) 6/16/94
 .\" $FreeBSD$
 .\"
-.Dd March 17, 2022
+.Dd April 21, 2022
 .Dt MOUNT 8
 .Os
 .Sh NAME
@@ -245,6 +245,9 @@ This file system should be skipped when
 is run with the
 .Fl a
 flag.
+.It Cm nocache
+Disable caching.
+Some filesystems may not support this.
 .It Cm noclusterr
 Disable read clustering.
 .It Cm noclusterw


pgpYPDNVeBw1Z.pgp
Description: Digitale PGP-Signatur