[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-19 Thread Ole Holm Nielsen via slurm-users
It turns out that the Slurm job limits are *not* controlled by the normal /etc/security/limits.conf configuration. Any service running under Systemd (such as slurmd) has limits defined by Systemd, see [1] and [2]. The limits of processes started by slurmd are defined by LimitXXX in

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-18 Thread Ole Holm Nielsen via slurm-users
I looked at some of our busy 96-core nodes where users are currently running the STAR-CCM+ CFD software. One job runs on 4 96-core nodes. I'm amazed that each STAR-CCM+ process has opened almost 1000 open files, for example: $ lsof -p 440938 | wc -l 950 and that on this node the user has

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-17 Thread Bjørn-Helge Mevik via slurm-users
Jeffrey T Frey via slurm-users writes: >> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" >> is per user. > > The ulimit is a frontend to rusage limits, which are per-process restrictions > (not per-user). You are right; I sit corrected. :) (Except for number of procs

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jason Simms via slurm-users
As a related point, for this reason I mount /var/log separately from /. Ask me how I learned that lesson... Jason On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users < slurm-users@lists.schedmd.com> wrote: > AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jeffrey T Frey via slurm-users
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is per user. The ulimit is a frontend to rusage limits, which are per-process restrictions (not per-user). The fs.file-max is the kernel's limit on how many file descriptors can be open in aggregate. You'd have to edit

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen writes: > Hi Bjørn-Helge, > > That sounds interesting, but which limit might affect the kernel's > fs.file-max? For example, a user already has a narrow limit: > > ulimit -n > 1024 AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" is per user. Now that I

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users
Hi Bjørn-Helge, On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote: Ole Holm Nielsen via slurm-users writes: Therefore I believe that the root cause of the present issue is user applications opening a lot of files on our 96-core nodes, and we need to increase fs.file-max. You could

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen via slurm-users writes: > Therefore I believe that the root cause of the present issue is user > applications opening a lot of files on our 96-core nodes, and we need > to increase fs.file-max. You could also set a limit per user, for instance in /etc/security/limits.d/. Then

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users
Hi Jeffrey, Thanks a lot for the information: On 4/15/24 15:40, Jeffrey T Frey wrote: https://github.com/dun/munge/issues/94 I hadn't seen issue #94 before, and it seems to be relevant to our problem. It's probably a good idea to upgrade munge beyond what's supplied by EL8/EL9. We can

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users
https://github.com/dun/munge/issues/94 The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the additional strerror() output you're definitely running an older version, correct? If you go on one of the affected nodes and do an `lsof -p ` I'm betting you'll find a long