Hi Jeffrey,

Thanks a lot for the information:

On 4/15/24 15:40, Jeffrey T Frey wrote:
https://github.com/dun/munge/issues/94

I hadn't seen issue #94 before, and it seems to be relevant to our problem. It's probably a good idea to upgrade munge beyond what's supplied by EL8/EL9. We can build the latest 0.5.16 RPMs by:

wget https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
rpmbuild -ta munge-0.5.16.tar.xz

I've updated my Slurm Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service accordingly now.

The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?

Correct, we run munge 0.5.13 as supplied by EL8 (RockyLinux 8.9).

If you go on one of the affected nodes and do an `lsof -p <munged-pid>` I'm betting 
you'll find a long list of open file descriptors — that would explain the "Too many open 
files" situation _and_ indicate that this is something other than external memory pressure 
or open file limits on the process.

Actually, munged is normally working without too many open files as seen by "lsof -p `pidof munged`" over the entire partition, where the munged open file count is only 29. I currently don't have any broken nodes with a full file system that I can examine.

Therefore I believe that the root cause of the present issue is user applications opening a lot of files on our 96-core nodes, and we need to increase fs.file-max. And upgrade munge as well to avoid the log file growing without bounds.

I'd still like to know if anyone has good recommendations for setting the fs.file-max parameter on Slurm compute nodes?

Thanks,
Ole

On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
<slurm-users@lists.schedmd.com> wrote:

We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
8.9.  We've had a number of incidents where the Munge log-file 
/var/log/munge/munged.log suddenly fills up the root file system, after a while 
to 100% (tens of GBs), and the node eventually comes to a grinding halt!  
Wiping munged.log and restarting the node works around the issue.

I've tried to track down the symptoms and this is what I found:

1. In munged.log there are infinitely many lines filling up the disk:

   2024-04-11 09:59:29 +0200 Info:      Suspended new connections while 
processing backlog

2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
--num-threads=10
   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
"/var/run/munge/munge.socket.2": Resource temporarily unavailable
   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
RESPONSE_ACCT_GATHER_UPDATE has authentication error

3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge.  The error may perhaps be 
triggered by certain user codes (possibly star-ccm+) that might be opening a 
lot more files on the 96-core nodes than on nodes with a lower core count.

My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
since!

The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version 
in https://github.com/dun/munge/releases/tag/munge-0.5.16
I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well?  Are there 
any good recommendations for setting the fs.file-max parameter on Slurm compute 
nodes?

Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to