[slurm-users] Fwd: sreport cluster UserUtilizationByaccount Used result versus sreport job SizesByAccount or sacct: inconsistencies
-- Forwarded message - 发件人: KK Date: 2024年4月15日周一 13:25 Subject: sreport cluster UserUtilizationByaccount Used result versus sreport job SizesByAccount or sacct: inconsistencies To: I wish to ascertain the CPU core hours utilized by user dj1 and dj. I have tested with sreport cluster UserUtilizationByAccount, sreport job SizesByAccount, and sacct. It appears that sreport cluster UserUtilizationByAccount displays the total core hours used by the entire account, rather than the individual user's cpu time. Here are the specifics: Users dj and dj1 are both under the account mehpc. In 2024-04-12 ~ 2024-04-15, dj1 used approximately 10 minutes of core time, while dj used about 4 minutes. However, "*sreport Cluster UserUtilizationByAccount user=dj1 start=2024-04-12 end=2024-04-15*" shows 14 minutes of usage. Similarly, "*sreport job SizesByAccount Users=dj start=2024-04-12 end=2024-04-15*" hows about 14 minutes. Using "*sreport job SizesByAccount Users=dj1 start=2024-04-12 end=2024-04-15*" or "*sacct -u dj1 -S 2024-04-12 -E 2024-04-15 -o "jobid,partition,account,user,alloccpus,cputimeraw,state,workdir%60" -X |awk 'BEGIN{total=0}{total+=$6}END{print total}'*" yields the accurate values, which are around 10 minutes for dj1. Here are the details: [root@ood-master ~]# sacctmgr list assoc format=cluster,user,account,qos Cluster UserAccount QOS -- -- -- mehpc root normal mehpc root root normal mehpc mehpc normal mehpc dj mehpc normal mehpcdj1 mehpc normal [root@ood-master ~]# sacct -X -u dj1 -S 2024-04-12 -E 2024-04-15 -o jobid,ncpus,elapsedraw,cputimeraw JobID NCPUS ElapsedRaw CPUTimeRAW -- -- -- 4 1 60 60 5 2120240 6 1 61 61 8 2120240 9 0 0 0 [root@ood-master ~]# sacct -X -u dj -S 2024-04-12 -E 2024-04-15 -o jobid,ncpus,elapsedraw,cputimeraw JobID NCPUS ElapsedRaw CPUTimeRAW -- -- -- 7 2120240 [root@ood-master ~]# sreport job SizesByAccount Users=dj1 start=2024-04-12 end=2024-04-15 Job Sizes 2024-04-12T00:00:00 - 2024-04-14T23:59:59 (259200 secs) Time reported in Minutes Cluster Account 0-49 CPUs 50-249 CPUs 250-499 CPUs 500-999 CPUs >= 1000 CPUs % of cluster - - - - - - - mehpc root10 0 0 0 0 100.00% [root@ood-master ~]# sreport job SizesByAccount Users=dj start=2024-04-12 end=2024-04-15 Job Sizes 2024-04-12T00:00:00 - 2024-04-14T23:59:59 (259200 secs) Time reported in Minutes Cluster Account 0-49 CPUs 50-249 CPUs 250-499 CPUs 500-999 CPUs >= 1000 CPUs % of cluster - - - - - - - mehpc root 4 0 0 0 0 100.00% [root@ood-master ~]# sreport Cluster UserUtilizationByAccount user=dj1 start=2024-04-12 end=2024-04-15 Cluster/User/Account Utilization 2024-04-12T00:00:00 - 2024-04-14T23:59:59 (259200 secs) Usage reported in CPU Minutes Cluster Login Proper Name Account Used Energy - - --- --- mehpc dj1 dj1 dj1 mehpc 140 [root@ood-master ~]# sreport Cluster UserUtilizationByAccount user=dj start=2024-04-12 end=2024-04-15 Cluster/User/Account Utilization 2024-04-12T00:00:00 - 2024-04-14T23:59:59 (259200 secs) Usage reported in CPU Minutes Cluster Login Proper Name Account Used Energy - - --- --- mehpcdj dj dj mehpc 140 [root@ood-master ~]# sacct -u dj1 -S 2024-04-12 -E 2024-04-15 -o "jobid,partition,account,user,allocc
[slurm-users] Re: Slurm.conf and workers
Xaver, If you look at your slurmctld log, you likely end up seeing messages about each node's slurm.conf not being the same as that on the master. So, yes, it can work temporarily, but unless there are some very specific settings done, issues will arise. The state you are in now, you will want to sync the config across all nodes and then 'scontrol reconfigure' You may want to look into configless if you can set DNS entries and your config is basically monolithic or all parts are in /etc/slurm/ Brian Andrus On 4/15/2024 2:55 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show the added partition when calling sinfo on them. Is the stored slurm.conf on every instance just a fallback for when connection is down or what is the purpose? The documentation only says: . "This file should be consistent across all nodes in the cluster." Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
https://github.com/dun/munge/issues/94 The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the additional strerror() output you're definitely running an older version, correct? If you go on one of the affected nodes and do an `lsof -p ` I'm betting you'll find a long list of open file descriptors — that would explain the "Too many open files" situation _and_ indicate that this is something other than external memory pressure or open file limits on the process. > On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users > wrote: > > We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux > 8.9. We've had a number of incidents where the Munge log-file > /var/log/munge/munged.log suddenly fills up the root file system, after a > while to 100% (tens of GBs), and the node eventually comes to a grinding > halt! Wiping munged.log and restarting the node works around the issue. > > I've tried to track down the symptoms and this is what I found: > > 1. In munged.log there are infinitely many lines filling up the disk: > > 2024-04-11 09:59:29 +0200 Info: Suspended new connections while > processing backlog > > 2. The slurmd is not getting any responses from munged, even though we run > "munged --num-threads 10". The slurmd.log displays errors like: > > [2024-04-12T02:05:45.001] error: If munged is up, restart with > --num-threads=10 > [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to > "/var/run/munge/munge.socket.2": Resource temporarily unavailable > [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: > RESPONSE_ACCT_GATHER_UPDATE has authentication error > > 3. The /var/log/messages displays the errors from slurmd as well as > NetworkManager saying "Too many open files in system". > The telltale syslog entry seems to be: > > Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached > > where the limit is confirmed in /proc/sys/fs/file-max. > > We have never before seen any such errors from Munge. The error may perhaps > be triggered by certain user codes (possibly star-ccm+) that might be opening > a lot more files on the 96-core nodes than on nodes with a lower core count. > > My workaround has been to edit the line in /etc/sysctl.conf: > > fs.file-max = 131072 > > and update settings by "sysctl -p". We haven't seen any of the Munge errors > since! > > The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer > version in https://github.com/dun/munge/releases/tag/munge-0.5.16 > I can't figure out if 0.5.16 has a fix for the issue seen here? > > Questions: Have other sites seen the present Munge issue as well? Are there > any good recommendations for setting the fs.file-max parameter on Slurm > compute nodes? > > Thanks for sharing your insights, > Ole > > -- > Ole Holm Nielsen > PhD, Senior HPC Officer > Department of Physics, Technical University of Denmark > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Munge log-file fills up the file system to 100%
We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 8.9. We've had a number of incidents where the Munge log-file /var/log/munge/munged.log suddenly fills up the root file system, after a while to 100% (tens of GBs), and the node eventually comes to a grinding halt! Wiping munged.log and restarting the node works around the issue. I've tried to track down the symptoms and this is what I found: 1. In munged.log there are infinitely many lines filling up the disk: 2024-04-11 09:59:29 +0200 Info: Suspended new connections while processing backlog 2. The slurmd is not getting any responses from munged, even though we run "munged --num-threads 10". The slurmd.log displays errors like: [2024-04-12T02:05:45.001] error: If munged is up, restart with --num-threads=10 [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error 3. The /var/log/messages displays the errors from slurmd as well as NetworkManager saying "Too many open files in system". The telltale syslog entry seems to be: Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached where the limit is confirmed in /proc/sys/fs/file-max. We have never before seen any such errors from Munge. The error may perhaps be triggered by certain user codes (possibly star-ccm+) that might be opening a lot more files on the 96-core nodes than on nodes with a lower core count. My workaround has been to edit the line in /etc/sysctl.conf: fs.file-max = 131072 and update settings by "sysctl -p". We haven't seen any of the Munge errors since! The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version in https://github.com/dun/munge/releases/tag/munge-0.5.16 I can't figure out if 0.5.16 has a fix for the issue seen here? Questions: Have other sites seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks for sharing your insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Interfaces of topology/tree and Topology Awareness
I know this isn't a developer forum, but I don't really know where else to ask. I've had no luck with Stackoverflow. Is there no input on this? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm.conf and workers
Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show the added partition when calling sinfo on them. Is the stored slurm.conf on every instance just a fallback for when connection is down or what is the purpose? The documentation only says: "This file should be consistent across all nodes in the cluster." Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com