[slurm-users] Re: Redirect jobs submitted to old partition to new
For jobs already in default_queue squeue -t pd -h --Format=jobID |xargs -L1 -I{} scontrol update jobID={} partition=queue1 What version of slurm are you running? In slurm 23.02.5, man slurm.conf under PARTITION CONFIGURATION Alternate Partition name of alternate partition to be used if the state of this partition is "DRAIN" or "INACTIVE." -Original Message- From: wdennis--- via slurm-users Sent: Tuesday, April 16, 2024 9:48 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Redirect jobs submitted to old partition to new Hi all, I have a single-partition Slurm cluster (the single partition name being "default_queue") that I now want to implement multiple different queues on to subdivide the resources. Say the new default queue is "queue1"; Should I set the "default_queue" to `State=INACTIVE` and then use `Alternate=queue1` on it to redirect jobs sent to the old "default_queue" to end up on "queue1"? Thinking it would be nice to have an AltPartitionName= construct to handle this... (must be a reason this doesn't exist) -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Redirect jobs submitted to old partition to new
Hi all, I have a single-partition Slurm cluster (the single partition name being "default_queue") that I now want to implement multiple different queues on to subdivide the resources. Say the new default queue is "queue1"; Should I set the "default_queue" to `State=INACTIVE` and then use `Alternate=queue1` on it to redirect jobs sent to the old "default_queue" to end up on "queue1"? Thinking it would be nice to have an AltPartitionName= construct to handle this... (must be a reason this doesn't exist) -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm version 23.11.6 is now available
We are pleased to announce the availability of Slurm version 23.11.6. The 23.11.6 release includes two different problems with the priority/multifactor plugin: a crash and a miscalculation of AssocGrpCPURunMinutes after a slurmctld reconfiguration/restart. The wsrep_on errors that sites running MySQL or older MariaDB should happen much less frequently and has a clarifying statement when it is an innocuous error. Slurm can be downloaded from https://www.schedmd.com/downloads.php . -Marshall * Changes in Slurm 23.11.6 == -- Avoid limiting sockets per node to one when using gres enforce-binding. -- slurmrestd - Avoid permission denied errors when attempting to listen on the same port multiple times. -- Fix GRES reservations where the GRES has no topology (no cores= in gres.conf). -- Ensure that thread_id_rpc is gone before priority_g_fini(). -- Fix scontrol reboot timeout removing drain state from nodes. -- squeue - Print header on empty reponse to `--only-job-state`. -- Fix slurmrestd not ending job properly when xauth is not present and a x11 job is sent. -- Add experimental job state caching with SchedulerParameters=enable_job_state_cache to speed up querying job states with squeue --only-job-state. -- slurmrestd - Correct dumping of invalid ArrayJobIds returned from 'GET /slurm/v0.0.40/jobs/state'. -- squeue - Correct dumping of invalid ArrayJobIds returned from `squeue --only-job-state --{json|yaml}`. -- If scancel --ctld is not used with --interactive, --sibling, or specific step ids, then this option issues a single request to the slurmctld to signal all jobs matching the specified filters. This greatly improves the performance of slurmctld and scancel. The updated --ctld option also fixes issues with the --partition or --reservation scancel options for jobs that requested multiple partitions or reservations. -- slurmrestd - Give EINVAL error when failing to parse signal name to numeric signal. -- slurmrestd - Allow ContentBody for all methods per RFC7230 even if ignored. -- slurmrestd - Add 'DELETE /slurm/v0.0.40/jobs' endpoint to allow bulk job signaling via slurmctld. -- Fix combination of --nodelist and --exclude not always respecting the excluded node list. -- Fix jobs incorrectly allocating nodes exclusively when started on a partition that doesn't enforce it. This could happen if a multi-partition job doesn't specify --exclusive and is evaluated first on a partition configured with OverSubscribe=EXCLUSIVE but ends up starting in a partition configured with OverSubscribe!=EXCLUSIVE evaluated afterwards. -- Setting GLOB_SILENCE flag no longer exposes old bugged behavior. -- Fix associations AssocGrpCPURunMinutes being incorrectly computed for running jobs after a controller reconfiguration/restart. -- Fix scheduling jobs that request --gpus and nodes have different node weights and different numbers of gpus. -- slurmrestd - Add "NO_CRON_JOBS" as possible flag value to the following: 'DELETE /slurm/v0.0.40/jobs' flags field. 'DELETE /slurm/v0.0.40/job/{job_id}?flags=' flags query parameter. -- Fix scontrol segfault/assert failure if the TRESPerNode parameter is used when creating reservations. -- Avoid checking for wsrep_on when restoring streaming replication settings. -- Clarify in the logs that error "1193 Unknown system variable 'wsrep_on'" is innocuous. -- accounting_storage/mysql - Fix problem when loading reservations from an archive dump. -- slurmdbd - Fix minor race condition when sending updates to a shutdown slurmctld. -- slurmctld - Fix invalid refusal of a reservation update. -- openapi - Fix memory leak of /meta/slurm/cluster response field. -- Fix memory leak when using auth/slurm and AuthInfo=use_client_ids. -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
As a related point, for this reason I mount /var/log separately from /. Ask me how I learned that lesson... Jason On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users < slurm-users@lists.schedmd.com> wrote: > AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is per user. > > > The ulimit is a frontend to rusage limits, which are per-process > restrictions (not per-user). > > The fs.file-max is the kernel's limit on how many file descriptors can be > open in aggregate. You'd have to edit that with sysctl: > > > *$ sysctl fs.file-max* > fs.file-max = 26161449 > > > > Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative > limit versus the default. > > > > > But if you have ulimit -n == 1024, then no user should be able to hit > the fs.file-max limit, even if it is 65536. (Technically, 96 jobs from > 96 users each trying to open 1024 files would do it, though.) > > > Naturally, since the ulimit is per-process the equating of core count with > the multiplier isn't valid. It also assumes Slurm isn't setup to > oversubscribe CPU resources :-) > > > > I'm not sure how the number 3092846 got set, since it's not defined in > /etc/security/limits.conf. The "ulimit -u" varies quite a bit among > our compute nodes, so which dynamic service might affect the limits? > > > If the 1024 is a soft limit, you may have users who are raising it to > arbitrary values themselves, for example. Especially as 1024 is somewhat > low for the more naively-written data science Python code I see on our > systems. If Slurm is configured to propagate submission shell ulimits to > the runtime environment and you allow submission from a variety of > nodes/systems you could be seeing myriad limits reconstituted on the > compute node despite the /etc/security/limits.conf settings. > > > The main question needing an answer is _what_ process(es) are opening all > the files on your systems that are faltering. It's very likely to be user > jobs' opening all of them, I was just hoping to also rule out any bug in > munged. Since you're upgrading munged, you'll now get the errno associated > with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM. > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- *Jason L. Simms, Ph.D., M.P.H.* Instructor, Department of Languages & Literary Studies Lafayette College Pardee Hall | One Pardee Dr, 4th Fl | Easton, PA 18042 Office: Pardee 405 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is per user. The ulimit is a frontend to rusage limits, which are per-process restrictions (not per-user). The fs.file-max is the kernel's limit on how many file descriptors can be open in aggregate. You'd have to edit that with sysctl: $ sysctl fs.file-max fs.file-max = 26161449 Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative limit versus the default. > But if you have ulimit -n == 1024, then no user should be able to hit > the fs.file-max limit, even if it is 65536. (Technically, 96 jobs from > 96 users each trying to open 1024 files would do it, though.) Naturally, since the ulimit is per-process the equating of core count with the multiplier isn't valid. It also assumes Slurm isn't setup to oversubscribe CPU resources :-) >> I'm not sure how the number 3092846 got set, since it's not defined in >> /etc/security/limits.conf. The "ulimit -u" varies quite a bit among >> our compute nodes, so which dynamic service might affect the limits? If the 1024 is a soft limit, you may have users who are raising it to arbitrary values themselves, for example. Especially as 1024 is somewhat low for the more naively-written data science Python code I see on our systems. If Slurm is configured to propagate submission shell ulimits to the runtime environment and you allow submission from a variety of nodes/systems you could be seeing myriad limits reconstituted on the compute node despite the /etc/security/limits.conf settings. The main question needing an answer is _what_ process(es) are opening all the files on your systems that are faltering. It's very likely to be user jobs' opening all of them, I was just hoping to also rule out any bug in munged. Since you're upgrading munged, you'll now get the errno associated with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Ole Holm Nielsen writes: > Hi Bjørn-Helge, > > That sounds interesting, but which limit might affect the kernel's > fs.file-max? For example, a user already has a narrow limit: > > ulimit -n > 1024 AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" is per user. Now that I think of it, fs.file-max of 65536 seems *very* low. On our CentOS-7-based clusters, we have in the order of tens of millions, and on our Rocky 9 based clusters, we have 9223372036854775807(!) Also a per-user limit of 1024 seems low to me; I think we have in the order of 200K files per user on most clusters. But if you have ulimit -n == 1024, then no user should be able to hit the fs.file-max limit, even if it is 65536. (Technically, 96 jobs from 96 users each trying to open 1024 files would do it, though.) > whereas the permitted number of user processes is a lot higher: > > ulimit -u > 3092846 I guess any process will have a few open files, which I believe count against the ulimit -n for each user (and fs.file-max). > I'm not sure how the number 3092846 got set, since it's not defined in > /etc/security/limits.conf. The "ulimit -u" varies quite a bit among > our compute nodes, so which dynamic service might affect the limits? There is a vague thing in my head saying that I've looked for this before, and found that the default value dependened on the size of the RAM of the machine. But the vague thing might of course be lying to me. :) -- Bjørn-Helge signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Hi Bjørn-Helge, On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote: Ole Holm Nielsen via slurm-users writes: Therefore I believe that the root cause of the present issue is user applications opening a lot of files on our 96-core nodes, and we need to increase fs.file-max. You could also set a limit per user, for instance in /etc/security/limits.d/. Then users would be blocked from opening unreasonably many files. One could use this to find which applications are responsible, and try to get them fixed. That sounds interesting, but which limit might affect the kernel's fs.file-max? For example, a user already has a narrow limit: ulimit -n 1024 whereas the permitted number of user processes is a lot higher: ulimit -u 3092846 I'm not sure how the number 3092846 got set, since it's not defined in /etc/security/limits.conf. The "ulimit -u" varies quite a bit among our compute nodes, so which dynamic service might affect the limits? Perhaps there is a recommendation for defining nproc in /etc/security/limits.conf on compute nodes? Thanks, Ole -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Ole Holm Nielsen via slurm-users writes: > Therefore I believe that the root cause of the present issue is user > applications opening a lot of files on our 96-core nodes, and we need > to increase fs.file-max. You could also set a limit per user, for instance in /etc/security/limits.d/. Then users would be blocked from opening unreasonably many files. One could use this to find which applications are responsible, and try to get them fixed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Hi Jeffrey, Thanks a lot for the information: On 4/15/24 15:40, Jeffrey T Frey wrote: https://github.com/dun/munge/issues/94 I hadn't seen issue #94 before, and it seems to be relevant to our problem. It's probably a good idea to upgrade munge beyond what's supplied by EL8/EL9. We can build the latest 0.5.16 RPMs by: wget https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz rpmbuild -ta munge-0.5.16.tar.xz I've updated my Slurm Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service accordingly now. The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the additional strerror() output you're definitely running an older version, correct? Correct, we run munge 0.5.13 as supplied by EL8 (RockyLinux 8.9). If you go on one of the affected nodes and do an `lsof -p ` I'm betting you'll find a long list of open file descriptors — that would explain the "Too many open files" situation _and_ indicate that this is something other than external memory pressure or open file limits on the process. Actually, munged is normally working without too many open files as seen by "lsof -p `pidof munged`" over the entire partition, where the munged open file count is only 29. I currently don't have any broken nodes with a full file system that I can examine. Therefore I believe that the root cause of the present issue is user applications opening a lot of files on our 96-core nodes, and we need to increase fs.file-max. And upgrade munge as well to avoid the log file growing without bounds. I'd still like to know if anyone has good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks, Ole On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users wrote: We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 8.9. We've had a number of incidents where the Munge log-file /var/log/munge/munged.log suddenly fills up the root file system, after a while to 100% (tens of GBs), and the node eventually comes to a grinding halt! Wiping munged.log and restarting the node works around the issue. I've tried to track down the symptoms and this is what I found: 1. In munged.log there are infinitely many lines filling up the disk: 2024-04-11 09:59:29 +0200 Info: Suspended new connections while processing backlog 2. The slurmd is not getting any responses from munged, even though we run "munged --num-threads 10". The slurmd.log displays errors like: [2024-04-12T02:05:45.001] error: If munged is up, restart with --num-threads=10 [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error 3. The /var/log/messages displays the errors from slurmd as well as NetworkManager saying "Too many open files in system". The telltale syslog entry seems to be: Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached where the limit is confirmed in /proc/sys/fs/file-max. We have never before seen any such errors from Munge. The error may perhaps be triggered by certain user codes (possibly star-ccm+) that might be opening a lot more files on the 96-core nodes than on nodes with a lower core count. My workaround has been to edit the line in /etc/sysctl.conf: fs.file-max = 131072 and update settings by "sysctl -p". We haven't seen any of the Munge errors since! The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version in https://github.com/dun/munge/releases/tag/munge-0.5.16 I can't figure out if 0.5.16 has a fix for the issue seen here? Questions: Have other sites seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks for sharing your insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com