[slurm-users] Re: need to set From: address for slurm
There is no way to do it in slurm. You have to do it in the mail program you are using to send mail. In our case we use postfix and we set smtp_generic_maps to accomplish this. -Paul Edmon- On 6/7/2024 3:33 PM, Vanhorn, Mike via slurm-users wrote: All, When the slurm daemon is sending out emails, they are coming from “sl...@servername.subdomain.domain.edu”. This has worked okay in the past, but due to a recent mail server change (over which I have no control whatsoever) this will no longer work. Now, the From: address is going to have to be something like “slurm-servern...@domain.edu” , or, at least something that ends in “@domain.edu” (the subdomain being present will cause it to get rejected by the mail server. I am not seeing in the documentation how to change the “From:” address tha slurm uses. Is there a way to do this and I’m just missing it? --- Mike VanHorn Senior Computer Systems Administrator College of Engineering and Computer Science Wright State University 265 Russ Engineering Center 937-775-5157 michael.vanh...@wright.edu -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: dynamical configuration || meta configuration mgmt
Many parameters in slurm can be changed via scontrol and sacctmgr commands without updating the conf itself. The thing is that scontrol commands are not durable across restarts. sacctmgr though update the slurmdb and thus will be sticky. That's at least what I would do is that if you are using a QoS to manage this (which I am assuming you are), I would use sacctmgr. As for a framework that does the state inspection, I'm not aware of one. You could do it via cron and batch scripts to do the state inspection. I don't know if some one has something more sophisticated though. -Paul Edmon- On 5/29/2024 11:05 AM, Heckes, Frank via slurm-users wrote: Hello all, I’m sorry if this has been asked and answered before, but I couldn’t find anything related. Does anyone know whether a framework of sorts exists that allow to change certain SLURM configuration parameters provided some conditions in the batch system’s state are detected and of course are revert if the state became the old one again? (To be more concrete: We like to raise or unset maxjobPU to run as much small jobs as possible to allocate all nodes as soon as certain threshold of free nodes are available and of course some other scenarios) Many thanks in advance. Cheers, -Frank Max-Planck-Institut für Sonnensystemforschung Justus-von-Liebig-Weg 3 D-37077 Göttingen Phone: [+49] 551 – 384 979 320 E-Mail: hec...@mps.mpg.de <mailto:hec...@mps.mpg.de> -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] HPC Principal System Engineer at the Broad
A friend ask me to pass this along. Figured some folks on this list might be interested. https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773 -Paul Edmon- -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them
Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them. -Paul Edmon- On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote: We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported that their jobs have been stuck at the completion stage for a long time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we found that indeed the batchhost for the job was removed from the cluster, perhaps without draining it first. How do we cancel/delete the jobs ? * We tried scancel on the batch and individual job ids from both the user and from SlurmUser -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Avoiding fragmentation
I wrote a little blog post on this topic a few years back: https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/ It's a vexing problem, but as noted by the other responders it is something that depends on your cluster policy and job performance needs. Well written MPI code should be able to scale well even when given non-optimal topologies. You might also look at Node Weights (https://slurm.schedmd.com/slurm.conf.html#OPT_Weight). We use them on mosaic partitions so that the latest hardware is left available for larger jobs needing more performance. You can also use it to force jobs to one side of the partition, though generally the scheduler does this automatically. -Paul Edmon- On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote: Agree with that. Plus, of course, even if the jobs run a bit slower by not having all the cores on a single node, they will be scheduled sooner, so the overall turnaround time for the user will be better, and ultimately that's what they care about. I've always been of the view, for any scheduler, that the less you try to constrain it the better. It really depends on what you're trying to optimise for, but generally speaking I try to optimise for maximum utilisation and throughput, unless I have a specific business case that needs to prioritise particular workloads, and then I'll compromise on throughput to get the urgent workload through sooner. Tun *From:* Loris Bennett via slurm-users *Sent:* 09 April 2024 06:51 *To:* slurm-users@lists.schedmd.com *Cc:* Gerhard Strangar *Subject:* [slurm-users] Re: Avoiding fragmentation Hi Gerhard, Gerhard Strangar via slurm-users writes: > Hi, > > I'm trying to figure out how to deal with a mix of few- and many-cpu > jobs. By that I mean most jobs use 128 cpus, but sometimes there are > jobs with only 16. As soon as that job with only 16 is running, the > scheduler splits the next 128 cpu jobs into 96+16 each, instead of > assigning a full 128 cpu node to them. Is there a way for the > administrator to achieve preferring full nodes? > The existence of pack_serial_at_end makes me believe there is not, > because that basically is what I needed, apart from my serial jobs using > 16 cpus instead of 1. > > Gerhard This may well not be relevant for your case, but we actively discourage the use of full nodes for the following reasons: - When the cluster is full, which is most of the time, MPI jobs in general will start much faster if they don't specify the number of nodes and certainly don't request full nodes. The overhead due to the jobs being scattered across nodes is often much lower than the additional waiting time incurred by requesting whole nodes. - When all the cores of a node are requested, all the memory of the node becomes unavailable to other jobs, regardless of how much memory is requested or indeed how much is actually used. This holds up jobs with low CPU but high memory requirements and thus reduces the total throughput of the system. These factors are important for us because we have a large number of single core jobs and almost all the users, whether doing MPI or not, significantly overestimate the memory requirements of their jobs. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com <https://www.astrazeneca.com> -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FairShare priority questions
For this use case you probably want to go with Classic Fairshare (https://slurm.schedmd.com/classic_fair_share.html) rather than FairTree. Classic Fairshare behaves in a way similar to what you describe. You can set up different bins for fairshare and then the user can pull from them. So that would be my recommendation. This is how we handle fairshare at FASRC: https://docs.rc.fas.harvard.edu/kb/fairshare/ As we use Classic Fairshare. You will need to enable this: https://slurm.schedmd.com/slurm.conf.html#OPT_NO_FAIR_TREE as Fair Tree is on by default. -Paul Edmon- On 3/27/2024 9:22 AM, Long, Daniel S. via slurm-users wrote: Hi, I’m trying to set up multifactor priority on our cluster and am having some trouble getting it to behave the way I’d like. My main issues seem to revolve around FairShare. We have multiple projects on our cluster and multiple users in those projects (and some users are in multiple projects, of course). I would like the FairShare to be based only on the project associated with the job; if user A and user B both submit jobs on project C, the FairShare should be identical. However, it looks like the FairShare is based on both the project and the user. Is there a way to get the behavior I’m looking for? Thanks for any help you can provide. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm Utilities
Just wanted to share some slurm utilities that we've written at Harvard FASRC that maybe useful to the community. seff-account: https://github.com/fasrc/seff-account Creates job statistics summaries for users and accounts similar to what seff and seff-array does. showq: https://github.com/fasrc/slurm_showq A slurm version of the Moab showq command lsload: https://github.com/fasrc/lsload A slurm version of the LSF lsload command scalc: https://github.com/fasrc/scalc A calculator for various fairshare related things spart: https://github.com/fasrc/spart A simplified output for slurm partition information stdg: https://github.com/fasrc/stdg Slurm test deck generator prometheus-slurm-exporter: https://github.com/fasrc/prometheus-slurm-exporter Slurm exporters for prometheus Hopefully people find these useful. Pull requests are always appreciated. -Paul Edmon- -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: salloc+srun vs just srun
He's talking about recent versions of Slurm which now have this option: https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step -Paul Edmon- On 2/28/2024 10:46 AM, Paul Raines wrote: What do you mean "operate via the normal command line"? When you salloc, you are still on the login node. $ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash salloc: Pending job allocation 3798364 salloc: job 3798364 queued and waiting for resources salloc: job 3798364 has been allocated resources salloc: Granted job allocation 3798364 salloc: Waiting for resource configuration salloc: Nodes rtx-02 are ready for job mesg: cannot open /dev/pts/91: Permission denied mlsc-login[0]:~$ hostname mlsc-login.nmr.mgh.harvard.edu mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST SLURM_JOB_NODELIST=rtx-02 Seems you MUST use srun -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote: External Email - Use Caution salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc. As you've noticed with srun you tend lose control of your shell as it takes over so you have background the process unless it is the main process. We've hit this before when people use srun to subschedule in a salloc. You can also just launch the salloc and then operate via the normal command line reserving srun for things like launching MPI. The reason they changed from srun to salloc is that you can't srun inside a srun. So if you were a user who started a srun interactive session and then you tried to invoke MPI it would get weird as you would be invoking another srun. By using salloc you avoid this issue. We used to use srun for interactive sessions as well but swapped to salloc a few years back and haven't had any issues. -Paul Edmon- On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote: Hi list, In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun [programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got no output other than the srun/slurmstepd termination output. I think I read somewhere that directly invoking srun creates an allocation; why then would I want to do an initial salloc, and then srun? (i the case that I want a foreground program, such as a bash shell) I have surveyed some other institution's Slurm interactive jobs documentation for users, I see both examples of advice to run srun directly, or salloc and then srun. Please help me to understand how this is intended to work, and if we are "doing it wrong" :) Thanks, Will -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: salloc+srun vs just srun
salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc. As you've noticed with srun you tend lose control of your shell as it takes over so you have background the process unless it is the main process. We've hit this before when people use srun to subschedule in a salloc. You can also just launch the salloc and then operate via the normal command line reserving srun for things like launching MPI. The reason they changed from srun to salloc is that you can't srun inside a srun. So if you were a user who started a srun interactive session and then you tried to invoke MPI it would get weird as you would be invoking another srun. By using salloc you avoid this issue. We used to use srun for interactive sessions as well but swapped to salloc a few years back and haven't had any issues. -Paul Edmon- On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote: Hi list, In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun [programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got no output other than the srun/slurmstepd termination output. I think I read somewhere that directly invoking srun creates an allocation; why then would I want to do an initial salloc, and then srun? (i the case that I want a foreground program, such as a bash shell) I have surveyed some other institution's Slurm interactive jobs documentation for users, I see both examples of advice to run srun directly, or salloc and then srun. Please help me to understand how this is intended to work, and if we are "doing it wrong" :) Thanks, Will -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Question about IB and Ethernet networks
I concur with what folks have written so far, it really depends on your use case. For instance if you are looking at a cluster with GPU's and intend to do some serious computing there you are going to need RDMA of some sort. But it all depends on what you end up needing for your workflows. For us we put most of our network traffic over the IB using IPoIB combined with aliasing all the nodes to their IB address. Thus all the internode network traffic spans the IB fabric rather than the ethernet. We then have 1GbE for our ethernet backend which we mainly use for management purposes. So we haven't heavily invested in a high speed ethernet backbone but instead invested in IB. To invest in both seems to me to be overkill, you should focus on one or the other unless you have the cash to spend and a good use case. -Paul Edmon- On 2/26/24 7:07 AM, Dan Healy via slurm-users wrote: I’m very appreciative for each person who’s provided some feedback, especially the lengthy replies. Sounds like RoCE capable Ethernet backbone may be the default way to go /unless/ the end users have some specific requirements that might need IB. At this point, we wouldn’t be interested in anything slower than 200Gbps. So perhaps Eth and IB are equivalent in terms of latency and RDMA capabilities, except one is an open standard. Thanks, Daniel Healy On Mon, Feb 26, 2024 at 3:40 AM Cutts, Tim wrote: My view is that it depends entirely on the workload, and the systems with which your compute needs to interact. A few things I’ve experienced before. 1. Modern ethernet networks have pretty good latency these days, and so MPI codes can run over them. Whether IB is worth the money is a cost/benefit calculation for the codes you want to run. The ethernet network we put in at Sanger in 2016 or so we measured as having similar latency, in practice, as FDR infiniband, if I remember correctly. So it wasn’t as good as state-of-the-art IB at the time, but not bad. Certainly good enough for our purposes, and we gained a lot of flexibility through software-defined networking, important if you have workloads which require better security boundaries than just a big shared network. 2. If your workload is predominantly single node, embarrassingly parallel, you might do better to go with ethernet and invest the saved money in more compute nodes. 3. If you only have ethernet, your cluster will be simpler, and require less specialised expertise to run 4. If your parallel filesystem is Lustre, IB seems to be the more well-worn path than ethernet. We encountered a few Lustre bugs early on because of that. 5. On the other hand, if you need to talk to Weka, ethernet is the well-worn path. Weka’s IB implementation requires the dedication of some cores on every client node, so you lose some compute capacity, which you don’t need to do if you’re using ethernet. So, as any lawyer would say “it depends”. Most of my career has been in genomics, where IB definitely wasn’t necessary. Now that I’m in pharma, there’s more MPI code, so there’s more of a case for it. Ultimately, I think you need to run the real benchmarks with real code, and as Jason says, work out whether the additional complexity and cost of the IB network is worth it for your particular workload. I don’t think the mantra “It’s HPC so it has to be Infiniband” is a given. Tim -- *Tim Cutts* Scientific Computing Platform Lead AstraZeneca Find out more about R IT Data, Analytics & AI and how we can support you by visiting ourService Catalogue <https://azcollaboration.sharepoint.com/sites/CMU993>| *From: *Jason Simms via slurm-users *Date: *Monday, 26 February 2024 at 01:13 *To: *Dan Healy *Cc: *slurm-users@lists.schedmd.com *Subject: *[slurm-users] Re: Question about IB and Ethernet networks Hello Daniel, In my experience, if you have a high-speed interconnect such as IB, you would do IPoIB. You would likely still have a "regular" Ethernet connection for management purposes, and yes that means both an IB switch and an Ethernet switch, but that switch doesn't have to be anything special. Any "real" traffic is routed over IB, everything is mounted via IB, etc. That's how the last two clusters I've worked with have been configured, and the next one will be the same (but will use Omnipath rather than IB). We likewise use BeeGFS. These next comments are perhaps more likely to encounter differences of opinion, but I would say that sufficiently fast Ethernet is often "good enough" for most workloads (e.g., MPI). I'd wager that for all but the most demanding of workloads, it's entirely acc
[slurm-users] Re: Recover Batch Script Error
Are you using the job_script storage option? If so then you should be able to get at it by doing: sacct -B j JOBID https://slurm.schedmd.com/sacct.html#OPT_batch-script -Paul Edmon- On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote: Hello all, I've used the "scontrol write batch_script" command to output the job submission script from completed jobs in the past, but for some reason, no matter which job I specify, it tells me it is invalid. Any way to troubleshoot this? Alternatively, is there another way - even if a manual database query - to recover the job script, assuming it exists in the database? sacct --jobs=38960 JobID JobName Partition Account AllocCPUS State ExitCode -- -- -- -- -- 38960 amr_run_v+ tsmith2lab tsmith2lab 72 COMPLETED 0:0 38960.batch batch tsmith2lab 40 COMPLETED 0:0 38960.extern extern tsmith2lab 72 COMPLETED 0:0 38960.0 hydra_pmi+ tsmith2lab 72 COMPLETED 0:0 scontrol write batch_script 38960 job script retrieval failed: Invalid job id specified Warmest regards, Jason -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec
You probably want the Prolog option: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail -Paul Edmon- On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote: Hi, I apologise if I’ve failed to find this in the documentation (and am happy to be told to RTFM) but a recent issue for one of my users resulted in a question I couldn’t answer. LSF has a feature called a Pre-Exec where a script executes to check whether a node is ready to run a task. So, you can run arbitrary checks and go back to the queue if they fail. For example, if I have some automounted filesystems, and I want to be able to check for failure of the automounted, in an LSF world, I can do: bsub -E “test -f /nfs/someplace/file_I_know_exists” my_job.sh What’s the equivalent in SLURM? Thanks, Tim -- *Tim Cutts* Scientific Computing Platform Lead AstraZeneca Find out more about R IT Data, Analytics & AI and how we can support you by visiting ourService Catalogue <https://azcollaboration.sharepoint.com/sites/CMU993>| AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com <https://www.astrazeneca.com> -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] Two jobs each with a different partition running on same node?
That certainly isn't the case in our configuration. We have multiple overlapping partitions and our nodes have a mix of jobs from all different partitions. So the default behavior is to have a mixing of partitions on a node governed by the Priority Tier of the partition. Namely the highest priority tier always goes first but jobs from the lower tiers can fill in the gaps on a node. Having multiple partitions and then having only one own a node if it happens to have a job running isn't a standard option to my knowledge. You can accomplish this though with MCS which I know can lock down nodes to specific users and groups. But what you describe sounds more like you are locking down based on partition not on user or group, which I'm not how to accomplish in the current version of slurm. Doesn't mean its not possible, I just don't know how unless it is some obscure option. -Paul Edmon- On 1/29/2024 9:25 AM, Loris Bennett wrote: Hi, I seem to remember that in the past, if a node was configured to be in two partitions, the actual partition of the node was determined by the partition associated with the jobs running on it. Moreover, at any instance where the node was running one or more jobs, the node could only actually be in a single partition. Was this indeed the case and is it still the case with version Slurm 23.02.7? Cheers, Loris
Re: [slurm-users] preemptable queue
My concern was you config inadvertantly having that line commented out and then seeing problems. If it wasn't then no worries at this point. We run using preempt/partition_prio on our cluster and have a mix of partitions using PreemptMode=OFF and PreemptMode=REQUEUE. So I know that combination works. I would be surprised if PreemptMode=CANCEL did not work as that's a valid option. Something we do have set though is what the default mode is. We have set: ### Govern's default preemption behavior PreemptType=preempt/partition_prio PreemptMode=REQUEUE So you might try setting that default of PreemptMode=CANCEL and then set specific PreemptModes for all your partitions. That's what we do and it works for us. -Paul Edmon- On 1/12/2024 10:33 AM, Davide DelVento wrote: Thanks Paul, I don't understand what you mean by having a typo somewhere. I mean, that configuration works just fine right now, whereas if I add the commented out line any slurm command will just abort with the error "PreemptType and PreemptMode values incompatible". So, assuming there is a typo, it should be in the commented line right? Or are you saying that having that line makes slurm sensitive to a typo somewhere else that would be otherwise ignored? Obviously I can't exclude that option, but it seems unlikely to me. Also because it does say these two things are incompatible. It would obviously much better if the error would say what EXACTLY is incompatible with what, but the documentation at https://slurm.schedmd.com/preempt.html I see many clues of what that could be, and hence I am asking people here who may have deployed preemption already on their system. Some excerpts from that URL: *PreemptType*: Specifies the plugin used to identify which jobs can be preempted in order to start a pending job. * /preempt/none/: Job preemption is disabled (default). * /preempt/partition_prio/: Job preemption is based upon partition /PriorityTier/. Jobs in higher PriorityTier partitions may preempt jobs from lower PriorityTier partitions. This is not compatible with /PreemptMode=OFF/. which somewhat make it sounds like all partitions should have preemption set and not only some? I obviously have some "off" partitions. However elsewhere in that document it says *PreemptMode*: Mechanism used to preempt jobs or enable gang scheduling. When the /PreemptType/ parameter is set to enable preemption, the /PreemptMode/ in the main section of slurm.conf selects the default mechanism used to preempt the preemptable jobs for the cluster. /PreemptMode/ may be specified on a per partition basis to override this default value if /PreemptType=preempt/partition_prio/. which kind of sounds like it should be okay (unless it means **everything** must be different than OFF). Yet still elsewhere in that same page it says On the other hand, if you want to use /PreemptType=preempt/partition_prio/ to allow jobs from higher PriorityTier partitions to Suspend jobs from lower PriorityTier partitions, then you will need overlapping partitions, and /PreemptMode=SUSPEND,GANG/ to use Gang scheduler to resume the suspended job(s). In either case, time-slicing won't happen between jobs on different partitions. Which somewhat sounds like only suspend and gang can be used as preemption modes, and not cancel (my preference) or requeue (perhaps acceptable, if I jump through some hoops). So to me the documentation is highly confusing about what can or cannot be used together with what else, and the examples at the bottom of the page are nice, but they do not specify the full settings. Particularly this one https://slurm.schedmd.com/preempt.html#example2 is close enough to mine, but it does not tell what PreemptType has been chosen (nor if "cancel" would be allowed or not in that setup). Thanks again! On Fri, Jan 12, 2024 at 7:22 AM Paul Edmon wrote: At least in the example you are showing you have PreemptType commented out, which means it will return the default. PreemptMode Cancel should work, I don't see anything in the documentation that indicates it wouldn't. So I suspect you have a typo somewhere in your conf. -Paul Edmon- On 1/11/2024 6:01 PM, Davide DelVento wrote: I would like to add a preemptable queue to our cluster. Actually I already have. We simply want jobs submitted to that queue be preempted if there are no resources available for jobs in other (high priority) queues. Conceptually very simple, no conditionals, no choices, just what I wrote. However it does not work as desired. This is the relevant part: grep -i Preemp /opt/slurm/slurm.conf #PreemptType = preempt/partition_prio PartitionName=regular DefMemPerCPU=4580 Default=True Nodes=node[01-12] State=UP PreemptMode=off PriorityTier=200 PartitionName=All DefMemPerCPU=4580 Nodes=node[01-36] State=UP PreemptMode=off Prio
Re: [slurm-users] preemptable queue
At least in the example you are showing you have PreemptType commented out, which means it will return the default. PreemptMode Cancel should work, I don't see anything in the documentation that indicates it wouldn't. So I suspect you have a typo somewhere in your conf. -Paul Edmon- On 1/11/2024 6:01 PM, Davide DelVento wrote: I would like to add a preemptable queue to our cluster. Actually I already have. We simply want jobs submitted to that queue be preempted if there are no resources available for jobs in other (high priority) queues. Conceptually very simple, no conditionals, no choices, just what I wrote. However it does not work as desired. This is the relevant part: grep -i Preemp /opt/slurm/slurm.conf #PreemptType = preempt/partition_prio PartitionName=regular DefMemPerCPU=4580 Default=True Nodes=node[01-12] State=UP PreemptMode=off PriorityTier=200 PartitionName=All DefMemPerCPU=4580 Nodes=node[01-36] State=UP PreemptMode=off PriorityTier=500 PartitionName=lowpriority DefMemPerCPU=4580 Nodes=node[01-36] State=UP PreemptMode=cancel PriorityTier=100 That PreemptType setting (now commented) fully breaks slurm, everything refuses to run with errors like $ squeue squeue: error: PreemptType and PreemptMode values incompatible squeue: fatal: Unable to process configuration file If I understand correctly the documentation at https://slurm.schedmd.com/preempt.html that is because preemption cannot cancel jobs based on partition priority, which (if true) is really unfortunate. I understand that allowing cross-partition time-slicing could be tricky and so I understand why that isn't allowed, but cancelling? Anyway, I have to questions: 1) is that correct and so should I avoid using either partition priority or cancelling? 2) is there an easy way to trick slurm into requeing and then have those jobs cancelled instead? 3) I guess the cleanest option would be to implement QoS, but I've never done it and we don't really need it for anything else other than this. The documentation looks complicated, but is it? The great Ole's website is unavailable at the moment... Thanks!!
Re: [slurm-users] Beginner admin question: Prioritization within a partition based on time limit
Yeah, that's sort of the job of the backfill scheduler, as smaller jobs will fit better into the gaps. There are several options with in the priority framework that you can use to dial in which jobs get which priority. I recommend reading through all those and finding the options that will work best for the policy you want to implement. -Paul Edmon- On 1/9/2024 10:43 AM, Kenneth Chiu wrote: I'm just learning about slurm. I understand that different different partitions can be prioritized separately, and can have different max time limits. I was wondering whether or not there was a way to have a finer-grained prioritization based on the time limit specified by a job, within a single partition. Or perhaps this is already happening by default? Would the backfill scheduler be best for this?
Re: [slurm-users] GPU Card Reservation?
I believe the 23.11 version of slurm will allow you to reserve specific cards as part of a Reservation. That won't do preemption though as a reservation just takes the card and dedicates it to the user. I don't know if a QoS could pull that off, I haven't experimented with it. A partition would be all or nothing for a node so that would not work. -Paul Edmon- On 12/15/23 12:16 PM, Jason Simms wrote: Hello all, At least at one point, I understood that it was not particularly possible, or at least not elegant, to provide priority preempt access to a specific GPU card. So, if a node has 4 GPUs, a researcher can preempt as needed one or more of them. Is this still the case? Or is there a reasonable way to facilitate this? Warmest regards, Jason
Re: [slurm-users] Disabling SWAP space will it effect SLURM working
We've been running for years with out swap on with no issues. You may want to set MemSpecLimit in your config to reserve memory for the OS, so that way you don't OOM the system with user jobs: https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit -Paul Edmon- On 12/11/2023 11:19 AM, Davide DelVento wrote: A little late here, but yes everything Hans said is correct and if you are worried about slurm (or other critical system software) getting killed by OOM, you can workaround it by properly configuring cgroup. On Wed, Dec 6, 2023 at 2:06 AM Hans van Schoot wrote: Hi Joseph, This might depend on the rest of your configuration, but in general swap should not be needed for anything on Linux. BUT: you might get OOM killer messages in your system logs, and SLURM might fall victim to the OOM killer (OOM = Out Of Memory) if you run applications on the compute node that eat up all your RAM. Swap does not prevent against this, but makes it less likely to happen. I've seen OOM kill slurm daemon processes on compute nodes with swap, usually slurm recovers just fine after the application that ate up all the RAM ends up getting killed by the OOM killer. My compute nodes are not configured to monitor memory usage of jobs. If you have memory configured as a managed resource in your SLURM setup, and you leave a bit of headroom for the OS itself (e.g. only hand our a maximum of 250GB RAM to jobs on your 256GB RAM nodes), you should be fine. cheers, Hans ps. I'm just a happy slurm user/admin, not an expert, so I might be wrong about everything :-) On 06-12-2023 05:57, John Joseph wrote: Dear All, Good morning We have 4 node [256 GB Ram in each node] SLURM instance with which we installed and it is working fine. We have 2 GB of SWAP space on each node, for some purpose to make the system in full use want to disable the SWAP memory, Like to know if I am disabling the SWAP partition will it efffect SLURM functionality . Advice requested Thanks Joseph John
Re: [slurm-users] enabling job script archival
You will probably need to. The way we handle it is that we add users when the first submit a job via the job_submit.lua script. This way the database autopopulates with active users. -Paul Edmon- On 10/3/23 9:01 AM, Davide DelVento wrote: By increasing the slurmdbd verbosity level, I got additional information, namely the following: slurmdbd: error: couldn't get information for this user (null)(xx) slurmdbd: debug: accounting_storage/as_mysql: as_mysql_jobacct_process_get_jobs: User xx has no associations, and is not admin, so not returning any jobs. again where x is the posix ID of the user who's running the query in the slurmdbd logs. I suspect this is due to the fact that our userbase is small enough (we are a department HPC) that we don't need to use allocation and the like, so I have not configured any association (and not even studied its configuration, since when I was at another place which did use associations, someone else took care of slurm administration). Anyway, I read the fantastic document by our own member at https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations and in fact I have not even configured slurm users: # sacctmgr show user User Def Acct Admin -- -- - root root Administ+ # So is that the issue? Should I just add all users? Any suggestions on the minimal (but robust) way to do that? Thanks! On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento wrote: Thanks Paul, this helps. I don't have any PrivateData line in either config file. According to the docs, "By default, all information is visible to all users" so this should not be an issue. I tried to add a line with "PrivateData=jobs" to the conf files, just in case, but that didn't change the behavior. On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon wrote: At least in our setup, users can see their own scripts by doing sacct -B -j JOBID I would make sure that the scripts are being stored and how you have PrivateData set. -Paul Edmon- On 10/2/2023 10:57 AM, Davide DelVento wrote: I deployed the job_script archival and it is working, however it can be queried only by root. A regular user can run sacct -lj towards any jobs (even those by other users, and that's okay in our setup) with no problem. However if they run sacct -j job_id --batch-script even against a job they own themselves, nothing is returned and I get a slurmdbd: error: couldn't get information for this user (null)(xx) where x is the posix ID of the user who's running the query in the slurmdbd logs. Both configure files slurmdbd.conf and slurm.conf do not have any "permission" setting. FWIW, we use LDAP. Is that the expected behavior, in that by default only root can see the job scripts? I was assuming the users themselves should be able to debug their own jobs... Any hint on what could be changed to achieve this? Thanks! On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento wrote: Fantastic, this is really helpful, thanks! On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon wrote: Yes it was later than that. If you are 23.02 you are good. We've been running with storing job_scripts on for years at this point and that part of the database only uses up 8.4G. Our entire database takes up 29G on disk. So its about 1/3 of the database. We also have database compression which helps with the on disk size. Raw uncompressed our database is about 90G. We keep 6 months of data in our active database. -Paul Edmon- On 9/28/2023 1:57 PM, Ryan Novosielski wrote: Sorry for the duplicate e-mail in a short time: do you know (or anyone) when the hashing was added? Was planning to enable this on 21.08, but we then had to delay our upgrade to it. I’m assuming later than that, as I believe that’s when the feature was added. On Sep 28, 2023, at 13:55, Ryan Novosielski <mailto:novos...@rutgers.edu> wrote: Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought of that in passing, but the real world experience is really useful. I could easily see wanting that stuff to be retained less often than the main records, which is what I’d ask for. I assume that archiving, in general, would also
Re: [slurm-users] enabling job script archival
At least in our setup, users can see their own scripts by doing sacct -B -j JOBID I would make sure that the scripts are being stored and how you have PrivateData set. -Paul Edmon- On 10/2/2023 10:57 AM, Davide DelVento wrote: I deployed the job_script archival and it is working, however it can be queried only by root. A regular user can run sacct -lj towards any jobs (even those by other users, and that's okay in our setup) with no problem. However if they run sacct -j job_id --batch-script even against a job they own themselves, nothing is returned and I get a slurmdbd: error: couldn't get information for this user (null)(xx) where x is the posix ID of the user who's running the query in the slurmdbd logs. Both configure files slurmdbd.conf and slurm.conf do not have any "permission" setting. FWIW, we use LDAP. Is that the expected behavior, in that by default only root can see the job scripts? I was assuming the users themselves should be able to debug their own jobs... Any hint on what could be changed to achieve this? Thanks! On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento wrote: Fantastic, this is really helpful, thanks! On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon wrote: Yes it was later than that. If you are 23.02 you are good. We've been running with storing job_scripts on for years at this point and that part of the database only uses up 8.4G. Our entire database takes up 29G on disk. So its about 1/3 of the database. We also have database compression which helps with the on disk size. Raw uncompressed our database is about 90G. We keep 6 months of data in our active database. -Paul Edmon- On 9/28/2023 1:57 PM, Ryan Novosielski wrote: Sorry for the duplicate e-mail in a short time: do you know (or anyone) when the hashing was added? Was planning to enable this on 21.08, but we then had to delay our upgrade to it. I’m assuming later than that, as I believe that’s when the feature was added. On Sep 28, 2023, at 13:55, Ryan Novosielski <mailto:novos...@rutgers.edu> wrote: Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought of that in passing, but the real world experience is really useful. I could easily see wanting that stuff to be retained less often than the main records, which is what I’d ask for. I assume that archiving, in general, would also remove this stuff, since old jobs themselves will be removed? -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Sep 28, 2023, at 13:48, Paul Edmon <mailto:ped...@cfa.harvard.edu> wrote: Slurm should take care of it when you add it. So far as horror stories, under previous versions our database size ballooned to be so massive that it actually prevented us from upgrading and we had to drop the columns containing the job_script and job_env. This was back before slurm started hashing the scripts so that it would only store one copy of duplicate scripts. After this point we found that the job_script database stayed at a fairly reasonable size as most users use functionally the same script each time. However the job_env continued to grow like crazy as there are variables in our environment that change fairly consistently depending on where the user is. Thus job_envs ended up being too massive to keep around and so we had to drop them. Frankly we never really used them for debugging. The job_scripts though are super useful and not that much overhead. In summary my recommendation is to only store job_scripts. job_envs add too much storage for little gain, unless your job_envs are basically the same for each user in each location. Also it should be noted that there is no way to prune out job_scripts or job_envs right now. So the only way to get rid of them if they get large is to 0 out the column in the table. You can ask SchedMD for the mysql command to do this as we had to do it here to our job_envs. -Paul Edmon- On 9/28/2023 1:40 PM, Davide DelVento wrote: In my current slurm installation, (recently upgraded to slurm v23.02.3), I only have AccountingStoreFlags=job_comment I
Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?
This is one of the reasons we stick with using RPM's rather than the symlink process. It's just cleaner and avoids the issue of having the install on shared storage that may get overwhelmed with traffic or suffer outages. Also the package manager automatically removes the previous versions and local installs stuff. I've never been a fan of the symlink method has it runs counter to the entire point and design of Linux and package managers which are supposed to do this heavy lifting for you. Rant aside :). Generally for minor upgrades the process is less touchy. For our setup we follow the following process that works well for us, but does create an outage for the period of the upgrade. 1. Set all partitions to down: This makes sure no new jobs are scheduled. 2. Suspend all jobs: This makes sure jobs aren't running while we upgrade. 3. Stop slurmctld and slurmdbd. 4. Upgrade the slurmdbd. Restart slurmdbd 5. Upgrade the slurmd and slurmctld across the cluster. 6. Restart slurmd and slurmctld simultaneously using choria. 7. Unsuspend all jobs 8. Reopen all partitions. For major upgrades we always take a mysqldump and backup the spool for the slurmctld before upgrading just in case something goes wrong. We've had this happen before when the slurmdbd upgrade cut out early (note, always run the slurmdbd and slurmctld upgrades in -D mode and not via systemctl as systemctl can timeout and kill the upgrade midway for large upgrades). That said I've also skipped steps 1, 2, 7, and 8 before for minor upgrades and it works fine. The slurmd, slurmctld, and slurmdbd can all run on different versions so long as the slurmdbd > slurmctld > slurmd. So if you want to do a live upgrade you can do it. However out paranoia we general stop everything. The entire process takes about an hour start to finish, with the longest part being the pausing of all the jobs. -Paul Edmon- On 9/29/2023 9:48 AM, Groner, Rob wrote: I did already see the upgrade section of Jason's talk, but it wasn't much about the mechanics of the actual upgrade process, more of a big picture it seemed. It dealt a lot with different parts of slurm at different versions, which is something we don't have. One little wrinkle here is that while, yes, we're using a symlink to point to what version of slurm is the current one...it's all on a shared filesystem. So, ALL nodes, slurmdb, slurmctld are using that same symlink. There is no means to upgrade one component at a time. That means to upgrade, EVERYTHING has to come down before it could come back up. Jason's slides seemed to indicate that, if there were separate symlinks, then I could focus on just the slurmdb first and upgrade it...then focus on slurmctld and upgrade it, and then finally the nodes (take down their slurmd, upgrade the link, bring up slurmd). So maybe that's what I'm missing. Otherwise, I think what I'm saying is that I see references to a "rolling upgrade", but I don't see any guide to a rolling upgrade. I just see the 14 steps in https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://slurm.schedmd.com/quickstart_admin.html#upgrade>, and I guess I'd always thought of that as the full octane, high fat upgrade. I've only ever done upgrades during one of our many scheduled downtimes, because the upgrades were always to a new major version, and because I'm a scared little chicken, so I figured there were maybe some smaller subset of steps if only upgrading a patchlevel change. Smaller change, less risk, less precautionary steps...? I'm seeing now that's not the case. Thank you all for the suggestions! Rob *From:* slurm-users on behalf of Ryan Novosielski *Sent:* Friday, September 29, 2023 2:48 AM *To:* Slurm User Community List *Subject:* Re: [slurm-users] Steps to upgrade slurm for a patchlevel change? You don't often get email from novos...@rutgers.edu. Learn why this is important <https://aka.ms/LearnAboutSenderIdentification> I started off writing there’s really no particular process for these/just do your changes and start the new software (be mindful of any PATH that might contain data that’s under your software tree, if you have that setup), and that you might need to watch the timeouts, but I figured I’d have a look at the upgrade guide to be sure. There’s really nothing onerous in there. I’d personally back up my database and state save directories just because I’d rather be safe than sorry, or for if have to go backwards and want to be sure. You can run SlurmCtld for a good while with no database (note that -M on the command line will be broken during that time), just being mindful of the RAM on the SlurmCtld machine/don’t restart it before the DB is back up, and backing up our fairly large database doesn’t take all that long. Whether or not 5 is require
Re: [slurm-users] enabling job script archival
Yes it was later than that. If you are 23.02 you are good. We've been running with storing job_scripts on for years at this point and that part of the database only uses up 8.4G. Our entire database takes up 29G on disk. So its about 1/3 of the database. We also have database compression which helps with the on disk size. Raw uncompressed our database is about 90G. We keep 6 months of data in our active database. -Paul Edmon- On 9/28/2023 1:57 PM, Ryan Novosielski wrote: Sorry for the duplicate e-mail in a short time: do you know (or anyone) when the hashing was added? Was planning to enable this on 21.08, but we then had to delay our upgrade to it. I’m assuming later than that, as I believe that’s when the feature was added. On Sep 28, 2023, at 13:55, Ryan Novosielski wrote: Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought of that in passing, but the real world experience is really useful. I could easily see wanting that stuff to be retained less often than the main records, which is what I’d ask for. I assume that archiving, in general, would also remove this stuff, since old jobs themselves will be removed? -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Sep 28, 2023, at 13:48, Paul Edmon wrote: Slurm should take care of it when you add it. So far as horror stories, under previous versions our database size ballooned to be so massive that it actually prevented us from upgrading and we had to drop the columns containing the job_script and job_env. This was back before slurm started hashing the scripts so that it would only store one copy of duplicate scripts. After this point we found that the job_script database stayed at a fairly reasonable size as most users use functionally the same script each time. However the job_env continued to grow like crazy as there are variables in our environment that change fairly consistently depending on where the user is. Thus job_envs ended up being too massive to keep around and so we had to drop them. Frankly we never really used them for debugging. The job_scripts though are super useful and not that much overhead. In summary my recommendation is to only store job_scripts. job_envs add too much storage for little gain, unless your job_envs are basically the same for each user in each location. Also it should be noted that there is no way to prune out job_scripts or job_envs right now. So the only way to get rid of them if they get large is to 0 out the column in the table. You can ask SchedMD for the mysql command to do this as we had to do it here to our job_envs. -Paul Edmon- On 9/28/2023 1:40 PM, Davide DelVento wrote: In my current slurm installation, (recently upgraded to slurm v23.02.3), I only have AccountingStoreFlags=job_comment I now intend to add both AccountingStoreFlags=job_script AccountingStoreFlags=job_env leaving the default 4MB value for max_script_size Do I need to do anything on the DB myself, or will slurm take care of the additional tables if needed? Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know about the additional diskspace and potentially load needed, and with our resources and typical workload I should be okay with that. Thanks!
Re: [slurm-users] enabling job script archival
No, all the archiving does is remove the pointer. What slurm does right now is that it creates a hash of the job_script/job_env and then checks and sees if that hash matches one on record. If not then it adds it to the record, if it does match then it adds a pointer to the appropriate record. So you can think of the job_script/job_env as an internal database of all the various scripts and envs that slurm has ever seen and then what ends up in the Job record is a pointer to that database. This way slurm can deduplicate scripts/envs that are the same. This works great for job_scripts as they are functionally the same and thus you have many jobs pointed to the same script, but less so for job_envs. -Paul Edmon- On 9/28/2023 1:55 PM, Ryan Novosielski wrote: Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought of that in passing, but the real world experience is really useful. I could easily see wanting that stuff to be retained less often than the main records, which is what I’d ask for. I assume that archiving, in general, would also remove this stuff, since old jobs themselves will be removed? -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Sep 28, 2023, at 13:48, Paul Edmon wrote: Slurm should take care of it when you add it. So far as horror stories, under previous versions our database size ballooned to be so massive that it actually prevented us from upgrading and we had to drop the columns containing the job_script and job_env. This was back before slurm started hashing the scripts so that it would only store one copy of duplicate scripts. After this point we found that the job_script database stayed at a fairly reasonable size as most users use functionally the same script each time. However the job_env continued to grow like crazy as there are variables in our environment that change fairly consistently depending on where the user is. Thus job_envs ended up being too massive to keep around and so we had to drop them. Frankly we never really used them for debugging. The job_scripts though are super useful and not that much overhead. In summary my recommendation is to only store job_scripts. job_envs add too much storage for little gain, unless your job_envs are basically the same for each user in each location. Also it should be noted that there is no way to prune out job_scripts or job_envs right now. So the only way to get rid of them if they get large is to 0 out the column in the table. You can ask SchedMD for the mysql command to do this as we had to do it here to our job_envs. -Paul Edmon- On 9/28/2023 1:40 PM, Davide DelVento wrote: In my current slurm installation, (recently upgraded to slurm v23.02.3), I only have AccountingStoreFlags=job_comment I now intend to add both AccountingStoreFlags=job_script AccountingStoreFlags=job_env leaving the default 4MB value for max_script_size Do I need to do anything on the DB myself, or will slurm take care of the additional tables if needed? Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know about the additional diskspace and potentially load needed, and with our resources and typical workload I should be okay with that. Thanks!
Re: [slurm-users] enabling job script archival
Slurm should take care of it when you add it. So far as horror stories, under previous versions our database size ballooned to be so massive that it actually prevented us from upgrading and we had to drop the columns containing the job_script and job_env. This was back before slurm started hashing the scripts so that it would only store one copy of duplicate scripts. After this point we found that the job_script database stayed at a fairly reasonable size as most users use functionally the same script each time. However the job_env continued to grow like crazy as there are variables in our environment that change fairly consistently depending on where the user is. Thus job_envs ended up being too massive to keep around and so we had to drop them. Frankly we never really used them for debugging. The job_scripts though are super useful and not that much overhead. In summary my recommendation is to only store job_scripts. job_envs add too much storage for little gain, unless your job_envs are basically the same for each user in each location. Also it should be noted that there is no way to prune out job_scripts or job_envs right now. So the only way to get rid of them if they get large is to 0 out the column in the table. You can ask SchedMD for the mysql command to do this as we had to do it here to our job_envs. -Paul Edmon- On 9/28/2023 1:40 PM, Davide DelVento wrote: In my current slurm installation, (recently upgraded to slurm v23.02.3), I only have AccountingStoreFlags=job_comment I now intend to add both AccountingStoreFlags=job_script AccountingStoreFlags=job_env leaving the default 4MB value for max_script_size Do I need to do anything on the DB myself, or will slurm take care of the additional tables if needed? Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know about the additional diskspace and potentially load needed, and with our resources and typical workload I should be okay with that. Thanks!
Re: [slurm-users] Submitting hybrid OpenMPI and OpenMP Jobs
You might also try swapping to use srun instead of mpiexec as that way slurm can give more direction as to what cores have been allocated to what. I've found it in the past that mpiexec will ignore what Slurm tells it. -Paul Edmon- On 9/22/23 8:24 AM, Lambers, Martin wrote: Hello, for this setup it typically helps to disable MPI process binding with "mpirun --bind-to none ..." (or similar) so that OpenMP can use all cores. Best, Martin On 22/09/2023 13:57, Selch, Brigitte (FIDD) wrote: Hello, one of our applications need hybrid OpenMPI and OpenMP Job-Submit. Only one task is allowed on one node, but this task should use all cores of the node. So, for example I made: /#!/bin/bash/ // /#SBATCH --nodes=5/ /#SBATCH --ntasks=5/ /#SBATCH --cpus-per-task=44/ /#SBATCH --export=ALL/ // /export OMP_NUM_THREADS=44/ /mpiexec PreonNode test.prscene/ But the job does not take more than one thread: … /Thread binding will be disabled because the full machine is not available for the process./ */Detected 44 CPU threads/*/, 2 l3 caches and 2 packages on the machine./ */Number of CPU processors reported by OpenMP: 1/* */Maximum number of CPU threads reported by OpenMP: 44/* // /Warning: *OMP_NUM_THREADS was set to 44, which is higher than the number of available processors of *1. Will use 1 threads now./ /…/ What did I wrong? Does anyone have any idea why OpenMP thinks it can only use one thread per node? Thanks ! Best regards, Brigitte Selch ** *MAN Truck & Bus SE* IT Produktentwicklung Simulation (FIDD) Vogelweiher Str. 33 90441 Nürnberg MAN Truck & Bus SE Sitz der Gesellschaft: München Registergericht: Amtsgericht München, HRB 247520 Vorsitzender des Aufsichtsrats: Christian Levin, Vorstand: Alexander Vlaskamp (Vorsitzender), Murat Aksel, Friedrich-W. Baumann, Michael Kobriger, Inka Koljonen, Arne Puls, Dr. Frederik Zohm You can find information about how we process your personal data and your rights in our data protection notice: www.man.eu/data-protection-notice This e-mail (including any attachments) is confidential and may be privileged. If you have received it by mistake, please notify the sender by e-mail and delete this message from your system. Any unauthorised use or dissemination of this e-mail in whole or in part is strictly prohibited. Please note that e-mails are susceptible to change. MAN Truck & Bus SE (including its group companies) shall not be liable for the improper or incomplete transmission of the information contained in this communication nor for any delay in its receipt. MAN Truck & Bus SE (or its group companies) does not guarantee that the integrity of this communication has been maintained nor that this communication is free of viruses, interceptions or interference.
Re: [slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?
I would recommend standing up an instance of XDMod as it handles most of this for you in its summary reports. https://open.xdmod.org/10.0/index.html -Paul Edmon- On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote: Good morning, We have at least one billed account right now, where the associated researchers are able to submit jobs that run against our normal queue with fairshare, but not for an academic research purpose. So we'd like to accurately calculate their CPU hours. We are currently using a script to query the db with sacct and sum up the value of ElapsedRaw * AllocCPUS for all jobs. But this seems limited, because requeueing will create what the sacct man page calls duplicates. By default jobs normally get requeued only if there's something outside of the user's control like a NODE_FAIL or an scontrol command to requeue it manually, though I think users can requeue things themselves, it's not a feature we've seen our researchers use. However with the new scrontab feature, whenever the cron is executed more than once, sacct reports that the previous jobs are "requeued" and are only visible by looking up duplicates. I haven't seen any billed account use requeueing or scrontab yet, but it's clear to me that it could be significant once researchers start using scrontab more. Scrontab has existed since one of the releases from 2020 I believe, but we enabled it this year and see it as much more powerful than the traditional linux crontab. What would be the best way to more thoroughly calculate ElapsedRaw * AllocCPUS, to account for duplicates, but optionally ignore unintentional requeueing like from a NODE_FAIL? Here's the main loop of the simple bash script I have now: while IFS='|' read -r end elapsed cpus; do # if a job crosses the month barrier # the entire bill will be put under the 2nd month year_month="${end:0:7}" if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then continue fi core_seconds["$year_month"]=$(( core_seconds["$year_month"] + (elapsed * cpus) )) done < <(sacct -a -A "$SLURM_ACCOUNT" \ -S "$START_DATE" \ -E "$END_DATE" \ -o End,ElapsedRaw,AllocCPUS -X -P --noheader) Our slurmdbd is configured to keep 6 months of data. It make senses to loop through the jobids instead, using sacct's -D/--duplicates option each time to reveal the hidden duplicates in the REQUEUED state, but I'm interested if there are alternatives or if I'm missing anything here. Thanks, Joseph -- Joseph F. Guzman - ITS (Advanced Research Computing) Northern Arizona University joseph.f.guz...@nau.edu
Re: [slurm-users] changing the operational network in slurm setup
We do this for our Infiniband set up. What we do is that we populate /etc/hosts with the hostname mapped to the IP we want Slurm to use. This way you get IP traffic traversing the address you want between nodes while not having to mess with DNS. -Paul Edmon- On 3/14/2023 12:19 AM, Purvesh Parmar wrote: Thank you. It would be helpful if you can elaborate on this. We had hostnames given according to interfaces. Now that also needs to be changed, Thanks, P. parmar On Tue, 14 Mar 2023 at 07:58, Steven Hood wrote: Set dns server to use the ip address of the 10g Sent from my T-Mobile 4G LTE Device Original message From: Purvesh Parmar Date: 3/13/23 7:05 PM (GMT-08:00) To: Slurm User Community List Subject: Re: [slurm-users] changing the operational network in slurm setup CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi, No, its an additional network enabled on all the nodes and now slurm services we want to migrate from 1 GbE network to 10 GbE network. Yes, we have assigned different ip addresses on the 10 GbE network On Tue, 14 Mar 2023 at 07:22, Steven Hood wrote: Have you changed the IP assignment to use the 10GB interface? -Original Message- From: Purvesh Parmar Reply-To: Slurm User Community List To: Slurm User Community List Subject: [slurm-users] changing the operational network in slurm setup Date: 03/13/2023 06:19:13 PM CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. hi, We have slurm 22.08 running on ethernet (1 GbE) network (slurmdbd, slurmctld and slurmd on compute nodes) on ubuntu 20.04. We want to migrate the slurm services on the 10 gbe network, which is present on all the nodes and on the master server as well. How to proceed for this? Thanks, P. Parmar
Re: [slurm-users] linting slurm.conf files
We have a gitlab runner that fires up a docker container that basically starts up a mini scheduler (slurmdbd and slurmctld) to confirm that both can start. It covers most bases but we would like to see an official syntax checker (https://bugs.schedmd.com/show_bug.cgi?id=3435). -Paul Edmon- On 1/27/23 2:36 PM, Kevin Broch wrote: I'm wondering what others use to lint their slurm.conf files to give more confidence that the changes are valid. I came across https://github.com/appeltel/slurmlint which was somewhat functional but since it hasn't been updated since 2019, when I ran it against a valid slurm.conf file based on a later slurm rev. it flagged a bunch of false positives that were simply new valid options. On the plus side it was able to flag an example of a misconfigured node/partition. Any ideas would be greatly appreciated. Best, /
Re: [slurm-users] Maintaining slurm config files for test and production clusters
The symlink method for slurm.conf is what we do as well. We have a NFS mount from the slurm master that we host the slurm.conf on that we then symlink slurm.conf to that NFS share. -Paul Edmon- On 1/4/2023 1:53 PM, Brian Andrus wrote: One of the simple ways I have dealt with different configs is to symlink /etc/slurm/slurm.conf to the appropriate file (eg: slurm-dev.conf and slurm-prod.conf) In fact, I use the symlink for my dev and nothing (configless) for prod. Then I can change a running node to/from dev/prod by merely creating/deleting the symlink and restarting slurmd. Just an option that may work for you. I also use separate repos for prod/dev when I am working on packages/testing. I rather prefer that separation so I don't have someone accidentally update to a package that is not production-ready. Brian Andrus On 1/4/2023 9:22 AM, Groner, Rob wrote: We currently have a test cluster and a production cluster, all on the same network. We try things on the test cluster, and then we gather those changes and make a change to the production cluster. We're doing that through two different repos, but we'd like to have a single repo to make the transition from testing configs to publishing them more seamless. The problem is, of course, that the test cluster and production clusters have different cluster names, as well as different nodes within them. Using the include directive, I can pull all of the NodeName lines out of slurm.conf and put them into %c-nodes.conf files, one for production, one for test. That still leaves me with two problems: * The clustername itself will still be a problem. I WANT the same slurm.conf file between test and production...but the clustername line will be different for them both. Can I use an env var in that cluster name, because on production there could be a different env var value than on test? * The gres.conf file. I tried using the same "include" trick that works on slurm.conf, but it failed because it did not know what the "ClusterName" was. I think that means that either it doesn't work for anything other than slurm.conf, or that the clustername will have to be defined in gres.conf as well? Any other suggestions of how to keep our slurm files in a single source control repo, but still have the flexibility to have them run elegantly on either test or production systems? Thanks.
Re: [slurm-users] How to read job accounting data long output? `sacct -l`
The seff utility (in slurm-contribs) also gives good summary info. You can also you --parsable to make things more managable. -Paul Edmon- On 12/14/22 3:41 PM, Ross Dickson wrote: I wrote a simple Python script to transpose the output of sacct from a row into a column. See if it meets your needs. https://github.com/ComputeCanada/slurm_utils/blob/master/sacct-all.py - Ross Dickson Dalhousie University / ACENET / Digital Research Alliance of Canada On Wed, Dec 14, 2022 at 1:16 PM Davide DelVento wrote: It would be very useful if there were a way (perhaps a custom script parsing the sacct output) to provide the information in the same format as "scontrol show job" Has anybody attempted to do that?
Re: [slurm-users] Slurm v22 for Alma 8
Yeah, our spec is based off of their spec with our own additional features plugged in. -Paul Edmon- On 12/2/22 2:12 PM, David Thompson wrote: Hi Paul, thanks for passing that along. The error I saw was coming from the rpmbuild %check stage in the el9/fc38 builds, which your .spec file doesn’t run (likewise the spec file included in the schedmd tarball). Certainly one way to avoid failing a check is to not run it. Regardless, I appreciate the help. David Thompson University of Wisconsin – Madison Social Science Computing Cooperative *From:* slurm-users *On Behalf Of *Paul Edmon *Sent:* Friday, December 2, 2022 11:26 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Slurm v22 for Alma 8 Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8. -Paul Edmon- On 12/2/22 12:21 PM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option, which isn’t present in the current EPEL el8 rpms (version 20.11.9). Rebuilding from either the el9 or fc38 SRPM or fails on a protocol test in testsuite/slurm_unit/common/slurm_protocol_defs: FAIL: slurm_addto_id_char_list-test Before I start digging in, I thought I would check here and see if anyone has a successful RHEL/Alma/Rocky 8 slurm v22 SRPM they’d be willing to share. Thanks much! David Thompson University of Wisconsin – Madison Social Science Computing Cooperative
Re: [slurm-users] Slurm v22 for Alma 8
Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8. -Paul Edmon- On 12/2/22 12:21 PM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option, which isn’t present in the current EPEL el8 rpms (version 20.11.9). Rebuilding from either the el9 or fc38 SRPM or fails on a protocol test in testsuite/slurm_unit/common/slurm_protocol_defs: FAIL: slurm_addto_id_char_list-test Before I start digging in, I thought I would check here and see if anyone has a successful RHEL/Alma/Rocky 8 slurm v22 SRPM they’d be willing to share. Thanks much! David Thompson University of Wisconsin – Madison Social Science Computing Cooperative Name: slurm Version: 22.05.6 %define rel 1 Release: %{rel}fasrc01%{?dist} Summary: Slurm Workload Manager Group: System Environment/Base License: GPLv2+ URL: https://slurm.schedmd.com/ # when the rel number is one, the directory name does not include it %if "%{rel}" == "1" %global slurm_source_dir %{name}-%{version} %else %global slurm_source_dir %{name}-%{version}-%{rel} %endif Source: %{slurm_source_dir}.tar.bz2 # build options .rpmmacros options change to default action # # --prefix %_prefix path install path for commands, libraries, etc. # --with cray %_with_cray 1 build for a Cray Aries system # --with cray_network %_with_cray_network 1 build for a non-Cray system with a Cray network # --with cray_shasta %_with_cray_shasta 1 build for a Cray Shasta system # --with slurmrestd %_with_slurmrestd 1 build slurmrestd # --with slurmsmwd %_with_slurmsmwd 1 build slurmsmwd # --without debug %_without_debug 1 don't compile with debugging symbols # --with hdf5 %_with_hdf5 path require hdf5 support # --with hwloc %_with_hwloc 1 require hwloc support # --with lua %_with_lua path build Slurm lua bindings # --with mysql %_with_mysql 1 require mysql/mariadb support # --with numa %_with_numa 1 require NUMA support # --without pam %_without_pam 1 don't require pam-devel RPM to be installed # --without x11 %_without_x11 1 disable internal X11 support # --with ucx %_with_ucx path require ucx support # --with pmix %_with_pmix path require pmix support # --with nvml %_with_nvml path require nvml support # %define _with_slurmrestd 1 # Options that are off by default (enable with --with ) %bcond_with cray %bcond_with cray_network %bcond_with cray_shasta %bcond_with slurmrestd %bcond_with slurmsmwd %bcond_with multiple_slurmd %bcond_with ucx # These options are only here to force there to be these on the build. # If they are not set they will still be compiled if the packages exist. %bcond_with hwloc %bcond_with mysql %bcond_with hdf5 %bcond_with lua %bcond_with numa %bcond_with pmix %bcond_with nvml # Use debug by default on all systems %bcond_without debug # Options enabled by default %bcond_without pam %bcond_without x11 # Disable hardened builds. -z,now or -z,relro breaks the plugin stack %undefine _hardened_build %global _hardened_cflags "-Wl,-z,lazy" %global _hardened_ldflags "-Wl,-z,lazy" # Disable Link Time Optimization (LTO) %define _lto_cflags %{nil} Requires: munge %{?systemd_requires} BuildRequires: systemd BuildRequires: munge-devel munge-libs BuildRequires: python3 BuildRequires: readline-devel Obsoletes: slurm-lua <= %{version} Obsoletes: slurm-munge <= %{version} Obsoletes: slurm-plugins <= %{version} # fake systemd support when building rpms on other platforms %{!?_unitdir: %global _unitdir /lib/systemd/systemd} %define use_mysql_devel %(perl -e '`rpm -q mariadb-devel`; print $?;') %if %{with mysql} %if %{use_mysql_devel} BuildRequires: mysql-devel >= 5.0.0 %else BuildRequires: mariadb-devel >= 5.0.0 %endif %endif %if %{with cray} BuildRequires: cray-libalpscomm_cn-devel BuildRequires: cray-libalpscomm_sn-devel BuildRequires: libnuma-devel BuildRequires: libhwloc-devel BuildRequires: cray-libjob-devel BuildRequires: gtk2-devel BuildRequires: glib2-devel BuildRequires: pkg-config %endif %if %{with cray_network} %if %{use_mysql_devel} BuildRequires: mysql-devel %else BuildRequires: mariadb-devel %endif BuildRequires: cray-libalpscomm_cn-devel BuildRequires: cray-libalpscomm_sn-devel BuildRequires: hwloc-devel BuildRequires: gtk2-devel BuildRequires: glib2-devel BuildRequires: pkgconfig %endif BuildRequires: perl(ExtUtils::MakeMaker) BuildRequires: libcurl-devel BuildRequires: numactl-devel BuildRequires: json-c-devel BuildRequires: infiniband-diags-devel BuildRequires: rdma-core-devel BuildRequires: lz4-devel BuildRequires: man2html BuildRequires: http-parser-devel BuildRequires: libyaml-devel BuildRequires: hdf5-devel BuildRequires: freeipmi-devel BuildRequires: rrdtool-devel BuildRequires: hwloc-devel BuildRequires: lua-devel BuildRequires: mysql-devel BuildRequires: gtk2-dev
Re: [slurm-users] slurm 22.05 "hash_k12" related upgrade issue
It only happens for versions on the 22.05 series prior to the latest release (22.05.5). So the 21 version isn't impacted and you should be fine to upgrade from 21 to 22.05.5 and not see the hash_k12 issue. If you upgrade to any prior minor version though you will hit this issue. -Paul Edmon- On 10/24/2022 3:13 PM, Marko Markoc wrote: Hi All, Regarding https://lists.schedmd.com/pipermail/slurm-users/2022-September/009222.html . Question for all of you that might have done this upgrade recently, does this happen during the major version ( 21->22 in my case ) upgrade also ? All of the discussion I found online about it only mentions minor version upgrades. Thanks, Marko
Re: [slurm-users] Ideal NFS exported StateSaveLocation size.
HA for slurmctld is not multidatacenter HA but rather a traditional HA setup where you have two server heads off of one storage brick (connected by SAS cables or other fast interconnect). Multidatacenter HA has issues with keeping things in sync due to latency and IOPs (as noted below). So the HA setup for slurmctld will protect you from the server hosting the slurmctld getting hosed, not the entire rack going down or the datacenter going down. -Paul Edmon- On 10/24/2022 4:14 AM, Ole Holm Nielsen wrote: On 10/24/22 09:57, Diego Zuccato wrote: Il 24/10/2022 09:32, Ole Holm Nielsen ha scritto: > It is definitely a BAD idea to store Slurm StateSaveLocation on a slow > NFS directory! SchedMD recommends to use local NVME or SSD disks > because there will be many IOPS to this file system! IIUC it does have to be shared between controllers, right? Possibly use NVME-backed (or even better NVDIMM-backed) NFS share. Or replica-3 Gluster volume with NVDIMMs for the bricks, for the paranoid :) IOPS is the key parameter! Local NVME or SSD should beat any networked storage. The original question refers to having StateSaveLocation on a standard (slow) NFS drive, AFAICT. I don't know how many people prefer using 2 slurmctld hosts (primary and backup)? I certainly don't do that. Slurm does have a configurable SlurmctldTimeout parameter so that you can reboot the server quickly when needed. It would be nice if people with experience in HA storage for slurmctld could comment. /Ole
Re: [slurm-users] Check consistency
The slurmctld log will print out if hosts are out of sync with the slurmctld slurm.conf. That said it doesn't report on cgroup consistency changes like that. It's possible that dialing up the verbosity on the slurmd logs may give that info but I haven't seen it in normal operating. -Paul Edmon- On 10/6/22 5:47 PM, Davide DelVento wrote: Is there a simple way to check that whas slurm is running is what the config say it should be? For example, my understanding is that changing cgroup.conf should be followed by 'systemctl stop slurmd' on all compute nodes, then 'systemctl restart slurmctld' on the head node, then 'systemctl start slurmd' on the compute nodes. Assuming this is correct, is there a way to query the nodes and ask if they are indeed running what the config is saying (or alternatively have them dump their config files somewhere for me to manually run a diff on)? Thanks, Davide
Re: [slurm-users] Recommended amount of memory for the database server
It should generally be as much as you need to hold the full database in memory. That said if you are storing Job Envs and Scripts that will be a lot of data, even with the deduping they are doing. We've generally done about 90 GB buffer size here with out much of any issue even though our database is bigger than that. -Paul Edmon- On 9/25/22 5:18 PM, byron wrote: Hi Does anyone know what is the recommended amount of memory to give slurms mariadb database server? I seem to remember reading a simple estimate based on the size of certain tables (or something along those lines) but I can't find it now. Thanks
Re: [slurm-users] Providing users with info on wait time vs. run time
We also call scontrol in our scripts (a little as we can manage) and we run at the scale of 1500 nodes. It hasn't really caused many issues, but we try to limit it as much as we possibly can. -Paul Edmon- On 9/16/22 9:41 AM, Sebastian Potthoff wrote: Hi Hermann, So you both are happily(?) ignoring this warning the "Prolog and Epilog Guide", right? :-) "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc)." We have probably been doing this since before the warning was added to the documentation. So we are "ignorantly ignoring" the advice :-/ Same here :) But if $SLURM_JOB_STDOUT is not defined as documented … what can you do. May I ask how big your clusters are (number of nodes) and how heavily they are used (submitted jobs per hour)? We have around 500 nodes (mostly 2x18 cores). Jobs ending (i.e. calling the epilog script) varies quite a lot between 1000 and 15k a day, so something in between 40 and 625 Jobs/hour. During those peaks Slurm can become noticeably slower, however usually it runs fine. Sebastian Am 16.09.2022 um 15:15 schrieb Loris Bennett : Hi Hermann, Hermann Schwärzler writes: Hi Loris, hi Sebastian, thanks for the information on how you are doing this. So you both are happily(?) ignoring this warning the "Prolog and Epilog Guide", right? :-) "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc)." We have probably been doing this since before the warning was added to the documentation. So we are "ignorantly ignoring" the advice :-/ May I ask how big your clusters are (number of nodes) and how heavily they are used (submitted jobs per hour)? We have around 190 32-core nodes. I don't know how I would easily find out the average number of jobs per hour. The only problems we have had with submission have been when people have written their own mechanisms for submitting thousands of jobs. Once we get them to use job array, such problems generally disappear. Cheers, Loris Regards, Hermann On 9/16/22 9:09 AM, Loris Bennett wrote: Hi Hermann, Sebastian Potthoff writes: Hi Hermann, I happened to read along this conversation and was just solving this issue today. I added this part to the epilog script to make it work: # Add job report to stdout StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grep StdOut | /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; { print $2 }') NODELIST=($(/usr/bin/scontrol show hostnames)) # Only add to StdOut file if it exists and if we are the first node if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z "${StdOut}" ] then echo "# JOB REPORT ##" >> $StdOut /usr/bin/seff $SLURM_JOB_ID >> $StdOut echo "###" >> $StdOut fi We do something similar. At the end of our script pointed to by EpilogSlurmctld we have OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'` if [ ! -f "$OUT" ]; then exit fi printf "\n== Epilog Slurmctld ==\n\n" >> ${OUT} seff ${SLURM_JOB_ID} >> ${OUT} printf "\n==\n" ${OUT} chown ${user} ${OUT} Cheers, Loris Contrary to what it says in the slurm docs https://slurm.schedmd.com/prolog_epilog.html I was not able to use the env var SLURM_JOB_STDOUT, so I had to fetch it via scontrol. In addition I had to make sure it is only called by the „leading“ node as the epilog script will be called by ALL nodes of a multinode job and they would all call seff and clutter up the output. Last thing was to check if StdOut is not of length zero (i.e. it exists). Interactive jobs would otherwise cause the node to drain. Maybe this helps. Kind regards Sebastian PS: goslmailer looks quite nice with its recommendations! Will definitely look into it. -- Westfälische Wilhelms-Universität (WWU) Münster WWU IT Sebastian Potthoff (eScience / HPC) Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler : Hi Ole, On 9/15/22 5:21 PM, Ole Holm Nielsen wrote: On 15-09-2022 16:08, Hermann Schwärzler wrote: Just out of curiosity: how do you insert the output of seff into the out-file of a job? Use the "smail" tool from the slurm-contribs RPM and set this in slurm.conf: MailProg=/usr/bin/smail Maybe I am missing something but from what I can tell smail sends an email and does *not* change or append to the .out file of a job... Regards, Hermann -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin emailloris.benn...@fu-berlin.de
Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9
But not any 20. There are 20 versions, 20.02 and 20.11, and there was a previous 19.05. So two versions for 18.08 would be 20.02 not 20.11 -Paul Edmon- On 9/8/22 12:14 PM, Wadud Miah wrote: The previous version was 18 and now I am trying to upgrade to 20, so I am well within 2 major versions. Regards, *From:* slurm-users on behalf of Paul Edmon *Sent:* Thursday, September 8, 2022 4:44:36 PM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9 *CAUTION:* This e-mail originated outside the University of Southampton. Typically slurm only supports upgrading between 2 major versions ahead. If you are on 18.08 you likely can only go to 20.02. Then after you upgrade to 20.02 you can go to 20.11 or 21.08. -Paul Edmon- On 9/8/22 11:38 AM, Wadud Miah wrote: hi Mick, I have checked that all the compute nodes and controllers all have the same version of SLURM (20.11.9). I am indeed trying to upgrade SlurmDB first, and am getting the errors in the slurmdbd.log: [2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started [2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 not supported [2022-09-08T15:33:57.001] unpacking header [2022-09-08T15:33:57.001] error: destroy_forward: no init [2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message receive failure [2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack SLURM_PERSIST_INIT message Regards, Wadud. *From:* slurm-users <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Timony, Mick <mailto:michael_tim...@hms.harvard.edu> *Sent:* 08 September 2022 16:24 *To:* Slurm User Community List <mailto:slurm-users@lists.schedmd.com> *Subject:* Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9 *CAUTION:* This e-mail originated outside the University of Southampton. This thread on the forums may help: https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4 <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FYB55Ru9rvD4=05%7C01%7Cw.miah%40soton.ac.uk%7Cfd25248a7e6a4fa729d308da91b20c1a%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982491141437024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=THo3JUObIzF6EWcIlQ1OsJwUwxEAGUFeMdLuvlEKhzA%3D=0> It looks like you have something on your network with an older version of slurm installed. I'd check the Slurm version installed on your compute nodes and controllers. The recommended approach to upgrading is to upgrade the SlurmDB first, then the controllers, then the compute nodes. More info here: https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23upgrade=05%7C01%7Cw.miah%40soton.ac.uk%7Cfd25248a7e6a4fa729d308da91b20c1a%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982491141437024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=Evu1PdyvAeinb0W11Ia6NxUgOvfaITmJiVau8nak%2Fac%3D=0> Regards -- Mick Timony Senior DevOps Engineer Harvard Medical School -- *From:* slurm-users <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Wadud Miah <mailto:w.m...@soton.ac.uk> *Sent:* Thursday, September 8, 2022 10:47 AM *To:* slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> <mailto:slurm-users@lists.schedmd.com> *Subject:* [slurm-users] Upgrading SLURM from 18 to 20.11.9 Hi, I am attempting to upgrade from SLURM 18 to 20.11.9 and when I attempt to start slurmdbd (version 20.11.9), I get the following error messages in /var/log/slurm/slurmdbd.log: [2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started [2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 not supported [2022-09-08T15:33:57.001] unpacking header [2022-09-08T15:33:57.001] error: destroy_forward: no init [2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message receive failure [2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack SLURM_PERSIST_INIT message Any help will be greatly appreciated. Regards, -- Wadud Miah Research Computing Support University of Southampton
Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9
Typically slurm only supports upgrading between 2 major versions ahead. If you are on 18.08 you likely can only go to 20.02. Then after you upgrade to 20.02 you can go to 20.11 or 21.08. -Paul Edmon- On 9/8/22 11:38 AM, Wadud Miah wrote: hi Mick, I have checked that all the compute nodes and controllers all have the same version of SLURM (20.11.9). I am indeed trying to upgrade SlurmDB first, and am getting the errors in the slurmdbd.log: [2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started [2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 not supported [2022-09-08T15:33:57.001] unpacking header [2022-09-08T15:33:57.001] error: destroy_forward: no init [2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message receive failure [2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack SLURM_PERSIST_INIT message Regards, Wadud. *From:* slurm-users on behalf of Timony, Mick *Sent:* 08 September 2022 16:24 *To:* Slurm User Community List *Subject:* Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9 *CAUTION:* This e-mail originated outside the University of Southampton. This thread on the forums may help: https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4 <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FYB55Ru9rvD4=05%7C01%7Cw.miah%40soton.ac.uk%7C13f4b2b736764041dc9d08da91af4672%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982479244856364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=cQGagihxp%2BD2JTZZY%2BMKVH5I%2B386oZIXbCZT9eyfTlg%3D=0> It looks like you have something on your network with an older version of slurm installed. I'd check the Slurm version installed on your compute nodes and controllers. The recommended approach to upgrading is to upgrade the SlurmDB first, then the controllers, then the compute nodes. More info here: https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23upgrade=05%7C01%7Cw.miah%40soton.ac.uk%7C13f4b2b736764041dc9d08da91af4672%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982479244856364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=BvJQSt4tfJY616T%2BTzfbGzw4nrTFCuZTbjyuThpssnQ%3D=0> Regards -- Mick Timony Senior DevOps Engineer Harvard Medical School -- *From:* slurm-users on behalf of Wadud Miah *Sent:* Thursday, September 8, 2022 10:47 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] Upgrading SLURM from 18 to 20.11.9 Hi, I am attempting to upgrade from SLURM 18 to 20.11.9 and when I attempt to start slurmdbd (version 20.11.9), I get the following error messages in /var/log/slurm/slurmdbd.log: [2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started [2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 not supported [2022-09-08T15:33:57.001] unpacking header [2022-09-08T15:33:57.001] error: destroy_forward: no init [2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message receive failure [2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack SLURM_PERSIST_INIT message Any help will be greatly appreciated. Regards, -- Wadud Miah Research Computing Support University of Southampton
Re: [slurm-users] maridb version compatibility with Slurm version
I've regularly upgraded the mariadb version with out upgrading the slurm version with no issue. We are currently running 10.6.7 for MariaDB on CentOS 7.9 with Slurm 22.05.2. So long as you do the mysql_upgrade after the upgrade and have a backup just in case you should be fine. -Paul Edmon- On 8/24/22 1:58 AM, navin srivastava wrote: Hi, I have a question related to the mariadb vs slurm version compatibility. Is there any matrix available? We are running with slurm version 20.02 in our environment on SLES15SP3 and with mariadb 10.5.x . We are upgrading the OS from SLES15SP3 to SP4 and with this we see the mariadb version is 10.6.x. and we are not upgrading the Slurm version. What is the best way to deal with this as we patch the server quarterly and keep the slurm version unchanged as I locked this at os level but the mariadb version update happens and as far as i see it has no impact. is it good idea to keep the mariadb version also intact with the slurm version? Regards Navin.
Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.
True. Though be aware that Slurm will by default map the environment from login nodes to compute. That's the real thing that matters. So as long as the environment is setup properly, any filesystems excluding the home directory do not need to be mounted on login. -Paul Edmon- On 8/2/2022 9:56 AM, Brian Andrus wrote: A quick nuance: We only have home directories on the login node. Our software installations are not accessible from there to prevent users from running things there (you must have a running job to access the software packages). So your login node does not necessarily need everything the compute nodes do. Brian Andrus On 8/2/2022 6:45 AM, Paul Edmon wrote: No, the node running the slurmctld does not need access to any of the customer facing filesystems or home directories. While all the login and client nodes do, the slurmctld does not. -Paul Edmon- On 8/2/2022 9:30 AM, Richard Chang wrote: Hi, I am new to SLURM, so please bear with me. I need to understand whether the Server/Node running the slurmctld daemon will need access to the Parallel file system, and if it will need all the SW run time libraries installed, as in the compute nodes. The users will login to the Login/submission nodes with their home mounted from say PFS1 and change directory to the PFS2 mount point and then submit/run their jobs. Does it mean the Server/node running the slurmctld daemon will also need access to both the PFS1 and PFS2 mount points ? I am not sure. The server running the slurmctld daemon will be exclusively for that and is not a login node. Thanks & regards, Richard.
Re: [slurm-users] SlurmDB Archive settings?
Sure. Here are our settings: ArchiveJobs=yes ArchiveDir="/slurm/archive" ArchiveSteps=yes ArchiveResvs=yes ArchiveEvents=yes ArchiveSuspend=yes ArchiveTXN=yes ArchiveUsage=yes PurgeEventAfter=6month PurgeJobAfter=6month PurgeResvAfter=6month PurgeStepAfter=6month PurgeSuspendAfter=6month PurgeTXNAfter=6month PurgeUsageAfter=6month -Paul Edmon- On 7/15/2022 2:08 AM, Ole Holm Nielsen wrote: Hi Paul, On 7/14/22 15:10, Paul Edmon wrote: We just use the Archive function built into slurm. That has worked fine for us for the past 6 years. We keep 6 months of data in the active archive. Could you kindly share your Archive* settings in slurmdbd.conf? I've never tried to use this, but it sounds like a good idea. Thanks, Ole
Re: [slurm-users] SlurmDB Archive settings?
Yeah, a word of warning about going from 21.08 to 22.05, make sure you have enough storage on the database host you are doing the work on and budget a long enough time for the upgrade. We just converted our 198 GB (compressed, 534 GB raw) database this week. The initial attempt failed (after running for 8 hours) because we ran out of disk space (part of the reason we had to compress is that the server we use for our slurm master only has 800 GB of SSD on it). That meant we had to reimport our DB, which took 8 hours, plus then we had to drop the job scripts and job envs, which took another 5 hours, to then attempt the upgrade which took 2 hours. Moral of the story, make sure you have enough space and budget sufficient time. You may want to consider nulling out the job scripts and envs for the upgrade as they complete redo the way those are stored in the database in 22.05 so that it is more efficient but getting from here to there is the trick. For details see the bug report we filed: https://bugs.schedmd.com/show_bug.cgi?id=14514 -Paul Edmon- On 7/14/2022 2:34 PM, Timony, Mick wrote: What I can tell you is that we have never had a problem reimporting the data back in that was dumped from older versions into a current version database. So the import using sacctmgr must do the conversion from the older formats to the newer formats and handle the schema changes. That's the bit of info I was missing, I didn't realise that it outputs the data in a format that sacctmgr can read. I will note that if you are storing job_scripts and envs those can eat up a ton of space in 21.08. It looks like they've solved that problem in 22.05 but the archive steps on 21.08 took forever due to those scripts and envs. Yes, we are storing job_scripts with: AccountingStoreFlags=job_script I think when we made that decision, we decided that also saving the job_env would take up too much room as our DB is pretty big at the moment, at approx. 300GB with the o2_step_table and the o2_job_table taking up the most space for obvious reasons: ++---+ | Table | Size (GB) | ++---+ | o2_step_table | 183.83 | | o2_job_table | 128.18 | That's good advice Paul, much appreciated. >took forever and actually caused issues with the archive process I think that should be highlighted for other users! For those interested, to find the tables sizes I did this: SELECT table_name AS "Table", ROUND(((data_length + index_length) / 1024 / 1024 / 1024), 2) AS "Size (GB)" FROM information_schema.TABLES WHERE table_schema = "slurmdbd" ORDER BY (data_length + index_length) DESC; Replace slurmdbdwith the name of your database. Cheers --Mick
Re: [slurm-users] SlurmDB Archive settings?
We just use the Archive function built into slurm. That has worked fine for us for the past 6 years. We keep 6 months of data in the active archive. If you have 6 years worth of data and you want to prune down to 2 years, I recommend going month by month rather than doing it in one go. When we initially started archiving data several years back our first pass at archiving (which at that time had 2 years of data in it) took forever and actually caused issues with the archive process. We worked with SchedMD, improved the archive script built into Slurm but also decided to only archive one month at a time which allowed it to get done in a reasonable amount of time. The archived data can be pulled into a different slurm database, which is what we do for importing historic data into our XDMod instance. -Paul Edmon- On 7/13/2022 4:55 PM, Timony, Mick wrote: Hi Slurm Users, Currently we don't archive our SlurmDB and have 6 years' worth of data in our SlurmDB. We are looking to start archiving our database as it starting to get rather large, and we have decided to keep 2 years' worth of data. I'm wondering what approaches or scripts other groups use. The docs refer to the ArchiveScript setting at: https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveScript I've seen suggestions to import into another database that will require keeping the schema up-to-date which seems like a possible maintenance issue or nightmare if one forgets to update the schema after updating Slurmdb. We also have most of the information in an Elasticsearch <https://slurm.schedmd.com/elasticsearch.html> instance, which will likely suite our needs for long term historical information. What do you use to archive this information? CSV files, SQL dumps or something else? Regards -- Mick Timony Senior DevOps Engineer Harvard Medical School --
Re: [slurm-users] upgrading slurm to 20.11
Database upgrades can also take a while if your database is large. Definitely recommend backing up prior to upgrade as well as running slurmdbd -Dv and not the systemd daemon as if the upgrade takes a long time it will kill it preemptively due to unresponsiveness which will create all sorts of problems. -Paul Edmon- On 5/17/22 2:50 PM, Ole Holm Nielsen wrote: Hi, You can upgrade from 19.05 to 20.11 in one step (2 major releases), skipping 20.02. When that is completed, it is recommended to upgrade again from 20.11 to 21.08.8 in order to get the current major version. The 22.05 will be out very soon, but you may want to wait a couple of minor releases before upgrading to 22.05. I have collected much detailed information about Slurm upgrades in my Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm It is strongly recommended to make the dry-run test of the database upgrade, just to be sure your database won't cause problems. /Ole On 17-05-2022 18:13, byron wrote: Sorry, I should have been clearer. I understand that with regards to slurmd / slurmctld you can skip a major release without impacting running jobs etc. My questions was about upgrading slurmdbd and whether it was necessary to upgrade through the intermediate major releases (which I know understand is necessary). Thanks On Tue, May 17, 2022 at 4:49 PM Paul Edmon <mailto:ped...@cfa.harvard.edu>> wrote: The slurm docs say you can do two major releases at a time (https://slurm.schedmd.com/quickstart_admin.html <https://slurm.schedmd.com/quickstart_admin.html>): "Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x) involves changes to the state files with new data structures, new options, etc. Slurm permits upgrades to a new major release from the past two major releases, which happen every nine months (e.g. 20.02.x or 20.11.x to 21.08.x) without loss of jobs or other state information." As for old versions of slurm I think at this point you would need to contact SchedMD. I'm sure they have past releases they can hand out if you are bootstrapping to a newer release. -Paul Edmon- On 5/17/22 11:42 AM, byron wrote: Thanks Brian for the speedy responce. Am I not correct in thinking that if I just go from 19.05 to 20.11 then there is the advantage that I can upgrade slurmd and slurmctld in one go and it won't affect the running jobs since upgrading to a new major release from the past two major releases doesn't affect the state information. Or are you saying that in this case (19.05 direct to 21.08) there isn't any impact to running jobs either. Or did you step through all the versions when upgrading slurmd and slurmctld also? Also where do I get a copy of 20.2 from if schedMD aren't providing it as a download. Thanks On Tue, May 17, 2022 at 4:05 PM Brian Andrus mailto:toomuc...@gmail.com>> wrote: You need to step upgrade through major versions (not minor). So 19.05=>20.x I would highly recommend going to 21.08 while you are at it. I just did the same migration (although they started at 18.x) with no issues. Running jobs were not impacted and users didn't even notice. Brian Andrus On 5/17/2022 7:35 AM, byron wrote: > Hi > > I'm looking at upgrading our install of slurm from 19.05 to 20.11 in > responce to the recenty announced security vulnerabilities. > > I've been through the documentation / forums and have managed to find > the answers to most of my questions but am still unclear about the > following > > - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go > through all the versions (19.05 => 20.2 => 20.11)? From reading the > forums it look as though it is necesary > https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ <https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ> > https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ <https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ> > However if that is the case it would seem strange that SchedMD have > removed 20.2 from the downloads page (I understand the reason is that > it contains the exploit) if it is still required for the upgrade. > > - We are running version 5.5.68 of the MariaDB, the version that comes > with centos7.9. I've seen a few references to upgrading v5.5 but > they were in the context of upgrading from slurm 17 to 18. I'm > wondering if its ok to stick with this version since we're already on > slurm 19.05. > > Any help much appreciated.
Re: [slurm-users] upgrading slurm to 20.11
I think it should be, but you should be able to run a test and find out. -Paul Edmon- On 5/17/22 12:13 PM, byron wrote: Sorry, I should have been clearer. I understand that with regards to slurmd / slurmctld you can skip a major release without impacting running jobs etc. My questions was about upgrading slurmdbd and whether it was necessary to upgrade through the intermediate major releases (which I know understand is necessary). Thanks On Tue, May 17, 2022 at 4:49 PM Paul Edmon wrote: The slurm docs say you can do two major releases at a time (https://slurm.schedmd.com/quickstart_admin.html): "Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x) involves changes to the state files with new data structures, new options, etc. Slurm permits upgrades to a new major release from the past two major releases, which happen every nine months (e.g. 20.02.x or 20.11.x to 21.08.x) without loss of jobs or other state information." As for old versions of slurm I think at this point you would need to contact SchedMD. I'm sure they have past releases they can hand out if you are bootstrapping to a newer release. -Paul Edmon- On 5/17/22 11:42 AM, byron wrote: Thanks Brian for the speedy responce. Am I not correct in thinking that if I just go from 19.05 to 20.11 then there is the advantage that I can upgrade slurmd and slurmctld in one go and it won't affect the running jobs since upgrading to a new major release from the past two major releases doesn't affect the state information. Or are you saying that in this case (19.05 direct to 21.08) there isn't any impact to running jobs either. Or did you step through all the versions when upgrading slurmd and slurmctld also? Also where do I get a copy of 20.2 from if schedMD aren't providing it as a download. Thanks On Tue, May 17, 2022 at 4:05 PM Brian Andrus wrote: You need to step upgrade through major versions (not minor). So 19.05=>20.x I would highly recommend going to 21.08 while you are at it. I just did the same migration (although they started at 18.x) with no issues. Running jobs were not impacted and users didn't even notice. Brian Andrus On 5/17/2022 7:35 AM, byron wrote: > Hi > > I'm looking at upgrading our install of slurm from 19.05 to 20.11 in > responce to the recenty announced security vulnerabilities. > > I've been through the documentation / forums and have managed to find > the answers to most of my questions but am still unclear about the > following > > - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go > through all the versions (19.05 => 20.2 => 20.11)? From reading the > forums it look as though it is necesary > https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ > https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ > However if that is the case it would seem strange that SchedMD have > removed 20.2 from the downloads page (I understand the reason is that > it contains the exploit) if it is still required for the upgrade. > > - We are running version 5.5.68 of the MariaDB, the version that comes > with centos7.9. I've seen a few references to upgrading v5.5 but > they were in the context of upgrading from slurm 17 to 18. I'm > wondering if its ok to stick with this version since we're already on > slurm 19.05. > > Any help much appreciated. > > > >
Re: [slurm-users] upgrading slurm to 20.11
The slurm docs say you can do two major releases at a time (https://slurm.schedmd.com/quickstart_admin.html): "Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x) involves changes to the state files with new data structures, new options, etc. Slurm permits upgrades to a new major release from the past two major releases, which happen every nine months (e.g. 20.02.x or 20.11.x to 21.08.x) without loss of jobs or other state information." As for old versions of slurm I think at this point you would need to contact SchedMD. I'm sure they have past releases they can hand out if you are bootstrapping to a newer release. -Paul Edmon- On 5/17/22 11:42 AM, byron wrote: Thanks Brian for the speedy responce. Am I not correct in thinking that if I just go from 19.05 to 20.11 then there is the advantage that I can upgrade slurmd and slurmctld in one go and it won't affect the running jobs since upgrading to a new major release from the past two major releases doesn't affect the state information. Or are you saying that in this case (19.05 direct to 21.08) there isn't any impact to running jobs either. Or did you step through all the versions when upgrading slurmd and slurmctld also? Also where do I get a copy of 20.2 from if schedMD aren't providing it as a download. Thanks On Tue, May 17, 2022 at 4:05 PM Brian Andrus wrote: You need to step upgrade through major versions (not minor). So 19.05=>20.x I would highly recommend going to 21.08 while you are at it. I just did the same migration (although they started at 18.x) with no issues. Running jobs were not impacted and users didn't even notice. Brian Andrus On 5/17/2022 7:35 AM, byron wrote: > Hi > > I'm looking at upgrading our install of slurm from 19.05 to 20.11 in > responce to the recenty announced security vulnerabilities. > > I've been through the documentation / forums and have managed to find > the answers to most of my questions but am still unclear about the > following > > - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go > through all the versions (19.05 => 20.2 => 20.11)? From reading the > forums it look as though it is necesary > https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ > https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ > However if that is the case it would seem strange that SchedMD have > removed 20.2 from the downloads page (I understand the reason is that > it contains the exploit) if it is still required for the upgrade. > > - We are running version 5.5.68 of the MariaDB, the version that comes > with centos7.9. I've seen a few references to upgrading v5.5 but > they were in the context of upgrading from slurm 17 to 18. I'm > wondering if its ok to stick with this version since we're already on > slurm 19.05. > > Any help much appreciated. > > > >
Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"
They fix this in newer versions of Slurm. We had the same issue with older versions so we hard to run with the config_override option on to keep the logs quiet. They changed the way logging was done in the more recent releases and its not as chatty. -Paul Edmon- On 5/12/22 7:35 AM, Per Lönnborg wrote: Greetings, is there a way to lower the log rate on error messages in slurmctld for nodes with hardware errors? We see for example this for a node that has DIMM errors: [2022-05-12T07:07:34.757] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:35.760] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:36.763] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:37.766] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:38.769] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:39.773] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:40.776] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:41.779] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:42.781] error: Node node37 has low real_memory size (257642 < 257660) [2022-05-12T07:07:45.143] error: Node node37 has low real_memory size (257642 < 257660) The log warning is correct, the node has DIMM errors, but that´s one log entry per second. That doesn´t seem right with such high log rate? Thanks, / Per Lonnborg ___ Annons: Handla enkelt och smidigt hos Clas Ohlson <http://www.dpbolvw.net/click-5762941-10771045>
Re: [slurm-users] Slurm 21.08.8-2 upgrade
We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we saw the communications issues described by Tim W. We upgraded to 21.08.8-2 this morning and that did the trick to resolve all the communications problems we were having. -Paul Edmon- On 5/6/2022 4:38 AM, Ole Holm Nielsen wrote: Hi Juergen, My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for the entire cluster, and we didn't have any issues. I built RPMs from the tar-ball and simply did "yum update" on the nodes (one partition at a time) while the cluster was running in full production mode. All slurmd get restarted during the yum update, and this happens within 1-2 minutes per partition. Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster, and again we have not seen any issues. We also do *not* setting CommunicationParameters=block_null_hash until a later date when there are no more old versions of slurmstepd running. We did however see RPC errors with "Protocol authentication error" while block_null_hash was enabled briefly, see https://bugs.schedmd.com/show_bug.cgi?id=14002, and so we turned it off again. It hasn't happened since. Best regards, Ole On 5/6/22 01:57, Juergen Salk wrote: Hi John, this is really bad news. We have stopped our rolling update from Slurm 21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of compute nodes already running slurmd 21.08.8-1 suddenly started flapping between responding and not responding but all other nodes that were still running version 21.08.6 slurmd were not affected. For the affected nodes we did not see any obvious reason in slurmd.log even with SlurmdDebug set to debug3 but we noticed the following in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route enabled. [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 RPC:REQUEST_PING : Protocol authentication error [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 RPC:REQUEST_PING : Protocol authentication error [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 RPC:REQUEST_PING : Protocol authentication error [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 RPC:REQUEST_PING : Protocol authentication error [2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 RPC:REQUEST_PING : Protocol authentication error [2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding So you seen this as well with 21.08.8-2? We didn't have CommunicationParameters=block_null_hash set, btw. Actually, after Tim's last announcement, I was hoping that we can start over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore, I would also be highly interested what others can say about rolling updates from Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a mix of patched and unpatched slurmd versions on the compute nodes. If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd we may have to drain the whole cluster for updating Slurm, which is something that I'd actually wished to avoid. Best regards Jürgen * Legato, John (NIH/NHLBI) [E] [220505 22:30]: Hello, We are in the process of upgrading from Slurm 21.08.6 to Slurm 21.08.8-2. We’ve upgraded the controller and a few partitions worth of nodes. We notice the nodes are losing contact with the controller but slurmd is still up. We thought that this issue was fixed in -2 based on this bug report: https://bugs.schedmd.com/show_bug.cgi?id=14011 However we are still seeing the same behavior. I note that nodes running 21.08.6 are having no issues with communication. I could upgrade the remaining 21.08.6 nodes but hesitate to do that as it seems like it would completely kill the functioning nodes. Is anyone else still seeing this in -2?
Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?
We've invoked scontrol in our epilog script for years to close off nodes with out any issue. What the docs are really referring to is gratuitous use of those commands. If you have those commands well circumscribed (i.e. only invoked when you have to actually close a node) and only use them when you absolutely have no other work around then you should be fine. -Paul Edmon- On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as *“Epilog error”*. Then our auto-repair program will have trouble to determine how to repair the node. Another way is call *scontrol* directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote: /Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). … Slurm commands in these scripts can potentially lead to performance issues and should not be used./ So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides *“Epilog error” *reason?
Re: [slurm-users] non-historical scheduling
So you want a purely fractional usage of the cluster. That's hard to do via fairshare or with out fairshare as the scheduler will usually fill up all the nodes with the top priority job. If you don't have fairshare running or any historical data it will revert to FIFO. So which ever user got in first will go first, no matter how many jobs there are. Fairshare can accomplish what you want above but it takes time for it to settle into a steady state due to behavior above. If you chart the usage over time with fairshare you will see it even out, but at any given immediate time you will have one user dominating over another one. You could probably achieve a pure fractional usage model by utilizing hard limits for each user in terms of number of cores. The problem is that you will leave parts of the cluster open and idle. If that is fine then I recommend setting hard limits for each user. -Paul Edmon- On 4/12/2022 8:55 AM, Chagai Nota wrote: Hi Loris Thanks for your answer. I tired to configure it and I didn't get desired results. This is my configuration: PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageResetPeriod=DAILY PriorityFavorSmall=yes PriorityWeightFairshare=10 PriorityWeightAge=0 PriorityWeightPartition=0 PriorityWeightJobSize=10 PriorityMaxAge=1-0 PriorityCalcPeriod=1 The desired result its that when 2 users A and B send jobs they will have equal number of jobs to each of them. Lets say all grid have 12 slots so user A and B each one of them will get 6, but when happen that user A get 12 and after sometime user B get 12 -Original Message- From: slurm-users On Behalf Of Loris Bennett Sent: Tuesday, April 12, 2022 12:06 PM To: Slurm User Community List Subject: Re: [slurm-users] non-historical scheduling CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Chagai, Chagai Nota <https://urldefense.proofpoint.com/v2/url?u=http-3A__chagai.nota-40altair-2Dsemi.com=DwIFaQ=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM=jm7M7VuAC03WZiP8QPjQXbsP_SRYyhc66dx6T2rYKGk=mC6-yDte_BkF_egdAiZhLfKbIi-zhwylR5b6AOgnfEo=aOWXcTJqFuopg_IznzSJXY_GKgxYv-0FAFrZrQBDpyA=> writes: Hi I would like to ask if there is any option that slurm scheduler will consider only running jobs and not historical data. We don't care about how many jobs users was running in the past but only the current usage. Look at https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_priority-5Fmultifactor.html=DwIFaQ=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM=jm7M7VuAC03WZiP8QPjQXbsP_SRYyhc66dx6T2rYKGk=mC6-yDte_BkF_egdAiZhLfKbIi-zhwylR5b6AOgnfEo=ez1DlsO7KAqb6GyBvfk6PoSMDxjckA26SrtvBRwOPtc= You probably need to set PriorityDecayHalfLife=0 and then, say, PriorityUsageResetPeriod=DAILY Cheers, Loris Thanks Chagai Nota -- -- -- Important Notice: This email message and any attachments thereto are confidential and/or privileged and/or subject to privacy laws and are intended only for use by the addressee(s) named above. If you are not the intended addressee, you are hereby kindly notified that any dissemination, distribution, copying or use of this email and any attachments thereto is strictly prohibited. If you have received this email in error, kindly delete it from your computer system and notify us at the telephone number or email address appearing above. The writer asserts in respect of this message and attachments all rights for confidentiality, privilege or privacy to the fullest extent permitted by law. -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de -- This email has been scanned for spam and viruses by Proofpoint Essentials. Visit the following link to report this email as spam: https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Feu1.proofpointessentials.com%2Findex01.php%3Fmod_id%3D11%26mod_option%3Dlogitem%26mail_id%3D1649754490-sOxgKUMnXxFb%26r_address%3Dchagai.nota%2540altair-semi.com%26report%3D1data=04%7C01%7C%7Cd5bb86cab5754f0b297008da1c63f87f%7Cd97b0f906803449a923b749ca7eedb2b%7C0%7C0%7C637853512947856000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=I%2FKmvMAweaHTndtIydlYiBJqyylKa58JlSOYCXSzJnI%3Dreserved=0 Important Notice: This email message and any attachments thereto are confidential and/or privileged and/or subject to privacy laws and are intended only for use by the addressee(s) named above. If you are not the intended addressee, you are hereby kindly notified that any dissemi
Re: [slurm-users] Limit partition to 1 job at a time
I think you could do this by clever use of a partition level QoS but I don't have an obvious way of doing this. -Paul Edmon- On 3/22/2022 11:40 AM, Russell Jones wrote: Hi all, For various reasons, we need to limit a partition to being able to run max 1 job at a time. Not 1 job per user, but 1 job total at a time, while queuing any other jobs to run after this one is complete. I am struggling to figure out how to do this. Any tips? Thanks!
Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5
We also noticed the same thing with 21.08.5. In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up. -Paul Edmon- On 2/10/2022 8:59 AM, Ward Poelmans wrote: Hi Paul, On 10/02/2022 14:33, Paul Brunk wrote: Now we see a problem in which the OOM killer is in some cases predictably killing job steps who don't seem to deserve it. In some cases these are job scripts and input files which ran fine before our Slurm upgrade. More details follow, but that's it the issue in a nutshell. I'm not sure if this is the case but it might help to keep in mind the difference between mpirun and srun. With srun you let slurm create tasks with the appropriate mem/cpu etc limits and the mpi ranks will run directly in a task. With mpirun you usually let your MPI distribution start on task per node which will spawn the mpi manager which will start the actual mpi program. You might very well end up with different memory limits per process which could be the cause of your OOM issue. Especially if not all MPI ranks use the same amount of memory. Ward
Re: [slurm-users] How to limit # of execution slots for a given node
Also I recommend setting: *CoreSpecCount* Number of cores reserved for system use. These cores will not be available for allocation to user jobs. Depending upon the *TaskPluginParam* option of *SlurmdOffSpec*, Slurm daemons (i.e. slurmd and slurmstepd) may either be confined to these resources (the default) or prevented from using these resources. Isolation of the Slurm daemons from user jobs may improve application performance. If this option and *CpuSpecList* are both designated for a node, an error is generated. For information on the algorithm used by Slurm to select the cores refer to the core specialization documentation ( https://slurm.schedmd.com/core_spec.html ). and *MemSpecLimit* Amount of memory, in megabytes, reserved for system use and not available for user allocations. If the task/cgroup plugin is configured and that plugin constrains memory allocations (i.e. *TaskPlugin=task/cgroup* in slurm.conf, plus *ConstrainRAMSpace=yes* in cgroup.conf), then Slurm compute node daemons (slurmd plus slurmstepd) will be allocated the specified memory limit. Note that having the Memory set in *SelectTypeParameters* as any of the options that has it as a consumable resource is needed for this option to work. The daemons will not be killed if they exhaust the memory allocation (ie. the Out-Of-Memory Killer is disabled for the daemon's memory cgroup). If the task/cgroup plugin is not configured, the specified memory will only be unavailable for user allocations. These will restrict specific memory and cores for system use. This is probably the best way to go rather than spoofing your config. -Paul Edmon- On 1/7/2022 2:36 AM, Rémi Palancher wrote: Le jeudi 6 janvier 2022 à 22:39, David Henkemeyer a écrit : All, When my team used PBS, we had several nodes that had a TON of CPUs, so many, in fact, that we ended up setting np to a smaller value, in order to not starve the system of memory. What is the best way to do this with Slurm? I tried modifying # of CPUs in the slurm.conf file, but I noticed that Slurm enforces that "CPUs" is equal to Boards * SocketsPerBoard * CoresPerSocket * ThreadsPerCore. This left me with having to "fool" Slurm into thinking there were either fewer ThreadsPerCore, fewer CoresPerSocket, or fewer SocketsPerBoard. This is a less than ideal solution, it seems to me. At least, it left me feeling like there has to be a better way. I'm not sure you can lie to Slurm about the real number of CPUs on the nodes. If you want to prevent Slurm from allocating more than n CPUs below the total number of CPUs of these nodes, I guess one solution is to use MaxCPUsPerNode=n at the partition level. You can also mask "system" CPUs with CpuSpecList at node level. The later is better if you need fine grain control over the exact list of reserved CPUs regarding NUMA topology or whatever. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] How to limit # of execution slots for a given node
You can actually spoof the number of cores and RAM on a node by using the config_override option. I've used that before for testing purposes. Mind you core binding and other features like that will not work if you start spoofing the number of cores and ram, so use with caution. -Paul Edmon- On 1/7/2022 2:36 AM, Rémi Palancher wrote: Le jeudi 6 janvier 2022 à 22:39, David Henkemeyer a écrit : All, When my team used PBS, we had several nodes that had a TON of CPUs, so many, in fact, that we ended up setting np to a smaller value, in order to not starve the system of memory. What is the best way to do this with Slurm? I tried modifying # of CPUs in the slurm.conf file, but I noticed that Slurm enforces that "CPUs" is equal to Boards * SocketsPerBoard * CoresPerSocket * ThreadsPerCore. This left me with having to "fool" Slurm into thinking there were either fewer ThreadsPerCore, fewer CoresPerSocket, or fewer SocketsPerBoard. This is a less than ideal solution, it seems to me. At least, it left me feeling like there has to be a better way. I'm not sure you can lie to Slurm about the real number of CPUs on the nodes. If you want to prevent Slurm from allocating more than n CPUs below the total number of CPUs of these nodes, I guess one solution is to use MaxCPUsPerNode=n at the partition level. You can also mask "system" CPUs with CpuSpecList at node level. The later is better if you need fine grain control over the exact list of reserved CPUs regarding NUMA topology or whatever. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] export qos
Just of our curiosity is there a reason you aren't just doing a mysqldump of the extant DB and then reimporting it? I'm not aware of a way to dump just the qos settings for import other than: sacctmgr show qos -Paul Edmon- On 12/17/2021 10:24 AM, Williams, Jenny Avis wrote: Sacctmgr dump gets the user listings, but I do not see how to dump qos settings. Does anyone know of a quick way to export qos settings for import to a new sched box? Jenny
Re: [slurm-users] slurmdbd full backup so the primary can be purged
I haven't tested with super ancient versions of Slurm but I know we have uploaded past versions before so we could scrape the data for XDMod. So as far as I'm aware there is no version limitation, but your mileage may vary with very old versions of Slurm. To make sure I would probably ping SchedMD as to any limitations they are aware of. Usually they are pretty good about being comprehensive in their docs so they would have probably mentioned it if there was one. -Paul Edmon- On 12/13/2021 5:07 AM, Loris Bennett wrote: Hi Paul, Am I right in assuming that there are going to be some limitations to loading archived data w.r.t. version of slurmdbd used to create the archive and that used to read it? Cheers, Loris Paul Edmon writes: Files generated by the slurmdbd archive are read back into the live database by sacctmgr. See: archive load Load in to the database previously archived data. The archive file will not be loaded if the records already exist in the database - therefore, trying to load an archive file more than once will result in an error. When this data is again archived and purged from the database, if the old archive file is still in the directory ArchiveDir, a new archive file will be created (see ArchiveDir in the slurmdbd.conf man page), so the old file will not be overwritten and these files will have duplicate records. File= File to load into database. The specified file must exist on the slurmdbd host, which is not necessarily the machine running the command. Insert= SQL to insert directly into the database. This should be used very cautiously since this is writing your sql into the database. So you could set up a full mirror and then read the old archives into that. You just want to make sure that mirror has archiving/purging turned off so it won't rearchive the data you restored. -Paul Edmon- On 12/10/2021 1:28 PM, Ransom, Geoffrey M. wrote: Hello Our slurmdbd database is getting rather large and affecting performance, but we want to keep usage data around for a few years for metric purposes in order to figure out how our users work. I read a suggestion to have a backup DB which has all the usage data synced to it for metric purposes and a main slurmdbd setup for the cluster to use that cleans out old data based on your user working needs. Is there any documentation suggesting how to set up a second slurmdbd server that will receive a copy of all the main slurmdbd entries without purging so we can start purging on the in use slurmdbd service to keep short term performance snappy? Presumably the upgrade process will be complicated by this as well since we have to keep the archive slurmdbd setup in sync with the cluster slurmdbd. Thanks. *EDIT before hitting send* I was re-reading the slurmdbd.conf man page and just saw the Archive* options and this sounds like it would work to implement something like this. Are archive files readable by sacct and sreport, or easily manually parseable? I am going to turn these on in my test cluster, but hearing about other peoples experiences with this would probably be helpful.
Re: [slurm-users] slurmdbd full backup so the primary can be purged
Files generated by the slurmdbd archive are read back into the live database by sacctmgr. See: archive load Load in to the database previously archived data. The archive file will not be loaded if the records already exist in the database - therefore, trying to load an archive file more than once will result in an error. When this data is again archived and purged from the database, if the old archive file is still in the directory ArchiveDir, a new archive file will be created (see ArchiveDir in the slurmdbd.conf man page), so the old file will not be overwritten and these files will have duplicate records. /File=/ File to load into database. The specified file must exist on the slurmdbd host, which is not necessarily the machine running the command. /Insert=/ SQL to insert directly into the database. This should be used very cautiously since this is writing your sql into the database. So you could set up a full mirror and then read the old archives into that. You just want to make sure that mirror has archiving/purging turned off so it won't rearchive the data you restored. -Paul Edmon- On 12/10/2021 1:28 PM, Ransom, Geoffrey M. wrote: Hello Our slurmdbd database is getting rather large and affecting performance, but we want to keep usage data around for a few years for metric purposes in order to figure out how our users work. I read a suggestion to have a backup DB which has all the usage data synced to it for metric purposes and a main slurmdbd setup for the cluster to use that cleans out old data based on your user working needs. Is there any documentation suggesting how to set up a second slurmdbd server that will receive a copy of all the main slurmdbd entries without purging so we can start purging on the in use slurmdbd service to keep short term performance snappy? Presumably the upgrade process will be complicated by this as well since we have to keep the archive slurmdbd setup in sync with the cluster slurmdbd. Thanks. **EDIT* before hitting send* I was re-reading the slurmdbd.conf man page and just saw the Archive* options and this sounds like it would work to implement something like this. Are archive files readable by sacct and sreport, or easily manually parseable? I am going to turn these on in my test cluster, but hearing about other peoples experiences with this would probably be helpful.
Re: [slurm-users] Database Compression
Just to put a resolution on this. I did some testing and compression does work but to get extant tables to compress you have to reimport your database. So the procedure would be to: 1. Turn on compression in my.cnf following the doc. 2. mysqldump the database you want to compress 3. recreate that database (drop and remake it) 4. reimport the database This can take a bit if your database is large. However when I tested this with our production database it went from 130G on disk to 29G, a factor of 4.5 improvement (this is using the default settings and zlib). I haven't had time to actually do it for real on our live system and see if there is a performance hit in terms of scheduling but we keep a sizable buffer in memory so I'm not anticipating any thing. My verdict then is that if you are going to do it, do it before your database grows too big as doing the dump and reimport will take a while (for me it was about 4 hours start to finish on my test system). -Paul Edmon- On 12/2/2021 1:06 PM, Baer, Troy wrote: My site has just updated to Slurm 21.08 and we are looking at moving to the built-in job script capture capability, so I'm curious about this as well. --Troy -Original Message- From: slurm-users On Behalf Of Paul Edmon Sent: Thursday, December 2, 2021 10:30 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Database Compression With the advent of the ability to store jobscripts in the slurmdb, our db is growing at a fairly impressive rate (which is expected). That said I've noticed that our database backups are highly compressible (factor of 24). Not being a mysql expert I hunted around to see if it could do native compression and it can: https://urldefense.com/v3/__https://mariadb.com/kb/en/innodb-page-compression/__;!!KGKeukY!jgWBQWFG0eUgbghBOl8d4w1h4_lBv5VNloBkfgr5pOYVq0V1xjV-hToRXTQZ$ My question is if anyone has had any experience with using page compression for mariadb and if there are any hitches I should be aware of? -Paul Edmon-
Re: [slurm-users] A Slurm topological scheduling question
This should be fine assuming you don't mind the mismatch in CPU speeds. Unless the codes are super sensitive to topology things should be okay as modern IB is wicked fast. In our environment here we have a variety of different hardware types all networked together on the same IB fabric. That said we create partitions for different hardware types and we don't have a queue that schedules across both, though we do have a backfill serial queue that underlies everything. All of that though is scheduled via a single scheduler with a single topology.conf governing it all. We also have all our internode IP comms going over our IB fabric and it works fine. -Paul Edmon- On 12/7/2021 11:05 AM, David Baker wrote: Hello, These days we have now enabled topology aware scheduling on our Slurm cluster. One part of the cluster consists of two racks of AMD compute nodes. These racks are, now, treated as separate entities by Slurm. Soon, we may add another set of AMD nodes with slightly difference CPU specs to the existing nodes. We'll aim to balance the new nodes across the racks re cooling/heating requirements. The new nodes will be controlled by a new partition. Does anyone know if it is possible to regard the two racks as a single entity (by connecting the InfiniBand switches together), and so schedule jobs across the resources in the racks with no loss efficiency. I would be grateful for your comments and ideas, please. The alternative is to put all the new nodes in a completely new rack, but that does mean that we'll have purchase some new Ethernet and IB switches. We are not happy, by the way, to have node/switch connections across racks. Best regards, David
Re: [slurm-users] [EXT] Re: slurmdbd does not work
I would check that you have MariaDB-shared installed too on the host you build on prior to your build. The changed the way the packaging is done in MariaDB and Slurm needs to detect the files in MariaDB-shared to actually trigger the configure to build the mysql libs. -Paul Edmon- On 12/3/2021 7:40 PM, Giuseppe G. A. Celano wrote: 10.4.22 On Sat, Dec 4, 2021 at 1:35 AM Brian Andrus wrote: Which version of Mariadb are you using? Brian Andrus On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote: After installation of libmariadb-dev, I have reinstalled the entire slurm with ./configure + options, make, and make install. Still, accounting_storage_mysql.so is missing. On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby wrote: Did you run ./configure (with any other options you normally use) make make install on your DBD server after you installed the mariadb-devel package? *From:* slurm-users on behalf of Giuseppe G. A. Celano *Sent:* Saturday, 4 December 2021 10:07 *To:* Slurm User Community List *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work * *External email: *Please exercise caution * The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so I have installed many mariadb-related packages, but that file is not created by slurm after installation: is there a point in the documentation where the installation procedure for the database is made explicit? On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus wrote: You will need to also reinstall/restart slurmdbd with the updated binary. Look in the slurmdbd logs to see what is happening there. I suspect it had errors updating/creating the database and tables. If you have no data in it yet, you can just DROP the database and restart slurmdbd. Brian Andrus On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote: Thanks for the answer, Brian. I now added --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and now also slurmctld does not work, with the error: [2021-12-03T15:36:41.018] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of 2613 bytes [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.022] error
Re: [slurm-users] Preferential scheduling on a subset of nodes
If you set up a higher priority partition with Preemption OFF on the lower priority partition you should be able to accomplish this. If you have preemption turned off for the specific partitions in question Slurm will not preempt but will schedule jobs from the higher priority partition first regardless of current fairshare scores. See: *PreemptMode* Mechanism used to preempt jobs or enable gang scheduling for this partition when *PreemptType=preempt/partition_prio* is configured. This partition-specific *PreemptMode* configuration parameter will override the cluster-wide *PreemptMode* for this partition. It can be set to OFF to disable preemption and gang scheduling for this partition. See also *PriorityTier* and the above description of the cluster-wide *PreemptMode* parameter for further details. This is at least how we manage that. -Paul Edmon- On 12/1/2021 11:32 AM, Sean McGrath wrote: Hi, Apologies for having to ask such a basic question. We want to be able to give some users preferential access to some nodes. They bought the nodes which are currently in a 'long' partition as their jobs need a longer walltime. When the purchasing users group is not using the nodes I would like other users to be able to run jobs on those nodes but when the owners group submit jobs I want those jobs to be queued as soon as currently running jobs on those nodes are finished. My understanding is that preemption won't work in these circumstances as it will either cancel or suspend currently running jobs, I want the currently running jobs to finish before the preferential ones start. I'm wondering if QOS could do what we need here. Can the following be sanity checked please. Put the specific nodes in both the long and the compute (standard) partition. Then restrict access to the long partition to specified users so that all users can access them in the compute queue but only a subset of users can use the longer wall time queue. $ scontrol update PartitionName=long Users=user1,user2 We currently don't have QOS enabled so change that in slurm.conf and restart the slurmctld. -PriorityWeightQOS=0 +PriorityWeightQOS=1 Then create a qos and modify its priority $ sacctmgr add qos boost $ sacctmgr modify qos boost set priority=10 $ sacctmgr modify user user1 set qos=boost Will that do what I expect please? Many thanks and again apologies for the basic question. Sean
Re: [slurm-users] Suspending jobs for file system maintenance
I think it depends on the filesystem type. Lustre generally fails over nicely and handles reconnections with out much of a problem. We've done this before with out any hitches, even with the jobs being live. Generally the jobs just hang and then resolve once the filesystem comes back. On a live system you will end up with a completion storm as jobs are always exiting and thus while the filesystem is gone the jobs dependent on it will just hang and if they are completing they will just stall on the completion step. Once it returns then all that traffic flushes. This can create issues where a bunch of nodes get closed due to Kill task fail or other completion flags. Generally these are harmless though I have seen stuck processes on nodes and have had to reboot them to clear, so you should check any node before putting it back in action. That said if you are pausing all the jobs and scheduling this is some what mitigated, though jobs will still exit due to timeout. -Paul Edmon- On 10/25/2021 4:47 AM, Alan Orth wrote: Dear Jurgen and Paul, This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job processes. The processes remain in memory, but are paused. What happens to open file handles, since the underlying filesystem goes away and comes back? Thank you, On Sat, Oct 23, 2021 at 1:10 AM Juergen Salk wrote: Thanks, Paul, for confirming our planned approach. We did it that way and it worked very well. I have to admit that my fingers were a bit wet when suspending thousands of running jobs, but it worked without any problems. I just didn't dare to resume all suspended jobs at once, but did that in a staggered manner. Best regards Jürgen * Paul Edmon [211019 15:15]: > Yup, we follow the same process for when we do Slurm upgrades, this looks > analogous to our process. > > -Paul Edmon- > > On 10/19/2021 3:06 PM, Juergen Salk wrote: > > Dear all, > > > > we are planning to perform some maintenance work on our Lustre file system > > which may or may not harm running jobs. Although failover functionality is > > enabled on the Lustre servers we'd like to minimize risk for running jobs > > in case something goes wrong. > > > > Therefore, we thought about suspending all running jobs and resume > > them as soon as file systems are back again. > > > > The idea would be to stop Slurm from scheduling new jobs as a first step: > > > > # for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done > > > > with foo, bar and baz being the configured partitions. > > > > Then suspend all running jobs (taking job arrays into account): > > > > # squeue -ho %A -t R | xargs -n 1 scontrol suspend > > > > Then perform the failover of OSTs to another OSS server. > > Once done, verify that file system is fully back and all > > OSTs are in place again on the client nodes. > > > > Then resume all suspended jobs: > > > > # squeue -ho %A -t S | xargs -n 1 scontrol resume > > > > Finally bring back the partitions: > > > > # for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done > > > > Does that make sense? Is that common practice? Are there any caveats that > > we must think about? > > > > Thank you in advance for your thoughts. > > > > Best regards > > Jürgen > > -- Alan Orth alan.o...@gmail.com https://picturingjordan.com https://englishbulgaria.net https://mjanja.ch
Re: [slurm-users] Suspending jobs for file system maintenance
Yup, we follow the same process for when we do Slurm upgrades, this looks analogous to our process. -Paul Edmon- On 10/19/2021 3:06 PM, Juergen Salk wrote: Dear all, we are planning to perform some maintenance work on our Lustre file system which may or may not harm running jobs. Although failover functionality is enabled on the Lustre servers we'd like to minimize risk for running jobs in case something goes wrong. Therefore, we thought about suspending all running jobs and resume them as soon as file systems are back again. The idea would be to stop Slurm from scheduling new jobs as a first step: # for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done with foo, bar and baz being the configured partitions. Then suspend all running jobs (taking job arrays into account): # squeue -ho %A -t R | xargs -n 1 scontrol suspend Then perform the failover of OSTs to another OSS server. Once done, verify that file system is fully back and all OSTs are in place again on the client nodes. Then resume all suspended jobs: # squeue -ho %A -t S | xargs -n 1 scontrol resume Finally bring back the partitions: # for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done Does that make sense? Is that common practice? Are there any caveats that we must think about? Thank you in advance for your thoughts. Best regards Jürgen
Re: [slurm-users] slurm.conf syntax checker?
Sadly no. There is a feature request for one though: https://bugs.schedmd.com/show_bug.cgi?id=3435 What we've done in the meantime is put together a gitlab runner which basically starts up a mini instance of the scheduler and runs slurmctld on the slurm.conf we want to put in place. We then have it reject any changes that cause failure. It's not perfect but it works. A real syntax checker would be better. -Paul Edmon- On 10/12/2021 4:08 PM, bbenede...@goodyear.com wrote: Is there any sort of syntax checker that we could run our slurm.conf file through before committing it? (And sometimes crashing slurmctld in the process...) Thanks!
[slurm-users] Using Nice to Break Ties
We use the classic fairshare algorithm here with users having their shares set to to parent and pulling from the group pool rather than having each user have their own fairshare (you can see our doc here: https://docs.rc.fas.harvard.edu/kb/fairshare/). This has worked very well for us for many years. However, there is a use case where this doesn't work namely breaking ties internal to a group. We have a lot of private partitions owned by a specific group and when you have a bunch of users in that group the queue turns into FIFO instead of letting lower usage users go first due to the parent flag on the fairshare. Now this is obviously solved by giving every user their own fairshare but this has the downside of impacting the users priority back on the shared partitions with other groups where they will not be able to use their groups full fairshare but instead are stuck with their own. Thus their total group fairshare may be something like 0.4 but their personal is stuck at 0 because they are one of the heaviest users in the lab. Now I get the feeling that Fair Tree might solve this but I can't move to it as it's taken years for our users to even understand and accept the classic fairshare model. As such I'm trying to come up with solutions that work with in the model. One option I have been considering is using the job_submit.lua script to set a Nice value for all the jobs based on that users usage. Basically the nice value would break the internal ties of the group and allow non-FIFO scheduling internal to accounts with out impacting their overall fairshare relative to other groups. Before I start messing around with this though I wanted to ping this wisdom of the group and see how others handle tie breaking internal to an account/group/lab? What solutions have people used for this? -Paul Edmon-
Re: [slurm-users] User CPU limit across partitions?
I think you can accomplish this by setting Partition QoS and defining it to hook into the same QoS for all there. I believe that would force it to share the same pool. That said I don't know if that would work properly, its worth a test. That is my first guess though. -Paul Edmon- On 8/3/2021 2:35 PM, bbenede...@goodyear.com wrote: Good day. Is is possible to have a user limits ACROSS partitions? Say I have three partitions, large, medium, and small. I would like my users to have a 1000 cpu limit across all three partitions. So that they could use up to 1000 cpus in any combination of large, medium, and small. But I don't want to limit them to 333 in each parition, but rather total up across all of the partitions to be no more than 1000 cpus. Is that possible? Thanks!
Re: [slurm-users] declare availability of up to 8 cores//job
From that page: *GrpTRES=* The total count of TRES able to be used at any given time from jobs running from an association and its children or QOS. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished from this group. So basically its the sum total of all the TRES a Group could run in a partition at one time. -Paul Edmon- On 8/2/2021 12:05 PM, Adrian Sevcenco wrote: On 8/2/21 6:26 PM, Paul Edmon wrote: Probably more like MaxTRESPERJob=cpu=8 i see, thanks!! i'm still searching for the definition of GrpTRES :) Thanks a lot! Adrian You would need to specify how much TRES you need for each job in the normal tres format. -Paul Edmon- On 8/2/2021 11:24 AM, Adrian Sevcenco wrote: On 8/2/21 5:44 PM, Paul Edmon wrote: You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html See the MaxTRESPerJob limit. oh, thanks a lot!! would something like this work/be in line with your indication? : add qos 8cpu GrpTRES=cpu=1 MaxTRESPerJob=8 modify account blah DefaultQOS=8cpu Thanks a lot! Adrian -Paul Edmon- On 8/2/2021 10:40 AM, Adrian Sevcenco wrote: Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default (as i see no limit regarding this .. ) .. i just have MaxNodes=1 this is CR_CPU alocator Thank you! Adrian
Re: [slurm-users] declare availability of up to 8 cores//job
Probably more like MaxTRESPERJob=cpu=8 You would need to specify how much TRES you need for each job in the normal tres format. -Paul Edmon- On 8/2/2021 11:24 AM, Adrian Sevcenco wrote: On 8/2/21 5:44 PM, Paul Edmon wrote: You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html See the MaxTRESPerJob limit. oh, thanks a lot!! would something like this work/be in line with your indication? : add qos 8cpu GrpTRES=cpu=1 MaxTRESPerJob=8 modify account blah DefaultQOS=8cpu Thanks a lot! Adrian -Paul Edmon- On 8/2/2021 10:40 AM, Adrian Sevcenco wrote: Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default (as i see no limit regarding this .. ) .. i just have MaxNodes=1 this is CR_CPU alocator Thank you! Adrian
Re: [slurm-users] declare availability of up to 8 cores//job
You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html See the MaxTRESPerJob limit. -Paul Edmon- On 8/2/2021 10:40 AM, Adrian Sevcenco wrote: Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default (as i see no limit regarding this .. ) .. i just have MaxNodes=1 this is CR_CPU alocator Thank you! Adrian
Re: [slurm-users] Can I get the original sbatch command, after the fact?
Not in the current version of Slurm. In the next major version long term storage of job scripts will be available. -Paul Edmon- On 7/16/2021 2:16 PM, David Henkemeyer wrote: If I execute a bunch of sbatch commands, can I use sacct (or something else) to show me the original sbatch command line for a given job ID? Thanks David
Re: [slurm-users] MinJobAge
The documentation indicates that's what should happen with MinJobAge: *MinJobAge* The minimum age of a completed job before its record is purged from Slurm's active database. Set the values of *MaxJobCount* and to ensure the slurmctld daemon does not exhaust its memory or other resources. The default value is 300 seconds. A value of zero prevents any job record purging. Jobs are not purged during a backfill cycle, so it can take longer than MinJobAge seconds to purge a job if using the backfill scheduling plugin. In order to eliminate some possible race conditions, the minimum non-zero value for *MinJobAge* recommended is 2. From my experience this does work. We've been running with MinJobAge=600 for years with out any problems to my knowledge -Paul Edmon- On 7/6/2021 8:59 AM, Emre Brookes wrote: Brian Andrus Nov 23, 2020, 1:55:54 PM to slurm...@lists.schedmd.com All, I always thought that MinJobAge affected how long a job will show up when doing 'squeue' That does not seem to be the case for me. I have MinJobAge=900, but if I do 'squeue --me' as soon as I finish an interactive job, there is nothing in the queue. I swear I used to see jobs in a completed state for a period of time, but they are not showing up at all on our cluster. How does one have jobs show up that are completed? I'm using slurm 20.02.7 & have the same issue (except I am running batch jobs). Does MinJobAge work to keep completed jobs around for the specified duration in squeue output? Thanks, Emre
Re: [slurm-users] Long term archiving
We keep 6 months in our active database and then we archive and purge anything older than that. The archive data itself is available for reimport and historical investigation. We've done this when importing historical data into XDMod. -Paul Edmon- On 6/28/2021 10:43 AM, Yair Yarom wrote: Hi list, I was wondering if you could share your long term archiving practices. We currently purge and archive the jobs after 31 days, and keep the usage data without purging. This gives us a reasonable history, and a downtime of "only" a few hours on database upgrade. We currently don't load the archives into a secondary db. We now have a use-case which might require us to save job information for more than that, and we're considering how to do that. Thanks in advance, -- /|| \/|Yair Yarom | System Group (DevOps) []|The Rachel and Selim Benin School [] /\ |of Computer Science and Engineering []//\\/ |The Hebrew University of Jerusalem [// \\ |T +972-2-5494522 | F +972-2-5494522 // \ |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il> // |
Re: [slurm-users] Upgrading slurm - can I do it while jobs running?
We generally pause scheduling during upgrades out of paranoia more than anything. What that means is that we set all our partitions to DOWN and suspend all the jobs. Then we do the upgrade. That said I know of people who do it live with out much trouble. The risk is more substantial for major version upgrades than minors. So if you are doing a minor version upgrade its likely fine to do live. For major version I would recommend at least pausing all the jobs. -Paul Edmon- On 5/26/2021 2:48 PM, Ole Holm Nielsen wrote: On 26-05-2021 20:23, Will Dennis wrote: About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm// which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7) but I have researchers running jobs on it currently. As I’m still building out the cluster, I found today that all Slurm source tarballs before 20.11.7 were withdrawn by SchedMD. So, need to upgrade at least the -ctld and -dbd nodes before I can roll any new nodes out on 20.11.7… As I have at least one researcher that is running some long multi-day jobs, can I down the -dbd and -ctld nodes and upgrade them, then put them back online running the new (latest) release, without munging the jobs on the running worker nodes? I recommend strongly to read the SchedMD presentations in the https://slurm.schedmd.com/publications.html page, especially the "Field notes" documents. The latest one is "Field Notes 4: From The Frontlines of Slurm Support", Jason Booth, SchedMD. We upgrade Slurm continuously while the nodes are in production mode. There's a required order of upgrading: first slurmdbd, then slurmctld, then slurmd nodes, and finally login nodes, see https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm The detailed upgrading commands for CentOS are in https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7 We don't have any problems with running jobs across upgrades, but perhaps others can share their experiences? /Ole
Re: [slurm-users] Determining Cluster Usage Rate
XDMod can give these sorts of stats. I also have some diamond collectors we use in concert with grafana to pull data and plot it which is useful for seeing large scale usage trends: https://github.com/fasrc/slurm-diamond-collector -Paul Edmon- On 5/13/2021 6:08 PM, Sid Young wrote: Hi All, Is there a way to define an effective "usage rate" of a HPC Cluster using the data captured in the slurm database. Primarily I want to see if it can be helpful in presenting to the business a case for buying more hardware for the HPC :) Sid Young
Re: [slurm-users] Cluster usage, filtered by partition
Yup, we use XDMod for this sort of data as well. -Paul Edmon- On 5/11/2021 8:52 AM, Renfro, Michael wrote: XDMoD [1] is useful for this, but it’s not a simple script. It does have some user-accessible APIs if you want some report automation. I’m using that to create a lightning-talk-style slide at [2]. [1] https://open.xdmod.org/ <https://open.xdmod.org/> [2] https://github.com/mikerenfro/one-page-presentation-hpc <https://github.com/mikerenfro/one-page-presentation-hpc> On May 11, 2021, at 5:18 AM, Diego Zuccato wrote: Il 11/05/21 11:21, Ole Holm Nielsen ha scritto: Tks for the very fast answer. I have written some accounting tools which are in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOleHolmNielsen%2FSlurm_tools%2Ftree%2Fmaster%2Fslurmacctdata=04%7C01%7Crenfro%40tntech.edu%7C18a22c9efe664a5841ca08d9146602bc%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637563250978957632%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=IN73eK1jj6OByPmObJyDjAHxnxe9ONVgjEKPsSgnXi8%3Dreserved=0 Maybe you can use the "topreports" tool? Testing it just now. I'll probably have to do some changes (re field witdh: our usernames are quite long, being from AD), but first I have to check if it extracts the info our users want to see :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] Testing Lua job submit plugins
We go the route of having a test cluster and vetting our lua scripts there before putting them in the production environment. -Paul Edmon- On 5/6/2021 1:23 PM, Renfro, Michael wrote: I’ve used the structure at https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 <https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5> to handle basic test/production branching. I can isolate the new behavior down to just a specific set of UIDs that way. Factoring out code into separate functions helps, too. I’ve seen others go so far as to put the functions into separate files, but I haven’t needed that yet. On May 6, 2021, at 12:11 PM, Michael Robbert wrote: *External Email Warning* *This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.* I’m wondering if others in the Slurm community have any tips or best practices for the development and testing of Lua job submit plugins. Is there anything that can be done prior to deployment on a production cluster that will help to ensure the code is going to do what you think it does or at the very least not prevent any jobs from being submitted? I realize that any configuration change in slurm.conf could break everything, but I feel like adding Lua code adds enough complexity that I’m a little more hesitant to just throw it in. Any way to run some kind of linting or sanity tests on the Lua script? Additionally, does the script get read in one time at startup or reconfig or can it be changed on the fly just by editing the file? Maybe a separate issue, but does anybody have an recipes to build a local test cluster in Docker that could be used to test this? I was working on one, but broke my local Docker install and thought I’d send this note out while I was working on rebuilding it. Thanks in advance, Mike Robbert
[slurm-users] Replacement for diamond
Python diamond has historically been really useful for shipping data to graphite. We have a bunch of diamond collectors we wrote for slurm as a result: https://github.com/fasrc/slurm-diamond-collector However with python 2 being end of life and diamond being unavailable for python 3 we need a new option. So what do people use for shipping various slurm stats to graphite? -Paul Edmon-
Re: [slurm-users] Draining hosts because of failing jobs
Since you can run an arbitrary script as a node health checker I might add a script that counts failures and then closes if it hits a threshold. The script shouldn't need to talk to the slurmctld or slurmdbd as it should be able to watch the log on the node and see the fail. -Paul Edmon- On 5/4/2021 12:09 PM, Gerhard Strangar wrote: Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard
Re: [slurm-users] Fairshare config change affect on running/queued jobs?
It shouldn't impact running jobs, all it should really do is impact pending jobs as it will order them by their relative priority scores. -Paul Edmon- On 4/30/2021 12:39 PM, Walsh, Kevin wrote: Hello everyone, We wish to deploy "fair share" scheduling configuration and would like to inquire if we should be aware of effects this might have on jobs already running or already queued when the config is changed. The proposed changes are from the example at https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config <https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config> : # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # 2 week half-life PriorityDecayHalfLife=14-0 # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=1000 PriorityWeightFairshare=1 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=0 # don't use the qos factor We're running SLURM 18.08.8 on CentOS Linux 7.8.2003. The current slurm.conf is defaults as far as fair share is concerned: EnforcePartLimits=ALL GresTypes=gpu MpiDefault=pmix ProctrackType=proctrack/cgroup PrologFlags=x11,contain PropagateResourceLimitsExcept=MEMLOCK,STACK RebootProgram=/sbin/reboot ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm SlurmdSyslogDebug=verbose StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none TaskPlugin=task/cgroup,task/affinity TaskPluginParam=Sched HealthCheckInterval=300 HealthCheckProgram=/usr/sbin/nhc InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 DefMemPerCPU=1024 FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory AccountingStorageHost=sched-db.lan AccountingStorageLoc=slurm_acct_db AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreJobComment=YES AccountingStorageTRES=gres/gpu JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmdDebug=info SlurmSchedLogFile=/var/log/slurm/slurmsched.log SlurmSchedLogLevel=1 Node and partition configs are omitted above. Any and all advice will be greatly appreciated. Best wishes, ~Kevin Kevin Walsh Senior Systems Administration Specialist New Jersey Institute of Technology Academic & Research Computing Systems
Re: [slurm-users] OpenMPI interactive change in behavior?
I haven't experienced this issue here. Then again we've been using PMIx for launching MPI for a while now, thus we may have circumvented this particular issue. -Paul Edmon- On 4/28/2021 9:41 AM, John DeSantis wrote: Hello all, Just an update, the following URL almost mirrors the issue we're seeing: https://github.com/open-mpi/ompi/issues/8378 But, SLURM 20.11.3 was shipped with the fix. I've verified that the changes are in the source code. We don't want to have to downgrade SLURM to 20.02.x, but it seems that this behaviour still exists. Are no other sites on fresh installs of >= SLURM 20.11.3 experiencing this problem? I was aware of the changes in 20.11.{0..2} which received a lot of scrunity, which is why 20.11.3 was selected. Thanks, John DeSantis On 4/26/21 5:12 PM, John DeSantis wrote: Hello all, We've recently (don't laugh!) updated two of our SLURM installations from 16.05.10-2 to 20.11.3 and 17.11.9, respectively. Now, OpenMPI doesn't seem to function in interactive mode across multiple nodes as it did previously on the latest version 20.11.3; using `srun` and `mpirun` on a single node gives desired results, while using multiple nodes causes a hang. Jobs submitted via `sbatch` do _work as expected_. [desantis@sclogin0 ~]$ scontrol show config |grep VERSION; srun -n 2 -N 2-2 -t 00:05:00 --pty /bin/bash SLURM_VERSION = 17.11.9 [desantis@sccompute0 ~]$ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; mpirun hostname; module purge; echo; done /apps/openmpi/1.8.5/bin/mpirun sccompute0 sccompute1 /apps/openmpi/2.0.4/bin/mpirun sccompute1 sccompute0 /apps/openmpi/2.0.4-psm2/bin/mpirun sccompute1 sccompute0 /apps/openmpi/2.1.6/bin/mpirun sccompute0 sccompute1 /apps/openmpi/3.1.6/bin/mpirun sccompute0 sccompute1 /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun sccompute1 sccompute0 15:58:28 Mon Apr 26 <0> desantis@itn0 [~] $ scontrol show config|grep VERSION; srun -n 2 -N 2-2 --qos=devel --partition=devel -t 00:05:00 --pty /bin/bash SLURM_VERSION = 20.11.3 srun: job 1019599 queued and waiting for resources srun: job 1019599 has been allocated resources 15:58:46 Mon Apr 26 <0> desantis@mdc-1057-30-1 [~] $ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; mpirun hostname; module purge; echo; done /apps/openmpi/1.8.5/bin/mpirun ^C /apps/openmpi/2.0.4/bin/mpirun ^C /apps/openmpi/2.0.4-psm2/bin/mpirun ^C /apps/openmpi/2.1.6/bin/mpirun ^C /apps/openmpi/3.1.6/bin/mpirun ^C /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun ^C[mpiexec@mdc-1057-30-1] Sending Ctrl-C to processes as requested [mpiexec@mdc-1057-30-1] Press Ctrl-C again to force abort ^C Our SLURM installations are fairly straight forward. We `rpmbuild` directly from the bzip2 files without any additional arguments. We've done this since we first started using SLURM with version 14.03.3-2 and through all upgrades. Due to SLURM's awesomeness(!), we've simply used the same configuration files between version changes, with the only changes being made to parameters which have been deprecated/renamed. Our "Mpi{Default,Params}" have always been sent to "none". The only real difference we're able to ascertain is that the MPI plugin for openmpi has been removed. svc-3024-5-2: SLURM_VERSION = 16.05.10-2 svc-3024-5-2: srun: MPI types are... svc-3024-5-2: srun: mpi/openmpi svc-3024-5-2: srun: mpi/mpich1_shmem svc-3024-5-2: srun: mpi/mpichgm svc-3024-5-2: srun: mpi/mvapich svc-3024-5-2: srun: mpi/mpich1_p4 svc-3024-5-2: srun: mpi/lam svc-3024-5-2: srun: mpi/none svc-3024-5-2: srun: mpi/mpichmx svc-3024-5-2: srun: mpi/pmi2 viking: SLURM_VERSION = 20.11.3 viking: srun: MPI types are... viking: srun: cray_shasta viking: srun: pmi2 viking: srun: none sclogin0: SLURM_VERSION = 17.11.9 sclogin0: srun: MPI types are... sclogin0: srun: openmpi sclogin0: srun: none sclogin0: srun: pmi2 sclogin0: As far as building OpenMPI, we've always withheld any SLURM specific flags, i.e. "--with-slurm", although during the build process SLURM is detected. Because OpenMPI was always built using this method, we never had to recompile OpenMPI after subsequent SLURM upgrades, and no cluster ready applications had to be rebuilt. The only time OpenMPI had to be rebuilt was due to OPA hardware which was a simple addition of the "--with-psm2" flag. It is my understanding that the openmpi plugin "never really did anything" (per perusing the mailing list), which is why it was removed. Furthermore, searching the mailing list suggests that the appropriate method is t
Re: [slurm-users] Questions about adding new nodes to Slurm
1. Part of the communications for slurm is hierarchical. Thus nodes need to know about other nodes so they can talk to each other and forward messages to the slurmctld. 2. Yes, this is what we do. We have our slurm.conf shared via NFS from our slurm master and then we just update that single conf. After that update we then use salt to issue a global restart to all the slurmd's and slurmctld to pick up the new config. scontrol reconfigure is not enough when adding new nodes, you have to issue a global restart. 3. It's pretty straight forward all told. You just need to update the slurm.conf and do a restart. You need to be careful that the names you enter into the slurm.conf are resolvable by DNS, else slurmctld may barf on restart. Sadly no built in sanity checker exists that I am aware of aside from actually running slurmctld. We got around this by putting together a gitlab runner which screens our slurm.conf's by running synthetic slurmctld to sanity check. -Paul Edmon- On 4/27/2021 2:35 PM, David Henkemeyer wrote: Hello, I'm new to Slurm (coming from PBS), and so I will likely have a few questions over the next several weeks, as I work to transition my infrastructure from PBS to Slurm. My first question has to do with *_adding nodes to Slurm_*. According to the FAQ (and other articles I've read), you need to basically shut down slurm, update the slurm.conf file /*on all nodes in the cluster*/, then restart slurm. - Why do all nodes need to know about all other nodes? From what I have read, its Slurm does a checksum comparison of the slurm.conf file across all nodes. Is this the only reason all nodes need to know about all other nodes? - Can I create a symlink that points /slurm.conf to a slurm.conf file on an NFS mount point, which is mounted on all the nodes? This way, I would only need to update a single file, then restart Slurm across the entire cluster. - Any additional help/resources for adding/removing nodes to Slurm would be much appreciated. Perhaps there is a "toolkit" out there to automate some of these operations (which is what I already have for PBS, and will create for Slurm, if something doesn't already exist). Thank you all, David
Re: [slurm-users] Slurm version 20.11.5 is now available
So just a heads up here are the two tickets I filed. The first: https://bugs.schedmd.com/show_bug.cgi?id=11183 Has more details as to how their plugin works. The second is the clearing house for improvements: https://bugs.schedmd.com/show_bug.cgi?id=11135 -Paul Edmon- On 3/19/2021 9:25 AM, Paul Edmon wrote: I was about to ask this as well as we use /scratch as our tmp space not /tmp. I haven't kicked the tires on this to know how it works but after I take a look at it I will probably file a feature request to make the name of the tmp dir flexible. -Paul Edmon- On 3/19/2021 7:19 AM, Tina Friedrich wrote: That's excellent; I've been using the 'auto_tmpdir' plugin for this; having that functionality within SLURM will be good. Have a question though - we have a need to also create a per-job /scratch/ (on a shared fast file system) in much the same way. I don't see a way that the currentl tmpfs plugin can be used to do that, as it would seem that it's hard-coded to mount things into /tmp/ (i.e. where to mount a file system can not be changed). Or am I misreading this? Tina On 16/03/2021 22:26, Tim Wickberg wrote: One errant backspace snuck into that announcement: the job_container.conf man page (with an 'r') serves as the initial documentation for this new job_container/tmpfs plugin. The link to the HTML version of the man page has been corrected in the text below: On 3/16/21 4:16 PM, Tim Wickberg wrote: We are pleased to announce the availability of Slurm version 20.11.5. This includes a number of moderate severity bug fixes, alongside a new job_container/tmpfs plugin developed by NERSC that can be used to create per-job filesystem namespaces. Initial documentation for this plugin is available at: https://slurm.schedmd.com/job_container.conf.html Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim
Re: [slurm-users] Set Fairshare by Hand
No, there is no way to my knowledge to do this. You can zero out some one's fairshare (by removing and readding them) or a groups fairshare but you can't set it to an arbitrary value. You can always adjust their RawShares for a somewhat similar effect but that will have all the normal consequences of changing their RawShares. -Paul Edmon- On 3/22/2021 5:12 AM, Michael Müller wrote: Dear Slurm users and admins, can we set the faireshare values manually, i.e., they are not (re)calculated be Slurm? With kind regards Michael
Re: [slurm-users] Slurm version 20.11.5 is now available
I was about to ask this as well as we use /scratch as our tmp space not /tmp. I haven't kicked the tires on this to know how it works but after I take a look at it I will probably file a feature request to make the name of the tmp dir flexible. -Paul Edmon- On 3/19/2021 7:19 AM, Tina Friedrich wrote: That's excellent; I've been using the 'auto_tmpdir' plugin for this; having that functionality within SLURM will be good. Have a question though - we have a need to also create a per-job /scratch/ (on a shared fast file system) in much the same way. I don't see a way that the currentl tmpfs plugin can be used to do that, as it would seem that it's hard-coded to mount things into /tmp/ (i.e. where to mount a file system can not be changed). Or am I misreading this? Tina On 16/03/2021 22:26, Tim Wickberg wrote: One errant backspace snuck into that announcement: the job_container.conf man page (with an 'r') serves as the initial documentation for this new job_container/tmpfs plugin. The link to the HTML version of the man page has been corrected in the text below: On 3/16/21 4:16 PM, Tim Wickberg wrote: We are pleased to announce the availability of Slurm version 20.11.5. This includes a number of moderate severity bug fixes, alongside a new job_container/tmpfs plugin developed by NERSC that can be used to create per-job filesystem namespaces. Initial documentation for this plugin is available at: https://slurm.schedmd.com/job_container.conf.html Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim
Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs. This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation. Thus I wouldn't trust any of the results with regards to memory usage if the job is terminated by OoM. sacct just can't pick up a sudden memory spike like that and even if it did it would not correctly record the peak memory because the job was terminated prior to that point. -Paul Edmon- On 3/15/2021 1:52 PM, Chin,David wrote: Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobID JobName User Partition NodeList Elapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE -- - -- --- -- -- -- -- -- 83387 ProdEmisI+ foob def node001 03:34:26 OUT_OF_ME+ 0:125 128Gn billing=16,cpu=16,node=1 83387.batch batch node001 03:34:26 OUT_OF_ME+ 0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K 153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data
Re: [slurm-users] SLURM submit policy
You might try looking at a partition QoS using the GrpTRESMins or GrpTRESRunMins: https://slurm.schedmd.com/resource_limits.html There are a bunch of options which may do what you want. -Paul Edmon- On 3/10/2021 9:13 AM, Marcel Breyer wrote: Greetings, we know about the SLURM configuration option *MaxSubmitJobsPerUser* to limit the number of jobs a user can submit at a given time. We would like to have a similar policy that says that the total time for all jobs of a user cannot exceed a certain time limit. For example (normal *MaxSubmitJobsPerUser = 2*): srun --time 10 ... srun --time 20 ... srun --time 10 ... <- fails since only 2 jobs are allowed per user However, we want something like (for a maximum aggregate time of e.g. 40mins): srun --time 10 ... srun --time 20 ... srun --time 10 ... srun --time 5 ... <- fails since the total job times exceed 40mins However, another allocation pattern could be: srun --time 5 ... srun --time 5 ... srun --time 5 ... srun --time 5 ... srun --time 5 ... srun --time 5 ... srun --time 5 ... srun --time 5 ... srun --time 5 ... <- fails since the total job times exceed 40mins (however, after the first job completed, the new job can be submitted normally) In essence we would like to have a policy using the FIFO scheduler (such that we don't have to specify another complex scheduler) such that we can guarantee that another user has the chance to get access to a machine after at most X time units (40mins in the example above). With the *MaxSubmitJobsPerUser *option we would have to allow only a really small number of jobs (penalizing users that divide their computation into small sub jobs) or X would be rather large (num_jobs * max_wall_time). Is there an option in SLURM that mimics such a behavior? With best regards, Marcel Breyer
Re: [slurm-users] qos on partition
For the first does MaxJobs not do that? For the second you can set MaxJobsPerUser. That's what we do here for our test partition, we set a limit of 5 jobs per user running at any given time. You can then tie the QoS to a specific partition using the QoS option in the partition config in slurm.conf -Paul Edmon- On 3/9/2021 5:10 AM, LEROY Christine 208562 wrote: Hello, I’d like to reproduce a configuration we had with torque on queues/partitions : • how to set a maximum number of running jobs on a queue ? • and a maximum number of running jobs per user for all the users (whatever is the user)? There is a qos with slurm but it seems always attached to a user or an account, not to a partition ? What would be the best thing to do here ? Thanks in advance, Christine Leroy
Re: [slurm-users] Rate Limiting of RPC calls
We've hit this before several times. The tricks we've used to deal with this are: 1. Being on the latest release: A lot of work has gone into improving RPC throughput, if you aren't running the latest 20.11 release I highly recommend upgrading. 20.02 also was pretty good at this. 2. max_rpc_cnt/defer: I would recommend using either of these settings for SchedulerParameters as it will allow the scheduler more time to breathe. 3. I would make sure that your mysql settings are set such that your DB is fully cached in memory and not hitting disk. I also recommend running your DB on the same server as you run your ctld. We've found that this can improve throughput. 4. We put a caching version of squeue in place which gives almost live data to the users rather than live data. This additional buffer layer helps cut down traffic. This is something we rolled in house with a database that updates every 30 seconds. 5. Recommend to users to submit jobs that last for more than 10 minutes and to use Job arrays instead of looping sbatch. This will reduce thrashing. Those are my recommendations for how to deal with this. -Paul Edmon- On 2/9/2021 7:59 PM, Kota Tsuyuzaki wrote: Hello guys, In our cluster, sometimes new incoming member accidentally creates too many slurm RPC calls (sbatch, sacct, etc), then slurmctld, slurmdbd, and mysql may be overloaded. To prevent such a situation, I'm looking for something like RPC Rate Limit for users. Does Slurm supports such a RateLimit feature? If not, is there way to save Slurm server-side resources? Best, Kota 露崎 浩太 (Kota Tsuyuzaki) kota.tsuyuzaki...@hco.ntt.co.jp NTTソフトウェアイノベーションセンタ 分散処理基盤技術プロジェクト 0422-59-2837 -
Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?
That is correct. I think NVML has some additional features but in terms of actually scheduling them what you have should work. They will just be treated as normal gres resources. -Paul Edmon- On 1/26/2021 3:55 PM, Ole Holm Nielsen wrote: On 26-01-2021 21:36, Paul Edmon wrote: You can include gpu's as gres in slurm with out compiling specifically against nvml. You only really need to do that if you want to use the autodetection features that have been built into the slurm. We don't really use any of those features at our site, we only started building against nvml to future proof ourselves for when/if those features become relevant to us. Thanks for this clarification about not actually *requiring* the NVIDIA NVML library in the Slurm build! Now I'm seeing this description in https://slurm.schedmd.com/gres.html about automatic GPU configuration by Slurm: If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration, configuration details will automatically be filled in for any system-detected NVIDIA GPU. This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect. I have defined our GPUs manually in gres.conf with File=/dev/nvidia? lines, so it would seem that this obviates the need for NVML. Is this the correct conclusion? /Ole To me at least it would be nicer if there was a less hacky way of getting it to do that. Arguably Slurm should dynamically link against the libs it needs or not depending on the node. We hit this issue with Lustre/IB as well where you have to roll a separate slurm for each type of node you have if you want these which is hardly ideal. -Paul Edmon- On 1/26/2021 3:24 PM, Robert Kudyba wrote: You all might be interested in a patch to the SPEC file, to not make the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at configure time. See https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 <https://bugs.schedmd.com/show_bug.cgi?id=7919#c3> On Tue, Jan 26, 2021 at 3:17 PM Paul Raines mailto:rai...@nmr.mgh.harvard.edu>> wrote: You should check your jobs that allocated GPUs and make sure CUDA_VISIBLE_DEVICES is being set in the environment. This is a sign you GPU support is not really there but SLURM is just doing "generic" resource assignment. I have both GPU and non-GPU nodes. I build SLURM rpms twice. Once on a non-GPU node and use those RPMs to install on the non-GPU nodes. Then build again on the GPU node where CUDA is installed via the NVIDIA CUDA YUM repo rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to the default RPM SPEC is needed. I just run rpmbuild --tb slurm-20.11.3.tar.bz2 You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the GPU node. -- Paul Raines (https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo= <https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=> ) On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote: > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote: >> Personally, I think it's good that Slurm RPMs are now available through >> EPEL, although I won't be able to use them, and I'm sure many people on >> the list won't be able to either, since licensing issues prevent them from >> providing support for NVIDIA drivers, so those of us with GPUs on our >> clusters will still have to compile Slurm from source to include NVIDIA >> GPU support. > > We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes. > The Slurm GPU documentation seems to be > https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA= <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh
Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?
You can include gpu's as gres in slurm with out compiling specifically against nvml. You only really need to do that if you want to use the autodetection features that have been built into the slurm. We don't really use any of those features at our site, we only started building against nvml to future proof ourselves for when/if those features become relevant to us. To me at least it would be nicer if there was a less hacky way of getting it to do that. Arguably Slurm should dynamically link against the libs it needs or not depending on the node. We hit this issue with Lustre/IB as well where you have to roll a separate slurm for each type of node you have if you want these which is hardly ideal. -Paul Edmon- On 1/26/2021 3:24 PM, Robert Kudyba wrote: You all might be interested in a patch to the SPEC file, to not make the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at configure time. See https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 <https://bugs.schedmd.com/show_bug.cgi?id=7919#c3> On Tue, Jan 26, 2021 at 3:17 PM Paul Raines mailto:rai...@nmr.mgh.harvard.edu>> wrote: You should check your jobs that allocated GPUs and make sure CUDA_VISIBLE_DEVICES is being set in the environment. This is a sign you GPU support is not really there but SLURM is just doing "generic" resource assignment. I have both GPU and non-GPU nodes. I build SLURM rpms twice. Once on a non-GPU node and use those RPMs to install on the non-GPU nodes. Then build again on the GPU node where CUDA is installed via the NVIDIA CUDA YUM repo rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to the default RPM SPEC is needed. I just run rpmbuild --tb slurm-20.11.3.tar.bz2 You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the GPU node. -- Paul Raines (https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo= <https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=> ) On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote: > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote: >> Personally, I think it's good that Slurm RPMs are now available through >> EPEL, although I won't be able to use them, and I'm sure many people on >> the list won't be able to either, since licensing issues prevent them from >> providing support for NVIDIA drivers, so those of us with GPUs on our >> clusters will still have to compile Slurm from source to include NVIDIA >> GPU support. > > We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes. > The Slurm GPU documentation seems to be > https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA= <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA=> > We don't seem to have any problems scheduling jobs on GPUs, even though our > Slurm RPM build host doesn't have any NVIDIA software installed, as shown by > the command: > $ ldconfig -p | grep libnvidia-ml > > I'm curious about Prentice's statement about needing NVIDIA libraries to be > installed when building Slurm RPMs, and I read the discussion in bug 9525, > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI= <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI=> > from which it seems that the problem was fix
Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?
In our RPM spec we use to build slurm we do the following additional things for GPU's: BuildRequires: cuda-nvml-devel-11-1 the in the %build section we do: export CFLAGS="$CFLAGS -L/usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs/ -I/usr/local/cuda-11.1/targets/x86_64-linux/include/" That ensures the cuda libs are installed and it directs slurm to where they are. After that configure should detect the nvml libs and link against them. I've attached our full spec that we use to build. -Paul Edmon- On 1/26/2021 2:29 PM, Ole Holm Nielsen wrote: In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote: Personally, I think it's good that Slurm RPMs are now available through EPEL, although I won't be able to use them, and I'm sure many people on the list won't be able to either, since licensing issues prevent them from providing support for NVIDIA drivers, so those of us with GPUs on our clusters will still have to compile Slurm from source to include NVIDIA GPU support. We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes. The Slurm GPU documentation seems to be https://slurm.schedmd.com/gres.html We don't seem to have any problems scheduling jobs on GPUs, even though our Slurm RPM build host doesn't have any NVIDIA software installed, as shown by the command: $ ldconfig -p | grep libnvidia-ml I'm curious about Prentice's statement about needing NVIDIA libraries to be installed when building Slurm RPMs, and I read the discussion in bug 9525, https://bugs.schedmd.com/show_bug.cgi?id=9525 from which it seems that the problem was fixed in 20.02.6 and 20.11. Question: Is there anything special that needs to be done when building Slurm RPMs with NVIDIA GPU support? Thanks, Ole Name: slurm Version:20.11.3 %define rel 1 Release:%{rel}fasrc01%{?dist} Summary:Slurm Workload Manager Group: System Environment/Base License:GPLv2+ URL:https://slurm.schedmd.com/ # when the rel number is one, the directory name does not include it %if "%{rel}" == "1" %global slurm_source_dir %{name}-%{version} %else %global slurm_source_dir %{name}-%{version}-%{rel} %endif Source: %{slurm_source_dir}.tar.bz2 # build options .rpmmacros options change to default action # # --prefix %_prefix path install path for commands, libraries, etc. # --with cray %_with_cray 1 build for a Cray Aries system # --with cray_network %_with_cray_network 1 build for a non-Cray system with a Cray network # --with cray_shasta%_with_cray_shasta 1build for a Cray Shasta system # --with slurmrestd %_with_slurmrestd 1 build slurmrestd # --with slurmsmwd %_with_slurmsmwd 1 build slurmsmwd # --without debug %_without_debug 1 don't compile with debugging symbols # --with hdf5 %_with_hdf5 pathrequire hdf5 support # --with hwloc %_with_hwloc 1 require hwloc support # --with lua%_with_lua path build Slurm lua bindings # --with mysql %_with_mysql 1 require mysql/mariadb support # --with numa %_with_numa 1 require NUMA support # --without pam %_without_pam 1 don't require pam-devel RPM to be installed # --without x11 %_without_x11 1 disable internal X11 support # --with ucx%_with_ucx path require ucx support # --with pmix %_with_pmix pathrequire pmix support # Options that are off by default (enable with --with ) %bcond_with cray %bcond_with cray_network %bcond_with cray_shasta %bcond_with slurmrestd %bcond_with slurmsmwd %bcond_with multiple_slurmd %bcond_with ucx # These options are only here to force there to be these on the build. # If they are not set they will still be compiled if the packages exist. %bcond_with hwloc %bcond_with mysql %bcond_with hdf5 %bcond_with lua %bcond_with numa %bcond_with pmix # Use debug by default on all systems %bcond_without debug # Options enabled by default %bcond_without pam %bcond_without x11 # Disable hardened builds. -z,now or -z,relro breaks the plugin stack %undefine _hardened_build %global _hardened_cflags "-Wl,-z,lazy" %global _hardened_ldflags "-Wl,-z,lazy" Requires: munge %{?systemd_requires} BuildRequires: systemd BuildRequires: munge-devel munge-libs BuildRequires: python3 BuildRequires: readline-devel Obsoletes: slurm-lua slurm-munge slurm-plugins # fake systemd support when building rpms on other platforms %{!?_unitdir: %global _unitdir /lib/systemd/systemd} %define use_mysql_devel %(perl -e '`rpm -q mariadb-devel`; print $?;') %if %{with mysql} %if %{use_mysql_devel} BuildRequires: mysql-devel >= 5.0.0 %else BuildRequires: mariadb-devel >= 5.0.0 %endif %endif %if %{with cray} Buil
Re: [slurm-users] Slurm Upgrade Philosophy?
We are the same way, though we tend to keep pace with minor releases. We typically wait until the .1 release of a new major release before considering upgrade so that many of the bugs are worked out. We then have a test cluster that we install the release on a run a few test jobs to make sure things are working, usually MPI jobs as they tend to hit most of the features of the scheduler. We also like to stay current with releases as there are new features we want, or features we didn't know we wanted but our users find and start using. So our general methodology is to upgrade to the latest minor release at our next monthly maintenance. For major releases we will upgrade at our next monthly maintenance after the .1 release is out unless there is a show stopping bug that we run into in our own testing. At which point we file a bug with SchedMD and get a patch. -Paul Edmon- On 12/24/2020 1:57 AM, Chris Samuel wrote: On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote: Thanks to several helpful members on this list, I think I have a much better handle on how to upgrade Slurm. Now my question is, do most of you upgrade with each major release? We do, though not immediately and not without a degree of testing on our test systems. One of the big reasons for us upgrading is that we've usually paid for features in Slurm for our needs (for example in 20.11 that includes scrontab so users won't be tied to favourite login nodes, as well as the experimental RPC queue code due to the large numbers of RPCs our systems need to cope with). I also keep an eye out for discussions of what other sites find with new releases too, so I'm following the current concerns about 20.11 and the change in behaviour for job steps that do (expanding NVIDIA's example slightly): #SBATCH --exclusive #SBATCH -N2 srun --ntasks-per-node=1 python multi_node_launch.py which (if I'm reading the bugs correctly) fails in 20.11 as that srun no longer gets all the allocated resources, instead just gets the default of --cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI built with Slurm support (as it effectively calls "srun orted" and that "orted" launches the MPI ranks, so in 20.11 it only has access to a single core for them all to fight over). Again - if I'm interpreting the bugs correctly! I don't currently have a test system that's free to try 20.11 on, but hopefully early in the new year I'll be able to test this out to see how much of an impact this is going to have and how we will manage it. https://bugs.schedmd.com/show_bug.cgi?id=10383 https://bugs.schedmd.com/show_bug.cgi?id=10489 All the best, Chris
Re: [slurm-users] getting fairshare
You can use the -o option to select which field you want it to print. The last column is the FairShare score. The equation is part of the slurm documentation: https://slurm.schedmd.com/priority_multifactor.html If you are using the Classic Fairshare you can look at our documentation: https://docs.rc.fas.harvard.edu/kb/fairshare/ -Paul Edmon- On 12/16/2020 12:30 PM, Erik Bryer wrote: $ sshare -a Account User RawShares NormShares RawUsage EffectvUsage FairShare -- -- --- --- - -- root 0.00 158 1.00 root root 1 0.25 0 0.00 1.00 borrowed 1 0.25 157 0.994905 borrowed ebryer 6 0.020979 157 1.00 0.08 borrowed napierski 7 0.024476 0 0.00 0.33 borrowed sagatest01 259 0.905594 0 0.00 0.33 borrowed sagatest02 14 0.048951 0 0.00 0.33 gaia 1 0.25 0 0.005095 gaia ebryer 3 0.272727 0 1.00 0.416667 gaia napiersk 2 0.181818 0 0.00 0.67 gaia sagatest01 1 0.090909 0 0.00 0.67 gaia sagatest02 5 0.454545 0 0.00 0.67 saral 1 0.25 0 0.00 saral ebryer 20 0.869565 0 0.00 1.00 saral napierski 1 0.043478 0 0.00 1.00 saral sagatest01 2 0.086957 0 0.00 1.00 Is there a way to take output from sshare and get FairShare? I'm looking for a simple equation or some indication why that's not possible. I've ready everything I can find on this topic. Thanks, Erik
Re: [slurm-users] Query for minimum memory required in partition
We do this here using the job_submit.lua script. Here is an example: if part == "bigmem" then if (job_desc.pn_min_memory ~= 0) then if (job_desc.pn_min_memory < 19 or job_desc.pn_min_memory > 2147483646) then slurm.log_user("You must request more than 190GB for jobs in bigmem partition") return 2052 end end end -Paul Edmon- On 12/16/2020 11:06 AM, Sistemas NLHPC wrote: Hello Good afternoon, i have a query currently in our cluster we have different partitions: 1 partition called slims with 48 Gb of ram 1 partition called general 192 Gb of ram 1 partition called largemem with 768 Gb of ram. Is it possible to restrict access to the largemem partition and for tasks to be accepted as long as a minimum of 193 Gb is reserved in slurm.conf or another method? This is because we have users who use the largemem partition reserving less than 192 GB. Thanks for help. -- Mirko Pizarro Pizarro mailto:mpiza...@nlhpc.cl> Ingeniero de Sistemas National Laboratory for High Performance Computing (NLHPC) www.nlhpc.cl <http://www.nlhpc.cl/> CMM - Centro de Modelamiento Matemático Facultad de Ciencias Físicas y Matemáticas (FCFM) Universidad de Chile Beauchef 851 Edificio Norte - Piso 6, of. 601 Santiago – Chile tel +56 2 2978 4603
Re: [slurm-users] Novice Slurm Upgrade Questions
It won't figure it out automatically no. You will need to ensure that the spec is installing to the same locale as your vendor installed it if they didn't put it in the default location (/opt isn't the default). -Paul Edmon- On 12/4/2020 3:39 PM, Jason Simms wrote: Dear Ole, Thanks. I've read through your docs many times. The relevant upgrade section begins with the assumption that you have properly configured RPMs, so all I'm trying to do is ensure I get to that point. As I noted, a vendor installed Slurm initially through a proprietary script, though they did base it off of created RPMs. I've reached out to them to see whether they used a modified slurm.spec file, which I suspect they did, given that Slurm is installed in /opt/slurm (which seems like a modified prefix, if nothing else). The fundamental question is, if I am performing a yum update, and I don't adjust any settings in the default slurm.spec file, will it upgrade everything properly where they currently "live," or will it install new files in standard locations? It's a question of whether "yum update" is "smart enough" to figure out what was done before and go with that, or whether I must specify all relevant information in the slurm.spec file each time? Based on Paul's reply, it seems we do need an updated slurm.spec file that reflects our environment, each time we upgrade. Jason On Fri, Dec 4, 2020 at 3:13 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: Hi Jason, Slurm upgrading should be pretty simple, IMHO. I've been through this multiple times, and my Slurm Wiki has detailed upgrade documentation: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm <https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm> Building RPMs is described in this page as well: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms <https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms> I hope this helps. /Ole On 04-12-2020 20:36, Jason Simms wrote: > Thank you for being such a helpful resource for All Things Slurm; I > sincerely appreciate the helpful feedback. Right now, we are running > 20.02 and considering upgrading to 20.11 during our next maintenance > window in January. This will be the first time we have upgraded Slurm, > so understandably we are somewhat nervous and have some questions. > > I am able to download the source and build RPMs successfully. What is > unclear to me is whether I have to adjust anything in the slurm.spec > file or use a .rpmmacros file to control certain aspects of the > installation. Since this would be an upgrade, rather than a new install, > do I have to adjust, e.g., the --prefix value, and all other settings > (X11 support, etc.)? Or, will a yum update "correctly" put the files > where they are on my system, using settings from the existing 20.02 version? > > We purchased the system from a vendor, and of course they use custom > scripts to build and install Slurm, and those are tailored for an > initial installation, not an upgrade. Their advice to us was, don't > upgrade if you don't need to, which seems reasonable, except that many > of you respond to initial requests for help by recommending an upgrade. > And in any case, Slurm doesn't upgrade nicely from more than two major > versions back, so I'm hesitant to go too long without patching. > > I'm terribly sorry for my ignorance of all this. But I really lament how > terrible most resources are about all this. They assume that you have > built the RPMs already, without offering any real guidance as to how to > adjust relevant options, or even whether that is a requirement for an > upgrade vs. a fresh installation. > > Any guidance would be most welcome. -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research and High-Performance Computing XSEDE Campus Champion Lafayette College Information Technology Services 710 Sullivan Rd | Easton, PA 18042 Office: 112 Skillman Library p: (610) 330-5632
Re: [slurm-users] Novice Slurm Upgrade Questions
Usually the slurm.spec file provided doesn't change that much between versions. What we do here is that we maintain a git repository of our slurm.spec that we use with our modifications. Then each time Slurm is released we compare ours against what is provided, and simply modify the provided one with our changes. Unless you make specific tweaks to the slurm.spec, you should be able to just use it out of the box no problem. As always read the changelog to see if there are any major changes between the versions in case a feature you were using was deprecated. This can happen during major version upgrades. At least from my experience if you follow the directions on the Slurm documentation regarding upgrades, you should be fine. The only real hitch is that by default the RPM's do restart the slurmdbd and slurmctld services, which you don't want when upgrading. You should either neuter this or have those both stopped during the upgrade. After the upgrade you should run slurmdbd and slurmctld in commandline mode for the initial run. Once it is done and running normally you can kill these and restart the relevant services. -Paul Edmon- On 12/4/2020 2:36 PM, Jason Simms wrote: Hello all, Thank you for being such a helpful resource for All Things Slurm; I sincerely appreciate the helpful feedback. Right now, we are running 20.02 and considering upgrading to 20.11 during our next maintenance window in January. This will be the first time we have upgraded Slurm, so understandably we are somewhat nervous and have some questions. I am able to download the source and build RPMs successfully. What is unclear to me is whether I have to adjust anything in the slurm.spec file or use a .rpmmacros file to control certain aspects of the installation. Since this would be an upgrade, rather than a new install, do I have to adjust, e.g., the --prefix value, and all other settings (X11 support, etc.)? Or, will a yum update "correctly" put the files where they are on my system, using settings from the existing 20.02 version? We purchased the system from a vendor, and of course they use custom scripts to build and install Slurm, and those are tailored for an initial installation, not an upgrade. Their advice to us was, don't upgrade if you don't need to, which seems reasonable, except that many of you respond to initial requests for help by recommending an upgrade. And in any case, Slurm doesn't upgrade nicely from more than two major versions back, so I'm hesitant to go too long without patching. I'm terribly sorry for my ignorance of all this. But I really lament how terrible most resources are about all this. They assume that you have built the RPMs already, without offering any real guidance as to how to adjust relevant options, or even whether that is a requirement for an upgrade vs. a fresh installation. Any guidance would be most welcome. Warmest regards, Jason -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research and High-Performance Computing XSEDE Campus Champion Lafayette College Information Technology Services 710 Sullivan Rd | Easton, PA 18042 Office: 112 Skillman Library p: (610) 330-5632
Re: [slurm-users] FairShare
Yup, our doc is for the classic fairshare not for fairtree. Thanks for the kudos on the doc by the way. We are glad it is useful. -Paul Edmon- On 12/2/2020 12:45 PM, Ryan Cox wrote: That is not for Fair Tree, which is what Micheal asked about. Ryan On 12/2/20 10:32 AM, Renfro, Michael wrote: Yesterday, I posted https://docs.rc.fas.harvard.edu/kb/fairshare/ <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rc.fas.harvard.edu%2Fkb%2Ffairshare%2F=04%7C01%7Crenfro%40tntech.edu%7Cc23f89dcb97743ee5eda08d8960679ed%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C1%7C637424301864169250%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000=%2FnB4ivZeDNrVZiaeupFnAj86oQLOhMu1%2FK6YiuBxTB8%3D=0>in response to a similar question. If you want the simplest general explanation for FairShare values, it's that they range from 0.0 to 1.0, values above 0.5 indicate that account or user has used less than their share of the resource, and values below 0.5 indicate that that account or user has used more than their share of the resource. Since all your users have the same RawShares value and are entitled to the same share of the resource, you can see that bdehaven has the most RawUsage and the lowest FairShare value, followed by ajoel and xtsao with almost identical RawUsage and FairShare, and finally ahantau with very little usage and the highest FairShare value. We use FairShare here as the dominant factor in priorities for queued jobs: if you're a light user, we bump up your priority over heavier users, and your job starts quicker than those for heavier users, assuming all other job attributes are equal. All these values are relative: in our setup, we'd bump ahantau's pending jobs ahead of the others, and put bdehaven's at the end. But if root needed to run a job outside the sray account, they'd get an enormous bump ahead since the sray account has used far more than its fair share of the resource. *From: *slurm-users *Date: *Wednesday, December 2, 2020 at 11:23 AM *To: *slurm-users@lists.schedmd.com *Subject: *Re: [slurm-users] FairShare *External Email Warning* *This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.* I've read the manual and I re-read the other link. What they boil down to is Fair Share is calculated based on a recondite "rooted plane tree", which I do not have the background in discrete math to understand. I'm hoping someone can explain it so my little kernel can understand. *From:*slurm-users on behalf of Micheal Krombopulous *Sent:* Wednesday, December 2, 2020 9:32 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] FairShare Can someone tell me how to calculate fairshare (under fairtree)? I can't figure it out. I would have thought it would be the same score for all users in an account. E.g., here is one of my accounts: Account User RawShares NormShares RawUsage NormUsage EffectvUsage LevelFS FairShare -- -- --- --- --- - -- -- root 0.00 611349 1.00 root root 1 0.076923 0 0.00 0.00 inf 1.00 sray 1 0.076923 30921 0.505582 0.505582 0.152147 sray phedge 1 0.05 0 0.00 0.00 inf 0.181818 sray raab 1 0.05 0 0.00 0.00 inf 0.181818 sray benequist 1 0.05 0 0.00 0.00 inf 0.181818 sray bosch 1 0.05 0 0.00 0.00 inf 0.181818 sray rjenkins 1 0.05 0 0.00 0.00 inf 0.181818 sray esmith 1 0.05 0 0.00 0.00 1.7226e+07 0.054545 sray gheinz 1 0.05 0 0.00 0.00 1.9074e+14 0.072727 sray jfitz 1 0.05 0 0.00 0.00 8.0640e+20 0.081818 sray ajoel 1 0.05 42449 0.069465 0.137396 0.363913 0.018182 sray jmay 1 0.05 0 0.00 0.00 inf 0.181818 sray aferrier 1 0.05 0 0.00 0.00 inf 0.181818 sray bdehaven 1 0.05 225002 0.367771 0.727420 0.068736
Re: [slurm-users] job restart :: how to find the reason
You can dig through the slurmctld log and search for the JobID. That should tell you what Slurm was doing at the time. -Paul Edmon- On 12/2/2020 6:27 AM, Adrian Sevcenco wrote: Hi! I encountered a situation when a bunch of jobs were restarted and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 So, i would like to know, how i can i find why there is a Requeue (when there is only one partition defined) and why there is a restart .. Thanks a lot!!! Adrian
Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120
That can help. Usually this happens due to laggy storage the job is using taking time flushing the job's data. So making sure that your storage is up, responsive, and stable will also cut these down. -Paul Edmon- On 11/30/2020 12:52 PM, Robert Kudyba wrote: I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 <https://bugs.schedmd.com/show_bug.cgi?id=3941> but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, ExitCode 0 Resending TERMINATE_JOB request JobId=6908 Nodelist=node001 update_node: node node001 reason set to: Kill task failed update_node: node node001 state set to DRAINING error: slurmd error running JobId=6908 on node(s)=node001: Kill task failed update_node: node node001 reason set to: hung update_node: node node001 state set to DOWN update_node: node node001 state set to IDLE error: Nodes node001 not responding scontrol show config | grep kill UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec Do we just increase the timeout value?