[slurm-users] HPC Principal System Engineer at the Broad

2024-04-25 Thread Paul Edmon via slurm-users
A friend ask me to pass this along. Figured some folks on this list might be interested. https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773 -Paul Edmon- -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Paul Edmon via slurm-users
Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them. -Paul Edmon- On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote: We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Paul Edmon via slurm-users
it to force jobs to one side of the partition, though generally the scheduler does this automatically. -Paul Edmon- On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote: Agree with that.   Plus, of course, even if the jobs run a bit slower by not having all the cores on a single node

[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Edmon via slurm-users
that would be my recommendation. This is how we handle fairshare at FASRC: https://docs.rc.fas.harvard.edu/kb/fairshare/ As we use Classic Fairshare. You will need to enable this: https://slurm.schedmd.com/slurm.conf.html#OPT_NO_FAIR_TREE as Fair Tree is on by default. -Paul Edmon- On 3/27/2024 9

[slurm-users] Slurm Utilities

2024-03-13 Thread Paul Edmon via slurm-users
for slurm partition information stdg: https://github.com/fasrc/stdg Slurm test deck generator prometheus-slurm-exporter: https://github.com/fasrc/prometheus-slurm-exporter  Slurm exporters for prometheus Hopefully people find these useful. Pull requests are always appreciated. -Paul Edmon

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
He's talking about recent versions of Slurm which now have this option: https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step -Paul Edmon- On 2/28/2024 10:46 AM, Paul Raines wrote: What do you mean "operate via the normal command line"?  When you salloc, you

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
to salloc a few years back and haven't had any issues. -Paul Edmon- On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote: Hi list, In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login n

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Paul Edmon via slurm-users
. So we haven't heavily invested in a high speed ethernet backbone but instead invested in IB. To invest in both seems to me to be overkill, you should focus on one or the other unless you have the cash to spend and a good use case. -Paul Edmon- On 2/26/24 7:07 AM, Dan Healy via slurm-users

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Paul Edmon via slurm-users
Are you using the job_script storage option? If so then you should be able to get at it by doing: sacct -B j JOBID https://slurm.schedmd.com/sacct.html#OPT_batch-script -Paul Edmon- On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote: Hello all, I've used the "scontrol

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Edmon via slurm-users
You probably want the Prolog option: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail -Paul Edmon- On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote: Hi, I apologise if I’ve failed to find

Re: [slurm-users] Two jobs each with a different partition running on same node?

2024-01-29 Thread Paul Edmon
. -Paul Edmon- On 1/29/2024 9:25 AM, Loris Bennett wrote: Hi, I seem to remember that in the past, if a node was configured to be in two partitions, the actual partition of the node was determined by the partition associated with the jobs running on it. Moreover, at any instance where the node

Re: [slurm-users] preemptable queue

2024-01-12 Thread Paul Edmon
that default of PreemptMode=CANCEL and then set specific PreemptModes for all your partitions. That's what we do and it works for us. -Paul Edmon- On 1/12/2024 10:33 AM, Davide DelVento wrote: Thanks Paul, I don't understand what you mean by having a typo somewhere. I mean, that configuration works

Re: [slurm-users] preemptable queue

2024-01-12 Thread Paul Edmon
At least in the example you are showing you have PreemptType commented out, which means it will return the default. PreemptMode Cancel should work, I don't see anything in the documentation that indicates it wouldn't.  So I suspect you have a typo somewhere in your conf. -Paul Edmon- On 1/11

Re: [slurm-users] Beginner admin question: Prioritization within a partition based on time limit

2024-01-09 Thread Paul Edmon
that will work best for the policy you want to implement. -Paul Edmon- On 1/9/2024 10:43 AM, Kenneth Chiu wrote: I'm just learning about slurm. I understand that different different partitions can be prioritized separately, and can have different max time limits. I was wondering whether

Re: [slurm-users] GPU Card Reservation?

2023-12-15 Thread Paul Edmon
would be all or nothing for a node so that would not work. -Paul Edmon- On 12/15/23 12:16 PM, Jason Simms wrote: Hello all, At least at one point, I understood that it was not particularly possible, or at least not elegant, to provide priority preempt access to a specific GPU card. So

Re: [slurm-users] Disabling SWAP space will it effect SLURM working

2023-12-11 Thread Paul Edmon
We've been running for years with out swap on with no issues. You may want to set MemSpecLimit in your config to reserve memory for the OS, so that way you don't OOM the system with user jobs: https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit -Paul Edmon- On 12/11/2023 11:19 AM

Re: [slurm-users] enabling job script archival

2023-10-03 Thread Paul Edmon
You will probably need to. The way we handle it is that we add users when the first submit a job via the job_submit.lua script. This way the database autopopulates with active users. -Paul Edmon- On 10/3/23 9:01 AM, Davide DelVento wrote: By increasing the slurmdbd verbosity level, I got

Re: [slurm-users] enabling job script archival

2023-10-02 Thread Paul Edmon
At least in our setup, users can see their own scripts by doing sacct -B -j JOBID I would make sure that the scripts are being stored and how you have PrivateData set. -Paul Edmon- On 10/2/2023 10:57 AM, Davide DelVento wrote: I deployed the job_script archival and it is working, however

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Paul Edmon
ything. The entire process takes about an hour start to finish, with the longest part being the pausing of all the jobs. -Paul Edmon- On 9/29/2023 9:48 AM, Groner, Rob wrote: I did already see the upgrade section of Jason's talk, but it wasn't much about the mechanics of the actual upgrade p

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
which helps with the on disk size. Raw uncompressed our database is about 90G.  We keep 6 months of data in our active database. -Paul Edmon- On 9/28/2023 1:57 PM, Ryan Novosielski wrote: Sorry for the duplicate e-mail in a short time: do you know (or anyone) when the hashing was added

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
for job_scripts as they are functionally the same and thus you have many jobs pointed to the same script, but less so for job_envs. -Paul Edmon- On 9/28/2023 1:55 PM, Ryan Novosielski wrote: Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
of them if they get large is to 0 out the column in the table. You can ask SchedMD for the mysql command to do this as we had to do it here to our job_envs. -Paul Edmon- On 9/28/2023 1:40 PM, Davide DelVento wrote: In my current slurm installation, (recently upgraded to slurm v23.02.3), I only

Re: [slurm-users] Submitting hybrid OpenMPI and OpenMP Jobs

2023-09-22 Thread Paul Edmon
You might also try swapping to use srun instead of mpiexec as that way slurm can give more direction as to what cores have been allocated to what. I've found it in the past that mpiexec will ignore what Slurm tells it. -Paul Edmon- On 9/22/23 8:24 AM, Lambers, Martin wrote: Hello

Re: [slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

2023-05-08 Thread Paul Edmon
I would recommend standing up an instance of XDMod as it handles most of this for you in its summary reports. https://open.xdmod.org/10.0/index.html -Paul Edmon- On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote: Good morning, We have at least one billed account right now, where

Re: [slurm-users] changing the operational network in slurm setup

2023-03-14 Thread Paul Edmon
We do this for our Infiniband set up.  What we do is that we populate /etc/hosts with the hostname mapped to the IP we want Slurm to use.  This way you get IP traffic traversing the address you want between nodes while not having to mess with DNS. -Paul Edmon- On 3/14/2023 12:19 AM, Purvesh

Re: [slurm-users] linting slurm.conf files

2023-01-27 Thread Paul Edmon
We have a gitlab runner that fires up a docker container that basically starts up a mini scheduler (slurmdbd and slurmctld) to confirm that both can start. It covers most bases but we would like to see an official syntax checker (https://bugs.schedmd.com/show_bug.cgi?id=3435). -Paul Edmon

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Paul Edmon
The symlink method for slurm.conf is what we do as well. We have a NFS mount from the slurm master that we host the slurm.conf on that we then symlink slurm.conf to that NFS share. -Paul Edmon- On 1/4/2023 1:53 PM, Brian Andrus wrote: One of the simple ways I have dealt with different

Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-14 Thread Paul Edmon
The seff utility (in slurm-contribs) also gives good summary info. You can also you --parsable to make things more managable. -Paul Edmon- On 12/14/22 3:41 PM, Ross Dickson wrote: I wrote a simple Python script to transpose the output of sacct from a row into a column.  See if it meets your

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Paul Edmon
Yeah, our spec is based off of their spec with our own additional features plugged in. -Paul Edmon- On 12/2/22 2:12 PM, David Thompson wrote: Hi Paul, thanks for passing that along. The error I saw was coming from the rpmbuild %check stage in the el9/fc38 builds, which your .spec file

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Paul Edmon
Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8. -Paul Edmon- On 12/2/22 12:21 PM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option, which isn’t

Re: [slurm-users] slurm 22.05 "hash_k12" related upgrade issue

2022-10-24 Thread Paul Edmon
It only happens for versions on the 22.05 series prior to the latest release (22.05.5).  So the 21 version isn't impacted and you should be fine to upgrade from 21 to 22.05.5 and not see the hash_k12 issue.  If you upgrade to any prior minor version though you will hit this issue. -Paul Edmon

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Paul Edmon
the HA setup for slurmctld will protect you from the server hosting the slurmctld getting hosed, not the entire rack going down or the datacenter going down. -Paul Edmon- On 10/24/2022 4:14 AM, Ole Holm Nielsen wrote: On 10/24/22 09:57, Diego Zuccato wrote: Il 24/10/2022 09:32, Ole Holm

Re: [slurm-users] Check consistency

2022-10-07 Thread Paul Edmon
The slurmctld log will print out if hosts are out of sync with the slurmctld slurm.conf.  That said it doesn't report on cgroup consistency changes like that.  It's possible that dialing up the verbosity on the slurmd logs may give that info but I haven't seen it in normal operating. -Paul

Re: [slurm-users] Recommended amount of memory for the database server

2022-09-26 Thread Paul Edmon
database is bigger than that. -Paul Edmon- On 9/25/22 5:18 PM, byron wrote: Hi Does anyone know what is the recommended amount of memory to give slurms mariadb database server? I seem to remember reading a simple estimate based on the size of certain tables (or something along those lines

Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-16 Thread Paul Edmon
We also call scontrol in our scripts (a little as we can manage) and we run at the scale of 1500 nodes.  It hasn't really caused many issues, but we try to limit it as much as we possibly can. -Paul Edmon- On 9/16/22 9:41 AM, Sebastian Potthoff wrote: Hi Hermann, So you both are happily

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Paul Edmon
But not any 20.  There are 20 versions, 20.02 and 20.11, and there was a previous 19.05.  So two versions for 18.08 would be 20.02 not 20.11 -Paul Edmon- On 9/8/22 12:14 PM, Wadud Miah wrote: The previous version was 18 and now I am trying to upgrade to 20, so I am well within 2 major

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Paul Edmon
Typically slurm only supports upgrading between 2 major versions ahead.  If you are on 18.08 you likely can only go to 20.02. Then after you upgrade to 20.02 you can go to 20.11 or 21.08. -Paul Edmon- On 9/8/22 11:38 AM, Wadud Miah wrote: hi Mick, I have checked that all the compute nodes

Re: [slurm-users] maridb version compatibility with Slurm version

2022-08-24 Thread Paul Edmon
I've regularly upgraded the mariadb version with out upgrading the slurm version with no issue. We are currently running 10.6.7 for MariaDB on CentOS 7.9 with Slurm 22.05.2.  So long as you do the mysql_upgrade after the upgrade and have a backup just in case you should be fine. -Paul Edmon

Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Paul Edmon
True.  Though be aware that Slurm will by default map the environment from login nodes to compute.  That's the real thing that matters.  So as long as the environment is setup properly, any filesystems excluding the home directory do not need to be mounted on login. -Paul Edmon- On 8/2/2022

Re: [slurm-users] SlurmDB Archive settings?

2022-07-18 Thread Paul Edmon
ter=6month PurgeTXNAfter=6month PurgeUsageAfter=6month -Paul Edmon- On 7/15/2022 2:08 AM, Ole Holm Nielsen wrote: Hi Paul, On 7/14/22 15:10, Paul Edmon wrote: We just use the Archive function built into slurm.  That has worked fine for us for the past 6 years. We keep 6 months of data in the acti

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
in 22.05 so that it is more efficient but getting from here to there is the trick. For details see the bug report we filed: https://bugs.schedmd.com/show_bug.cgi?id=14514 -Paul Edmon- On 7/14/2022 2:34 PM, Timony, Mick wrote: What I can tell you is that we have never had a problem

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
archive one month at a time which allowed it to get done in a reasonable amount of time. The archived data can be pulled into a different slurm database, which is what we do for importing historic data into our XDMod instance. -Paul Edmon- On 7/13/2022 4:55 PM, Timony, Mick wrote: Hi Slurm

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
sorts of problems. -Paul Edmon- On 5/17/22 2:50 PM, Ole Holm Nielsen wrote: Hi, You can upgrade from 19.05 to 20.11 in one step (2 major releases), skipping 20.02.  When that is completed, it is recommended to upgrade again from 20.11 to 21.08.8 in order to get the current major version

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
I think it should be, but you should be able to run a test and find out. -Paul Edmon- On 5/17/22 12:13 PM, byron wrote: Sorry, I should have been clearer.   I understand that with regards to slurmd / slurmctld you can skip a major release without impacting running jobs etc.  My questions

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
y can hand out if you are bootstrapping to a newer release. -Paul Edmon- On 5/17/22 11:42 AM, byron wrote: Thanks Brian for the speedy responce. Am I not correct in thinking that if I just go from 19.05 to 20.11 then there is the advantage that I can upgrade slurmd and slurmctld in one go an

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Paul Edmon
They fix this in newer versions of Slurm.  We had the same issue with older versions so we hard to run with the config_override option on to keep the logs quiet.  They changed the way logging was done in the more recent releases and its not as chatty. -Paul Edmon- On 5/12/22 7:35 AM, Per

Re: [slurm-users] Slurm 21.08.8-2 upgrade

2022-05-06 Thread Paul Edmon
We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we saw the communications issues described by Tim W.  We upgraded to 21.08.8-2 this morning and that did the trick to resolve all the communications problems we were having. -Paul Edmon- On 5/6/2022 4:38 AM, Ole Holm

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Paul Edmon
when you absolutely have no other work around then you should be fine. -Paul Edmon- On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check

Re: [slurm-users] non-historical scheduling

2022-04-12 Thread Paul Edmon
limits for each user. -Paul Edmon- On 4/12/2022 8:55 AM, Chagai Nota wrote: Hi Loris Thanks for your answer. I tired to configure it and I didn't get desired results. This is my configuration: PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageResetPeriod=DAILY

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-22 Thread Paul Edmon
I think you could do this by clever use of a partition level QoS but I don't have an obvious way of doing this. -Paul Edmon- On 3/22/2022 11:40 AM, Russell Jones wrote: Hi all, For various reasons, we need to limit a partition to being able to run max 1 job at a time. Not 1 job per user

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Edmon
for older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS  What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up. -Paul Edmon- On 2/10/2022 8:59 AM, Ward Poelmans wrote: Hi Paul, On 10/02/2022 14

Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Paul Edmon
, the specified memory will only be unavailable for user allocations. These will restrict specific memory and cores for system use. This is probably the best way to go rather than spoofing your config. -Paul Edmon- On 1/7/2022 2:36 AM, Rémi Palancher wrote: Le jeudi 6 janvier 2022 à 22:39, David

Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Paul Edmon
You can actually spoof the number of cores and RAM on a node by using the config_override option.  I've used that before for testing purposes.  Mind you core binding and other features like that will not work if you start spoofing the number of cores and ram, so use with caution. -Paul Edmon

Re: [slurm-users] export qos

2021-12-17 Thread Paul Edmon
Just of our curiosity is there a reason you aren't just doing a mysqldump of the extant DB and then reimporting it? I'm not aware of a way to dump just the qos settings for import other than: sacctmgr show qos -Paul Edmon- On 12/17/2021 10:24 AM, Williams, Jenny Avis wrote: Sacctmgr dump

Re: [slurm-users] slurmdbd full backup so the primary can be purged

2021-12-13 Thread Paul Edmon
SchedMD as to any limitations they are aware of.  Usually they are pretty good about being comprehensive in their docs so they would have probably mentioned it if there was one. -Paul Edmon- On 12/13/2021 5:07 AM, Loris Bennett wrote: Hi Paul, Am I right in assuming that there are going

Re: [slurm-users] slurmdbd full backup so the primary can be purged

2021-12-10 Thread Paul Edmon
this is writing your sql into the database. So you could set up a full mirror and then read the old archives into that.  You just want to make sure that mirror has archiving/purging turned off so it won't rearchive the data you restored. -Paul Edmon- On 12/10/2021 1:28 PM, Ransom, Geoffrey M

Re: [slurm-users] Database Compression

2021-12-09 Thread Paul Edmon
and reimport will take a while (for me it was about 4 hours start to finish on my test system). -Paul Edmon- On 12/2/2021 1:06 PM, Baer, Troy wrote: My site has just updated to Slurm 21.08 and we are looking at moving to the built-in job script capture capability, so I'm curious about

Re: [slurm-users] A Slurm topological scheduling question

2021-12-07 Thread Paul Edmon
all our internode IP comms going over our IB fabric and it works fine. -Paul Edmon- On 12/7/2021 11:05 AM, David Baker wrote: Hello, These days we have now enabled topology aware scheduling on our Slurm cluster. One part of the cluster consists of two racks of AMD compute nodes

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Paul Edmon
I would check that you have MariaDB-shared installed too on the host you build on prior to your build.  The changed the way the packaging is done in MariaDB and Slurm needs to detect the files in MariaDB-shared to actually trigger the configure to build the mysql libs. -Paul Edmon- On 12/3

Re: [slurm-users] Preferential scheduling on a subset of nodes

2021-12-01 Thread Paul Edmon
*PreemptMode* for this partition. It can be set to OFF to disable preemption and gang scheduling for this partition. See also *PriorityTier* and the above description of the cluster-wide *PreemptMode* parameter for further details. This is at least how we manage that. -Paul Edmon

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Paul Edmon
the jobs and scheduling this is some what mitigated, though jobs will still exit due to timeout. -Paul Edmon- On 10/25/2021 4:47 AM, Alan Orth wrote: Dear Jurgen and Paul, This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol suspend

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-19 Thread Paul Edmon
Yup, we follow the same process for when we do Slurm upgrades, this looks analogous to our process. -Paul Edmon- On 10/19/2021 3:06 PM, Juergen Salk wrote: Dear all, we are planning to perform some maintenance work on our Lustre file system which may or may not harm running jobs. Although

Re: [slurm-users] slurm.conf syntax checker?

2021-10-13 Thread Paul Edmon
then have it reject any changes that cause failure.  It's not perfect but it works.  A real syntax checker would be better. -Paul Edmon- On 10/12/2021 4:08 PM, bbenede...@goodyear.com wrote: Is there any sort of syntax checker that we could run our slurm.conf file through before committing

[slurm-users] Using Nice to Break Ties

2021-09-14 Thread Paul Edmon
/group/lab?  What solutions have people used for this? -Paul Edmon-

Re: [slurm-users] User CPU limit across partitions?

2021-08-03 Thread Paul Edmon
I think you can accomplish this by setting Partition QoS and defining it to hook into the same QoS for all there.  I believe that would force it to share the same pool. That said I don't know if that would work properly, its worth a test.  That is my first guess though. -Paul Edmon- On 8/3

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
its the sum total of all the TRES a Group could run in a partition at one time. -Paul Edmon- On 8/2/2021 12:05 PM, Adrian Sevcenco wrote: On 8/2/21 6:26 PM, Paul Edmon wrote: Probably more like MaxTRESPERJob=cpu=8 i see, thanks!! i'm still searching for the definition of GrpTRES :) Thanks

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
Probably more like MaxTRESPERJob=cpu=8 You would need to specify how much TRES you need for each job in the normal tres format. -Paul Edmon- On 8/2/2021 11:24 AM, Adrian Sevcenco wrote: On 8/2/21 5:44 PM, Paul Edmon wrote: You can set up a Partition based QoS that can set this limit

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html  See the MaxTRESPerJob limit. -Paul Edmon- On 8/2/2021 10:40 AM, Adrian Sevcenco wrote: Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default

Re: [slurm-users] Can I get the original sbatch command, after the fact?

2021-07-16 Thread Paul Edmon
Not in the current version of Slurm.  In the next major version long term storage of job scripts will be available. -Paul Edmon- On 7/16/2021 2:16 PM, David Henkemeyer wrote: If I execute a bunch of sbatch commands, can I use sacct (or something else) to show me the original sbatch command

Re: [slurm-users] MinJobAge

2021-07-06 Thread Paul Edmon
, the minimum non-zero value for *MinJobAge* recommended is 2. From my experience this does work.  We've been running with MinJobAge=600 for years with out any problems to my knowledge -Paul Edmon- On 7/6/2021 8:59 AM, Emre Brookes wrote:   Brian Andrus Nov 23, 2020, 1:55:54 PM

Re: [slurm-users] Long term archiving

2021-06-28 Thread Paul Edmon
We keep 6 months in our active database and then we archive and purge anything older than that.  The archive data itself is available for reimport and historical investigation.  We've done this when importing historical data into XDMod. -Paul Edmon- On 6/28/2021 10:43 AM, Yair Yarom wrote

Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Paul Edmon
for major version upgrades than minors. So if you are doing a minor version upgrade its likely fine to do live.  For major version I would recommend at least pausing all the jobs. -Paul Edmon- On 5/26/2021 2:48 PM, Ole Holm Nielsen wrote: On 26-05-2021 20:23, Will Dennis wrote: About to embark on my

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Paul Edmon
XDMod can give these sorts of stats.  I also have some diamond collectors we use in concert with grafana to pull data and plot it which is useful for seeing large scale usage trends: https://github.com/fasrc/slurm-diamond-collector -Paul Edmon- On 5/13/2021 6:08 PM, Sid Young wrote: Hi All

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Paul Edmon
Yup, we use XDMod for this sort of data as well. -Paul Edmon- On 5/11/2021 8:52 AM, Renfro, Michael wrote: XDMoD [1] is useful for this, but it’s not a simple script. It does have some user-accessible APIs if you want some report automation. I’m using that to create a lightning-talk-style

Re: [slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Paul Edmon
We go the route of having a test cluster and vetting our lua scripts there before putting them in the production environment. -Paul Edmon- On 5/6/2021 1:23 PM, Renfro, Michael wrote: I’ve used the structure at https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 <ht

[slurm-users] Replacement for diamond

2021-05-04 Thread Paul Edmon
a new option. So what do people use for shipping various slurm stats to graphite? -Paul Edmon-

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Paul Edmon
Since you can run an arbitrary script as a node health checker I might add a script that counts failures and then closes if it hits a threshold.  The script shouldn't need to talk to the slurmctld or slurmdbd as it should be able to watch the log on the node and see the fail. -Paul Edmon

Re: [slurm-users] Fairshare config change affect on running/queued jobs?

2021-04-30 Thread Paul Edmon
It shouldn't impact running jobs, all it should really do is impact pending jobs as it will order them by their relative priority scores. -Paul Edmon- On 4/30/2021 12:39 PM, Walsh, Kevin wrote: Hello everyone, We wish to deploy "fair share" scheduling configuration and would like

Re: [slurm-users] OpenMPI interactive change in behavior?

2021-04-28 Thread Paul Edmon
I haven't experienced this issue here.  Then again we've been using PMIx for launching MPI for a while now, thus we may have circumvented this particular issue. -Paul Edmon- On 4/28/2021 9:41 AM, John DeSantis wrote: Hello all, Just an update, the following URL almost mirrors the issue

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-04-27 Thread Paul Edmon
together a gitlab runner which screens our slurm.conf's by running synthetic slurmctld to sanity check. -Paul Edmon- On 4/27/2021 2:35 PM, David Henkemeyer wrote: Hello, I'm new to Slurm (coming from PBS), and so I will likely have a few questions over the next several weeks, as I work

Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-25 Thread Paul Edmon
So just a heads up here are the two tickets I filed.  The first: https://bugs.schedmd.com/show_bug.cgi?id=11183  Has more details as to how their plugin works.  The second is the clearing house for improvements: https://bugs.schedmd.com/show_bug.cgi?id=11135 -Paul Edmon- On 3/19/2021 9:25 AM

Re: [slurm-users] Set Fairshare by Hand

2021-03-22 Thread Paul Edmon
consequences of changing their RawShares. -Paul Edmon- On 3/22/2021 5:12 AM, Michael Müller wrote: Dear Slurm users and admins, can we set the faireshare values manually, i.e., they are not (re)calculated be Slurm? With kind regards Michael

Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-19 Thread Paul Edmon
I was about to ask this as well as we use /scratch as our tmp space not /tmp.  I haven't kicked the tires on this to know how it works but after I take a look at it I will probably file a feature request to make the name of the tmp dir flexible. -Paul Edmon- On 3/19/2021 7:19 AM, Tina

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Paul Edmon
any of the results with regards to memory usage if the job is terminated by OoM.  sacct just can't pick up a sudden memory spike like that and even if it did  it would not correctly record the peak memory because the job was terminated prior to that point. -Paul Edmon- On 3/15/2021 1:52 PM

Re: [slurm-users] SLURM submit policy

2021-03-10 Thread Paul Edmon
You might try looking at a partition QoS using the GrpTRESMins or GrpTRESRunMins: https://slurm.schedmd.com/resource_limits.html There are a bunch of options which may do what you want. -Paul Edmon- On 3/10/2021 9:13 AM, Marcel Breyer wrote: Greetings, we know about the SLURM configuration

Re: [slurm-users] qos on partition

2021-03-09 Thread Paul Edmon
in slurm.conf -Paul Edmon- On 3/9/2021 5:10 AM, LEROY Christine 208562 wrote: Hello, I’d like to reproduce a configuration we had with torque on queues/partitions : • how to set a maximum number of running jobs on a queue ? • and a maximum number of running jobs per user for all the users

Re: [slurm-users] Rate Limiting of RPC calls

2021-02-09 Thread Paul Edmon
with a database that updates every 30 seconds. 5. Recommend to users to submit jobs that last for more than 10 minutes and to use Job arrays instead of looping sbatch.  This will reduce thrashing. Those are my recommendations for how to deal with this. -Paul Edmon- On 2/9/2021 7:59 PM, Kota

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Edmon
That is correct.  I think NVML has some additional features but in terms of actually scheduling them what you have should work. They will just be treated as normal gres resources. -Paul Edmon- On 1/26/2021 3:55 PM, Ole Holm Nielsen wrote: On 26-01-2021 21:36, Paul Edmon wrote: You can

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Edmon
/IB as well where you have to roll a separate slurm for each type of node you have if you want these which is hardly ideal. -Paul Edmon- On 1/26/2021 3:24 PM, Robert Kudyba wrote: You all might be interested in a patch to the SPEC file, to not make the slurm RPMs depend on libnvidia-ml.so, even

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Edmon
clude/" That ensures the cuda libs are installed and it directs slurm to where they are.  After that configure should detect the nvml libs and link against them. I've attached our full spec that we use to build. -Paul Edmon- On 1/26/2021 2:29 PM, Ole Holm Nielsen wrote: In another thread

Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-24 Thread Paul Edmon
to the latest minor release at our next monthly maintenance.  For major releases we will upgrade at our next monthly maintenance after the .1 release is out unless there is a show stopping bug that we run into in our own testing.  At which point we file a bug with SchedMD and get a patch. -Paul

Re: [slurm-users] getting fairshare

2020-12-16 Thread Paul Edmon
://docs.rc.fas.harvard.edu/kb/fairshare/ -Paul Edmon- On 12/16/2020 12:30 PM, Erik Bryer wrote: $ sshare -a              Account       User  RawShares  NormShares  RawUsage  EffectvUsage  FairShare

Re: [slurm-users] Query for minimum memory required in partition

2020-12-16 Thread Paul Edmon
> 2147483646) then     slurm.log_user("You must request more than 190GB for jobs in bigmem partition")     return 2052     end     end     end -Paul Edmon- On 12/16/202

Re: [slurm-users] Novice Slurm Upgrade Questions

2020-12-04 Thread Paul Edmon
It won't figure it out automatically no.  You will need to ensure that the spec is installing to the same locale as your vendor installed it if they didn't put it in the default location (/opt isn't the default). -Paul Edmon- On 12/4/2020 3:39 PM, Jason Simms wrote: Dear Ole, Thanks. I've

Re: [slurm-users] Novice Slurm Upgrade Questions

2020-12-04 Thread Paul Edmon
should either neuter this or have those both stopped during the upgrade.  After the upgrade you should run slurmdbd and slurmctld in commandline mode for the initial run. Once it is done and running normally you can kill these and restart the relevant services. -Paul Edmon- On 12/4/2020 2:36 PM

Re: [slurm-users] FairShare

2020-12-02 Thread Paul Edmon
Yup, our doc is for the classic fairshare not for fairtree. Thanks for the kudos on the doc by the way.  We are glad it is useful. -Paul Edmon- On 12/2/2020 12:45 PM, Ryan Cox wrote: That is not for Fair Tree, which is what Micheal asked about. Ryan On 12/2/20 10:32 AM, Renfro, Michael

Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Paul Edmon
You can dig through the slurmctld log and search for the JobID. That should tell you what Slurm was doing at the time. -Paul Edmon- On 12/2/2020 6:27 AM, Adrian Sevcenco wrote: Hi! I encountered a situation when a bunch of jobs were restarted and this is seen from Requeue=1 Restarts=1

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Paul Edmon
That can help.  Usually this happens due to laggy storage the job is using taking time flushing the job's data.  So making sure that your storage is up, responsive, and stable will also cut these down. -Paul Edmon- On 11/30/2020 12:52 PM, Robert Kudyba wrote: I've seen where this was a bug

Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon
take roughly 1-2 hours for us. -Paul Edmon- On 11/2/2020 11:15 AM, Chris Samuel wrote: On 11/2/20 7:31 am, Paul Edmon wrote: e. Run slurmdbd -Dv to do the database upgrade. Depending on the upgrade this can take a while because of database schema changes. I'd like to emphasis

Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon
We haven't really had MPI ugliness with the latest versions. Plus we've been rolling our own PMIx and building against that which seems to have solved most of the cross compatibility issues. -Paul Edmon- On 11/2/2020 10:38 AM, Fulcomer, Samuel wrote: Our strategy is a bit simpler. We're

  1   2   3   >