[slurm-users] Is there a way create reservations w/o being Operator or Admin?

2022-07-11 Thread David Henkemeyer
I would like to remove the restriction that users must be at least operator level to do "scontrol create reservation". So, either I could: - Change the default AdminLevel to operator. Is that possible? - Remove the restriction that a user has to be operator to create a reservation. Is

Re: [slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread David Henkemeyer
t) 5000 jobs being considered, the > remaining aren't even looked at. > > Brian Andrus > On 5/12/2022 7:34 AM, David Henkemeyer wrote: > > Question for the braintrust: > > I have 3 partitions: > >- Partition A_highpri: 80 nodes >- Partition A_lowpri: same 80 nodes

[slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread David Henkemeyer
Question for the braintrust: I have 3 partitions: - Partition A_highpri: 80 nodes - Partition A_lowpri: same 80 nodes - Partition B_lowpri: 10 different nodes There is no overlap between A and B partitions. Here is what I'm observing. If I fill the queue with ~20-30k jobs for

[slurm-users] How to run a job at the end of a set of jobs

2022-05-09 Thread David Henkemeyer
Prologue is a feature whereby I can run something after a single job. Is there a best practice for running a job after a set of jobs? We submit a bunch of jobs to a bunch of nodes, and after all the jobs are done, we would like to submit a "utility job" on each node, but it has to be the last

Re: [slurm-users] Is sacct not handling quotes properly?

2022-05-04 Thread David Henkemeyer
-- sbatch --export=NONE --wrap=uname -a --exclusive So, its storing properly, now I need to see if I can figure out how to preserve/add the quotes on the way out of the DB... David On Wed, May 4, 2022 at 11:15 AM Michael Jennings wrote: > On Wednesday, 04 May 2022, at 10:00:57 (-0700), > Davi

[slurm-users] Is sacct not handling quotes properly?

2022-05-04 Thread David Henkemeyer
I am seeing what I think might be a bug with sacct. When I do the following: *> sbatch --export=NONE --wrap='uname -a' --exclusive* *Submitted batch job 2869585* Then, I ask sacct for the SubmitLine, as such: *> sacct -j 2869586 -o

Re: [slurm-users] gres/gpu count lower than reported

2022-05-03 Thread David Henkemeyer
I have found that the "reason" field doesn't get updated after you correct the issue. For me, its only when I move the node back to the idle state, that the reason field is then reset. So, assuming /dev/nvidia[0-3] is correct (I've never seen otherwise with nvidia GPUs), then try taking them

[slurm-users] Looking for examples of daily job reports

2022-04-15 Thread David Henkemeyer
All, I'm wanting to improve our daily Slurm job reports. Can anyone point me to some good examples? Currently we are reporting on several things, such as # of jobs that failed to schedule, # of jobs that failed during execution, node utilization, etc, but the report itself is pretty basic and

[slurm-users] Can I define and use custom env vars in slurm.conf?

2022-04-04 Thread David Henkemeyer
If I have a large number of heterogeneously named nodes in my cluster, and several partitions that include the same large subset of those nodes, I would love to be able to define an env var, and reference that in each partition specification. For instance, say we have the following:

[slurm-users] Why is --cpu_bind not an option for sbatch? Why only srun?

2022-03-31 Thread David Henkemeyer
We noticed that we can pass --cpu_bind into an srun commandline, but not sbatch. Why is that? Thanks David

Re: [slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread David Henkemeyer
> will likely bite you in the end. E.g., the 64 thread case should do > "--cpus-per-task 64", and the launching processes in the loop should > _probably_ do "-n 64" (assuming it can handle the tasks being assigned to > different nodes). > > On Thu, Mar 24, 2022 at

Re: [slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread David Henkemeyer
nce - and it is significant. > > > On Mar 24, 2022, at 12:32 PM, David Henkemeyer < > david.henkeme...@gmail.com> wrote: > > > > Assuming -N is 1 (meaning, this job needs only one node), then is there > a difference between any of these 3 flag combinations: > > > > -n

[slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread David Henkemeyer
Assuming -N is 1 (meaning, this job needs only one node), then is there a difference between any of these 3 flag combinations: -n 64 (leaving cpus-per-task to be the default of 1) --cpus-per-task 64 (leaving -n to be the default of 1) --cpus-per-task 32 -n 2 As far as I can tell, there is no

[slurm-users] Questions about default_queue_depth

2022-01-12 Thread David Henkemeyer
Hello, A few weeks ago, we tested Slurm against about 50K jobs, and observed at least one instance where a node went idle, while there were jobs on the queue that could have run on the idle node. The best guess as to why this occurred, at this point, is that the default_queue_depth was set to

[slurm-users] How to limit # of execution slots for a given node

2022-01-06 Thread David Henkemeyer
All, When my team used PBS, we had several nodes that had a TON of CPUs, so many, in fact, that we ended up setting np to a smaller value, in order to not starve the system of memory. What is the best way to do this with Slurm? I tried modifying # of CPUs in the slurm.conf file, but I noticed

[slurm-users] Bug when I run "sinfo --states=idle"

2021-10-28 Thread David Henkemeyer
Hello, I just noticed today that when I run "sinfo --states=idle", I get all the idle nodes, plus an additional node that is in the "DRAIN" state (notice how xavier6 is showing up below, even though its not in the idle state): (! 807)-> sinfo --states=idle PARTITION AVAIL TIMELIMIT NODES

[slurm-users] Can I get the original sbatch command, after the fact?

2021-07-16 Thread David Henkemeyer
If I execute a bunch of sbatch commands, can I use sacct (or something else) to show me the original sbatch command line for a given job ID? Thanks David

[slurm-users] When using RequeueExit in Slurm.conf, can you limit the # of requeues?

2021-07-01 Thread David Henkemeyer
Hello, I am investigating Slurm's ability to do requeuing of jobs. I like the fact that I can set RequeueExit= in the slurm.conf file, since this will automatically requeue jobs that exit with the specified exit codes. But, is there a way to limit the # of requeues? Thanks David

[slurm-users] New node w/ 3 GPUs is not accepting GPUs tasks

2021-06-23 Thread David Henkemeyer
Hello, I just added a 3rd node to my slurm partition (called "hsw5"), as we continue to enable Slurm in our environment. But the new node is not accepting jobs that require a GPU, despite the fact that it has 3 GPUs. The other node that has a GPU ("devops3") is accepting GPU jobs as expected.

[slurm-users] Question about adding and removing features in Slurm

2021-06-18 Thread David Henkemeyer
We are transitioning from PBS to Slurm. In PBS, we use the following syntax to add/remove properties to a node: qmgr -c "set node properties += " qmgr -c "set node properties -= " Is there a similar way to do this for Slurm? Or is it expected that the administrator will manually edit

Re: [slurm-users] Configless mode enabling issue

2021-05-07 Thread David Henkemeyer
t; > > ------ > *From:* slurm-users on behalf of > David Henkemeyer > *Sent:* Friday, May 7, 2021 2:41:41 PM > *To:* slurm-users@lists.schedmd.com > *Subject:* [slurm-users] Configless mode enabling issue > > Hello all. My team is enabling

[slurm-users] Configless mode enabling issue

2021-05-07 Thread David Henkemeyer
Hello all. My team is enabling slurm (version 20.11.5) in our environment, and we got a controller up and running, along with 2 nodes. Everything was working fine. However, when we try to enable configless mode, I ran into a problem. The node that has a GPU is coming up in "drained" state, and

[slurm-users] slurmd -C vs lscpu - which do I use to populate slurm.conf?

2021-04-28 Thread David Henkemeyer
I'm working on populating slurm.conf on my nodes, and I noticed that slurmd -C doesn't agree with lscpu, in all cases, and I'm not sure why. Here is what lscpu reports: Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 And here is what slurmd -C is reporting:

[slurm-users] Questions about adding new nodes to Slurm

2021-04-27 Thread David Henkemeyer
Hello, I'm new to Slurm (coming from PBS), and so I will likely have a few questions over the next several weeks, as I work to transition my infrastructure from PBS to Slurm. My first question has to do with *adding nodes to Slurm*. According to the FAQ (and other articles I've read), you need