[slurm-users] Partition Preemption Configuration Question

2024-05-02 Thread Jason Simms via slurm-users
Hello all, The Slurm docs have me a bit confused... I'm wanting to enable job preemption on certain partitions but not others. I *presume* I would set PreemptType=preempt/partition_prio globally, but then on the partitions where I don't want jobs to be able to be preempted, I would set

[slurm-users] Re: Trying to Track Down root Usage

2024-04-29 Thread Jason Simms via slurm-users
user root in place? > > sreport accounts resources reserved for a user as well (even if not > used by jobs) while sacct reports job accounting only. > > Best regards > Jürgen > > > * Jason Simms via slurm-users [240429 > 10:47]: > > Hello all, > > > > E

[slurm-users] Trying to Track Down root Usage

2024-04-29 Thread Jason Simms via slurm-users
Hello all, Each week, I generate an automated report of the top users by CPU hours. This week, for whatever reason the user root accounted for a massive number of hours: Login Proper Name Used

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jason Simms via slurm-users
As a related point, for this reason I mount /var/log separately from /. Ask me how I learned that lesson... Jason On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users < slurm-users@lists.schedmd.com> wrote: > AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is

[slurm-users] Re: Enforcing relative resource restrictions in submission script

2024-02-28 Thread Jason Simms via slurm-users
Hello Matthew, You may be aware of this already, but most sites would make these kinds of checks/validations using job_submit.lua. I'm not an expert in that - though plenty of others on this list are - but I'm positive you could implement this type of validation logic. I'd like to say that I've

[slurm-users] Re: pty jobs are killed when another job on the same node terminates

2024-02-28 Thread Jason Simms via slurm-users
Hello Thomas, I know I'm a few days late to this, so I'm wondering whether you've made any progress. We experience this, too, but in a different way. First, though, you may be aware, but you should use salloc rather than srun --pty for an interactive session. That's been the preferred method for

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-25 Thread Jason Simms via slurm-users
Hello Daniel, In my experience, if you have a high-speed interconnect such as IB, you would do IPoIB. You would likely still have a "regular" Ethernet connection for management purposes, and yes that means both an IB switch and an Ethernet switch, but that switch doesn't have to be anything

[slurm-users] Recover Batch Script Error

2024-02-16 Thread Jason Simms via slurm-users
Hello all, I've used the "scontrol write batch_script" command to output the job submission script from completed jobs in the past, but for some reason, no matter which job I specify, it tells me it is invalid. Any way to troubleshoot this? Alternatively, is there another way - even if a manual

[slurm-users] GPU Card Reservation?

2023-12-15 Thread Jason Simms
Hello all, At least at one point, I understood that it was not particularly possible, or at least not elegant, to provide priority preempt access to a specific GPU card. So, if a node has 4 GPUs, a researcher can preempt as needed one or more of them. Is this still the case? Or is there a

Re: [slurm-users] cpus-per-task behaviour of srun after 22.05

2023-10-22 Thread Jason Simms
Hello Michael, I don't have an elegant solution, but I'm writing mostly to +1 this. I didn't catch this in the release notes but am concerned if it is indeed the new behavior. Researchers use scripts that rely on --cpus-per-task (or -c) as part of, e.g., SBATCH directives. I suppose you could

[slurm-users] Clean Up Scratch After Failed Job

2023-10-10 Thread Jason Simms
Hello all, Our template scripts for Slurm include a workflow to copy files to a scratch space prior to running a job, and then copying any output files, etc. back to the original submit directory on job completion, and then finally cleaning up (deleting) the scratch space before exiting. This

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Jason Simms
I personally don't think that we should assume users will always know which partitions are available to them. Ideally, of course, they would, but I think it's fine to assume users should be able to submit a list of partitions that they would be fine running their jobs on, and if one is forbidden

Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-09-08 Thread Jason Simms
Hello John, I also am keen to follow your progress, as this is something we would find extremely useful as well. Regards, Jason On Fri, Sep 8, 2023 at 4:47 AM John Snowdon wrote: > I've been needing to do this as part of some analysis work we are > undertaking to determine requirements for a

Re: [slurm-users] Decreasing time limit of running jobs (notification)

2023-07-06 Thread Jason Simms
a running job? > > > > > > On Thu, 6 Jul 2023, 18:16 Jason Simms, wrote: > >> An unfortunate example of the “with great power comes great >> responsibility” maxim. Linux will gleefully let you rm -fr your entire >> system, drop production databases, etc., p

Re: [slurm-users] Decreasing time limit of running jobs (notification)

2023-07-06 Thread Jason Simms
r than the time the >> job had already run, so it killed it immediately? >> >> On Jul 6, 2023, at 12:04 PM, Jason Simms wrote: >> >> No, not a bug, I would say. When the time limit is reached, that's it, >> job dies. I wouldn't be aware of a way to manage th

Re: [slurm-users] Decreasing time limit of running jobs (notification)

2023-07-06 Thread Jason Simms
No, not a bug, I would say. When the time limit is reached, that's it, job dies. I wouldn't be aware of a way to manage that. Once the time limit is reached, it wouldn't be a hard limit if you then had to notify the user and then... what? How long would you give them to extend the time? Wouldn't

Re: [slurm-users] Distribute a single node resources across multiple partitons

2023-07-06 Thread Jason Simms
Hello Purvesh, I'm not an expert in this, but I expect a common question would be, why are you wanting to do this? More information would be helpful. On the surface, it seems like you could just allocate two full nodes to each partition. You must have a reason why that is unacceptable, however.

Re: [slurm-users] SLUG '23 Registration and Call for Papers

2023-05-30 Thread Jason Simms
Hello Victoria, Sorry to hear that remote attendance is not possible. Is it safe to assume, however, that it will be archived and viewable after the event? Warmest regards, Jason On Tue, May 30, 2023 at 3:00 PM Victoria Hobson wrote: > Hi Sean, > > That is correct. There will be no remote or

[slurm-users] Inaccurate Preemption Notification?

2023-04-24 Thread Jason Simms
Hello all, A user received an email from Slurm that one of his jobs was preempted. Normally when a job is preempted, the logs will show something like this: [2023-03-30T08:19:16.535] [25538.batch] error: *** JOB 25538 ON node07 CANCELLED AT 2023-03-30T08:19:16 DUE TO PREEMPTION ***

Re: [slurm-users] Resource LImits

2023-04-20 Thread Jason Simms
Hello Ole and Hoot, First, Hoot, thank you for your question. I've managed Slurm for a few years now and still feel like I don't have a great understanding about managing or limiting resources. Ole, thanks for your continued support of the user community with your documentation. I do wish not

Re: [slurm-users] Odd prolog Error?

2023-04-11 Thread Jason Simms
: >> 1) /opt/slurm/prolog.sh exists on the node(s) >> 2) the slurmd user is able to execute it >> >> I would connect to the node and try to run the command as the slurmd user. >> Also, ensure the user exists on the node, however you are propagating the >> uids. &

Re: [slurm-users] Odd prolog Error?

2023-04-11 Thread Jason Simms
he slurmd user. > Also, ensure the user exists on the node, however you are propagating the > uids. > > Brian ANdrus > > On 4/11/2023 9:48 AM, Jason Simms wrote: > > Hello all, > > Regularly I'm seeing array jobs fail, and the only log info from the > compute node is

[slurm-users] Odd prolog Error?

2023-04-11 Thread Jason Simms
Hello all, Regularly I'm seeing array jobs fail, and the only log info from the compute node is this: [2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited with status 0x0100 [2023-04-11T11:41:12.336] error: [job 26090] prolog failed status=1:0 [2023-04-11T11:41:12.336] Job 26090 already

Re: [slurm-users] Troubles with cgroups

2023-03-21 Thread Jason Simms
e. > And if time permits we will check if it can be triggered with a vanilla > kernel. > > Regards, > Hermann > > On 3/17/23 21:34, Jason Simms wrote: > > Hello, > > > > This isn't precisely related, but I can say that we were having strange > > issu

Re: [slurm-users] Troubles with cgroups

2023-03-17 Thread Jason Simms
Hello, This isn't precisely related, but I can say that we were having strange issues with system load spiking to the point that the nodes became unresponsive and likewise needed a hard reboot. After several tests and working with our vendor, on nodes that we entirely disabled swap, the problems

Re: [slurm-users] priority access and QoS

2023-02-27 Thread Jason Simms
Hello all, I haven't found any guidance that seems to be the current "better practice," but this does seem to be a common use case. I imagine there are multiple ways to accomplish this goal. For example, you could assuredly do it with QoS, but you can likely also accomplish this with some other

Re: [slurm-users] Interactive jobs using "srun --pty bash" and MPI

2022-11-03 Thread Jason Simms
Oh hey this is fun, thanks for sharing. I hadn't seen this, but it works as advertised. Jason On Thu, Nov 3, 2022 at 12:31 AM Christopher Samuel wrote: > On 11/2/22 4:45 pm, Juergen Salk wrote: > > > However, instead of using `srun --pty bash´ for launching interactive > jobs, it > > is now

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Jason Simms
The oversight is perhaps understandable, since for most software, a given version XX.YY would be major version XX, minor version YY. But with Slurm it’s major version XX.YY and minor is XX.YY.zz On Thu, Sep 8, 2022 at 2:43 PM Ole Holm Nielsen wrote: > Paul is right! You may upgrade 18.08 to

[slurm-users] Heterogeneous GPU Node?

2022-06-23 Thread Jason Simms
Hello all, Slightly OT, but I'm hoping the hive mind here can share some advice. We have a GPU node with three RTX8000 GPUs installed. The node has a capacity of 8 cards in total. I have a researcher who possibly wants to add an A100. I recall asking our vendor a while back whether it's

Re: [slurm-users] Incorrect Number of GPUs?

2021-07-26 Thread Jason Simms
nge a NodeName line. > "scontrol reconfigure" doesn't do the truck. > > On Mon, Jul 26, 2021 at 12:49 PM Fulcomer, Samuel < > samuel_fulco...@brown.edu> wrote: > >> If you have a dual-root PCIe system you may need to specify the CPU/core >> affinity in gres.conf

[slurm-users] Incorrect Number of GPUs?

2021-07-26 Thread Jason Simms
Hello all, I have a GPU node with 3 identical GPUs (we started with two and recently added the third). Running nvidia-smi correctly shows that all three are recognized. My gres.conf file has only this line: NodeName=gpu01 File=/dev/nvidia[0-2] Type=quadro_8000 Name=gpu Count=3 And the relevant

[slurm-users] Priority Access to GPU?

2021-07-12 Thread Jason Simms
Dear all, I feel like I've attempted to track this down before but have never fully understood how to accomplish this. I have a GPU node with three GPU cards, one of which was purchased by a user. I want to provide priority access for that user to the card, while still allowing it to be used by

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Jason Simms
ant to make use of the > cluster. Let's keep the discussion on how to get slurm to do it, if that's > possible. > > On Fri, Jun 4, 2021 at 11:13 AM Jason Simms wrote: > >> Unpopular opinion: remove the failing GPU. >> >> JLS >> >> On Fri, Jun 4, 2021 at

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Jason Simms
Unpopular opinion: remove the failing GPU. JLS On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa wrote: > Because there are failing GPUs that I'm trying to avoid. > > On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth > wrote: > >> On 03.06.21 07:11, Ahmad Khalifa wrote: >> > How to send a job to a

[slurm-users] QOS or Priority Access to GPU/GRES?

2021-04-27 Thread Jason Simms
Hello all, As usual, I have a super basic question, so thank you for your patience. I want to verify the correct syntax to configure a GPU for priority preempt access via a QOS, much like we are currently doing for a specified number of cores. When I have created a QOS in the past, I've so far

Re: [slurm-users] Managing Multiple Dependencies

2021-03-03 Thread Jason Simms
> complex depencies in workflows in other contexts. Snakemake should > support slurm. > > HTH, > Jan > > > On 02-03-2021 20:16, Jason Simms wrote: > > Hello all, > > > > I am relatively new to the nuances of handling complex dependencies in > > Slurm, so

[slurm-users] Managing Multiple Dependencies

2021-03-02 Thread Jason Simms
Hello all, I am relatively new to the nuances of handling complex dependencies in Slurm, so I'm hoping the hive mind can help. I have a user wanting to accomplish the following: - submit one job - submit multiple jobs that are dependent on the output from the first job (so they just

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-26 Thread Jason Simms
We’re in the same boat. Extremely small cluster. $10k for support. We don’t need nearly that level of engagement, but there ya go. We’ve passed for now, but I’d like to have a support contract ideally. Jason On Tue, Jan 26, 2021 at 2:49 PM Robert Kudyba wrote: > > > On Mon, Jan 25, 2021 at

[slurm-users] QOS Verification and Management

2021-01-20 Thread Jason Simms
Dear all, I have two users on our cluster who "bought into" it, much like a condo model, by purchasing one single physical node each. For those users, I have attempted to configure two QOS levels, such that when they submit jobs and invoke the QOS, they will have preempt, priority access to

[slurm-users] Slurm Upgrade Philosophy?

2020-12-18 Thread Jason Simms
Hello all, Thanks to several helpful members on this list, I think I have a much better handle on how to upgrade Slurm. Now my question is, do most of you upgrade with each major release? I recognize that, normally, if something is working well, then don't upgrade it! In our case, we're running

Re: [slurm-users] Novice Slurm Upgrade Questions

2020-12-04 Thread Jason Simms
page as well: > https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms > > I hope this helps. > > /Ole > > > On 04-12-2020 20:36, Jason Simms wrote: > > Thank you for being such a helpful resource for All Things Slurm; I > > sincerely appreciate the helpful feed

[slurm-users] Novice Slurm Upgrade Questions

2020-12-04 Thread Jason Simms
Hello all, Thank you for being such a helpful resource for All Things Slurm; I sincerely appreciate the helpful feedback. Right now, we are running 20.02 and considering upgrading to 20.11 during our next maintenance window in January. This will be the first time we have upgraded Slurm, so

Re: [slurm-users] seff Not Caluculating [FIXED?]

2020-11-18 Thread Jason Simms
for everyone, and I can't figure out why. Warmest regards, Jason On Wed, Nov 18, 2020 at 12:09 PM Peter Kjellström wrote: > On Wed, 18 Nov 2020 09:15:59 -0500 > Jason Simms wrote: > > > Dear Diego, > > > > A while back, I attempted to make some edits locally to see

Re: [slurm-users] seff Not Caluculating [FIXED?]

2020-11-18 Thread Jason Simms
Dear Diego, A while back, I attempted to make some edits locally to see whether I could produce "better" results. Here is a comparison of the output of your latest version, and then mine: [root@hpc bin]# seff 24567 Use of uninitialized value $hash{"2"} in division (/) at /bin/seff line 108,

Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Jason Simms
Hello David, I'm still relatively new at Slurm, but one way we handle this is that for users/groups who have "bought in" to the cluster, we use a QOS to provide them preemptible access to the set of resources provided by, e.g., a set number of nodes, but not the nodes themselves. That is, in one

Re: [slurm-users] Simple free for all cluster

2020-10-06 Thread Jason Simms
FWIW, I define the DefaultTime as 5 minutes, which effectively means for any "real" job that users must actually define a time. It helps users get into that habit, because in the absence of a DefaultTime, most will not even bother to think critically and carefully about what time limit is actually

[slurm-users] seff Not Caluculating

2020-09-11 Thread Jason Simms
Hello all, I've found that when I run seff, it fails to report calculated values, e.g.: Nodes: 1 Cores per node: 20 CPU Utilized: 00:00:00 CPU Efficiency: 0.00% of 1-11:49:40 core-walltime Job Wall-clock time: 01:47:29 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of

[slurm-users] Priority QOS with Preempt on Some Resources?

2020-09-01 Thread Jason Simms
Hello all, I have a couple of users, each of whom has contributed funds to purchase a node for the cluster, much like a condo system. Each node has 52 cores, so I'd like to provide each user with preempt access for up to 52 cores. I can configure that easily enough with a QOS for each user with

[slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Jason Simms
Hello everyone! We have a script that queries our LDAP server for any users that have an entitlement to use the cluster, and if they don't already have an account on the cluster, one is created for them. In addition, they need to be added to the Slurm database (in order to track usage, FairShare,

[slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Jason Simms
Hello all, Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into a DRAIN state. I'm not sure it

Re: [slurm-users] Reset FairShare?

2020-07-27 Thread Jason Simms
The only value currently supported is 0 (zero). This > is a settable specification only - it cannot be used as a filter to list > accounts. > > See: > > https://slurm.schedmd.com/sacctmgr.html > > -Paul Edmon- > On 7/27/2020 2:17 PM, Jason Simms wrote: > > Dear

[slurm-users] Reset FairShare?

2020-07-27 Thread Jason Simms
Dear all, Apologies for the basic question. I've looked around online for an answer to this, and I haven't found anything that has helped accomplish exactly what I want. That said, it is also probable that what I am asking isn't a best practice, or isn't actually necessary, etc. I'd welcome any

Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Jason Simms
f Melbourne, Victoria 3010 Australia > > > > On Wed, 8 Jul 2020 at 01:14, Jason Simms wrote: > >> *UoM notice: External email. Be cautious of links, attachments, or >> impersonation attempts.* >> -- >> Hello all, >> >&

[slurm-users] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Jason Simms
Hello all, Two users on my system experience job failures every time they submit a job via sbatch. When I run their exact submission script, or when I create a local system user and launch from there, the jobs run fine. Here is an example of what I see in the slurmd log: