[slurm-users] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU From: Alison Peterson Date: Thursday, April 4, 2024 at 11:58 AM To: Renfro, Michael Subject: Re: [EXT] Re: [slurm-users] SLURM configuration help External Email Warning This email originated from outside the university. Please use caution when

[slurm-users] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” show? On one job we currently have that’s pending due to Resources, that job has requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the node it wants to run on only has 37 CPUs available (seen by

[slurm-users] Re: SLURM configuration for LDAP users

2024-02-04 Thread Renfro, Michael via slurm-users
“An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user.” Since I don’t think it was mentioned below, does a non-LDAP user get the same error, or does it work by default? We don’t

Re: [slurm-users] Slurp for sw builds

2024-01-03 Thread Renfro, Michael
You can attack this in a few different stages. A lot of what you’re interested in will be found at various university or national lab sites (I Googled “sbatch example” for the one below) 1. If you’re good with doing a “make -j” to parallelize a make compilation over multiple CPUs in a

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Renfro, Michael
Is this Northwestern’s Quest HPC or another one? I know at least a few of the people involved with Quest, and I wouldn’t have thought they’d be in dire need of coaching. And to follow on with Davide’s point, this really sounds like a case for submitting multiple jobs with dependencies between

Re: [slurm-users] Guidance on which HPC to try our "OpenHPC or TrintyX " for novice

2023-10-03 Thread Renfro, Michael
I’d probably default to OpenHPC just for the community around it, but I’ll also note that TrinityX might not have had any commits in their GitHub for an 18-month period (unless I’m reading something wrong). On Oct 3, 2023, at 5:51 AM, John Joseph wrote:  External Email Warning This email

Re: [slurm-users] extended list of nodes allocated to a job

2023-08-17 Thread Renfro, Michael
Given a job ID: scontrol show hostnames $(scontrol show job some_job_id | grep ' NodeList=' | cut -d= -f2) | paste -sd, Maybe there’s something more built-in than this, but it gets the job done. From: slurm-users on behalf of Alain O' Miniussi Date: Thursday, August 17, 2023 at 7:46 AM To:

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Renfro, Michael
If there’s a fairshare component to job priorities, and there’s a share assigned to each user under the account, wouldn’t the light user’s jobs move ahead of any of the heavy user’s pending jobs automatically? From: slurm-users on behalf of "Groner, Rob" Reply-To: Slurm User Community List

Re: [slurm-users] Allow regular users to make reservations

2022-08-08 Thread Renfro, Michael
Going in a completely different direction than you’d planned, but for the same goal, what about making a script (shell, Python, or otherwise) that could validate all the constraints and call the scontrol program if appropriate, and then run that script via “sudo” as one of the regular users?

Re: [slurm-users] Changing a user's default account

2022-08-05 Thread Renfro, Michael
This should work: sacctmgr add user someuser account=newaccount # adds user to new account sacctmgr modify user where user=someuser set defaultaccount=newaccount # change default sacctmgr remove user where user=someuser and account=oldaccount # remove from old account From: slurm-users on

Re: [slurm-users] Sharing a GPU

2022-04-03 Thread Renfro, Michael
Someone else may see another option, but NVIDIA MIG seems like the straightforward option. That would require both a Slurm upgrade and the purchase of MIG-capable cards. https://slurm.schedmd.com/gres.html#MIG_Management Would be able to host 7 users per A100 card, IIRC. On Apr 3, 2022, at

Re: [slurm-users] Performance with hybrid setup

2022-03-13 Thread Renfro, Michael
Slurm supports a l3_cache_as_socket [1] parameter in recent releases. That would make an Epyc system, for example, appear to have many more sockets than physically exist, and that should help ensure threads in a single task share a cache. You’d want to run slurmd -C on a node with that setting

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Renfro, Michael
For later reference, [1] should be the (current) authoritative source on data types for the job_desc values: some strings, some numbers, some booleans. [1] https://github.com/SchedMD/slurm/blob/4c21239d420962246e1ac951eda90476283e7af0/src/plugins/job_submit/lua/job_submit_lua.c#L450 From:

Re: [slurm-users] Fairshare within a single Account (Project)

2022-02-01 Thread Renfro, Michael
mstadt Tel: +49 6151 16-21469 Alarich-Weiss-Straße 10 64287 Darmstadt Office: L2|06 410 On 1/30/22 21:14, Renfro, Michael wrote: You can. We use: sacctmgr show assoc where account=researchgroup format=user,share to see current fairshare within the account, and: sacctmgr modif

Re: [slurm-users] Fairshare within a single Account (Project)

2022-01-30 Thread Renfro, Michael
You can. We use: sacctmgr show assoc where account=researchgroup format=user,share to see current fairshare within the account, and: sacctmgr modify user where name=someuser account=researchgroup set fairshare=N to modify a particular user's fairshare within the account.

Re: [slurm-users] how to allocate high priority to low cpu and memory jobs

2022-01-25 Thread Renfro, Michael
Since there's only 9 factors to assign priority weights to, one way around this might be to set up separate partitions for high memory and low memory jobs (with a max memory allowed for the low memory partition), and then use partition weights to separate those jobs out. From: slurm-users on

Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Renfro, Michael
Not answering every question below, but for (1) we're at 200 on a cluster with a few dozen nodes and around 1k cores, as per https://lists.schedmd.com/pipermail/slurm-users/2021-June/007463.html -- there may be other settings in that email that could be beneficial. We had a lot of idle

Re: [slurm-users] work with sensitive data

2021-12-17 Thread Renfro, Michael
Untested, but given a common service account with a GPG key pair, a user with a GPG key pair, and the EncFS encrypted with a password, the user could encrypt a password with their own private key and the service account's public key, and leave it alongside the EncFS. If the service account is

Re: [slurm-users] Reserving cores without immediately launching tasks on all of them

2021-11-26 Thread Renfro, Michael
/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s From: slurm-users On Behalf Of Renfro, Michael Sent: Friday, November 26, 2021 8:15 AM To: Slurm User Community List Subject: [EXTERNAL] Re: [slurm-users] Reserving cores without immediately launching

Re: [slurm-users] Reserving cores without immediately launching tasks on all of them

2021-11-26 Thread Renfro, Michael
The end of the MPICH section at [1] shows an example using salloc [2]. Worst case, you should be able to use the output of “scontrol show hostnames” [3] and use that data to make mpiexec command parameters to run one rank per node, similar to what’s shown at the end of the synopsis section of

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Renfro, Michael
return slurm.ERROR end end end Fritz Ratnasamy Data Scientist Information Technology The University of Chicago Booth School of Business 5807 S. Woodlawn Chicago, Illinois 60637 Phone: +(1) 773-834-4556 On Mon, Sep 27, 2021 at 1:40 PM Renfro, Michael mailto:ren...@tntec

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Renfro, Michael
m.conf/ is there any Slurm service to restart after that? Thanks again Fritz Ratnasamy Data Scientist Information Technology The University of Chicago Booth School of Business 5807 S. Woodlawn Chicago, Illinois 60637 Phone: +(1) 773-834-4556 On Sat, Sep 25, 2021 at 11:08 AM Renfro, Michael mail

Re: [slurm-users] Block jobs on GPU partition when GPU is not specified

2021-09-25 Thread Renfro, Michael
If you haven't already seen it there's an example Lua script from SchedMD at [1], and I've got a copy of our local script at [2]. Otherwise, in the order you asked: 1. That seems reasonable, but our script just checks if there's a gres at all. I don't *think* any gres other than gres=gpu

Re: [slurm-users] Regarding job in pending state

2021-09-16 Thread Renfro, Michael
If you're not the cluster admin, you'll want to check with them, but that should be related to a limit in how many node-hours an association (a unique combination of user, cluster, partition, and account) can have in running or pending state. Further jobs would get blocked to allow others' jobs

Re: [slurm-users] estimate queue time using 'sbatch --test-only'

2021-09-15 Thread Renfro, Michael
I can imagine at least the following causing differences in the estimated time and the actual start time: * If running users have overestimated their job times, and their jobs finish earlier than expected, the original estimate will be high. * If another user's job submission gets

Re: [slurm-users] scancel gpu jobs when gpu is not requested

2021-08-26 Thread Renfro, Michael
Not a solution to your exact problem, but we document partitions for interactive, debug, and batch, and have a job_submit.lua [1] that routes GPU-reserving jobs to gpu-interactive, gpu-debug, and gpu partitions automatically. Since our GPU nodes have extra memory slots, and have tended to run

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Renfro, Michael
Did Diego's suggestion from [1] not help narrow things down? [1] https://lists.schedmd.com/pipermail/slurm-users/2021-August/007708.html From: slurm-users on behalf of Jack Chen Date: Tuesday, August 10, 2021 at 10:08 AM To: Slurm User Community List Subject: Re: [slurm-users] Compact

Re: [slurm-users] Slurm Scheduler Help

2021-06-11 Thread Renfro, Michael
Not sure it would work out to 60k queued jobs, but we're using: SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200 in our setup. bf_window is driven by our 30-day max job time, bf_resolution is at 5% of that time, and the other values

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-09 Thread Renfro, Michael
re Munich (HMGU) - From: slurm-users On Behalf Of Renfro, Michael Sent: Tuesday, 8 June 2021 20:12 To: Slurm User Community List Subject: Re: [slurm-users] Kill job when child process gets OOM-killed Any reason *not* to create an array of 100k jobs and let the scheduler just handle things? Current ve

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Renfro, Michael
Any reason *not* to create an array of 100k jobs and let the scheduler just handle things? Current versions of Slurm support arrays of up to 4M jobs, and you can limit the number of jobs running simultaneously with the '%' specifier in your array= sbatch parameter. From: slurm-users on behalf

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-05-14 Thread Renfro, Michael
Untested, but prior experience with cgroups indicates that if things are working correctly, even if your code tries to run as many processes as you have cores, those processes will be confined to the cores you reserve. Try a more compute-intensive worker function that will take some seconds or

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Renfro, Michael
could inquire at. [1] https://github.com/ubccr/xdmod/releases/tag/v9.5.0-rc.4 From: Diego Zuccato Date: Wednesday, May 12, 2021 at 8:37 AM To: Renfro, Michael Cc: Slurm User Community List Subject: Re: [slurm-users] Cluster usage, filtered by partition Il 12/05/21 13:30, Diego Zuccato ha s

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Renfro, Michael
://xdmod.ccr.buffalo.edu/ — may be the easiest way to explore it. On May 12, 2021, at 3:52 AM, Diego Zuccato wrote: Il 11/05/21 21:20, Renfro, Michael ha scritto: In a word, nothing that's guaranteed to be stable. I got my start from this reply on the XDMoD list in November 2019. Worked on 8.0: Tks for the hint

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Renfro, Michael
usage, filtered by partition On Tue, May 11, 2021 at 5:55 AM Renfro, Michael wrote: > > XDMoD [1] is useful for this, but it’s not a simple script. It does have some > user-accessible APIs if you want some report automation. I’m using that to > create a lightning-talk-style slide at

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Renfro, Michael
XDMoD [1] is useful for this, but it’s not a simple script. It does have some user-accessible APIs if you want some report automation. I’m using that to create a lightning-talk-style slide at [2]. [1] https://open.xdmod.org/ [2] https://github.com/mikerenfro/one-page-presentation-hpc On May

Re: [slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Renfro, Michael
I’ve used the structure at https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 to handle basic test/production branching. I can isolate the new behavior down to just a specific set of UIDs that way. Factoring out code into separate functions helps, too. I’ve seen others go so

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Renfro, Michael
You'll definitely need to get slurmd and slurmctld working before proceeding further. slurmctld is the Slurm controller mentioned when you do the srun. Though there's probably some other steps you can take to make the slurmd and slurmctld system services available, it might be simpler to do the

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-16 Thread Renfro, Michael
I can't speak to what happens on node failure, but I can at least get you a greatly simplified pair of scripts that will run only one copy on each node allocated: #!/bin/bash # notarray.sh #SBATCH --nodes=28 #SBATCH --ntasks-per-node=1 #SBATCH --no-kill echo "notarray.sh is running on

Re: [slurm-users] derived counters

2021-04-13 Thread Renfro, Michael
I'll never miss an opportunity to plug XDMoD for anyone who doesn't want to write custom analytics for every metric. I've managed to get a little bit into its API to extract current values for number of jobs completed and the number of CPU-hours provided, and insert those into a single slide

Re: [slurm-users] [External] Autoset job TimeLimit to fit in a reservation

2021-03-30 Thread Renfro, Michael
I'd probably write a shell function that would calculate the time required, and add it as a command-line parameter to sbatch. We do a similar thing for easier interactive shells in our /etc/profile.d folder on the login node: function hpcshell() { srun --partition=interactive $@ --pty bash -i

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Renfro, Michael
Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly. On Mar 15, 2021, at 12:53 PM, Chin,David wrote:  External Email Warning This email

Re: [slurm-users] Managing Multiple Dependencies

2021-03-02 Thread Renfro, Michael
There may be prettier ways, but this gets the job done. Captures the output from each sbatch command to get a job ID, colon separates the ones in the second group, and removes the trailing colon before submitting the last job: #!/bin/bash JOB1=$(sbatch job1.sh | awk '{print $NF}') echo

Re: [slurm-users] using resources effectively?

2020-12-16 Thread Renfro, Michael
We have overlapping partitions for GPU work and some kinds non-GPU work (both large memory and regular memory jobs). For 28-core nodes with 2 GPUs, we have: PartitionName=gpu MaxCPUsPerNode=16 … Nodes=gpunode[001-004] PartitionName=any-interactive MaxCPUsPerNode=12 …

Re: [slurm-users] FairShare

2020-12-02 Thread Renfro, Michael
Yesterday, I posted

Re: [slurm-users] Doubts with Fairshare

2020-12-01 Thread Renfro, Michael
Harvard's Arts & Sciences Research Computing group has a good explanation of these columns at https://docs.rc.fas.harvard.edu/kb/fairshare/ -- might not answer your exact question, but it does go into how the FairShare column is calculated. From: slurm-users Date: Tuesday, December 1, 2020 at

Re: [slurm-users] sbatch overallocation

2020-10-10 Thread Renfro, Michael
I think the answer depends on why you’re trying to prevent the observed behavior: * Do you want to ensure that one job requesting 9 tasks (and 1 CPU per task) can’t overstep its reservation and take resources away from other jobs on those nodes? Cgroups [1] should be able to confine the

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael
From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users on behalf of Sajesh Singh Reply-To: Slurm User Community List Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm

Re: [slurm-users] Simple free for all cluster

2020-10-02 Thread Renfro, Michael
Depending on the users who will be on this cluster, I'd probably adjust the partition to have a defined, non-infinite MaxTime, and maybe a lower DefaultTime. Otherwise, it would be very easy for someone to start a job that reserves all cores until the nodes get rebooted, since all they have to

Re: [slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Renfro, Michael
I could have missed a detail on my description, but we definitely don’t enable oversubscribe, or shared, or exclusiveuser. All three of those are set to “no” on all active queues. Current subset of slurm.conf and squeue output: = # egrep '^PartitionName=(gpu|any-interactive) '

Re: [slurm-users] Limit a partition or host to jobs less than 4 cores?

2020-09-30 Thread Renfro, Michael
Untested, but a combination of a QOS with MaxTRESPerJob=cpu=X and a partition that allows or denies that QOS may work. A job_submit.lua should be able to adjust the QOS of a submitted job, too. On 9/30/20, 10:50 AM, "slurm-users on behalf of Paul Edmon" wrote: External Email Warning

Re: [slurm-users] Mocking SLURM to debug job_submit.lua

2020-09-23 Thread Renfro, Michael
Not having a separate test environment, I put logic into my job_submit.lua to use either the production settings or the ones under development or testing, based off the UID of the user submitting the job: = function slurm_job_submit(job_desc, part_list, submit_uid) test_user_table = {}

Re: [slurm-users] Question/Clarification: Batch array multiple tasks on nodes

2020-09-01 Thread Renfro, Michael
We set DefMemPerCPU in each partition to approximately the amount of RAM in a node divided by the number of cores in the node. For heterogeneous partitions, we use a lower limit, and we always reserve a bit of RAM for the OS, too. So for a 64 GB node with 28 cores, we default to 2000 M per CPU,

Re: [slurm-users] Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Renfro, Michael
One pending job in this partition should have a reason of “Resources”. That job has the highest priority, and if your job below would delay the highest-priority job’s start, it’ll get pushed back like you see here. On Aug 31, 2020, at 12:13 PM, Holtgrewe, Manuel wrote: Dear all, I'm seeing

Re: [slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Renfro, Michael
The PowerShell script I use to provision new users adds them to an Active Directory group for HPC, ssh-es to the management node to do the sacctmgr changes, and emails the user. Never had it fail, and I've looped over entire class sections in PowerShell. Granted, there are some inherent delays

Re: [slurm-users] scheduling issue

2020-08-14 Thread Renfro, Michael
We’ve run a similar setup since I moved to Slurm 3 years ago, with no issues. Could you share partition definitions from your slurm.conf? When you see a bunch of jobs pending, which ones have a reason of “Resources”? Those should be the next ones to run, and ones with a reason of “Priority” are

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Renfro, Michael
I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re: NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15 and I’ve got 2 jobs currently running on each node that’s available. So maybe: NodeName=c0005

Re: [slurm-users] Correct way to give srun and sbatch different MaxTime values?

2020-08-04 Thread Renfro, Michael
Untested, but you should be able to use a job_submit.lua file to detect if the job was started with srun or sbatch: * Check with (job_desc.script == nil or job_desc.script == '') * Adjust job_desc.time_limit accordingly Here, I just gave people a shell function "hpcshell", which

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-02 Thread Renfro, Michael
Probably unrelated to slurm entirely, and most likely has to do with lower-level network diagnostics. I can guarantee that it’s possible to access Internet resources from a compute node. Notes and things to check: 1. Both ping and http/https are IP protocols, but are very different (ping isn’t

Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Renfro, Michael
If the 500 parameters happened to be filenames, you could do adapt like (appropriated from somewhere else, but I can’t find the reference quickly: = #!/bin/bash # get count of files in this directory NUMFILES=$(ls -1 *.inp | wc -l) # subtract 1 as we have to use zero-based indexing (first

Re: [slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread Renfro, Michael
“The SchedulerType configuration parameter specifies the scheduler plugin to use. Options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.”

Re: [slurm-users] runtime priority

2020-06-30 Thread Renfro, Michael
There’s a --nice flag to sbatch and srun, at least. Documentation indicates it decreases priority by 100 by default. And untested, but it may be possible to use a job_submit.lua [1] to adjust nice values automatically. At least I can see a nice property in [2], which I assume means it'd be

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-15 Thread Renfro, Michael
On Sat, Jun 13, 2020, 20:37 Renfro, Michael mailto:ren...@tntech.edu>> wrote: Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivastava mail

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-13 Thread Renfro, Michael
Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivastava wrote: Hi All, In our environment we have GPU. so what i found is if the user having

Re: [slurm-users] Fairshare per-partition?

2020-06-12 Thread Renfro, Michael
I think that’s correct. From notes I’ve got for how we want to handle our fairshare in the future: Setting up a funded account (which can be assigned a fairshare): sacctmgr add account member1 Description="Member1 Description" FairShare=N Adding/removing a user to/from the funded

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
subscribe should be sufficient. > If you can't spare a single node then a VM would do the job. > > -Paul Edmon- > > On 6/11/2020 9:28 AM, Renfro, Michael wrote: >> That’s close to what we’re doing, but without dedicated nodes. We have three >> back-end partitions (interacti

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
That’s close to what we’re doing, but without dedicated nodes. We have three back-end partitions (interactive, any-interactive, and gpu-interactive), but the users typically don’t have to consider that, due to our job_submit.lua plugin. All three partitions have a default of 2 hours, 1 core, 2

Re: [slurm-users] Slurm Job Count Credit system

2020-06-01 Thread Renfro, Michael
Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like: = sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit sacctmgr modify qos bank1 set grptresmins=cpu=1000 sacctmgr add account bank1 sacctmgr modify account name=bank1

Re: [slurm-users] Ubuntu Cluster with Slurm

2020-05-13 Thread Renfro, Michael
I’d compare the RealMemory part of ’scontrol show node abhi-HP-EliteBook-840-G2’ to the RealMemory part of your slurm.conf: > Nodes which register to the system with less than the configured resources > (e.g. too little memory), will be placed in the "DOWN" state to avoid > scheduling jobs on

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-09 Thread Renfro, Michael
restart. Thanks. > On May 8, 2020, at 11:47 AM, Renfro, Michael wrote: > > Working on something like that now. From an SQL export, I see 16 jobs from > my user that have a state of 7. Both states 3 and 7 show up as COMPLETED in > sacct, and may also have some duplicate job en

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
f,to,pr" > # Get Slurm individual job accounting records using the "sacct" command > sacct $partitionselect -n -X -a -S $start_time -E $end_time -o $FORMAT > -s $STATE > > There are numerous output fields which you can inquire, see "sacct -e". > > /Ole

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
still get counted against the user's current requests. From: Ole Holm Nielsen Sent: Friday, May 8, 2020 9:27 AM To: slurm-users@lists.schedmd.com Cc: Renfro, Michael Subject: Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
re printed in detail by showuserlimits. These tools are available from https://github.com/OleHolmNielsen/Slurm_tools /Ole On 08-05-2020 15:34, Renfro, Michael wrote: > Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins > limit applied to each user for years. It generally

[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins limit applied to each user for years. It generally works as intended, but I have one user I've noticed whose usage is highly inflated from reality, causing the GrpTRESMins limit to be enforced much earlier than necessary:

Re: [slurm-users] Defining a default --nodes=1

2020-05-08 Thread Renfro, Michael
There are MinNodes and MaxNodes settings that can be defined for each partition listed in slurm.conf [1]. Set both to 1 and you should end up with the non-MPI partitions you want. [1] https://slurm.schedmd.com/slurm.conf.html From: slurm-users on behalf of

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael
> > Regards > Navin. > > > On Wed, May 6, 2020 at 7:47 PM Renfro, Michael wrote: > To make sure I’m reading this correctly, you have a software license that > lets you run jobs on up to 4 nodes at once, regardless of how many CPUs you > use? That is, you could run an

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael
pecific > nodes? > i do not want to create a separate partition. > > is there any way to achieve this by any other method? > > Regards > Navin. > > > Regards > Navin. > > On Tue, May 5, 2020 at 7:46 PM Renfro, Michael wrote: > Haven’t done it yet mysel

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Renfro, Michael
Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
ically updated the value based on usage? > > > Regards > Navin. > > > On Tue, May 5, 2020 at 7:00 PM Renfro, Michael wrote: > Have you seen https://slurm.schedmd.com/licenses.html already? If the > software is just for use inside the cluster, one Licenses= line in s

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
Have you seen https://slurm.schedmd.com/licenses.html already? If the software is just for use inside the cluster, one Licenses= line in slurm.conf plus users submitting with the -L flag should suffice. Should be able to set that license value is 4 if it’s licensed per node and you can run up

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-04 Thread Renfro, Michael
Assuming you need a scheduler for whatever size your user population is, so they need literal JupyterHub, or would they all be satisfied running regular Jupyter notebooks? On May 4, 2020, at 7:25 PM, Lisa Kay Weihl wrote:  External Email Warning This email originated from outside the

Re: [slurm-users] one job at a time - how to set?

2020-04-30 Thread Renfro, Michael
d have to specify this when submitting, right? I.e. 'sbatch > --exclusive myjob.sh', if I understand correctly. Would there be a way to > simply enforce this, i.e. at the slurm.conf level or something? > > Thanks again! > > Rutger > > On Wed, Apr 29, 2020 at 10:06 PM Renfr

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Renfro, Michael
That’s a *really* old version, but https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates there’s an exclusive flag you can set. On Apr 29, 2020, at 1:54 PM, Rutger Vos wrote: . Hi, for a smallish machine that has been having degraded performance we want to implement a

Re: [slurm-users] One node is not used by slurm

2020-04-19 Thread Renfro, Michael
Someone else might see more than I do, but from what you’ve posted, it’s clear that compute-0-0 will be used only after other lower-weighted nodes are too full to accept a particular job. I assume you’ve already submitted a set of jobs requesting enough resources to fill up all the nodes, and

Re: [slurm-users] [EXTERNAL] Follow-up-slurm-users Digest, Vol 30, Issue 32

2020-04-17 Thread Renfro, Michael
Can’t speak for everyone, but I went to Slurm 19.05 some months back, and haven't had any problems with CUDA 10.0 or 10.1 (or 8.0, 9.0, or 9.1). > On Apr 17, 2020, at 8:46 AM, Lisa Kay Weihl wrote: > > External Email Warning > > This email originated from outside the university. Please use

Re: [slurm-users] Need to calculate total runtime/walltime for one year

2020-04-11 Thread Renfro, Michael
Unless I’m misreading it, you have a wall time limit of 2 days, and jobs that use up to 32 CPUs. So a total CPU time of up to 64 CPU-days would be possible for a single job. So if you want total wall time for jobs instead of CPU time, then you’ll want to use the Elapsed attribute, not CPUTime.

Re: [slurm-users] Job are pending when plenty of resources available

2020-03-30 Thread Renfro, Michael
All of this is subject to scheduler configuration, but: what has job 409978 requested, in terms of resources and time? It looks like it's the highest priority pending job in the interactive partition, and I’d expect the interactive partition has a higher priority than the regress partition. As

Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Renfro, Michael
Others might have more ideas, but anything I can think of would require a lot of manual steps to avoid mutual interference with jobs in the other partitions (allocating resources for a dummy job in the other partition, modifying the MPI host list to include nodes in the other partition, etc.).

Re: [slurm-users] Can slurm be configured to only run one job at a time?

2020-03-23 Thread Renfro, Michael
Rather than configure it to only run one job at a time, you can use job dependencies to make sure only one job of a particular type at a time. A singleton dependency [1, 2] should work for this. From [1]: #SBATCH --dependency=singleton --job-name=big-youtube-upload in any job script would

Re: [slurm-users] Limit Number of Jobs per User in Queue?

2020-03-18 Thread Renfro, Michael
In addition to Sean’s recommendation, your user might want to use job arrays [1]. That’s less stress on the scheduler, and throughput should be equivalent to independent jobs. [1] https://slurm.schedmd.com/job_array.html -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology

Re: [slurm-users] Upgrade paths

2020-03-11 Thread Renfro, Michael
The release notes at https://slurm.schedmd.com/archive/slurm-19.05.5/news.html indicate you can upgrade from 17.11 or 18.08 to 19.05. I didn’t find equivalent release notes for 17.11.7, but upgrades over one major release should work. > On Mar 11, 2020, at 2:01 PM, Will Dennis wrote: > >

Re: [slurm-users] Issue with "hetjob" directive with heterogeneous job submission script

2020-03-05 Thread Renfro, Michael
I’m going to guess the job directive changed between earlier releases and 20.02. An version of the page from last year [1] has no mention of hetjob, and uses packjob instead. On a related note, is there a canonical location for older versions of Slurm documentation? My local man pages are

Re: [slurm-users] Should there be a different gres.conf for each node?

2020-03-05 Thread Renfro, Michael
We have a shared gres.conf that includes node names, which should have the flexibility to specify node-specific settings for GPUs: = NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia0 COREs=0-7 NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia1 COREs=8-15 = See the

Re: [slurm-users] Problem with configuration CPU/GPU partitions

2020-02-28 Thread Renfro, Michael
When I made similar queues, and only wanted my GPU jobs to use up to 8 cores per GPU, I set Cores=0-7 and 8-15 for each of the two GPU devices in gres.conf. Have you tried reducing those values to Cores=0 and Cores=20? > On Feb 27, 2020, at 9:51 PM, Pavel Vashchenkov wrote: > > External Email

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Renfro, Michael
If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU nodes are over-provisioned in terms of both RAM and CPU, we end up using the excess resources for non-GPU jobs. If that 32 GB is GPU RAM, then I have no experience with that, but I suspect MPS would be required. > On

Re: [slurm-users] Using "Nodes" on script - file ????

2020-02-12 Thread Renfro, Michael
Hey, Matthias. I’m having to translate a bit, so if I get a meaning wrong, please correct me. You should be able to set the minimum and maximum number of nodes used for jobs on a per-partition basis, or to set a default for all partitions. My most commonly used partition has:

Re: [slurm-users] Limits to partitions for users groups

2020-02-05 Thread Renfro, Michael
If you want to rigidly define which 20 nodes are available to the one group of users, you could define a 20-node partition for them, and a 35-node partition for the priority group, and restrict access by Unix group membership: PartitionName=restricted Nodes=node0[01-20] AllowGroups=ALL

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
early > release of v18. > > Best regards, > David > > From: slurm-users on behalf of > Renfro, Michael > Sent: 31 January 2020 17:23:05 > To: Slurm User Community List > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > I missed reading w

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
s at the > expense of the small fry for example, however that is a difficult decision > that means that someone has got to wait longer for results.. > > Best regards, > David > From: slurm-users on behalf of > Renfro, Michael > Sent: 31 January 2020 13:27 > T

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
Greetings, fellow general university resource administrator. Couple things come to mind from my experience: 1) does your serial partition share nodes with the other non-serial partitions? 2) what’s your maximum job time allowed, for serial (if the previous answer was “yes”) and non-serial

Re: [slurm-users] MaxJobs-limits

2020-01-29 Thread Renfro, Michael
> cgroups is the solution I suppose. > > On Tue, Jan 28, 2020 at 7:42 PM Renfro, Michael wrote: > For the first question: you should be able to define each node’s core count, > hyperthreading, or other details in slurm.conf. That would allow Slurm to > schedule (well-behaved) tas

  1   2   >