Re: [slurm-users] extended list of nodes allocated to a job

2023-08-17 Thread Greg Wickham
“sinfo” can expand compressed hostnames too: $ sinfo -n lm602-[08,10] -O NodeHost -h lm602-08 lm602-10 $ -Greg From: slurm-users on behalf of Alain O' Miniussi Date: Thursday, 17 August 2023 at 4:53 pm To: Slurm User Community List Subject: [EXTERNAL] Re: [slurm-users] extended list of

Re: [slurm-users] [EXTERNAL] Re: slurmdbd database usage

2023-08-02 Thread Greg Wickham
Yup – Slurm is specifically tied to MySQL/MariaDB. To get around this I wrote an C++ application that will extract job records from Slurm using “sacct” and write them into a PostgreSQL database. https://gitlab.com/greg.wickham/sminer The schema used in PostgreSQL is more

Re: [slurm-users] [EXTERNAL] Re: Job in "priority" status - resources available

2023-08-02 Thread Greg Wickham
Following on from what Michael said, the default Slurm configuration is to allocate only one job per node. If GRES a100_1g.10gb is on the same node ensure to enable “SelectType=select/cons_res” (info at https://slurm.schedmd.com/cons_res.html) to permit multiple jobs to use the same node.

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-18 Thread Greg Wickham
ion on those. Are you just creating those files and then including them in slurm.conf? Rob From: slurm-users on behalf of Greg Wickham Sent: Wednesday, January 18, 2023 1:38 AM To: Slurm User Community List Subject: Re: [slurm-users] Maintaining slurm con

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Greg Wickham
Hi Rob, Slurm doesn’t have a “validate” parameter hence one must know ahead of time whether the configuration will work or not. In answer to your question – yes – on our site the Slurm configuration is altered outside of a maintenance window. Depending upon the potential impact of the change,

Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Greg Wickham
t slurmdbd? Not sure. I have intentionally used the slurmdbd + mariadb in the second node because I didn't want to overload the primary slurmctld. I hope you all are getting the picture of how my set up is. Thanks, RC On 11/1/2022 10:40 AM, Greg Wickham wrote: Hi Richard, Slurmctld caches the upd

Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Greg Wickham
Hi Richard, Slurmctld caches the updates until slurmdbd comes back online. You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”. If this number grows significantly it means that slurmdbd isn’t available. -Greg On

Re: [slurm-users] [EXTERNAL] Ideal NFS exported StateSaveLocation size.

2022-10-23 Thread Greg Wickham
Hi Richard, We have just over 400 nodes and the StateSaveLocation directory has ~600MB of data. The share for SlurmdSpoolDir is about 17GB used across the nodes, but this also includes logs for each node (without log files it’s < 1GB). -Greg On 24/10/2022, 07:19, "slurm-users" wrote:

Re: [slurm-users] [EXTERNAL] Re: gpu utilization of a reserved node

2022-05-07 Thread Greg Wickham
Hi Purvesh, With some caveats, you can do: $ sacct -N -X -S -E -P format=jobid,alloctres And then post process the results with a scripting language. The caveats? . . The -X above is returning the job allocation, which in your case it appears to be everything you need. However for a job or

Re: [slurm-users] [EXTERNAL] Re: Managing shared memory (/dev/shm) usage per job?

2022-04-06 Thread Greg Wickham
Hi John, Mark, We use a spank plugin https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this was derived from other authors but modified for functionality required on site). It can bind tmpfs mount points to the users cgroup allocation, additionally bind options can be provided (ie:

Re: [slurm-users] [EXTERNAL] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread Greg Wickham
If it’s possible to see other GPUs within a job then that means that cgroups aren’t being used. Look at the cgroup documentation of slurm (https://slurm.schedmd.com/cgroup.conf.html) With cgroups activated an `nvidia-smi` will only show the GPU allocated to the job. -greg From:

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-18 Thread Greg Wickham
Hi Chris, You mentioned “But trials using this do not seem to be fruitful so far.” . . why? In our job_submit.lua there is: if job_desc.shared == 0 then slurm.user_msg("exclusive access is not permitted with GPU jobs.") slurm.user_msg("Remove '--exclusive' from your job

Re: [slurm-users] [EXTERNAL] Re: Information about finished jobs

2021-06-14 Thread Greg Wickham
As others have commented, some information is lost when it is stored in the database. To keep historically accurate data on the job run a script (refer to PrologSlurmctld in slurm.conf) that runs an "scontrol show -d job " and drops it into a local file. Using " PrologSlurmctld" is neat, as

Re: [slurm-users] [EXTERNAL] Re: Cluster usage, filtered by partition

2021-05-12 Thread Greg Wickham
Hi Diego, Disclaimer: A little bit of shameless self-promotion. We're using an application I wrote to inject slurm accounting records into a PostreSQL database. The data is extracted from Slurm using "sacct". From there it's possible to use SQL queries to mine the raw slurm data.

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-19 Thread Greg Wickham
Hi Erik, We use a private fork of https://github.com/hpc2n/spank-private-tmp It has worked quite well for us - jobs (or steps) don’t share a /tmp and during the prolog all files created for the job/step are deleted. Users absolutely cannot see each others temporary files so there’s no issue

Re: [slurm-users] QOS cutting off users before CPU limit is reached

2020-05-18 Thread Greg Wickham
Something to try . . If you restart “slurmctld” does the new QOS apply? We had a situation where slurmdbd was running as a different user than slurmctld and hence sacctmgr changes weren’t being reflected in slurmctld. -greg On 27 Apr 2020, at 12:57, Simon Andrews

[slurm-users] Musing: Can GPUs be restricted by changing ownership permissions?

2019-11-03 Thread Greg Wickham
-GPU nodes and a plethora of 1 GPU jobs - during heavy use the user may not have access to the GPU they require). Has anyone any experience with changing GPU permissions during prolog / epilogue? thanks, -greg -- Dr. Greg Wickham Advanced Computing Infrastructure Team Lead Advanced Computing

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-22 Thread Greg Wickham
ico mailto:mdidomeni...@gmail.com>> wrote: i've seen the same error, i don't think it's you. but i don't know what the cause is either, i didn't have time to look into it so i backed up to pmix 2.2.1 which seems to work fine On Tue, Jan 22, 2019 at 12:56 AM Greg Wickham mailto:

[slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-21 Thread Greg Wickham
Hi All, I’m trying to build pmix 3.1.1 against slurm 18.08.4, however in the slurm pmix plugin I get a fatal error: pmixp_client.c:147:28: error: ‘flag’ undeclared (first use in this function) PMIX_VAL_SET(>value, flag, 0); Is there something wrong with my build environment?

Re: [slurm-users] maintenance partitions?

2018-10-05 Thread Greg Wickham
od partition to drain without affecting the > node status in the maint partition. I don't believe I can do this > though. I believe i have to change the slurm.conf and reconfigure to > add/remove nodes from one partition or the other > > if anyone has a better solution, i'd like to hea