Re: [slurm-users] extended list of nodes allocated to a job

2023-08-17 Thread Greg Wickham

“sinfo” can expand compressed hostnames too:

$ sinfo -n lm602-[08,10] -O NodeHost -h
lm602-08
lm602-10
$

   -Greg

From: slurm-users  on behalf of Alain O' 
Miniussi 
Date: Thursday, 17 August 2023 at 4:53 pm
To: Slurm User Community List 
Subject: [EXTERNAL] Re: [slurm-users] extended list of nodes allocated to a job
Hi Sean,

A colleague pointed to me the following commands:

#scontrol show hostname x[1000,1009,1029-1031]
x1000
x1009
x1029
x1030
x1031
#scontrol show hostlist x[1000,1009,1029,1030,1031]
x[1000,1009,1029-1031]
#


Alain Miniussi
DSI, Pôles Calcul et Genie Log.
Observatoire de la Côte d'Azur
Tél. : +33609650665

- On Aug 17, 2023, at 3:12 PM, Sean Mc Grath  wrote:

Hi Alain,

I don't know if slurm can do that natively. python-hostlist, 
https://www.nsc.liu.se/~kent/python-hostlist/,
 may provide the functionality you need. I have used it in the past to generate 
a list of hosts that can be looped over.

Hope that helps.

Sean

---
Sean McGrath
Senior Systems Administrator, IT Services


From: slurm-users  on behalf of Alain O' 
Miniussi 
Sent: Thursday 17 August 2023 13:44
To: Slurm User Community List 
Subject: [slurm-users] extended list of nodes allocated to a job

Hi,

I'm looking for a way to get the list of nodes where a given job is running in 
a uncompressed way.
That is, I'd like to have node1,node2,node3 instead of node1-node3.
Is there way to achieve that ?
I need the information outside the script.

Thanks


Alain Miniussi
DSI, Pôles Calcul et Genie Log.
Observatoire de la Côte d'Azur
Tél. : +33609650665



Re: [slurm-users] [EXTERNAL] Re: slurmdbd database usage

2023-08-02 Thread Greg Wickham
Yup – Slurm is specifically tied to MySQL/MariaDB.

To get around this I wrote an C++ application that will extract job records 
from Slurm using “sacct” and write them into a PostgreSQL database.

https://gitlab.com/greg.wickham/sminer

The schema used in PostgreSQL is more conducive to faster adhoc queries than 
using “sacct”.

YMMV

   -greg

From: slurm-users  on behalf of Michael 
Gutteridge 
Date: Thursday, 3 August 2023 at 1:02 am
To: Slurm User Community List 
Subject: [EXTERNAL] Re: [slurm-users] slurmdbd database usage

Pretty sure that dog won't hunt.  There's not _just_ the tables, but I believe 
there's a bunch of queries and other database magic in slurmdbd that is 
specific to MySQL/MariaDB.

 - Michael


On Wed, Aug 2, 2023 at 2:33 PM Sandor 
mailto:sansho...@gmail.com>> wrote:
I am looking to track accounting and job data. Slurm requires the use of MySQL 
or MariaDB. Has anyone created the needed tables within PostGreSQL then had 
slurmdbd write to it? Any problems?

Thank you in advance!
Sandor Felho


Re: [slurm-users] [EXTERNAL] Re: Job in "priority" status - resources available

2023-08-02 Thread Greg Wickham
Following on from what Michael said, the default Slurm configuration is to 
allocate only one job per node. If GRES a100_1g.10gb is on the same node ensure 
to enable “SelectType=select/cons_res” (info at 
https://slurm.schedmd.com/cons_res.html) to permit multiple jobs to use the 
same node.

Also using “TaskPlugin=task/cgroup” is useful to ensure that users cannot 
inadvertently access resources not allocated to other jobs on the same node 
(refer to the slurm.conf man page).

   -Greg

From: slurm-users  on behalf of Michael 
Gutteridge 
Date: Wednesday, 2 August 2023 at 5:22 pm
To: Slurm User Community List 
Subject: [EXTERNAL] Re: [slurm-users] Job in "priority" status - resources 
available
I'm not sure there's enough information in your message- Slurm version and 
configs are often necessary to make a more confident diagnosis.  However, the 
behaviour you are looking for (lower priority jobs skipping the line) is called 
"backfill".  There's docs here: 
https://slurm.schedmd.com/sched_config.html#backfill

It should be loaded and active by default which is why I'm not super confident 
here.  There may also be something else going on with the node configuration as 
it looks like 1596 would maybe need the same node?  Maybe there's not enough 
CPU or memory to accommodate both jobs (1596 and 1739)?

HTH
 - Michael

On Wed, Aug 2, 2023 at 5:13 AM Cumer Cristiano 
mailto:cristianomaria.cu...@unibz.it>> wrote:

Hello,

I'm quite a newbie regarding Slurm. I recently created a small Slurm instance 
to manage our GPU resources. I have this situation:

 JOBIDSTATE TIME   ACCOUNTPARTITIONPRIORITY 
 REASON CPU MIN_MEM  TRES_PER_NODE
1739PENDING 0:00  standard  gpu-low   5
Priority   1 80Ggres:gpu:a100_1g.10gb:1
1738PENDING 0:00  standard  gpu-low   5
Priority   1 80G  gres:gpu:a100-sxm4-80gb:1
1737PENDING 0:00  standard  gpu-low   5
Priority   1 80G  gres:gpu:a100-sxm4-80gb:1
1736PENDING 0:00  standard  gpu-low   5   
Resources   1 80G  gres:gpu:a100-sxm4-80gb:1
1740PENDING 0:00  standard  gpu-low   1
Priority   1  8G  gres:gpu:a100_3g.39gb
1735PENDING 0:00  standard  gpu-low   1
Priority   8 64G  gres:gpu:a100-sxm4-80gb:1
1596RUNNING   1-13:26:45  standard  gpu-low   3 
   None   2 64Ggres:gpu:a100_1g.10gb:1
1653RUNNING 21:09:52  standard  gpu-low   2 
   None   1 16G gres:gpu:1
1734RUNNING59:52  standard  gpu-low   1 
   None   8 64G  gres:gpu:a100-sxm4-80gb:1
1733RUNNING  1:01:54  standard  gpu-low   1 
   None   8 64G  gres:gpu:a100-sxm4-80gb:1
1732RUNNING  1:02:39  standard  gpu-low   1 
   None   8 40G  gres:gpu:a100-sxm4-80gb:1
1731RUNNING  1:08:28  standard  gpu-low   1 
   None   8 40G  gres:gpu:a100-sxm4-80gb:1
1718RUNNING 10:16:40  standard  gpu-low   1 
   None   2  8G  gres:gpu:v100
1630RUNNING   1-00:21:21  standard  gpu-low   1 
   None   1 30G  gres:gpu:a100_3g.39gb
1610RUNNING   1-09:53:23  standard  gpu-low   1 
   None   2  8G  gres:gpu:v100



Job 1736 is in the PENDING state since there are no more available 
a100-sxm4-80gb GPUs. The job priority starts to rise with time (priority 5) as 
expected. Now another user submits job 1739 on a gres:gpu:a100_1g.10gb:1 that 
is available, but the job is not starting since its priority is 1. This is 
obviously not the desired outcome, and I believe I must change the scheduling 
strategy. Could someone with more experience than me give me some hints?

Thanks, Cristiano


Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-18 Thread Greg Wickham
Hi Rob,

> Are you just creating those files and then including them in slurm.conf?

Yes.

We’re using puppet, but you could get the same results using jinja2.

The workflow we use is a little convoluted – the original YAML files are 
validated then JSON formatted data is written to intermediate files.

The schema of the JSON formatted files is rather verbose to match the 
capability of the template engine (this simplifies the template definition 
significantly)

When puppet runs it loads the JSON and then renders the Slurm files.

(The same YAML files are used to configure warewulf/ dnsmasq (DHCP) / bind / 
iPXE . . .)

Finally, it’s worth mentioning that the YAML files are managed by git, with a 
gitlab runner completing the validation phase before any files are published to 
production.

   -greg

From: slurm-users  on behalf of Groner, 
Rob 
Date: Wednesday, 18 January 2023 at 5:22 pm
To: Slurm User Community List 
Subject: [EXTERNAL] Re: [slurm-users] Maintaining slurm config files for test 
and production clusters
Generating the *.conf files from parseable/testable sources is an interesting 
idea.  You mention nodes.conf and partitions.conf.  I can't find any 
documentation on those.  Are you just creating those files and then including 
them in slurm.conf?

Rob


From: slurm-users  on behalf of Greg 
Wickham 
Sent: Wednesday, January 18, 2023 1:38 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Maintaining slurm config files for test and 
production clusters

You don't often get email from greg.wick...@kaust.edu.sa. Learn why this is 
important<https://urldefense.com/v3/__https:/aka.ms/LearnAboutSenderIdentification__;!!Nmw4Hv0!xpX8C79DMx78EQjCb-HWl7-Kh8vEmqHmei_dLPEMg9yxAFj9MDVaawDu46eNkUEd6CTvMBDSnrDE6SSaKZiA6Q$>

Hi Rob,



Slurm doesn’t have a “validate” parameter hence one must know ahead of time 
whether the configuration will work or not.



In answer to your question – yes – on our site the Slurm configuration is 
altered outside of a maintenance window.



Depending upon the potential impact of the change, it will either be made 
silently (no announcement) or users are notified on slack that there maybe a 
brief outage.



Slurm is quite resilient – if slurmctld is down, launching jobs will not happen 
and user commands will fail. But all existing jobs will keep running.



Our users are quite tolerant as well – letting them know when a potential 
change may impact their overall experience of the cluster seems to be 
appreciated.



On our site the configuration files are not changed directly, but moreover a 
template engine is used – our slurm configuration data is in YAML files, which 
are then validated and processed to generate the slurm.conf / nodes.conf / 
partitions.conf / topology.conf



This provides some surety that adding / removing nodes etc. won’t result in an 
inadvertent configuration issue.



We have three clusters (one production, and two test) – all are managed the 
same way.



Finally, using configuration templating it’s possible to spin up new clusters 
quite quickly . . . The longest time is spent picking a new cluster name.



   -Greg



On 17/01/2023, 23:42, "slurm-users"  
wrote:



So, you have two equal sized clusters, one for test and one for production?  
Our test cluster is a small handful of machines compared to our production.



We have a test slurm control node on a test cluster with a test slurmdbd host 
and test nodes, all named specifically for test.  We don't want a situation 
where our "test" slurm controller node is named the same as our "prod" slurm 
controller node, because the possibility of mistake is too great.  ("I THOUGHT 
I was on the test network")



Here's the ultimate question I'm trying to get answered  Does anyone update 
their slurm.conf file on production outside of an outage?  If so, how do you 
KNOW the slurmctld won't barf on some problem in the file you didn't see (even 
a mistaken character in there would do it)?  We're trying to move to a model 
where we don't have downtimes as often, so I need to determine a reliable way 
to continue to add features to slurm without having to wait for the next 
outage.  There's no way I know of to prove the slurm.conf file is good, except 
by feeding it to slurmctld and crossing my fingers.



Rob



--


Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Greg Wickham
Hi Rob,

Slurm doesn’t have a “validate” parameter hence one must know ahead of time 
whether the configuration will work or not.

In answer to your question – yes – on our site the Slurm configuration is 
altered outside of a maintenance window.

Depending upon the potential impact of the change, it will either be made 
silently (no announcement) or users are notified on slack that there maybe a 
brief outage.

Slurm is quite resilient – if slurmctld is down, launching jobs will not happen 
and user commands will fail. But all existing jobs will keep running.

Our users are quite tolerant as well – letting them know when a potential 
change may impact their overall experience of the cluster seems to be 
appreciated.

On our site the configuration files are not changed directly, but moreover a 
template engine is used – our slurm configuration data is in YAML files, which 
are then validated and processed to generate the slurm.conf / nodes.conf / 
partitions.conf / topology.conf

This provides some surety that adding / removing nodes etc. won’t result in an 
inadvertent configuration issue.

We have three clusters (one production, and two test) – all are managed the 
same way.

Finally, using configuration templating it’s possible to spin up new clusters 
quite quickly . . . The longest time is spent picking a new cluster name.

   -Greg

On 17/01/2023, 23:42, "slurm-users"  
wrote:

So, you have two equal sized clusters, one for test and one for production?  
Our test cluster is a small handful of machines compared to our production.

We have a test slurm control node on a test cluster with a test slurmdbd host 
and test nodes, all named specifically for test.  We don't want a situation 
where our "test" slurm controller node is named the same as our "prod" slurm 
controller node, because the possibility of mistake is too great.  ("I THOUGHT 
I was on the test network")

Here's the ultimate question I'm trying to get answered  Does anyone update 
their slurm.conf file on production outside of an outage?  If so, how do you 
KNOW the slurmctld won't barf on some problem in the file you didn't see (even 
a mistaken character in there would do it)?  We're trying to move to a model 
where we don't have downtimes as often, so I need to determine a reliable way 
to continue to add features to slurm without having to wait for the next 
outage.  There's no way I know of to prove the slurm.conf file is good, except 
by feeding it to slurmctld and crossing my fingers.

Rob

--


Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Greg Wickham
Hi Richard,

While trying to respond I was looking into the manual pages and while it does 
appear that slurm can support some kind of high availability(*) it doesn’t seem 
simple.

With multiple slurmctld only one can be active at any time as they share state 
information. It’s not clear how they know about each other, so this may require 
STONITH(*).

With slurmdbd, there’s “AccountingStorageHost” and 
“AccountingStorageBackupHost”, again it’s not quite clear how these interact.

In slrmdbd.conf there is “StorageBackupHost” with the description:

. . . . It is up to the backup solution to enforce the coherency of the
accounting information between the two hosts. With clustered
database solutions (active/passive HA), you would not need to use
this feature. Default is none.

On our site we’re running only a simple setup. One VM with slurmctld and 
another VM with both slurmdbd+mariadbd.

Perhaps others who have dabbled with redundancy can reply.

   -greg

(* I say this trusting the best way to get a response on the Internet is say 
something wrong and then wait for the avalanche of corrections).

On 01/11/2022, 12:08, "slurm-users"  
wrote:


Hello Greg,

I have a two node set up. node1 is primary slurmctld + backup slurmdbd and 
node2 is primary slurmdbd + backup slurmctld and mysql database host.

 My concern is if node 2 goes down, then the backup slurmdbd will take over, 
then what will happen ?

I have read that slurmctld can cache data, but what about slurmdbd? Not sure.

I have intentionally used the slurmdbd + mariadb in the second node because I 
didn't want to overload the primary slurmctld.

I hope you all are getting the picture of how my set up is.

Thanks,

RC


On 11/1/2022 10:40 AM, Greg Wickham wrote:
Hi Richard,

Slurmctld caches the updates until slurmdbd comes back online.

You can see how many records are pending for the database by using the “sdiag” 
command and looking for “DBD Agent queue size”.

If this number grows significantly it means that slurmdbd isn’t available.

   -Greg

On 01/11/2022, 07:23, "slurm-users" 
<mailto:slurm-users-boun...@lists.schedmd.com>
 wrote:

Hi,

Just for my info, I would like to know what happens when SlurmDBD loses
connection to the backend Database, for ex, MariaDB.

Does it cache the accounting info and keep them till the DB comes back
up ?, or does it panic and shut down ?

Thank you,

RC.




Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Greg Wickham
Hi Richard,

Slurmctld caches the updates until slurmdbd comes back online.

You can see how many records are pending for the database by using the “sdiag” 
command and looking for “DBD Agent queue size”.

If this number grows significantly it means that slurmdbd isn’t available.

   -Greg

On 01/11/2022, 07:23, "slurm-users"  
wrote:

Hi,

Just for my info, I would like to know what happens when SlurmDBD loses
connection to the backend Database, for ex, MariaDB.

Does it cache the accounting info and keep them till the DB comes back
up ?, or does it panic and shut down ?

Thank you,

RC.



Re: [slurm-users] [EXTERNAL] Ideal NFS exported StateSaveLocation size.

2022-10-23 Thread Greg Wickham
Hi Richard,

We have just over 400 nodes and the StateSaveLocation directory has ~600MB of 
data.

The share for SlurmdSpoolDir is about 17GB used across the nodes, but this also 
includes logs for each node (without log files it’s < 1GB).

   -Greg

On 24/10/2022, 07:19, "slurm-users"  
wrote:

Hi,

Is there a thumb rule for the size of the directory that is NFS
exported, and to be used as StateSaveLocation.

I have a two node Slurmctld setup and both will mount an NFS exported
directory as the state save location.

Let me know your thoughts.

Thanks & regards,

RC





Re: [slurm-users] [EXTERNAL] Re: gpu utilization of a reserved node

2022-05-07 Thread Greg Wickham
Hi Purvesh,

With some caveats, you can do:

$ sacct -N  -X -S  -E  -P format=jobid,alloctres

And then post process the results with a scripting language.

The caveats? . . The -X above is returning the job allocation, which in your 
case it appears to be everything you need. However for a job or step that spans 
multiple nodes Slurm doesn’t save in the database what specific resources were 
allocated on each node.

“scontrol show job  -d” does display the node specific resource 
allocations, but this information is discarded during summarisation to Slurmdbd.

   -Greg

From: slurm-users  on behalf of Purvesh 
Parmar 
Date: Thursday, 5 May 2022 at 5:10 am
To: slurm-users@lists.schedmd.com 
Subject: [EXTERNAL] Re: [slurm-users] gpu utilization of a reserved node
Hi,

We have a node given to a group that has all the 2 GPUs in dedicated mode by 
setting reservation on the node  for 6 months. We want to find out GPU hours  
weekly utilization of that particular reserved node. The node is not in to 
seperate partition.
Below command does not help in showing the allocated gpu hours and also does 
not show for a week duration.
sreport reservation utilization name=rizwan_res start=2022-03-28T10:00:00 
end=2022-04-03T10:00:00

Please help.

Regards,
Purvesh

On Sat, 30 Apr 2022 at 15:57, Purvesh Parmar 
mailto:purveshp0...@gmail.com>> wrote:
Hello,

We have a node given to a group that has 2 GPUs in dedicated mode by setting 
reservation for 6 months. We want to find out GPU hours utilization weekly 
utilization of that particular reserved node. The node is not in to seperate 
partition.
Below command does not help in showing the allocated gpu hours and alos does 
not show for a week duration.
sreport reservation utilization name=rizwan_res start=2022-03-28T10:00:00 
end=2022-04-03T10:00:00

Please help.

Regards,
Purvesh


Re: [slurm-users] [EXTERNAL] Re: Managing shared memory (/dev/shm) usage per job?

2022-04-06 Thread Greg Wickham
Hi John, Mark,

We use a spank plugin 
https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this was derived 
from other authors but modified for functionality required on site).

It can bind tmpfs mount points to the users cgroup allocation, additionally 
bind options can be provided (ie: limit memory by size, limit memory by % as 
supported by tmpfs(5))

More information is in the README.md

  -Greg

On 05/04/2022, 23:17, "slurm-users"  
wrote:

I've thought-experimented this in the past, wanting to do the same thing but 
haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups to be 
accounted against the job's allocation. The best I have come up with is 
creating a per-job tmpfs from a prolog, removing from epilog and setting its 
size to be some amount of memory that at least puts some restriction on how 
much damage the job could do. Another alternative is to only allow access to a 
memory filesystem if the job request is exclusive and takes the whole node. 
Crude, but effective at least to the point of preventing one job from killing 
others. If you happen to find a real solution, please post it :)

griznog

On Mon, Apr 4, 2022 at 10:19 AM Mark Coatsworth 
mailto:mark.coatswo...@vectorinstitute.ai>> 
wrote:
Hi all,

We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch jobs 
dependent on shared memory (/dev/shm). When our machines get busy, we often run 
into a problem where one job exhausts all the shared memory on a system, 
causing any other jobs landing there to fail immediately.

We're trying to figure out a good way to manage this resource. I know that 
Slurm counts shared memory as part of a job's total memory allocation, so we 
could use cgroups to OOM kill jobs that exceed this. But that doesn't prevent a 
user from just making a large request and exhausting it all anyway.

Does anybody have any thoughts or experience with setting real limits on shared 
memory, and either swapping it out or killing the job if this gets exceeded? 
One thought we had was to use a new generic resource (GRES). This is pretty 
easy to add in the configuration, but seems like it would be a huge task to 
write a plugin that actually enforces it.

Is this something where the Job Container plugin might be useful?

Any thoughts or suggestions would be appreciated,

Mark


Re: [slurm-users] [EXTERNAL] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread Greg Wickham
If it’s possible to see other GPUs within a job then that means that cgroups 
aren’t being used.

Look at the cgroup documentation of slurm 
(https://slurm.schedmd.com/cgroup.conf.html)

With cgroups activated an `nvidia-smi` will only show the GPU allocated to the 
job.

   -greg

From: slurm-users  on behalf of 
taleinterve...@sjtu.edu.cn 
Date: Wednesday, 23 March 2022 at 5:50 pm
To: slurm-users@lists.schedmd.com 
Subject: [EXTERNAL] [slurm-users] how to locate the problem when slurm failed 
to restrict gpu usage of user jobs
Hi, all:

We found a problem that slurm job with argument such as --gres gpu:1 didn’t be 
restricted with gpu usage, user still can see all gpu card on allocated nodes.
Our gpu node has 4 cards with their gres.conf to be:
> cat /etc/slurm/gres.conf
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47
Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63

And for test, we submit simple job batch like:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=a100
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --gres=gpu:1
#SBATCH --reservation="gpu test"
hostname
nvidia-smi
echo end

Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect to 
see only 1 allocated gpu card.

Official document of slurm said it will set CUDA_VISIBLE_DEVICES env var to 
restrict the gpu card available to user. But we didn’t find such variable 
exists in job environment. We only confirmed it do exist in prolog script 
environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” to slurm 
prolog script.

So how do slurm co-operate with nvidia tools to make job user only see its 
allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA toolkit 
or any other part to help slurm correctly restrict the gpu usage?


Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-18 Thread Greg Wickham
Hi Chris,

You mentioned “But trials using this do not seem to be fruitful so far.” . . 
why?

In our job_submit.lua there is:

if job_desc.shared == 0 then
  slurm.user_msg("exclusive access is not permitted with GPU jobs.")
  slurm.user_msg("Remove '--exclusive' from your job submission script")
  return ESLURM_NOT_SUPPORTED
end

and testing:

$ srun --exclusive --time 00:10:00 --gres gpu:1 --pty /bin/bash -i
srun: error: exclusive access is not permitted with GPU jobs.
srun: error: Remove '--exclusive' from your job submission script
srun: error: Unable to allocate resources: Requested operation is presently 
disabled

In slurm.h the job_descriptor struct has:

uint16_t shared;/* 2 if the job can only share nodes with other
 *   jobs owned by that user,
 * 1 if job can share nodes with other jobs,
 * 0 if job needs exclusive access to the node,
 * or NO_VAL to accept the system default.
 * SHARED_FORCE to eliminate user control. */

If there’s a case where using “.shared” isn’t working please let us know.

   -Greg


From: slurm-users  on behalf of 
Christopher Benjamin Coffey 
Date: Saturday, 19 February 2022 at 3:17 am
To: slurm-users 
Subject: [EXTERNAL] [slurm-users] Can job submit plugin detect "--exclusive" ?
Hello!

The job_submit plugin doesn't appear to have a way to detect whether a user 
requested "--exclusive". Can someone confirm this? Going through the code: 
src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related. 
Potentially "shared" could be possible in some way. But trials using this do 
not seem to be fruitful so far.

If a user requests --exclusive, I'd like to append "--exclude=" on to 
their job request to keep them off of certain nodes. For instance, we have our 
gpu nodes in a default partition with a high priority so that jobs don't land 
on them until last. And this is the same for our highmem nodes. Normally this 
works fine, but if someone asks for "--exclusive" this will land on these nodes 
quite often unfortunately.

Any ideas? Of course, I could take these nodes out of the partition, yet I'd 
like to see if something like this would be possible.

Thanks! :)

Best,
Chris

--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167




Re: [slurm-users] [EXTERNAL] Re: Information about finished jobs

2021-06-14 Thread Greg Wickham
As others have commented, some information is lost when it is stored in the 
database.

To keep historically accurate data on the job run a script (refer to 
PrologSlurmctld in slurm.conf) that runs an "scontrol show -d job " and 
drops it into a local file.

Using " PrologSlurmctld" is neat, as it is executed on the slurmctld host when 
the job is being launched. (interestingly the job state will be shown as 
"CONFIGURING").

Side note - using "-d " with scontrol will provide accurate allocation of 
resources on each node (specific CPUs, specific GPUs, and memory).

  -Greg

On 14/06/2021, 10:37, "slurm-users on behalf of Arthur Gilly" 
 wrote:

Hi all,

A related question, on my setup, scontrol show job displays the standard
output, standard error redirections as well as the wd, whereas this info is
lost after completion when sacct is required. Is this something that's
configurable so that this info is preserved with sacct?

Cheers,

A

-
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-

--



Re: [slurm-users] [EXTERNAL] Re: Cluster usage, filtered by partition

2021-05-12 Thread Greg Wickham

Hi Diego,

Disclaimer: A little bit of shameless self-promotion.

We're using an application I wrote to inject slurm accounting records into a 
PostreSQL database. The
data is extracted from Slurm using "sacct".

From there it's possible to use SQL queries to mine the raw slurm data.

https://gitlab.com/greg.wickham/sminer

This tool _only_ injects Slurm data into PostgreSQL, unlike XDMoD (which can do 
this and more).

However the a big benefit for us is sminer can inject records into an existing 
database (no need for a separate database).

CSV dumps can be obtained using native PostgreSQL commands.

Graphs are created using python scripts (querying the data) and then plotted 
with gnuplot.

  -Greg

—

-Original Message-
From: slurm-users  on behalf of Diego 
Zuccato 
Reply to: Slurm User Community List 
Date: Wednesday, 12 May 2021 at 11:57 am
To: Slurm User Community List , "Renfro, 
Michael" 
Subject: [EXTERNAL] Re: [slurm-users] Cluster usage, filtered by partition

Il 11/05/21 21:20, Renfro, Michael ha scritto:

> In a word, nothing that's guaranteed to be stable. I got my start from 
> this reply on the XDMoD list in November 2019. Worked on 8.0:
Tks for the hint.
XDMoD seems interesting and I'll try to have a look. But a scientific 
report w/o access to the bare numbers is definitely a no-no :)

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-19 Thread Greg Wickham
Hi Erik,

We use a private fork of https://github.com/hpc2n/spank-private-tmp

It has worked quite well for us - jobs (or steps) don’t share a /tmp and during 
the prolog all files created for the job/step are deleted.

Users absolutely cannot see each others temporary files so there’s no issue 
even if they happen to have a hard corded path ie: /tmp/myfiles.txt

   -Greg

On 12 May 2020, at 18:40, Ellestad, Erik 
mailto:erik.elles...@ucsf.edu>> wrote:

I was wanted to set TMPDIR from /tmp to a per job directory I create in local 
/scratch/$SLURM_JOB_ID (for example)

This bug suggests I should be able to do this in a task-prolog.

https://bugs.schedmd.com/show_bug.cgi?id=2664

However adding the following to task-prolog doesn’t seem to affect the 
variables the job script is running with.

unset TMPDIR
export TMPDIR=/scratch/$SLURM_JOB_ID

It does work if it is done in the job script, rather than the task-prolog.

Am I missing something?

Erik

--
Erik Ellestad
Wynton Cluster SysAdmin
UCSF



Re: [slurm-users] QOS cutting off users before CPU limit is reached

2020-05-18 Thread Greg Wickham
Something to try . .

If you restart “slurmctld” does the new QOS apply?

We had a situation where slurmdbd was running as a different user than 
slurmctld and hence sacctmgr changes weren’t being reflected in slurmctld.

   -greg


On 27 Apr 2020, at 12:57, Simon Andrews 
mailto:simon.andr...@babraham.ac.uk>> wrote:

I’m trying to use QoS limits to dynamically change the number of CPUs a user is 
allowed to use on our cluster.  As far as I can see I’m setting the appropriate 
GrpTRES=cpu value and I can read that back, but then jobs are being stopped 
before the user has reached that limit.

In squeue I see loads of lines like:

166599normal nf-BISMARK_(288)   auser PD   0:00  1 
(QOSMaxCpuPerUserLimit)

..but if I run:

squeue -t running -p normal --format="%.12u %.2t %C "

Then the total for that user is 288 cores, but in the QoS configuration they 
should be allowed more.  If I run:

sacctmgr show user WithAssoc format=user%12,GrpTRES

..then I get:

auser  cpu=512

What am I missing?  Why is ‘auser’ not being allowed to use all 512 of their 
allowed CPUs before the QOS limit is kicking in?

Thanks for any help you can offer.

Simon.

The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered 
Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If 
you received this in error, please contact the sender and delete this email 
from your system. The contents of this e-mail are the views of the sender and 
do not necessarily represent the views of the Babraham Institute. Full 
conditions at: www.babraham.ac.uk



[slurm-users] Musing: Can GPUs be restricted by changing ownership permissions?

2019-11-03 Thread Greg Wickham

Hi All,

We have a flotilla of GPUs all protected by cgroups.

But occasionally we have users who _must_ ssh into the node.

Of course if they ssh in, then the cgroup protection doesn’t work (yes - 
there’s a slurm plugin
to tie an ssh session to a cgroup, but that seems more problematic with 8-GPU 
nodes and
a plethora of 1 GPU jobs - during heavy use the user may not have access to the 
GPU they require).

Has anyone any experience with changing GPU permissions during prolog / 
epilogue?

thanks,

   -greg

--
Dr. Greg Wickham
Advanced Computing Infrastructure Team Lead
Advanced Computing Core Laboratory
King Abdullah University of Science and Technology
Building #1, Office #0124
greg.wick...@kaust.edu.sa +966 544 700 330
--

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-22 Thread Greg Wickham
I had some time this afternoon so dug into the source code (of slurm and pmix) 
and found the
issue.

In the file:

slurm-18.08.4/src/plugins/mpi/pmix/pmixp_client.c

line 147 (first instance):

PMIX_VAL_SET(>value, flag, 0);

“PMIX_VAL_SET” is a macro from /usr/include/pmix_common.h (version 2.2.1)

In version 3.1.1 it is missing.

Digging further it is pmix commit 47b8a8022a9d6cea8819c4365afd800b047c508e
(Sun Aug 12 11:27:28 2018 -0700) that removes the macros.

I don’t understand pmix enough to understand why the change was made.

  -Greg

On 22 Jan 2019, at 5:10 pm, Michael Di Domenico 
mailto:mdidomeni...@gmail.com>> wrote:

i've seen the same error, i don't think it's you.  but i don't know
what the cause is either, i didn't have time to look into it so i
backed up to pmix 2.2.1 which seems to work fine

On Tue, Jan 22, 2019 at 12:56 AM Greg Wickham 
mailto:greg.wick...@kaust.edu.sa>> wrote:


Hi All,

I’m trying to build pmix 3.1.1 against slurm 18.08.4, however in the slurm
pmix plugin I get a fatal error:

   pmixp_client.c:147:28: error: ‘flag’ undeclared (first use in this 
function)
   PMIX_VAL_SET(>value, flag, 0);

Is there something wrong with my build environment?

The variable ‘flag’ doesn’t appear to be defined anywhere in any file in
the directory slurm-18.08.4/src/plugins/mpi/pmix

thanks,

  -greg

--


--
Dr. Greg Wickham
Advanced Computing Infrastructure Team Lead
Advanced Computing Core Laboratory
King Abdullah University of Science and Technology
Building #1, Office #0124
greg.wick...@kaust.edu.sa<mailto:greg.wick...@kaust.edu.sa> +966 544 700 330
--



[slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-21 Thread Greg Wickham

Hi All,

I’m trying to build pmix 3.1.1 against slurm 18.08.4, however in the slurm
pmix plugin I get a fatal error:

pmixp_client.c:147:28: error: ‘flag’ undeclared (first use in this 
function)
PMIX_VAL_SET(>value, flag, 0);
 
Is there something wrong with my build environment?

The variable ‘flag’ doesn’t appear to be defined anywhere in any file in
the directory slurm-18.08.4/src/plugins/mpi/pmix

thanks,

   -greg

--

Re: [slurm-users] maintenance partitions?

2018-10-05 Thread Greg Wickham

We use “maintenance” reservations to prevent nodes from receiving production 
jobs.

https://slurm.schedmd.com/reservations.html

Create a reservation with “flags=maint” and it will override other reservations
(if they exist).

   -greg

> On 5 Oct 2018, at 4:06 PM, Michael Di Domenico  wrote:
>
> Is anyone on the list using maintenance partitions for broken nodes?
> If so, how are you moving nodes between partitions?
>
> The situation with my machines at the moment, is that we have a steady
> stream of new jobs coming into the queues, but broken nodes as well.
> I'd like to fix those broken nodes and re-add them to a separate
> non-production pool so that user jobs don't match, but allow me to run
> maintenance jobs on the nodes to prove things are working before
> giving them back to the users
>
> if i simply mark nodes with downnodes= or scontrol update state=drain,
> slurm will prevent users from new jobs, but wont allow me to run jobs
> on the nodes
>
> Ideally, i'd like to have a prod partition and a maint partition,
> where the maint partition is set to exclusiveuser and i can set the
> status of a node in the prod partition to drain without affecting the
> node status in the maint partition.  I don't believe I can do this
> though.  I believe i have to change the slurm.conf and reconfigure to
> add/remove nodes from one partition or the other
>
> if anyone has a better solution, i'd like to hear it.
>

--
Dr. Greg Wickham
Advanced Computing Infrastructure Team Lead
Advanced Computing Core Laboratory
King Abdullah University of Science and Technology
Building #1, Office #0124
greg.wick...@kaust.edu.sa +966 544 700 330
--



This message and its contents including attachments are intended solely for the 
original recipient. If you are not the intended recipient or have received this 
message in error, please notify me immediately and delete this message from 
your computer system. Any unauthorized use or distribution is prohibited. 
Please consider the environment before printing this email.